updated tiny interpreter; now can print integers; 511 bytes
Kragen Javier Sitaker
kragen at pobox.com
Sat Nov 24 03:37:01 EST 2007
# Simple token-threaded "Forth" interpreter. -*- asm -*-
# by Kragen Javier Sitaker; dedicated to the public domain,
# i.e. I relinquish whatever exclusive rights copyright law
# gives me with regard to this work.
# Major parts taken from Richard W.M. Jones's public-domain
# JONESFORTH 42 by Richard W.M. Jones <rich at annexia.org>
# http://annexia.org/forth
# This program just outputs "hello, world, hello" under Linux.
# to compile:
# gcc -m32 -nostdlib -static -o tokthr2 tokthr2.S
### Why Small Things are Interesting
# There are still a lot of computers out there that have tens of
# kilobytes of memory or less, and they cost a lot less than,
# say, a cellphone. Cellphones are apparently still too
# expensive for half the world's population. I want to see how
# close I can get to having a comfortable programming
# environment in a smaller device.
# Some smallish microcontroller chips from five different
# manufacturers, with current Digi-Key prices:
# Name bytes RAM bytes ROM MHz price
# ATtiny2313 128 2048 20 US$1.38
# ATMega48-20AU 512 4096 20 US$1.62
# MSP430F1111AIPW 128 2264 8 US$2.43
# LPC2101 2048 8192 70 US$2.52
# H8/300H Tiny 1536 8192 12 US$3.58
# M16C/R8C/Tiny/1B 1024 16384 12 US$3.54
# SX28AC/SS 136 3072 50 US$2.79
# More practically and short-termly, small projects can take
# less time to finish, and I feel like I need to learn about
# different approaches to implementing programming languages.
### Why This is Small
# The normal Forth representation of a function is as an array
# of pointers to the other functions it calls, in sequence; a
# few of those other functions may move the interpreter pointer
# around in that array, or snarf up a constant that's stored in
# the array, or stuff like that, but for the most part, the
# functions just get called in sequence. This is called
# "threaded code" and it's fairly compact, especially on 16-bit
# systems where the pointers are only two bytes.
# A traditional approach taken by Forth implementations to
# reduce code size even further is called "token threading".
# Rather than making arrays of 16-bit or 32-bit pointers, they
# make lists of 8-bit indices into an array of pointers. This
# has two advantages:
# 1. the indices are one fourth the size of a list of 32-bit
# pointers;
# 2. it is possible to save these lists of indices somewhere
# outside of memory and continue to use them even after
# making some changes to the code, as long as the same
# indices in the table have the same meanings. So, for
# example, you could write some boot firmware in this
# "bytecode".
# It also has some disadvantages:
# 1. You run out of space in the table. Even a fairly minimal
# full Forth system contains close to 256 subroutines. You
# can mitigate this by packing, say, two 12-bit pointers
# every three bytes, or maybe by having a special bytecode
# that looks up the next byte in an extended table.
# 2. It's slower and makes the machine-code part of the program
# take more space. The traditional LODSW; JMP AX version of
# $NEXT from the eForth Model, which fetches and executes the
# next execution token in the threaded list, is three bytes
# and two instructions; my 'next' here is 41 bytes and 14
# instructions, which is big enough that I jump to it (2
# bytes) rather than making an assembler macro. Which blows
# your branch target buffers to pieces. Oh well. The
# performance penalty is probably two orders of magnitude
# over native code, but I haven't measured it yet. I
# measured an earlier version on my 700MHz PIII laptop on an
# empty loop at about a factor of 3.5 over simple
# direct-threading, which in turn is on the order of 10 times
# slower than machine code.
# Anyway, so this is an example program built using this
# technique. It implements two Forthlike stacks and interpreted
# subroutines, but not yet the ability to define new subroutines
# at run-time.
### What's Here
# I've implemented all of the primitives from C. H. Ting and
# Bill Muench's public-domain (?) eForth Model 1.0, except for
# the following:
# - I haven't implemented their lowercase "next" (as in FOR
# NEXT) because I think it's a bad idea, it's complex, and it
# can be implemented at a higher level if you really need it;
# - I didn't implement !IO because it's a no-op in this context;
# - I haven't yet implemented ?RX, although I think it's
# possible to implement it on top of syscall5, using select();
# However, most of it is untested and therefore probably broken.
# Procedure call and return and the system calls do work.
# Currently registers are used as follows:
# %esi --- interpreter pointer; points to next byte to execute
# %ebp --- return stack pointer; points to last thing pushed. This stack,
# like the other one, grows downwards.
# %esp --- data stack pointer; points to last thing pushed. This is
# the processor's standard stack pointer; "push" and "pop"
# instructions use it, which makes assembly code to use it
# quite concise. The Intel "call" and "ret" instructions
# would also use this stack, but they aren't used in this
# program.
# flags --- the "down" direction flag must be cleared.
# Fortunately this seems to be the case by default.
# It's probably missing a couple of primitives needed because of
# the token-threading implementation strategy; the address of
# the token table probably needs to be knowable, at least.
# Direct and indirect threading, the normal Forth approaches to
# allowing unrestricted coexistence of words written in assembly
# language and interpreted Forth, both had heavy space costs
# here --- close to 100% for the bytecode currently in the
# system. So the inner interpreter checks, for each bytecode,
# whether it is in the range of bytecodes whose interpretations
# are in native code, and picks the relevant code path. This
# avoids consuming any space per-word for this distinction, but with
# what I assume is a heavy performance cost.
### How Small This Is
# eForth 1.0's machine-code part seems to be 171 instructions
# and 399 bytes, including some data that's mixed in there with
# it.
# Last I checked, this program uses only 19 different
# instructions: jmp, jz; push, pop, lodsb, lodsl, xchg, mov;
# movsbl, inc, and, xor, or, lea, cdq, rcl, add, idiv; and int.
# At the moment, this program is 229 bytes in 125 machine-code
# instructions, plus 13 bytes of read-only data, 104 bytes of
# token table for the 52 currently-defined words, and another
# 143 bytes of bytecode in those words, and then another 22
# bytes in the main program, for a total of 511 bytes.
# One important thing that's missing here is the dictionary
# structure, which will minimally use up another hundred bytes
# or so.
### Other Things I Tried
# I tried switching to caching the top of the data stack in a
# register, on the theory that it would shorten things like
# 'and'. Currently 'and' is pop %eax; pop %ebx; and %ebx,
# %eax; push %eax; jmp next. If top of stack is cached in %eax
# instead of being stored in memory, this becomes pop %ebx; and
# %ebx, %eax; jmp next, which is considerably shorter.
# However, most things don't change, and other things become
# longer due to the extra work to save top-of-stack. I tried
# using both %ebx and %eax as the top-of-stack cache.
# In the version using %ebx as top-of-stack, the total size of the
# machine-code part was 216 bytes, 115 instructions, compared to 197
# bytes, 112 instructions for the version using the current strategy.
# In the version using %eax as top-of-stack, it was only 215 bytes,
# but that's still worse than 197.
# In previous versions, all routines were machine-code routines
# that you could just jmp to. High-level bytecode words began
# with "call dolist", which took the saved %eip off the stack
# and stuck it in %esi. Unfortunately, that added 5 bytes to
# each bytecode word; as I write this, the bytecode region is
# 143 bytes and contains 24 word definitions --- so 5 bytes each
# would have been 120 bytes of overhead, or 84%! It also would
# have required a region to be both executable and writable to
# support run-time routine definition, which is kind of a pain
# thse days.
# In previous versions, the token-table entries were 32 bits
# each (instead of 16 bits as they are now), which added another
# 2 bytes of overhead per word. There are currently 52 words,
# so that's another 104 bytes of shaved overhead.
### What's Wrong With This Program
# - It's a long way from doing anything useful.
# - Only these words are tested: hello, sub1, type, comma,
# world, newline, dolit_s8, dot, bye, exit, branch_on_0,
# c_bang, drop, dup, swap, negative, umplus, divmod,
# syscall5, syscall3, zero, syscall1, rpop, rpush, one,
# dolit_32, neg1, add, emit, tuck, udot, udot_nospc,
# udot_nonzero, branch, invert, add1, negate, xor. These are
# not tested, and therefore may be broken: execute, bang, at,
# c_at, rp_at, rp_bang, sp_at, sp_bang, over, and, or, r_at.
# - There's no dictionary structure yet.
# - It probably needs another couple of primitives.
# - There's no checking for stack overflow or underflow, but
# they will break things.
### The Beginning of the Program
# .. include system library header so we know __NR_exit = 1 and
# __NR_write = 4
#include <asm/unistd.h>
### The token table
## I was frustrated with the unreadability of my
## bytecode lists; I was counting token table entries
## by hand and writing bytecodes numerically. So I
## wrote a macro to help.
## Note that we are using a separate .subsection
## directive because gas 2.17 doesn't support putting
## that in the .pushsection line, even though it is
## documented to do so; see message from Maciej
## W. Rozycki on 2007-10-11, subject "Re: How to use
## .pushsection?",
## http://sourceware.org/ml/binutils/2007-10/msg00176.html
## for more details)
## The first few entries in the table of bytecodes are
## all defined in machine code; the rest are all
## defined in bytecode. The inner interpreter examines
## each bytecode to determine which category it falls
## in in order to figure out how to execute it,
## including what base address to add its offset to.
## This sucks for extensibility but rocks for
## compactness.
.macro define_bytecode name, origin
.pushsection .data # save current position, go to data section
.subsection 1 # and subsection 1, where we put the addrs
b_\name = (. - token_table) / 2 # define b_foo as the index of this ptr
.ifeq b_\name - 256
.error "\name got bytecode 256"
.endif
#.int \name # insert pointer which will be resolved next:
.short \name - \origin # insert offset which will be resolved next
.popsection # return to where we were, and
\name: # define the name
.endm
.macro defasm name
define_bytecode \name, machine_code_primitives
.endm
.macro defbytes name
define_bytecode \name, bytecode_start
.endm
.data 1 # Start putting stuff in data subsection 1
.align 4
## table to define the "bytecode" instructions
token_table:
.data 3
instructions:
# And here is the actual "program" in that bytecode.
.byte b_hello # string "hello" and count
.byte b_sub1 # subtract 1 from count: "hell"
.byte b_type # spit it out
.byte b_comma, b_type, b_world, b_type # ", world"
.byte b_comma, b_type, b_hello, b_type, b_newline, b_type
# test the "dot" command to print out numbers
.byte b_dolit_s8, -120, b_dot
.byte b_dolit_s8, 104, b_dot, b_newline, b_type
.byte b_bye
### The Return Stack
# We put Forth return addresses here, but programs can also use
# it for other purposes.
.bss
.space 4096
initial_return_stack_pointer:
### Initialization
.text # the following stuff goes in the text segment
.global _start # declare _start as a global symbol
# (otherwise ld won't be able to find it)
_start: # this is the entry point for ELF I guess
mov $initial_return_stack_pointer, %ebp
mov $instructions, %esi # %esi is the interpreter pointer register
jmp next # and now we start the interpreter.
# (somewhat silly since we could just
# fall through..)
### The Machine-Code Primitives
# Also next (aka the address interpreter or inner interpreter)
# is in this section.
machine_code_primitives:
# dolit_s8 takes a signed 8-bit literal from the instruction
# stream and pushes it onto the stack.
defasm dolit_s8
lodsb
movsbl %al, %eax
push %eax
jmp next
defasm dolit_32 # more general dolit
lodsl
push %eax
jmp next
defasm exit # Return from a colon defn.
xchg %ebp, %esp
pop %esi
xchg %ebp, %esp
jmp next
defasm execute # Run xt on data stack.
pop %eax # Here 'xt' is the one-byte token.
jmp execute_eax
# Branch if top of stack is 0 (implementing IF).
# Both branch instructions take a signed byte offset from the bytecode
# stream.
defasm branch_on_0
pop %eax
and %eax, %eax
jz branch
inc %esi # skip 1-byte jump offset
jmp next
defasm branch
lodsb
movsbl %al, %eax # same insn size as cbtw; cwde
add %eax, %esi
jmp next
# Store a cell.
defasm bang
pop %ebx
pop (%ebx) # I'm amazed this is legal
jmp next
# Fetch a cell.
defasm at
pop %ebx
push (%ebx) # I'm amazed this is legal too
jmp next
# Store a byte.
defasm c_bang
pop %ebx
pop %eax
mov %al, (%ebx) # push and pop don't do bytes
jmp next
# Fetch a byte.
defasm c_at
pop %ebx
xor %eax, %eax
mov (%ebx), %al
push %eax
jmp next
# Get the return stack pointer.
defasm rp_at
push %ebp
jmp next
# Set the return stack pointer.
defasm rp_bang
pop %ebp
jmp next
# Pop the return stack to the data stack ( R> )
defasm rpop
push (%ebp)
lea 4(%ebp), %ebp # add or xchg/pop: same size
jmp next
# Push the return stack from the data stack ( >R )
defasm rpush
lea -4(%ebp), %ebp
pop (%ebp)
jmp next
# Get the data stack pointer (before it gets pushed).
defasm sp_at
push %esp # safe on 286 and later
jmp next
# Set the data stack pointer.
defasm sp_bang
pop %esp
jmp next
# Pop the stack.
defasm drop
pop %eax
jmp next
# Push a copy of TOS.
# eForth 1.0 used BX to index the stack here, for a couple of
# reasons: on the 8086, SP got decremented prior to the fetch,
# and also wasn't valid as a base or index register.
defasm dup
push (%esp)
jmp next
# Swap top two stack items ("exch" in PostScript)
defasm swap
pop %eax
pop %ebx
push %eax
push %ebx
jmp next
# Stack manipulation ( w1 w2 -- w1 w2 w1 )
# technically not necessary, but it's so easy and tiny
defasm over
push 4(%esp)
# jmp next fall through because "next" is next
# "next" fetches the next bytecode and runs it. It's placed
# here in the middle of the bytecode definitions so that more
# of them can use the short two-byte jump form to get to it.
next:
xor %eax, %eax # set %eax to 0
xor %ebx, %ebx # clear high half of %ebx
lodsb # load %al from where %esi points
# (%esi is the interpreter pointer)
execute_eax:
## load offset of new word into %ebx
mov token_table(,%eax,2), %bx # bx := token_table[eax * 2bytes]
cmp $last_asm_bytecode, %eax
jbe next_primitive # if primitive, handle primitive word
## otherwise, handle a bytecode definition or "colon list"
# save old %esi on return stack
xchg %ebp, %esp
push %esi
xchg %ebp, %esp
lea bytecode_start(%ebx), %esi
jmp next
next_primitive:
lea machine_code_primitives(%ebx), %ebx
jmp *%ebx
# Push true if n negative. ( n -- f )
defasm negative
pop %eax
cdq
push %edx
jmp next
# Bitwise operators:
defasm and
pop %eax
pop %ebx
and %ebx, %eax
push %eax
jmp next
defasm or
pop %eax
pop %ebx
or %ebx, %eax
push %eax
jmp next
defasm xor
pop %eax
pop %ebx
xor %ebx, %eax
push %eax
jmp next
# add two unsigned numbers, returning sum and carry.
# ( u1 u2 -- u3 cy )
defasm umplus
xor %ecx, %ecx
pop %eax
pop %ebx
add %ebx, %eax
rcl $1, %ecx
push %eax
push %ecx
jmp next
# Divide double-precision by single-precision, unsigned.
# UM/MOD from eForth. ( udl udh un -- ur uq )
defasm divmod
pop %ebx
pop %edx
pop %eax
idiv %ebx
push %edx
push %eax
jmp next
# Copy the top of the return stack onto the data stack.
# May be the traditional Forth word "I".
defasm r_at
push (%ebp)
jmp next
# syscall5:
# Linux system call with up to 5 arguments
# This is no longer the fashionable way to make system calls
# in Linux. Now you're supposed to use SYSENTER on newer
# CPUs, and rather than have you figure out which one to use,
# the kernel mmaps a chunk of code called a VDSO into your
# memory space at a random address and tells you where to
# find it using the ELF auxiliary vector. Then you're
# supposed to invoke the dynamic linker or something to parse
# the ELF executable mysteriously manifested in this way by
# the kernel, and then resolve an undefined symbol in libc
# into calls to it. See "What is linux-gate.so.1?"
# http://www.trilithium.com/johan/2005/08/linux-gate/
# "The Linux kernel: System Calls" by Andries Brouwer, 2003-02-01
# http://www.win.tue.nl/%7Eaeb/linux/lk/lk-4.html
# "About ELF Auxiliary Vectors" by Manu Garg
# http://manugarg.googlepages.com/aboutelfauxiliaryvectors
# But the old int $0x80 approach still works, thank goodness,
# because all of that is *way* more than these ten
# instructions.
defasm syscall5
pop %edi
## we have to save %esi for the interpreter
mov %esi, -4(%ebp)
pop %esi
pop %edx
pop %ecx
pop %ebx
pop %eax
int $0x80
push %eax
mov -4(%ebp), %esi
jmp next
last_asm_bytecode = b_syscall5
### Basic Interpreted Words
## a macro for defining interpreted words
## Because after I left off b_exit once, I wasted a long
## time trying to figure out what was wrong
.macro def name, bytes:vararg
defbytes \name
.byte \bytes
.byte b_exit
.endm
.data 2 # separate subsection from token table
bytecode_start:
# System call with three arguments.
def syscall3, b_zero, b_zero, b_syscall5
# System call with one argument.
def syscall1, b_zero, b_zero, b_syscall3
def bye, b_dolit_s8,__NR_exit, b_zero, b_syscall1 # exit program
def zero, b_dolit_s8,0 # push 0
def one, b_dolit_s8,1
# This word outputs a string whose address and count are on
# the stack. ( b u -- )
defbytes type
.byte b_rpush, b_rpush # move two args onto rstack
# system call is __NR_write:
.byte b_dolit_s8,__NR_write
.byte b_one # push constant 1: stdout
.byte b_rpop, b_rpop # move two args back from rstack
.byte b_syscall3 # call syscall with 3 args
.byte b_drop # discard return value
.byte b_exit # return
# The next few words exist just to poke string addresses
# and lengths onto the stack so "type" can print them.
.macro make_counted_string name, contents
defbytes \name
.byte b_dolit_32 # dolit_32 pushes a 32-bit
.int string_\name # literal --- an addr, here
# now push literal length and return
.byte b_dolit_s8,string_length_\name, b_exit
.pushsection .rodata # define the actual string:
string_\name:
.ascii "\contents"
## Here we count the length of the string --- computers
## are for counting bytes so people don't have to!
string_length_\name = . - string_\name
.popsection
.endm
make_counted_string hello, "hello"
make_counted_string world, "world"
make_counted_string comma, ", "
make_counted_string newline, "\n"
### Some More Basic Words
def neg1, b_dolit_s8, -1 # ( -- -1 )
def add, b_umplus, b_drop # ( a b -- a+b ) drop the carry
def sub1, b_neg1, b_add # ( n -- n-1 )
def rot, b_rpush, b_swap, b_rpop, b_swap # ( a b c -- b c a )
def unrot, b_rot, b_rot # ( a b c -- c a b )
def tuck b_dup, b_unrot # ( a b -- b a b )
# emit: output a single byte. eForth calls this "TX!".
# This version is 11 bytes, including the buffer byte, plus the 2-byte
# token table pointer. a machine-code version I wrote the other day
# was 28 bytes. However, I also added rot, unrot, and tuck to support
# this function, and they total 11 bytes, plus 6 bytes of overhead.
# For a total of 11+2+11+6 = 30 bytes. Not winning yet on size over
# x86 asm! But we're getting close.
emit_buffer:
.byte 0
defbytes emit
.byte b_dolit_32
.int emit_buffer
.byte b_tuck # save a copy of address for b_type
.byte b_c_bang # store into emit buffer
.byte b_one, b_type, b_exit # output one-byte buffer
### "u." prints out an unsigned number.
# I had a version of this in x86 machine code in 52 bytes (23
# instructions), essentially exactly the same code as here.
# This is 31 bytes, plus 6 bytes of overhead, plus I had
# to define b_divmod (9 bytes plus 2 bytes overhead). Now we are
# starting to win!
defbytes udot # print space after number
.byte b_udot_nospc, b_dolit_s8, 0x20, b_emit, b_exit
defbytes udot_nospc # print number without space
.byte b_dup, b_branch_on_0,2, b_udot_nonzero, b_exit
.byte b_drop, b_dolit_s8, '0, b_emit, b_exit
defbytes udot_nonzero
.byte b_zero, b_dolit_s8,10, b_divmod # divide by 10
.byte b_dup, b_branch_on_0,3, b_udot_nonzero # recurse if nonzero
.byte b_branch,1 # else
.byte b_drop # drop zero quotient
.byte b_dolit_s8, '0, b_add, b_emit # print digit
.byte b_exit
### Add signed numeric output, ".". This cost 20 bytes plus 8 bytes
# of overhead, but added some fundamental numeric operations; only 12
# of those 28 bytes are specific to "."
# logical not: return true for 0, false (0) otherwise
#def zeq, b_branch_on_0,2, b_zero, b_exit, b_neg1
# logical bitwise not
def invert, b_dolit_s8, -1, b_xor
def add1, b_one, b_add
# arithmetic negation
def negate, b_invert, b_add1
# print signed number
defbytes dot
.byte b_dup, b_negative, b_branch_on_0,4
.byte b_dolit_s8, '-, b_emit, b_negate # in the negative case
.byte b_udot, b_exit
### Obviously the next thing to do is to add ".S", print the
# stack, so that I can stop having to investigate problems by
# using gdb.
More information about the Kragen-hacks
mailing list