"tweetable" "symbolic" hex COM loader
Kragen Javier Sitaker
kragen at canonical.org
Mon May 21 01:11:18 EDT 2012
On Wed, May 09, 2012 at 06:39:25PM +0200, Dave Long wrote:
> (hey Kragen, how did the bytebeat performances go?)
It went well! Although the Gameboy guy before me, running LSDJ, was a really
tough act to follow. I uploaded a recording to
<http://canonical.org/~kragen/bytebeat/la-cigale.wav.gz>, and I uploaded the
software to <https://github.com/kragen/pytebeat>.
> Apropos the bootstrapping thread[0], here's another hex loader:
>
> 0000000: 31c9 bf00 03ba 8a01 b40a cd21 a18b 013c 1..........!...<
> 0000010: 047c 17b8 0001 01c8 bb00 0202 1e8c 0189 .|..............
> 0000020: 07be 8c01 a5a5 9041 ebdb 31c0 a320 0289 .......A..1.. ..
> 0000030: cd31 c9be 0003 bf00 02bb 0100 31c9 31d2 .1..........1.1.
> 0000040: b800 42cd 21b8 0040 cd21 ac31 d231 c0ac ..B.!.. at .!.1.1..
> 0000050: 3c20 740f bb01 0101 cb29 da01 f889 c38b < t......)......
> 0000060: 0701 c231 c0ac 0c20 d410 d503 2c09 c0e0 ...1... ....,...
> 0000070: 0401 c2ac 0c20 d410 d503 2c09 01c2 9090 ..... ....,.....
> 0000080: b402 cd21 4139 e975 c1c3 5000 ...!A9.u..P.
Here's the disassembly, for my benefit and for whoever else is reading this.
kragen at VOSTRO9:~/devel$ objdump -m i8086 -b binary --adjust-vma=0x100 -D loader.com
loader.com: file format binary
Disassembly of section .data:
00000100 <.data>:
100: 31 c9 xor %cx,%cx
102: bf 00 03 mov $0x300,%di
105: ba 8a 01 mov $0x18a,%dx
108: b4 0a mov $0xa,%ah
10a: cd 21 int $0x21
10c: a1 8b 01 mov 0x18b,%ax
10f: 3c 04 cmp $0x4,%al
111: 7c 17 jl 0x12a
113: b8 00 01 mov $0x100,%ax
116: 01 c8 add %cx,%ax
118: bb 00 02 mov $0x200,%bx
11b: 02 1e 8c 01 add 0x18c,%bl
11f: 89 07 mov %ax,(%bx)
121: be 8c 01 mov $0x18c,%si
124: a5 movsw %ds:(%si),%es:(%di)
125: a5 movsw %ds:(%si),%es:(%di)
126: 90 nop
127: 41 inc %cx
128: eb db jmp 0x105
12a: 31 c0 xor %ax,%ax
12c: a3 20 02 mov %ax,0x220
12f: 89 cd mov %cx,%bp
131: 31 c9 xor %cx,%cx
133: be 00 03 mov $0x300,%si
136: bf 00 02 mov $0x200,%di
139: bb 01 00 mov $0x1,%bx
13c: 31 c9 xor %cx,%cx
13e: 31 d2 xor %dx,%dx
140: b8 00 42 mov $0x4200,%ax
143: cd 21 int $0x21
145: b8 00 40 mov $0x4000,%ax
148: cd 21 int $0x21
14a: ac lods %ds:(%si),%al
14b: 31 d2 xor %dx,%dx
14d: 31 c0 xor %ax,%ax
14f: ac lods %ds:(%si),%al
150: 3c 20 cmp $0x20,%al
152: 74 0f je 0x163
154: bb 01 01 mov $0x101,%bx
157: 01 cb add %cx,%bx
159: 29 da sub %bx,%dx
15b: 01 f8 add %di,%ax
15d: 89 c3 mov %ax,%bx
15f: 8b 07 mov (%bx),%ax
161: 01 c2 add %ax,%dx
163: 31 c0 xor %ax,%ax
165: ac lods %ds:(%si),%al
166: 0c 20 or $0x20,%al
168: d4 10 aam $0x10
16a: d5 03 aad $0x3
16c: 2c 09 sub $0x9,%al
16e: c0 e0 04 shl $0x4,%al
171: 01 c2 add %ax,%dx
173: ac lods %ds:(%si),%al
174: 0c 20 or $0x20,%al
176: d4 10 aam $0x10
178: d5 03 aad $0x3
17a: 2c 09 sub $0x9,%al
17c: 01 c2 add %ax,%dx
17e: 90 nop
17f: 90 nop
180: b4 02 mov $0x2,%ah
182: cd 21 int $0x21
184: 41 inc %cx
185: 39 e9 cmp %bp,%cx
187: 75 c1 jne 0x14a
189: c3 ret
18a: 50 push %ax
...
> The main advantage of keying in the extra 100 bytes is to gain
> branch relocation with (mono-)symbolic labels, facilitating
> alternations and repetitions, and hence paving the way for real (or
> at least multiple character) symbol tables and automated handling of
> other relocations[1]. Its format is fairly restrictive[2], with the
> first 4 characters of each line defining an output byte:
> 0: defines a label
> 1: calculates a pc-relative reference to a label
> 2: high nybble value
> 3: low nybble value
That's pretty impressive for 140 bytes! And probably a much more practical
approach than hand-assembling the next stage of everything on paper before
entering it into the loader. Octal might be more practical than hex, though.
It does sort of assume you already have an interactive text editor, though.
(Or, say, a paper tape reader, a knife, and Scotch tape.) But, if you had to
write an interactive editor, maybe it would make sense to write a binary editor
first, rather than a text editor? That was the approach I was taking in that
undebugged 78-byte editor in the tail end of my original post.
> As an example, here's a load file for a variant of fr-16:
> b8
> 13
> 00 # mov ax,13
> cd
> 10 # int 10
> c4
> 2f # les bp,[bx]
> l aa # loop: stosb
> 11
> f8 # adc ax,di
> 15
> 32
> 11 # adc ax,1132
> eb
> l00 # jmp loop
> (or see the footnotes for the "source"[3])
Very nice example. Did you come across a variant of DOS where AX was nonzero
at startup?
> In principle, there's one branch too many[4] in this code; it should
> be possible to run the second pass entirely in straight-line code.
> On the other hand, 15 additional bytes and slight modifications to
> the input format suffice to provide absolute relocations without any
> additional logic...
:)
The global assembler label is a remarkably powerful tool. I wonder, though, if
it might be a bit too powerful; perhaps the Forth approach (two separate
stack-based mechanisms for backward and forward jumps, plus an updatable symbol
table for calls to earlier-defined routines, giving each symbol a scope from
its own definition until the next definition of the same name) is actually
better.
The code to handle those constructs in Stoneknifeforth, which also uses
single-letter symbols, is as follows. First, the code for the calls to
earlier-defined routines (and variables):
: Type Four* header 6144 + + ; ( Table of definition Types: 1=code, 2=data )
: Addr Type 1024 + ; ( table of definition Addresses )
( register a colon definition )
: Colon %flush dup Type 1 xchg ! HERE @ xchg Addr ! ;
( register a variable definition )
: Var %flush dup Type 2 xchg ! HERE @ xchg Addr ! ;
( Stores word on top of stack at HERE, praise be for non-aligned access. )
: Encode %flush HERE @ ! 4 forward ;
: Lit 80 . ( push %eax ) 184 . ( load immediate %eax ) Encode ; ( literal)
( Compiles a user-defined subroutine or variable, using Type and Addr )
: &dispatch dup Type @ 1 = [ ( if it’s a subroutine...)
Restack 232 . ( CALL instruction; now we have to compute the offset )
HERE @ 4 + ( find address of next instruction )
xchg Addr @ xchg - ( find offset to subroutine )
Encode
Restack ; ] ( okay, and otherwise, it’s a variable... )
Addr @ addr Lit ;
("Restack" here switches between the operand stack and the return stack. As an
optimization, it has a single-instruction peephole buffer so that Restack
Restack is a no-op, and %flush flushes that buffer. So this code could be even
simpler.)
Then jumps forward:
( JZ: compile a forward conditional jump )
: JZ 133 . 192 . ( test %eax, %eax ) 88 . ( pop %eax )
116 . HERE @ 0 . ( jz 0 ) ;
( Patch: resolve a forward conditional jump by backpatching )
( 1 - is needed because the saved address is one byte inside the jnz
instruction, but the offset should be calculated from the beginning of the
next instruction )
: Patch dup %flush HERE @ xchg - 1 - xchg store ;
Then jumps backward:
( jnz: compile a backward conditional jump )
: jnz 133 . 192 . 88 . 117 . H @ - 1 - . ( jnz whatever ) ;
The forward and backward jumps are invoked by these cases in the Dispatch
routine:
dup '[ = [ pop JZ ; ] ( [ compiles a conditional jump,
pushing an address for backpatching )
dup '] = [ pop Patch ; ] ( backpatches the address on the compile stack )
dup '{ = [ pop %flush HERE @ ; ] ( merely pushes an address to jump back to )
dup '} = [ pop jnz ; ] ( compiles a conditional jump to that address )
It seems to me like the extra complexity of having those jumps use a stack
rather than the symbol table is almost nil (like maybe one line of code), and
it means that the symbol table doesn't have to support forward references, and
can also handle multiple definitions with the same name.
(I really should get around to making SKF run on modern Linuxes again. There's
some kind of corner it's cutting in ELF generation that causes the binaries it
generates to no longer load, and the ELF loader is not forthcoming with error
messages...)
Barely relatedly, linking is apparently a progressively bigger bottleneck in
builds of big C and even C++ projects, because you can compile each source file
on a separate core, but then the linker has to deal with the entire corpus at
once. It occurred to me that a mapreduce approach to linking ought to be quite
parallelizable:
1. Map: Break your input files up into (virtual address, symbol name,
relocation type, code snippet) tuples, where snippets of literal code are
either dragged along with the last relocation before them or associated with a
null symbol;
2. Sort these by symbol name;
3. Reduce: For each set of tuples for a given symbol name, resolve the
"U" unresolved relocations using the defining relocation (or throw an error),
producing a new set of (virtual address, code snippet) tuples (augmented, I
suppose, by some metadata);
4. Sort these by virtual address;
5. Concatenate them to produce the final program.
It seems like that approach ought to provide some kind of speedup for
sufficiently large programs.
> [1] in fact, if one adds enough evaluation to an assembler it
> becomes indistinguishable from a linker ... conversely, a
> sufficiently advanced linker is indistinguishable from an assembler.
Indeed. And the reason that compilers are called "compilers" is that many of
the early compilers had mostly the job of collecting together the appropriate
set of library routines, providing the programmer a slightly higher level
instruction set for input that was essentially machine code. For some decades,
some people tried to promote the more accurate term "translators", but that
seems to have failed by 1980.
Kragen
More information about the Kragen-discuss
mailing list