A more in depth look at the precise code tricks used to fit a whole compiler in 446 bytes of assembly code.
This post is Part 2 of a series of posts about my project to create a C compiler within a boot sector. The code for the project is on github.
In Part 1, I described the architecture of my compiler and how several of the design choices were to save space. In this post I go into detail about the more notable space savings tricks (that I can remember, since I did not document them as I went). This post will require an understanding of x86 assembly, but as a quick refresher:
There are 7 general purpose registers that I can
trivially use: ax, cx, dx, bx, bp, si, di
. The
ordering of the registers is intentional, it's not just the
letters a, b, c, d, but "accumulator", "data", "count", and
"base", and that order is how they are represented in
x86.
The sp
register is also technically a
general purpose register, but I need it to remain mostly
untouched for using push
, pop
,
call
, and ret
in my code.
The si
and di
registers also
have designated purposes in my code. si
stores
the current pointer into the C source code, so it can be
read with lodsb
, which is only one byte, and
di
stores the pointer to the next byte of
machine code to emit, so that the stos
family
of instructions can be used to emit code with fewer
bytes.
This leaves 5 registers for me to play with. Only
bx
and bp
can be used as a
pointer, which sometimes influences register shuffling, but
did not come into play much in my compiler.
All of these registers are 16 bits large, and I can access the high 8 bits or low 8 bits independently if needed. I technically can access 32 bit versions of the registers and do 32 bit operations, but that takes up significantly more code space and I only have 446 bytes of code to work with, so I avoid doing it whenever possible.
The space saving tricks start only a few instructions in:
;; clear ident->addr map
xor ax, ax
mov cx, 0x8000 ; write 0x8000 dwords = 0x2_0000 bytes
mov di, 0xFFFF ; |
inc edi ; | start pointer is 0x1_0000
a32 rep stosd
mov si, ax ; read source from fs:0000
The region of memory at addresses 0x1_0000
through 0x2_FFFF
is used to store the pointer
of generated variables and functions. It must be cleared
before use, as memory is not necessarily zeroed. The
stosd
instruction stores a 4 byte value
specified in eax
into the memory pointed to by
edi
and increments edi
by 4. The
rep
prefix means "repeat this instruction
ecx
times", and the a32
prefix is
used because this needs 32 bits of address to specify the
right pointer. We zeroed ax
above to store
zeroes. The 16-bit move of 0xFFFF
into
di
and then 32-bit increment of
edi
gets 0x1_0000
into the
register in one fewer byte than mov edi, 0x1_0000
.
Additionally, while a 0 value exists in a convenient
register, it gets put into si
for later
use.
In the serial port initialization code, a few tricks are used.
mov dx, COM1_PORT + 3
push dx
We use the COM1_PORT + 3
port in the future,
so we push a copy of this value to the stack for later use.
a push
and pop
pair on a 16-bit general purpose register is only two bytes
of instructions, where as another mov dx, COM1_PORT + 3
is three bytes.
mov dl, COM1_PORT & 0xFF
All of the ports we need are within the range
0x3F8
through 0x3FD
, so the top
byte of the port number can stay the same. This saves one
byte on this move.
inc dx
The next port to write to is simply one higher than the
previous port, so using inc
allows using only one byte instead of three.
dec dx ; PORT + 2
Similar to inc
,
dec
is also only one byte.
In the code that loads the C program to compile from the
serial port, I use the same trick of only modifying the low
byte of the dx
register to save two bytes in
total. Unfortunately there's not much else to be done here
for space savings.
In the main code of the compiler there's a few tricks to
save space that come from the design of the compiler. There
are only two real types that my compiler knows about,
int
and int*
. Because there are
only two types, there only needs to be one bit to
distinguish the type. I put that bit in the address of the
variable or function. Anything with type int*
get assigned an address that is odd, and anything else gets
assigned an address that is even. This also lets me easily
accept void
on functions, by ignoring the type
completely, since functions don't have return values.
stosb ; | align to 2 bytes
and di, 0xFFFE ; | (note: this is 3 bytes, assembler shortens it)
call next_token
; token can only be int, int*, or void here
cmp al, (TokenKind.INT_PTR & 0xFF)
jne ._no_ptr
stosb ; increment by 1 to mark as int ptr
._no_ptr:
The first two instructions align the address to two
bytes, increasing if needed. The and
cannot be used on its own because the address may be odd and
have something meaningful at that odd address. The cmp
checks whether the type of a function or variable is
int*
, and if it is, increments the address
again to make it odd.
The cmp
in this code does a trick that we will see several times
throughout this code. When only a specific set of tokens are
allowed, instead of comparing all 16 bits of the token to an
expected constant, we can compare only the low 8 bits, if
they're unique across the allowed tokens. In this case, only
int
(token 0x18F4
),
int*
(token 0xF982
), and
void
(token 0x2C7A
) are allowed to
begin a function or variable declaration. Since the low
bytes of those three tokens are unique, the compiler can
save one byte by comparing only the low byte.
This trick is used again after getting the name of the function or variable:
; token can only be a (){ or ;
; which have different low bytes
cmp al, (TokenKind.FN_ARGS & 0xFF)
Jumping all the way to next_token
, there's
some tricky math designed to optimize for space when
calculating the index of this variable in the massive
address map that was cleared at the start of the code.
; set gs to correctly address the highest nibble of the index table
; this is either 0x1000 for the low half, or 0x2000 for the high half
xor cx, cx ; clears the carry to ensure a 0 gets rotated in
; and makes sure that cx is 0
rcl bx, 1 ; rotate the high bit of bx into the carry register
rcr ch, 4 ; cx holds 0x0000 or 0x1000
add ch, 0x10 ; 0x1000 or 0x2000
mov gs, cx
; we don't restore bx here, because we need to multiply by 2 anyway
; this forms a 17 bit address with a constant offset of 0x1_0000
mov cx, word gs:[bx]
We need to form a 17 bit address effectively, which
requires setting the gs
segment register to
0x1000
or 0x2000
to select
0x1_XXXX
or 0x2_XXXX
respectively.
By rotating the high bit of bx
into the carry
flag and then back out into cx
, we can easily
create the values that we need. To save some space, that
add
is on ch
instead of cx
, since only
the high nibble needs to be changed.
The function _block
handles parsing and
generating a block of code within a { }
pair.
This is used for functions, if
, and
while
, and can be called recursively to handle
nested blocks. It does some very weird control flow to save
space, typically by using a conditional jump instead of a
jump and call
.
However, this means that the callers need to be able to jump
back to the loop. For a couple functions, this is done by
jumping back directly, but for most functions a fake return
address is pushed to the stack so that they can ret
to the correct location.
Comments are only allowed within a block, at the start of a statement. This saves some space in the code that gets tokens so that it doesn't have to keep track of the previous character in the token. Additionally, only single line comments exist, multi-line comments are not valid.
Actually emitting code is done in a weird way. The stosb
and stosw
instructions are used to emit a vast majority of code. They
store one or two bytes (respectively) to the address
specified in di
and then increment
di
by the number of bytes written. This means
that sequential stosb
simply write bytes in order. There's also some weird code
that I called mov_bx_action
that is used
extensively in code generation.
; emits a sequence
; mov bx, <CONST>
; <2 BYTES>
; cx must hold the constant to write to bx and dx must hold the next 2 bytes to write
; this is a common enough pattern that it's worth it
; clobbers ax, exchanges cx and dx
mov_bx_action:
mov al, 0xBB ; 0xBB encodes `mov bx, ...`
stosb
xchg ax, cx ; constant to load into bx
stosw
.shared:
xchg ax, dx ; action to happen after
stosw
.end:
ret
It is used to emit a move of a constant (normally an
address) into bx
, and then emits two more bytes
afterwards. This was a common enough pattern to be worth a
dedicated function to deduplicate. I also use the end of
this function as a "write dx
into code", by
calling the xchg ax, dx
line.
When generating code for expressions, the compiler has to
be able to tell the difference between a number literal and
a variable to generate the correct code. This is the
difference between 1 + 1
and
some_variable + 1
. The former only needs to add
two constants where as the latter has to look up the value
of the variable at runtime.
Because my compiler has no way to differentiate between
constant numbers and identifiers, determining the correct
code isn't as trivial. If an identifier has a zero entry in
the big map of all known variables, the number of the
identifier is used as a literal. Otherwise, it's considered
to be a variable and that address is used for looking up the
value. To save a couple bytes, the jcxz
instruction is used, which jumps if cx
is
zero.
To generate a binary operation, the compiler has to look up the appropriate code based on the binary operator. Because the compiler knows the operator must be one of the binary operators, it can do the same trick with comparing only the low byte of the identifier to find the correct one. The code to generate is partially stored in a large table that I am not satisfied with, but has been the most compact way to do this that I have found so far.
db TokenKind.PLUS & 0xFF
db 0x01, 0xC8 ; add ax, cx
db TokenKind.MINUS & 0xFF
db 0x29, 0xC8 ; sub ax, cx
db TokenKind.OR & 0xFF
db 0x09, 0xC8 ; or ax, cx
db TokenKind.AND & 0xFF
db 0x21, 0xC8 ; and ax, cx
db TokenKind.SHL & 0xFF
db 0xD3, 0xE0 ; shl ax, cl
db TokenKind.SHR & 0xFF
db 0xD3, 0xE8 ; shr ax, cl
._binop_cmp_start:
db TokenKind.EQUAL_EQUAL & 0xFF
db 0x94, 0xC0 ; sete last two bytes
db TokenKind.NOT_EQUAL & 0xFF
db 0x95, 0xC0 ; setne last two bytes
db TokenKind.LESS & 0xFF
db 0x9C, 0xC0 ; setl last two bytes
The last 3 entries in this table are special in that they
are only part of a sequence of bytes to emit. The code to
emit for these conditions is actually cmp ax, cx; setCC al
.
This compares ax
to cx
and then
sets al
to 1 if some condition CC
is met, otherwise sets it to 0
. The table of
bytes includes the byte that changes between the different
setCC
instructions, and then the code that uses
these bytes knows that if the found operator is at least at
address _binop_cmp_start
, it's a comparison,
and to emit the extra bytes needed.
The rest of the space savings are general
micro-optimizations. Proper management of registers to avoid
having to save and restore values, use of xor reg, reg
to zero a register, fiddling with jumps to optimize space,
and other such 1 byte optimizations.
If anyone knowledgeable about 16 bit x86 assembly has any
ideas for saving space, feel free to open an issue/pr on
github! It is getting harder and harder to find space for
improvements or bug fixes. Notably, the result of
comparison/equality operators is not 0 or 1, it's actually
maintaining the high byte of ax from the left hand side of
the operator. This is fine for if
and
while
since they just check for non-zero, but
means that the result of the comparison can't reliably be
used as a value.