Squeezing a Compiler Into Only 446 Bytes

A more in depth look at the precise code tricks used to fit a whole compiler in 446 bytes of assembly code.

This post is Part 2 of a series of posts about my project to create a C compiler within a boot sector. The code for the project is on github.

In Part 1, I described the architecture of my compiler and how several of the design choices were to save space. In this post I go into detail about the more notable space savings tricks (that I can remember, since I did not document them as I went). This post will require an understanding of x86 assembly, but as a quick refresher:

There are 7 general purpose registers that I can trivially use: ax, cx, dx, bx, bp, si, di. The ordering of the registers is intentional, it's not just the letters a, b, c, d, but "accumulator", "data", "count", and "base", and that order is how they are represented in x86.

The sp register is also technically a general purpose register, but I need it to remain mostly untouched for using push, pop, call, and ret in my code.

The si and di registers also have designated purposes in my code. si stores the current pointer into the C source code, so it can be read with lodsb, which is only one byte, and di stores the pointer to the next byte of machine code to emit, so that the stos family of instructions can be used to emit code with fewer bytes.

This leaves 5 registers for me to play with. Only bx and bp can be used as a pointer, which sometimes influences register shuffling, but did not come into play much in my compiler.

All of these registers are 16 bits large, and I can access the high 8 bits or low 8 bits independently if needed. I technically can access 32 bit versions of the registers and do 32 bit operations, but that takes up significantly more code space and I only have 446 bytes of code to work with, so I avoid doing it whenever possible.

Saving Space

;; clear ident->addr map
xor ax, ax
mov cx, 0x8000 ; write 0x8000 dwords = 0x2_0000 bytes
mov di, 0xFFFF ; |
inc edi        ; | start pointer is 0x1_0000
a32 rep stosd

mov si, ax     ; read source from fs:0000

The region of memory at addresses 0x1_0000 through 0x2_FFFF is used to store the pointer of generated variables and functions. It must be cleared before use, as memory is not necessarily zeroed. The stosd instruction stores a 4 byte value specified in eax into the memory pointed to by edi and increments edi by 4. The rep prefix means "repeat this instruction ecx times", and the a32 prefix is used because this needs 32 bits of address to specify the right pointer. We zeroed ax above to store zeroes. The 16-bit move of 0xFFFF into di and then 32-bit increment of edi gets 0x1_0000 into the register in one fewer byte than mov edi, 0x1_0000.

Additionally, while a 0 value exists in a convenient register, it gets put into si for later use.

mov dx, COM1_PORT + 3
push dx

We use the COM1_PORT + 3 port in the future, so we push a copy of this value to the stack for later use. a push and pop pair on a 16-bit general purpose register is only two bytes of instructions, where as another mov dx, COM1_PORT + 3 is three bytes.

mov dl, COM1_PORT & 0xFF

All of the ports we need are within the range 0x3F8 through 0x3FD, so the top byte of the port number can stay the same. This saves one byte on this move.

inc dx

The next port to write to is simply one higher than the previous port, so using inc allows using only one byte instead of three.

dec dx ; PORT + 2

In the code that loads the C program to compile from the serial port, I use the same trick of only modifying the low byte of the dx register to save two bytes in total. Unfortunately there's not much else to be done here for space savings.

In the main code of the compiler there's a few tricks to save space that come from the design of the compiler. There are only two real types that my compiler knows about, int and int*. Because there are only two types, there only needs to be one bit to distinguish the type. I put that bit in the address of the variable or function. Anything with type int* get assigned an address that is odd, and anything else gets assigned an address that is even. This also lets me easily accept void on functions, by ignoring the type completely, since functions don't have return values.

stosb          ; | align to 2 bytes
and di, 0xFFFE ; | (note: this is 3 bytes, assembler shortens it)
call next_token
; token can only be int, int*, or void here
cmp al, (TokenKind.INT_PTR & 0xFF)
jne ._no_ptr
stosb ; increment by 1 to mark as int ptr
._no_ptr:

The first two instructions align the address to two bytes, increasing if needed. The and cannot be used on its own because the address may be odd and have something meaningful at that odd address. The cmp checks whether the type of a function or variable is int*, and if it is, increments the address again to make it odd.

The cmp in this code does a trick that we will see several times throughout this code. When only a specific set of tokens are allowed, instead of comparing all 16 bits of the token to an expected constant, we can compare only the low 8 bits, if they're unique across the allowed tokens. In this case, only int (token 0x18F4), int* (token 0xF982), and void (token 0x2C7A) are allowed to begin a function or variable declaration. Since the low bytes of those three tokens are unique, the compiler can save one byte by comparing only the low byte.

; token can only be a (){ or ;
; which have different low bytes
cmp al, (TokenKind.FN_ARGS & 0xFF)

Jumping all the way to next_token, there's some tricky math designed to optimize for space when calculating the index of this variable in the massive address map that was cleared at the start of the code.

; set gs to correctly address the highest nibble of the index table
; this is either 0x1000 for the low half, or 0x2000 for the high half
xor cx, cx   ; clears the carry to ensure a 0 gets rotated in
             ; and makes sure that cx is 0
rcl bx, 1    ; rotate the high bit of bx into the carry register
rcr ch, 4    ; cx holds 0x0000 or 0x1000
add ch, 0x10 ; 0x1000 or 0x2000
mov gs, cx
; we don't restore bx here, because we need to multiply by 2 anyway
; this forms a 17 bit address with a constant offset of 0x1_0000
mov cx, word gs:[bx]

We need to form a 17 bit address effectively, which requires setting the gs segment register to 0x1000 or 0x2000 to select 0x1_XXXX or 0x2_XXXX respectively. By rotating the high bit of bx into the carry flag and then back out into cx, we can easily create the values that we need. To save some space, that add is on ch instead of cx, since only the high nibble needs to be changed.

The function _block handles parsing and generating a block of code within a { } pair. This is used for functions, if, and while, and can be called recursively to handle nested blocks. It does some very weird control flow to save space, typically by using a conditional jump instead of a jump and call. However, this means that the callers need to be able to jump back to the loop. For a couple functions, this is done by jumping back directly, but for most functions a fake return address is pushed to the stack so that they can ret to the correct location.

Comments are only allowed within a block, at the start of a statement. This saves some space in the code that gets tokens so that it doesn't have to keep track of the previous character in the token. Additionally, only single line comments exist, multi-line comments are not valid.

Actually emitting code is done in a weird way. The stosb and stosw instructions are used to emit a vast majority of code. They store one or two bytes (respectively) to the address specified in di and then increment di by the number of bytes written. This means that sequential stosb simply write bytes in order. There's also some weird code that I called mov_bx_action that is used extensively in code generation.

; emits a sequence
; mov bx, <CONST>
; <2 BYTES>
; cx must hold the constant to write to bx and dx must hold the next 2 bytes to write
; this is a common enough pattern that it's worth it
; clobbers ax, exchanges cx and dx
mov_bx_action:
    mov al, 0xBB ; 0xBB encodes `mov bx, ...`
    stosb
    xchg ax, cx ; constant to load into bx
    stosw
    .shared:
    xchg ax, dx ; action to happen after
    stosw
    .end:
    ret

It is used to emit a move of a constant (normally an address) into bx, and then emits two more bytes afterwards. This was a common enough pattern to be worth a dedicated function to deduplicate. I also use the end of this function as a "write dx into code", by calling the xchg ax, dx line.

When generating code for expressions, the compiler has to be able to tell the difference between a number literal and a variable to generate the correct code. This is the difference between 1 + 1 and some_variable + 1. The former only needs to add two constants where as the latter has to look up the value of the variable at runtime.

Because my compiler has no way to differentiate between constant numbers and identifiers, determining the correct code isn't as trivial. If an identifier has a zero entry in the big map of all known variables, the number of the identifier is used as a literal. Otherwise, it's considered to be a variable and that address is used for looking up the value. To save a couple bytes, the jcxz instruction is used, which jumps if cx is zero.

To generate a binary operation, the compiler has to look up the appropriate code based on the binary operator. Because the compiler knows the operator must be one of the binary operators, it can do the same trick with comparing only the low byte of the identifier to find the correct one. The code to generate is partially stored in a large table that I am not satisfied with, but has been the most compact way to do this that I have found so far.

db TokenKind.PLUS & 0xFF
  db 0x01, 0xC8 ; add ax, cx
db TokenKind.MINUS & 0xFF
  db 0x29, 0xC8 ; sub ax, cx
db TokenKind.OR & 0xFF
  db 0x09, 0xC8 ; or  ax, cx
db TokenKind.AND & 0xFF
  db 0x21, 0xC8 ; and ax, cx
db TokenKind.SHL & 0xFF
  db 0xD3, 0xE0 ; shl ax, cl
db TokenKind.SHR & 0xFF
  db 0xD3, 0xE8 ; shr ax, cl
._binop_cmp_start:
db TokenKind.EQUAL_EQUAL & 0xFF
  db 0x94, 0xC0 ; sete last two bytes
db TokenKind.NOT_EQUAL & 0xFF
  db 0x95, 0xC0 ; setne last two bytes
db TokenKind.LESS & 0xFF
  db 0x9C, 0xC0 ; setl last two bytes

The last 3 entries in this table are special in that they are only part of a sequence of bytes to emit. The code to emit for these conditions is actually cmp ax, cx; setCC al. This compares ax to cx and then sets al to 1 if some condition CC is met, otherwise sets it to 0. The table of bytes includes the byte that changes between the different setCC instructions, and then the code that uses these bytes knows that if the found operator is at least at address _binop_cmp_start, it's a comparison, and to emit the extra bytes needed.

The rest of the space savings are general micro-optimizations. Proper management of registers to avoid having to save and restore values, use of xor reg, reg to zero a register, fiddling with jumps to optimize space, and other such 1 byte optimizations.

If anyone knowledgeable about 16 bit x86 assembly has any ideas for saving space, feel free to open an issue/pr on github! It is getting harder and harder to find space for improvements or bug fixes. Notably, the result of comparison/equality operators is not 0 or 1, it's actually maintaining the high byte of ax from the left hand side of the operator. This is fine for if and while since they just check for non-zero, but means that the result of the comparison can't reliably be used as a value.