Writing a Compiler in 446 Bytes

An overview of the design of the C compiler, describing a lot of the limitations that were intentional design choices for the sake of making the implementation smaller.

This post is Part 1 of a series of posts about my project to create a C compiler within a boot sector. The code for the project is on github.

home · part0 Part 1 part2


In Part 0, I talked about how x86 computers boot up from nothing, and the environment that the first code on the system gets. Part 1 will be about what I did in that limited and hostile environment and how I managed to fit a compiler (even a very bad one) in 446 bytes.

The Code

This project uses nasm as the assembler of choice. No particular reason for nasm over anything else, though I was set on using Intel Syntax for the assembly, which does eliminate some choices.

The first order of business is, of course, to set up a basic file that can boot, even if it does nothing.

BITS 16 ; real mode is 16 bit code
org 0x7C00 ; BIOS drops us at this address

main:
    jmp $

TIMES 0x1BE - ($ - $$) db 0x00

; THIS WOULD NORMALLY BE THE MBR
; REMOVED FOR SIMPLICITY (and qemu doesn't need it)
TIMES 64 db 0x00

dw 0xAA55 ; boot signature

This code is the bare minimum to run. The Master Boot Record is not shown in that snippet because it doesn't really matter for our purposes. Additionally, the label main is not special. The first byte in the file will be run, so you need to be careful to never put anything before main, or have a jump to main as the first thing. The label is simply there for organizational purposes.

This code also doesn't do anything. jmp $ means "jump to the start of this line". This will repeat forever, keeping the CPU in an infinite loop. Assembling the code with nasm -w+all -w+error -f bin boot.asm -o boot.bin results in a 512 byte binary file, which is our boot sector. Running this sector with qemu-system-x86_64 -drive format=raw,file=boot.bin -s greets us with a screen, with some text that says "booting from hard disk…", and nothing happens. This is expected though, we're not doing anything. The -s flag tells qemu to allow a debugger to connect, so we can connect gdb with gdb --init-eval-command="target remote :1234". We will see that the program is indeed at 0x7C00 and if we step forward, it remains there forever.

This is what the development cycle looks like, make some changes to the asm, build it, launch qemu, and then inspect the state with a debugger to see if things are correct. Because the output of this compiler is x86 machine code bytes in memory (and due to space limitations), all debugging is done though gdb. There's not much that's useful to print to an output, and even if there were, inspecting the generated program will require looking at raw memory anyway. I am now intimately familiar with the code that my compiler generates, to the point that sometimes I don't even use a disassembler to see what's wrong, I can identify it by the patterns I expect.

If you are interested in working in this environment yourself, good luck. I find it highly interesting, but it's not a particularly useful skill these days. If you're actually interested in modern bootloaders and operating systems, UEFI is what everything uses now, which provides a much nicer 64 bit environment and standardized APIs for doing the things that a bootloader needs to do. Writing a Real Bootloader is a very different project, but one that provides similar joy in my opinion, and you might end up with something you could actually use. Piece by piece replacing every part of your computer's software stack with something you created is ideal, in my opinion, and it would be super fun to run a bootloader I wrote, but that may be a future project.

Implementing the Compiler

Note: this project is still in progress. Specific implementation details may be outdated. As of the latest update (2024-09-24), almost no work is being put into the compiler, but instead into writing programs for this compiler to run.

Input Program

The first thing that the compiler needs to do is actually get a source program. I chose to send the program over the serial port so that the binary does not have to include the source. This also makes it possible to type a program by hand, if you connect to the serial port, though there is no support for editing or backspace or moving up or down lines.

The basic code to read from the serial port looks like this

mov di, 0x6000 ; location in memory reserved for the source code

._load_program:
  mov dx, COM1_PORT + 5 ; status port
  in al, dx ; read from the status port
  and al, 0x01 ; bit 0 is set if there is a character ready
  jz ._load_program ; loop until the character is ready
  mov dx, COM1_PORT ; data port
  in al, dx ; read from the data port
  stosb ; *di = al; di += 1;
  or al, al ; exit when a 0x00 byte is read
  jnz ._load_program

There is additional code that is not shown to initialize the serial port, since it has to be told what rate to send bits and other configuration. This code simply places the input program in memory, where it will later be read by the compiler.

Tokenization

The compiler needs to split the input program into tokens. In most compilers this is a fairly complicated process to allow for optional whitespace and line breaks and other characters. Typically x=4*(x+2) should act the same as

x    =
  4 * ( x + 2 )

My compiler, of course, throws that out the window. Tokens are delimited by whitespace. All characters with ascii value 0x20 or less are considered "whitespace". If there's not a whitespace, my compiler treats it as still the same token. x=4*(x+2) is considered One Token to my compiler. This greatly simplifies the implementation of the tokenizer, at great cost to readability of code. However, since C allows lots of weird whitespace, my compiler requiring it means that this is still a subset of C. This is what some code might look like in my version of C:

int meow ;
int main (){
  meow = 20 ;
  while( 1 == 1 ){
    read_char ();
    print_char ();
  }
}

At the surface, this doesn't look so bad. However, all of those weird spaces are required. There must be whitespace:

This is surprisingly difficult to get right. Formatters will scream at you if you try to code like this, so you just have to remember the rules. Additionally, there's no error handling. At all. If the compiler expects a token, that token better exist exactly as the compiler expects. If it doesn't, all hell will break loose and you will certainly not get the output you expect (most of the time the compiler crashes and starts executing random memory).

The tokenizer is deceptively simple. In general its algorithm is:

<skip whitespace>
ident = 0
get_character()
while character > 0x20
    ident = 10 * ident + (character - 0x30)
    get_character()
return ident

This is a terrible algorithm for identifying anything. I copied the idea from another project that aimed to make a small C compiler1. The idea here is that integer literals get converted into their value, and everything else just so happens to produce a number that looks Varied Enough for use. This however does mean that it is impossible to differentiate between numbers and not numbers, for example ; and 11 have the same value. This is fortunately normally not an issue, as users will write well formed programs which always specific formats, so we never need to make that distinction. It also means that collisions can occur between names. The names get_next_cluster and set_next_cluster happen to have the same 16 bit identifier, and this has caused me pain a couple times when writing code. The best workaround that I have found is making sure the last few characters of a name are different, because they are less likely to get "pushed out" of the identifier and stop mattering. next_cluster_get and next_cluster_set are names I now use in code.

The compiler only needs to know about a few different tokens and they are as follows:

Name Value Meaning
COMMENT 0xFFF5 //
INT 0x18F4 int
INT_PTR 0xF982 int*
OPEN_PAREN 0xFFF8 (
CLOSE_PAREN 0xFFF9 )
CLOSE_BRACE 0x004D }
FN_ARGS 0xFCE5 (){
FN_CALL 0xFCA5 ();
SEMICOLON 0x000B ;
IF_START 0x1858 if(
WHILE_START 0xDA02 while(
ASM_START 0x973E asm("
DOT_BYTE 0x9491 .byte
ASM_END 0xFA4D ");
STAR 0xFFFA *
AND 0xFFF6 &
RETURN 0x7DA7 return;
PLUS 0xFFFB +
MINUS 0xFFFD -
OR 0x004C |
XOR 0x002E ^
SHL 0x0084 <<
SHR 0x009A >>
EQUAL_EQUAL 0x008F !=
NOT_EQUAL 0xFF77 ==
LESS 0x000C <

Some notes on these tokens:

The grammar for this implementation of C is also very limited. Function calls can only appear alone, as a statement (since they can't return a value). Declaring and assigning to variables must be separate, there's no variable initializers. Only global variables exist. Local variables require a surprising amount of implementation work to do with scopes and the stack, so you can only declare global variables. The lack of local variables is possibly one of the worst "features" of this language. Declaring variables away from their usage makes it easier to misuse them, and leads to awful conventions to prevent collisions of names, like int _read_char_state ; (this variable really wants to be a local named state).

Compiler Architecture

Several registers have meanings that are consistent throughout the compiler:

The compiler processes tokens by getting tokens sequentially and using recursive descent. It starts by looking for top level declarations, which are variables and functions. The type of the declaration is read first. Variables must have type int or int* while functions may have type void, int, or int*, but should use void, since return values do not exist. If the type of a declaration is int*, the location of the declaration generated program is incremented by 1. All other declarations are aligned to 2 bytes, so this allows the compiler to determine if a variable is a pointer or not without having to store that data in a separate table. For all declarations, first an entry is placed in a location of memory that acts as an ident -> pogram location map. This map is 128KiB large, since there's 65536 possible ident numbers and each one uses two bytes to represent the location of the declaration in the generated program. To distinguish a function declaration from a variable declaration, the code looks for a (){ token after the name of the variable or function. Since functions cannot have arguments, this suffices to identify all functions.

Generating code for variables is very simple - the compiled code pointer is incremented by 2 to reserve space for the variable, and the compiler loops to check for the next top level declaration.

The rest of the compiler's code is dedicated to generating code for functions, including all the different types of statements and expressions that can be within them.

There are 7 top level statements within a function body:

Function Calls

This was actually the very first thing that I implemented within the compiler, because it's surprisingly simple! The call x86 instruction implements everything we need, since we don't have locals or arguments. There exists a call <register> form too, so the compiler just generates

mov bx, ADDRESS
call bx

To obtain the address to call, the compiler looks up the address of the function in the ident -> address map. This is a constant over the lifetime of the program, so the generated code only needs to load that constant and call it.

Assignment

Variable assignment (var = expr) and assignment through a pointer (* ptr = expr) share most of their logic, but are some of the most complicated statements. Assignment has to remember the variable that it is assigning to, then parse and codegen a complicated arbitrary expression, and then store the result of that expression in the variable, or in the case of assignment through a pointer, in the memory location pointed to. The generated code for the assignment itself is fairly simple, it assumes that there's a value in ax and then stores it to the appropriate location.

; assume expr result in ax

; variable assignment
mov bx, VARIABLE_ADDRESS
mov word [bx], ax ; store to the variable

; assignment through a pointer
mov bx, VARIABLE_ADDRESS
mov bx, word [bx] ; get the pointer's value
mov word [bx], ax ; store to that address

Expressions

A vast majority of the complexity of the compiler is in the expression parsing and codegen required for assignment. Expressions can be either unary expressions (a single Thing, like a number, a variable, or a parenthesized expression), or binary expressions, which are a unary expression followed by a binary operator, and then another unary expression.

Some notes on the implementation:

The code for parsing and generating the unary expressions on the left hand side and right hand side of the binary operator is identical. Since the generated code always writes the value of the unary expression to ax and uses bx for addresses, we need to save the left hand side of the operator on the stack so that the right hand side can generate into ax, and then restore back so that the left hand side is in ax and the right hand side is in cx. Thankfully this is only a few bytes to emit the saves and restores. By ensuring that the LHS of the expression is in ax and the RHS is in cx, the code that emits the binary operator only needs to pick a sequence of bytes depending on the operator. It might emit add ax, cx to add two values or cmp ax, cx; sete al to check if two values are equal, but it doesn't concern itself with complicated register allocation.

There are a lot of assumptions in the binary operator code that assume that the user wrote a valid binary operator. Only the last byte of the idents are checked, meaning that there are 256 different ident numbers (and an infinite number of strings) that the compiler will consider to be a binary operator. If the token provided is not a binary operator, it's highly likely that the compiler is going to enter an infinite loop.

Numbers and variables are indistinguishable. The way that the compiler knows a variable is being accessed is if an ident number has an entry in the address table. This means that certain integer literals cannot be used. However, it should be possible to use the sum or difference of integer literals to form any integer that you want. I have yet to encounter this problem in my code, so it's hopefully Unlikely Enough. This is something that cannot be improved however, not without completely redoing the way that idents are managed.

In C, doing pointer + integer adds sizeof(type) * integer bytes to the pointer. Additionally, subtracting pointers divides the result of the subtraction by sizeof(type), giving the number of elements between the pointers. These are both special cased in the compiler. If the left hand side of a binary operation is a pointer (determined by its variable address being odd) and the operator is a +, twice as much is added to the pointer. If the left hand side is a pointer and the operator is a -, the result of the addition is divided by 2. If the left hand side is a pointer and the operator is anything else, normal behavior is used, which allows ==, !=, and < to work without concern. If the left hand side of an expression is not a pointer, none of these special cases are used.

Note: When writing the first version of the previous paragraph, I actually found two different bugs in how I was handling pointers! First, I was always multiplying the right hand side by 2, meaning that ptr == ptr was actually doing ptr == (ptr * 2) Then, once I fixed that, I realized that ptr - ptr is not equivalent to ptr - (ptr * 2) and that when I tested the original fix, I happened to pick a pair of values where that was the case.

if and while

Weirdly, implementing if and while is not very complicated. The basic structure of an if is

<expression> ; result in ax
or al, al
je end ; skip the block if the condition is not met
<block>
end:

This means that the expression code and the block code (used for generating function bodies) can both be reused. The compiler only needs to emit a conditional jump before the block. One complicating factor is that the target for the conditional jump is not known until the block is generated. This is solved by storing the location of the jump and then fixing it to point to the correct location after generating the block.

while can be composed similarly

while_start:
    <expression> ; result in ax
    or al, al
    je while_end
    <block>
    jmp while_start ; loop to check the condition again
while_end:

This is very similar to an if, but with an extra jmp at the end, and a slightly different jump target if the condition fails. And indeed, in the compiler, that's exactly how it's implemented. The compiler generates an if, adjusts the jump target for the if forwards by 3 bytes and then generates a jump back to the start of the while loop.

All of these design choices, with all of their flaws, are specifically to reduce the size of the compiler. As much as possible needs to be ripped out of the language to reduce the number of things to implement, and then the implementation needs to be exceptionally optimized to be able to fit everything left. Even the choice to omit > and >= is critical, implementing those would cost 6 bytes under the current implementation, but 6 bytes is not worth the convenience of those operators.

The next post will be about exactly how I saved so much space in the implementation of the compiler. There's actually only a few particularly tricky things that I had to do to get this space, but I will go over all of them in detail.