An overview of the design of the C compiler, describing a lot of the limitations that were intentional design choices for the sake of making the implementation smaller.
This post is Part 1 of a series of posts about my project to create a C compiler within a boot sector. The code for the project is on github.
In Part 0, I talked about how x86 computers boot up from nothing, and the environment that the first code on the system gets. Part 1 will be about what I did in that limited and hostile environment and how I managed to fit a compiler (even a very bad one) in 446 bytes.
This project uses nasm
as the assembler of
choice. No particular reason for nasm over anything else,
though I was set on using Intel Syntax for the assembly,
which does eliminate some choices.
The first order of business is, of course, to set up a basic file that can boot, even if it does nothing.
BITS 16 ; real mode is 16 bit code
org 0x7C00 ; BIOS drops us at this address
main:
jmp $
TIMES 0x1BE - ($ - $$) db 0x00
; THIS WOULD NORMALLY BE THE MBR
; REMOVED FOR SIMPLICITY (and qemu doesn't need it)
TIMES 64 db 0x00
dw 0xAA55 ; boot signature
This code is the bare minimum to run. The Master Boot
Record is not shown in that snippet because it doesn't
really matter for our purposes. Additionally, the label
main
is not special. The first byte in the file
will be run, so you need to be careful to never put anything
before main, or have a jump to main as the first thing. The
label is simply there for organizational purposes.
This code also doesn't do anything. jmp $
means "jump to the start of this line". This will repeat
forever, keeping the CPU in an infinite loop. Assembling the
code with nasm -w+all -w+error -f bin boot.asm -o boot.bin
results in a 512 byte binary file, which is our boot sector.
Running this sector with qemu-system-x86_64 -drive format=raw,file=boot.bin -s
greets us with a screen, with some text that says "booting
from hard disk…", and nothing happens. This is expected
though, we're not doing anything. The -s
flag
tells qemu to allow a debugger to connect, so we can connect
gdb with gdb --init-eval-command="target remote :1234"
.
We will see that the program is indeed at
0x7C00
and if we step forward, it remains there
forever.
This is what the development cycle looks like, make some changes to the asm, build it, launch qemu, and then inspect the state with a debugger to see if things are correct. Because the output of this compiler is x86 machine code bytes in memory (and due to space limitations), all debugging is done though gdb. There's not much that's useful to print to an output, and even if there were, inspecting the generated program will require looking at raw memory anyway. I am now intimately familiar with the code that my compiler generates, to the point that sometimes I don't even use a disassembler to see what's wrong, I can identify it by the patterns I expect.
If you are interested in working in this environment yourself, good luck. I find it highly interesting, but it's not a particularly useful skill these days. If you're actually interested in modern bootloaders and operating systems, UEFI is what everything uses now, which provides a much nicer 64 bit environment and standardized APIs for doing the things that a bootloader needs to do. Writing a Real Bootloader is a very different project, but one that provides similar joy in my opinion, and you might end up with something you could actually use. Piece by piece replacing every part of your computer's software stack with something you created is ideal, in my opinion, and it would be super fun to run a bootloader I wrote, but that may be a future project.
Note: this project is still in progress. Specific implementation details may be outdated. As of the latest update (2024-09-24), almost no work is being put into the compiler, but instead into writing programs for this compiler to run.
The first thing that the compiler needs to do is actually get a source program. I chose to send the program over the serial port so that the binary does not have to include the source. This also makes it possible to type a program by hand, if you connect to the serial port, though there is no support for editing or backspace or moving up or down lines.
The basic code to read from the serial port looks like this
mov di, 0x6000 ; location in memory reserved for the source code
._load_program:
mov dx, COM1_PORT + 5 ; status port
in al, dx ; read from the status port
and al, 0x01 ; bit 0 is set if there is a character ready
jz ._load_program ; loop until the character is ready
mov dx, COM1_PORT ; data port
in al, dx ; read from the data port
stosb ; *di = al; di += 1;
or al, al ; exit when a 0x00 byte is read
jnz ._load_program
There is additional code that is not shown to initialize the serial port, since it has to be told what rate to send bits and other configuration. This code simply places the input program in memory, where it will later be read by the compiler.
The compiler needs to split the input program into
tokens. In most compilers this is a fairly complicated
process to allow for optional whitespace and line breaks and
other characters. Typically x=4*(x+2)
should
act the same as
=
x 4 * ( x + 2 )
My compiler, of course, throws that out the window.
Tokens are delimited by whitespace. All characters with
ascii value 0x20
or less are considered
"whitespace". If there's not a whitespace, my compiler
treats it as still the same token. x=4*(x+2)
is
considered One Token to my compiler. This greatly simplifies
the implementation of the tokenizer, at great cost to
readability of code. However, since C allows lots
of weird whitespace, my compiler requiring
it means that this is still a subset of C. This is
what some code might look like in my version of C:
int meow ;
int main (){
= 20 ;
meow while( 1 == 1 ){
();
read_char ();
print_char }
}
At the surface, this doesn't look so bad. However, all of those weird spaces are required. There must be whitespace:
(){
(more on this later)=
, and the right hand side of an
assignment();
to call it (once again a weird token)if
,
while
, and asm
(more weird tokens
here)This is surprisingly difficult to get right. Formatters will scream at you if you try to code like this, so you just have to remember the rules. Additionally, there's no error handling. At all. If the compiler expects a token, that token better exist exactly as the compiler expects. If it doesn't, all hell will break loose and you will certainly not get the output you expect (most of the time the compiler crashes and starts executing random memory).
The tokenizer is deceptively simple. In general its algorithm is:
<skip whitespace>
ident = 0
get_character()
while character > 0x20
ident = 10 * ident + (character - 0x30)
get_character()
return ident
This is a terrible algorithm
for identifying anything. I copied the idea from another
project that aimed to make a small C compiler1. The idea here is that
integer literals get converted into their value, and
everything else just so happens to produce a number that
looks Varied Enough for use. This however does mean that it
is impossible to differentiate between numbers and not
numbers, for example ;
and 11
have
the same value. This is fortunately normally not an
issue, as users will write well formed programs which always
specific formats, so we never need to make
that distinction. It also means that collisions can occur
between names. The names get_next_cluster
and
set_next_cluster
happen to have the same 16 bit
identifier, and this has caused me pain a couple times when
writing code. The best workaround that I have found is
making sure the last few characters of a name are different,
because they are less likely to get "pushed out" of the
identifier and stop mattering. next_cluster_get
and next_cluster_set
are names I now use in
code.
The compiler only needs to know about a few different tokens and they are as follows:
Name | Value | Meaning |
---|---|---|
COMMENT |
0xFFF5 |
// |
INT |
0x18F4 |
int |
INT_PTR |
0xF982 |
int* |
OPEN_PAREN |
0xFFF8 |
( |
CLOSE_PAREN |
0xFFF9 |
) |
CLOSE_BRACE |
0x004D |
} |
FN_ARGS |
0xFCE5 |
(){ |
FN_CALL |
0xFCA5 |
(); |
SEMICOLON |
0x000B |
; |
IF_START |
0x1858 |
if( |
WHILE_START |
0xDA02 |
while( |
ASM_START |
0x973E |
asm(" |
DOT_BYTE |
0x9491 |
.byte |
ASM_END |
0xFA4D |
"); |
STAR |
0xFFFA |
* |
AND |
0xFFF6 |
& |
RETURN |
0x7DA7 |
return; |
PLUS |
0xFFFB |
+ |
MINUS |
0xFFFD |
- |
OR |
0x004C |
| |
XOR |
0x002E |
^ |
SHL |
0x0084 |
<< |
SHR |
0x009A |
>> |
EQUAL_EQUAL |
0x008F |
!= |
NOT_EQUAL |
0xFF77 |
== |
LESS |
0x000C |
< |
Some notes on these tokens:
(){
is one token. Functions may not
have arguments. Use global variables instead.();
(same as above)if(
is one token. This has tripped me up
several times, since in many languages, the convention is to
place a space between the if
and the
(
.while(
(same as if)return;
is one token. Functions may
not return values*. Use global variables
instead.<
exists. Swap the order of the
arguments if you want a greater than comparison.asm("
, .byte
, and
");
tokens exist to implement the
asm
extension. This is compatible with gcc and
clang's extensions. It is used like
asm(" .byte 235 ; .byte 254 ; ");
. Only decimal
literals exist, so this is very inconvenient to write.The grammar for this implementation of C is also very
limited. Function calls can only appear alone, as a
statement (since they can't return a value). Declaring and
assigning to variables must be separate, there's no variable
initializers. Only global variables
exist. Local variables require a
surprising amount of implementation work to do with scopes
and the stack, so you can only declare global variables. The
lack of local variables is possibly one of the worst
"features" of this language. Declaring variables away from
their usage makes it easier to misuse them, and leads to
awful conventions to prevent collisions of names, like
int _read_char_state ;
(this variable really
wants to be a local named state
).
Several registers have meanings that are consistent throughout the compiler:
ax
is used for holding token values or for
emitting generated codebx
is scratch, typically pointers, because
only bx
, si
, and di
can be used for pointers in real mode, and the other two are
in usecx
holds the location in memory where a
specific function or variable is storeddx
is mostly used for generating code,
otherwise scratchbp
is used to hold the value of the left
hand side of an expression, to correctly implement pointer
mathsi
is the pointer to the start of the next
token in the source codedi
is the pointer to the next place to emit
bytes for the compiled codeThe compiler processes tokens by getting tokens
sequentially and using recursive descent. It starts by
looking for top level declarations, which are variables and
functions. The type of the declaration is read first.
Variables must have type int
or
int*
while functions may have type
void
, int
, or int*
,
but should use void
, since
return values do not exist. If the type of a declaration is
int*
, the location of the declaration generated
program is incremented by 1. All other declarations are
aligned to 2 bytes, so this allows the compiler to determine
if a variable is a pointer or not without having to store
that data in a separate table. For all declarations, first
an entry is placed in a location of memory that acts as an
ident -> pogram location map. This map is 128KiB large,
since there's 65536 possible ident numbers and each one uses
two bytes to represent the location of the declaration in
the generated program. To distinguish a function declaration
from a variable declaration, the code looks for a
(){
token after the name of the variable or
function. Since functions cannot have arguments, this
suffices to identify all functions.
Generating code for variables is very simple - the compiled code pointer is incremented by 2 to reserve space for the variable, and the compiler loops to check for the next top level declaration.
The rest of the compiler's code is dedicated to generating code for functions, including all the different types of statements and expressions that can be within them.
There are 7 top level statements within a function body:
if
while
return
asm
This was actually the very first thing that I implemented
within the compiler, because it's surprisingly simple! The
call
x86 instruction implements everything we
need, since we don't have locals or arguments. There exists
a call <register>
form too, so the
compiler just generates
mov bx, ADDRESS
call bx
To obtain the address to call, the compiler looks up the address of the function in the ident -> address map. This is a constant over the lifetime of the program, so the generated code only needs to load that constant and call it.
Variable assignment (var = expr
) and
assignment through a pointer (* ptr = expr
)
share most of their logic, but are some of the most
complicated statements. Assignment has to remember the
variable that it is assigning to, then parse and codegen a
complicated arbitrary expression, and then store the result
of that expression in the variable, or in the case of
assignment through a pointer, in the memory location pointed
to. The generated code for the assignment itself is fairly
simple, it assumes that there's a value in ax
and then stores it to the appropriate location.
; assume expr result in ax
; variable assignment
mov bx, VARIABLE_ADDRESS
mov word [bx], ax ; store to the variable
; assignment through a pointer
mov bx, VARIABLE_ADDRESS
mov bx, word [bx] ; get the pointer's value
mov word [bx], ax ; store to that address
A vast majority of the complexity of the compiler is in the expression parsing and codegen required for assignment. Expressions can be either unary expressions (a single Thing, like a number, a variable, or a parenthesized expression), or binary expressions, which are a unary expression followed by a binary operator, and then another unary expression.
Some notes on the implementation:
The code for parsing and generating the unary expressions
on the left hand side and right hand side of the binary
operator is identical. Since the generated code always
writes the value of the unary expression to ax
and uses bx
for addresses, we need to save the
left hand side of the operator on the stack so that the
right hand side can generate into ax
, and then
restore back so that the left hand side is in
ax
and the right hand side is in
cx
. Thankfully this is only a few bytes to emit
the saves and restores. By ensuring that the LHS of the
expression is in ax
and the RHS is in
cx
, the code that emits the binary operator
only needs to pick a sequence of bytes depending on the
operator. It might emit add ax, cx
to add two values or cmp ax, cx; sete al
to check if two values are equal, but it doesn't concern
itself with complicated register allocation.
There are a lot of assumptions in the binary operator code that assume that the user wrote a valid binary operator. Only the last byte of the idents are checked, meaning that there are 256 different ident numbers (and an infinite number of strings) that the compiler will consider to be a binary operator. If the token provided is not a binary operator, it's highly likely that the compiler is going to enter an infinite loop.
Numbers and variables are indistinguishable. The way that the compiler knows a variable is being accessed is if an ident number has an entry in the address table. This means that certain integer literals cannot be used. However, it should be possible to use the sum or difference of integer literals to form any integer that you want. I have yet to encounter this problem in my code, so it's hopefully Unlikely Enough. This is something that cannot be improved however, not without completely redoing the way that idents are managed.
In C, doing pointer + integer
adds sizeof(type) * integer
bytes to the pointer. Additionally, subtracting pointers
divides the result of the subtraction by sizeof(type)
,
giving the number of elements between the
pointers. These are both special cased in the compiler. If
the left hand side of a binary operation is a pointer
(determined by its variable address being odd) and the
operator is a +
, twice as much is added to the
pointer. If the left hand side is a pointer and the operator
is a -
, the result of the addition is divided
by 2. If the left hand side is a pointer and the operator is
anything else, normal behavior is used, which allows
==
, !=
, and <
to
work without concern. If the left hand side of an expression
is not a pointer, none of these special cases are used.
Note: When writing the first version of the previous
paragraph, I actually found two different bugs in how
I was handling pointers! First, I was always
multiplying the right hand side by 2, meaning that
ptr == ptr
was actually doing ptr == (ptr * 2)
Then, once I fixed that, I realized that ptr - ptr
is not equivalent to ptr - (ptr * 2)
and that when I tested the original fix, I happened to
pick a pair of values where that was the case.
if
and
while
Weirdly, implementing if
and
while
is not very complicated. The basic
structure of an if
is
> ; result in ax
<expressionor al, al
je end ; skip the block if the condition is not met
>
<blockend:
This means that the expression code and the block code (used for generating function bodies) can both be reused. The compiler only needs to emit a conditional jump before the block. One complicating factor is that the target for the conditional jump is not known until the block is generated. This is solved by storing the location of the jump and then fixing it to point to the correct location after generating the block.
while
can be composed similarly
while_start:
> ; result in ax
<expressionor al, al
je while_end
>
<blockjmp while_start ; loop to check the condition again
while_end:
This is very similar to an if
, but
with an extra jmp
at the end, and a slightly
different jump target if the condition fails. And indeed, in
the compiler, that's exactly how it's implemented. The
compiler generates an if
, adjusts the jump
target for the if
forwards by 3 bytes and then
generates a jump back to the start of the while loop.
All of these design choices, with all of their flaws, are
specifically to reduce the size of the compiler. As much as
possible needs to be ripped out of the language to reduce
the number of things to implement, and then the
implementation needs to be exceptionally optimized to be
able to fit everything left. Even the choice to omit
>
and >=
is critical,
implementing those would cost 6 bytes under the current
implementation, but 6 bytes is not worth the convenience of
those operators.
The next post will be about exactly how I saved so much space in the implementation of the compiler. There's actually only a few particularly tricky things that I had to do to get this space, but I will go over all of them in detail.