Ramblings on the architecture of compilers
August 10, 2020•1,952 words
Heads up: When I say this is a "rambling" I mean that it's really a rambling sort of stream of consciousness
Computers execute instructions which are written as a series of numbers. A computer is essentially a really elaborate set of circuitry to which you can feed in instructions which it will execute. Over time this circuitry has become increasingly complex and much smaller (really, really small, like microscope small), but the underlying principle (you feed in instructions and the computer executes them) is the same.
It turns out writing instructions for the computer is inefficient, slow and difficult. Worse still, instructions are often specific to different types of machine. This seriously slows down the process of developing software because you have to rewrite your software for every machine you have to run it on. The instructions a computer can execute directly are limited to sequences of binary digits.
Long story short, people have found ways to get around this. The first step taken was one of convenience - text abbreviations were developed which corresponded to specific binary sequences. This meant that one could just type words like "load" instead of a long sequence of ones and zeros. Before programs are run a separate program carries out a find-and-replace operation on the program written with abbreviations to replace all the abbreviations with their corresponding binary sequences.
While definitely a step up in terms of programming ergonomics, this still required expensive rewrites because code was platform-specific and couldn't be moved without having to be re-engineered every time it was moved. This problem was effectively solved by the introduction of what's termed a "high level programming language." High level means that the language is more abstracted from the underlying machine when compared to other languages (such as "machine code" – the sequences of binary digits which are fed into the computer as instructions – and "assembly" – the acronyms for sequences of binary digits).
A high level language allows you to write code that feels a bit more like English than strings of ones and zeroes. For example, take this program written in a programming language called Rust.
fn main() {
let items = vec![1, 2, 3, 4, 5];
for item in items {
println!("{}", item);
}
}
When it's run, it outputs:
1
2
3
4
5
Were we to write this in Assembly, it wouldn't be a fun experience. The assembly code which this program outputs looks something like this:
push rbp
mov rbp, rsp
push r15
push r14
push r13
push r12
push rbx
sub rsp, 120
mov edi, 20
mov esi, 4
call ___rust_alloc
test rax, rax
je LBB5_7
mov rbx, rax
movaps xmm0, xmmword, ptr, [rip, +, LCPI5_0]
movups xmmword, ptr, [rax], xmm0
mov dword, ptr, [rax, +, 16], 5
add rax, 20
mov qword, ptr, [rbp, -, 152], rbx
mov qword, ptr, [rbp, -, 144], 5
mov qword, ptr, [rbp, -, 136], rbx
mov qword, ptr, [rbp, -, 120], rax
mov qword, ptr, [rbp, -, 128], rax
lea r14, [rbx, +, 4]
mov dword, ptr, [rbp, -, 44], 1
lea rax, [rbp, -, 44]
mov qword, ptr, [rbp, -, 64], rax
mov r13, qword, ptr, [rip, +, __ZN4core3fmt3num3imp52_$LT$impl$u20$core..fmt..Display$u20$for$u20$i32$GT$3fmt17hdf536815cc1fb34aE@GOTPCREL]
mov qword, ptr, [rbp, -, 56], r13
lea r15, [rip, +, l___unnamed_2]
mov qword, ptr, [rbp, -, 112], r15
mov qword, ptr, [rbp, -, 104], 2
mov qword, ptr, [rbp, -, 96], 0
lea r12, [rbp, -, 64]
mov qword, ptr, [rbp, -, 80], r12
mov qword, ptr, [rbp, -, 72], 1
lea rdi, [rbp, -, 112]
call std::io::stdio::_print
lea r14, [rbx, +, 8]
mov eax, dword, ptr, [rbx, +, 4]
mov dword, ptr, [rbp, -, 44], eax
lea rax, [rbp, -, 44]
mov qword, ptr, [rbp, -, 64], rax
mov qword, ptr, [rbp, -, 56], r13
mov qword, ptr, [rbp, -, 112], r15
mov qword, ptr, [rbp, -, 104], 2
mov qword, ptr, [rbp, -, 96], 0
mov qword, ptr, [rbp, -, 80], r12
mov qword, ptr, [rbp, -, 72], 1
lea rdi, [rbp, -, 112]
call std::io::stdio::_print
lea r14, [rbx, +, 12]
mov eax, dword, ptr, [rbx, +, 8]
mov dword, ptr, [rbp, -, 44], eax
lea rax, [rbp, -, 44]
mov qword, ptr, [rbp, -, 64], rax
mov qword, ptr, [rbp, -, 56], r13
mov qword, ptr, [rbp, -, 112], r15
mov qword, ptr, [rbp, -, 104], 2
mov qword, ptr, [rbp, -, 96], 0
mov qword, ptr, [rbp, -, 80], r12
mov qword, ptr, [rbp, -, 72], 1
lea rdi, [rbp, -, 112]
call std::io::stdio::_print
mov r14, rbx
add r14, 16
mov eax, dword, ptr, [rbx, +, 12]
mov dword, ptr, [rbp, -, 44], eax
lea rax, [rbp, -, 44]
mov qword, ptr, [rbp, -, 64], rax
mov qword, ptr, [rbp, -, 56], r13
mov qword, ptr, [rbp, -, 112], r15
mov qword, ptr, [rbp, -, 104], 2
mov qword, ptr, [rbp, -, 96], 0
mov qword, ptr, [rbp, -, 80], r12
mov qword, ptr, [rbp, -, 72], 1
lea rdi, [rbp, -, 112]
call std::io::stdio::_print
mov r14, qword, ptr, [rbp, -, 120]
mov eax, dword, ptr, [rbx, +, 16]
mov dword, ptr, [rbp, -, 44], eax
lea rax, [rbp, -, 44]
mov qword, ptr, [rbp, -, 64], rax
mov qword, ptr, [rbp, -, 56], r13
mov qword, ptr, [rbp, -, 112], r15
mov qword, ptr, [rbp, -, 104], 2
mov qword, ptr, [rbp, -, 96], 0
mov qword, ptr, [rbp, -, 80], r12
mov qword, ptr, [rbp, -, 72], 1
lea rdi, [rbp, -, 112]
call std::io::stdio::_print
mov esi, 20
mov edx, 4
mov rdi, rbx
call ___rust_dealloc
add rsp, 120
pop rbx
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
LBB5_7:
mov edi, 20
mov esi, 4
call alloc::alloc::handle_alloc_error
LBB5_8:
mov rbx, rax
mov qword, ptr, [rbp, -, 136], r14
lea rdi, [rbp, -, 152]
call core::ptr::drop_in_place
mov rdi, rbx
call __Unwind_Resume
ud2
In short writing in a higher-level language is much nicer than writing in a lower-level one. High-level languages aren't a new innovation by any means (they've been here since 1958) but over time they've been refined hugely. The tools for turning programs written in higher-level languages into machine code are marvels of modern engineering. They often rack up millions of lines of code and the amount of theory underpinning them is immense. They're fascinating though, and although the software used in production is immensely sophisticated writing your own hobby languages can be really simple and easy to do. This is because compilers are incredibly modular and you can add and remove pieces from them as you see fit.
These higher level languages are translated into code which the computer can be run by a piece of software called a compiler. For the end-user this is where the story pretty much ends. The program is translated and run. For the person writing the translator (compiler) it's a different story. Every processor has its own unique instruction language which means that a different compiler has to be implemented for every processor (though often families of processors – e.g. Intel i3, i5, i7, etc) will share the same machine code. There's a large amount of code which can be re-used between different implementations of compilers.
This is capitalised on extensively in modern compiler design. The ecosystem of processors is vast, especially when one begins to factor in other devices capable of computing (e.g. graphics processing units). People have build handy abstractions to help with the design of compilers. One abstraction is the idea of "passes" where the compiler goes through multiple steps to reach its end goal (translating a programming language into binary). An example of these "passes" might be "tokenisation" where the input programming language is transformed into tokens.
For example the input
fn main() {
let x = 1;
}
Might be tokenised as a stream looking something like (KEYWORD.Function), (KEYWORD.OpeningBrace), (IDENTIFIER, "main"), (KEYWORD.Let), (IDENTIFIER, "x"), (OPERATOR.Equals), (INTEGER, 1), (KEYWORD.SemiColon), (KEYWORD.ClosingBrace)
. This processes the input so that later stages can more easily handle it. It's common to build a "tree" and then walk all the nodes in the tree (visiting them all in turn and outputting some code for each node ).
It turns out that this can be improved even further! If we have multiple programming languages, we can share code between them. A lot of the code in the compiler is very similar for compilers of different programming languages. Compilers are often bundled together into single software applications which include multiple compilers. For example gcc
(the GNU Compiler Collection) is a collection of compilers which are grouped together.
Going even further, compilers implemented separately can do this as well. A lot of the internals can be extracted into separate libraries so that multiple compilers can use these libraries. Often compilers are divided into "frontends" and "backends." Frontends and backends communicate through the use of an "immediate language." This is a language which both the frontend and the backend "speak." The frontend transforms its input into the shared language and the backend turns this shared language into machine code.
Shouldn't we try this with other pieces of software? Make it modular and easily interchangeable. The answer is that we are, but only in some regards. Inspired by the Unix philosophy (make small programs which do only one thing and build big programs by composing the small programs; there's more to it but that's a high-level overview) WebAssembly is a specification for a common language. Unlike our compilers however WebAssembly is designed to be directly executed. This makes it possible for programs written in different languages to interact with each other with limited overhead (often data has to be duplicated and encoded by one language before being decoded by another language to overcome the differences between languages).
While interesting and important, such specifications are about realising such a philosophy for the programmer. When it comes to runtime (i.e. when a computer program is run) the user typically has extremely limited capability to change things. Our interoperable standards are many and they lie beneath the surface like sleeping beasts. We need to awake them and allow them to smash through the ice which has built up. When I talk about the "surface" I mean technology that the average person uses like search engines and social media platforms.
The downside to such an approach – interoperability and modularity is that it places a burden of additional complexity on the user. Choice is a great thing to have, but many people are likely to be unsure of what to decide to use. We don't just need to make applications interoperable. We need to make them interoperable in a way which makes them more usable. The problem is a complex one and requires some serious thought. How do we decompose applications in such a way that software becomes more usable rather than less? How do we ensure that applications which offer a large set of common features, but also offer slightly different features can interoperate? Is it possible to convince our tech giants to go for such a model?
This modularity has worked in compilers because they are essentially the culmination of a number of different services, sharing some code to save themselves effort.
Such a model might be a better alternative to antitrust, which is going to be a messy processes. Obviously, the current model we have is bad and I personally believe antitrust is better than the status quo. Forcing tech companies to open up their platforms and make them interoperable might be better than antitrust though. Maybe it's even a form of antitrust?
This is something I'm curious to investigate further, particularly around drawing inspiration from existing pieces of software which handle complexity and scale effectively.