Ramblings on the architecture of compilers

Heads up: When I say this is a "rambling" I mean that it's really a rambling sort of stream of consciousness

Computers execute instructions which are written as a series of numbers. A computer is essentially a really elaborate set of circuitry to which you can feed in instructions which it will execute. Over time this circuitry has become increasingly complex and much smaller (really, really small, like microscope small), but the underlying principle (you feed in instructions and the computer executes them) is the same.

It turns out writing instructions for the computer is inefficient, slow and difficult. Worse still, instructions are often specific to different types of machine. This seriously slows down the process of developing software because you have to rewrite your software for every machine you have to run it on. The instructions a computer can execute directly are limited to sequences of binary digits.

Long story short, people have found ways to get around this. The first step taken was one of convenience - text abbreviations were developed which corresponded to specific binary sequences. This meant that one could just type words like "load" instead of a long sequence of ones and zeros. Before programs are run a separate program carries out a find-and-replace operation on the program written with abbreviations to replace all the abbreviations with their corresponding binary sequences.

While definitely a step up in terms of programming ergonomics, this still required expensive rewrites because code was platform-specific and couldn't be moved without having to be re-engineered every time it was moved. This problem was effectively solved by the introduction of what's termed a "high level programming language." High level means that the language is more abstracted from the underlying machine when compared to other languages (such as "machine code" – the sequences of binary digits which are fed into the computer as instructions – and "assembly" – the acronyms for sequences of binary digits).

A high level language allows you to write code that feels a bit more like English than strings of ones and zeroes. For example, take this program written in a programming language called Rust.

fn main() {
    let items = vec![1, 2, 3, 4, 5];
    for item in items {
        println!("{}", item);
    }
}

When it's run, it outputs:

1
2
3
4
5

Were we to write this in Assembly, it wouldn't be a fun experience. The assembly code which this program outputs looks something like this:

 push    rbp
 mov     rbp, rsp
 push    r15
 push    r14
 push    r13
 push    r12
 push    rbx
 sub     rsp, 120
 mov     edi, 20
 mov     esi, 4
 call    ___rust_alloc
 test    rax, rax
 je      LBB5_7
 mov     rbx, rax
 movaps  xmm0, xmmword, ptr, [rip, +, LCPI5_0]
 movups  xmmword, ptr, [rax], xmm0
 mov     dword, ptr, [rax, +, 16], 5
 add     rax, 20
 mov     qword, ptr, [rbp, -, 152], rbx
 mov     qword, ptr, [rbp, -, 144], 5
 mov     qword, ptr, [rbp, -, 136], rbx
 mov     qword, ptr, [rbp, -, 120], rax
 mov     qword, ptr, [rbp, -, 128], rax
 lea     r14, [rbx, +, 4]
 mov     dword, ptr, [rbp, -, 44], 1
 lea     rax, [rbp, -, 44]
 mov     qword, ptr, [rbp, -, 64], rax
 mov     r13, qword, ptr, [rip, +, __ZN4core3fmt3num3imp52_$LT$impl$u20$core..fmt..Display$u20$for$u20$i32$GT$3fmt17hdf536815cc1fb34aE@GOTPCREL]
 mov     qword, ptr, [rbp, -, 56], r13
 lea     r15, [rip, +, l___unnamed_2]
 mov     qword, ptr, [rbp, -, 112], r15
 mov     qword, ptr, [rbp, -, 104], 2
 mov     qword, ptr, [rbp, -, 96], 0
 lea     r12, [rbp, -, 64]
 mov     qword, ptr, [rbp, -, 80], r12
 mov     qword, ptr, [rbp, -, 72], 1
 lea     rdi, [rbp, -, 112]
 call    std::io::stdio::_print
 lea     r14, [rbx, +, 8]
 mov     eax, dword, ptr, [rbx, +, 4]
 mov     dword, ptr, [rbp, -, 44], eax
 lea     rax, [rbp, -, 44]
 mov     qword, ptr, [rbp, -, 64], rax
 mov     qword, ptr, [rbp, -, 56], r13
 mov     qword, ptr, [rbp, -, 112], r15
 mov     qword, ptr, [rbp, -, 104], 2
 mov     qword, ptr, [rbp, -, 96], 0
 mov     qword, ptr, [rbp, -, 80], r12
 mov     qword, ptr, [rbp, -, 72], 1
 lea     rdi, [rbp, -, 112]
 call    std::io::stdio::_print
 lea     r14, [rbx, +, 12]
 mov     eax, dword, ptr, [rbx, +, 8]
 mov     dword, ptr, [rbp, -, 44], eax
 lea     rax, [rbp, -, 44]
 mov     qword, ptr, [rbp, -, 64], rax
 mov     qword, ptr, [rbp, -, 56], r13
 mov     qword, ptr, [rbp, -, 112], r15
 mov     qword, ptr, [rbp, -, 104], 2
 mov     qword, ptr, [rbp, -, 96], 0
 mov     qword, ptr, [rbp, -, 80], r12
 mov     qword, ptr, [rbp, -, 72], 1
 lea     rdi, [rbp, -, 112]
 call    std::io::stdio::_print
 mov     r14, rbx
 add     r14, 16
 mov     eax, dword, ptr, [rbx, +, 12]
 mov     dword, ptr, [rbp, -, 44], eax
 lea     rax, [rbp, -, 44]
 mov     qword, ptr, [rbp, -, 64], rax
 mov     qword, ptr, [rbp, -, 56], r13
 mov     qword, ptr, [rbp, -, 112], r15
 mov     qword, ptr, [rbp, -, 104], 2
 mov     qword, ptr, [rbp, -, 96], 0
 mov     qword, ptr, [rbp, -, 80], r12
 mov     qword, ptr, [rbp, -, 72], 1
 lea     rdi, [rbp, -, 112]
 call    std::io::stdio::_print
 mov     r14, qword, ptr, [rbp, -, 120]
 mov     eax, dword, ptr, [rbx, +, 16]
 mov     dword, ptr, [rbp, -, 44], eax
 lea     rax, [rbp, -, 44]
 mov     qword, ptr, [rbp, -, 64], rax
 mov     qword, ptr, [rbp, -, 56], r13
 mov     qword, ptr, [rbp, -, 112], r15
 mov     qword, ptr, [rbp, -, 104], 2
 mov     qword, ptr, [rbp, -, 96], 0
 mov     qword, ptr, [rbp, -, 80], r12
 mov     qword, ptr, [rbp, -, 72], 1
 lea     rdi, [rbp, -, 112]
 call    std::io::stdio::_print
 mov     esi, 20
 mov     edx, 4
 mov     rdi, rbx
 call    ___rust_dealloc
 add     rsp, 120
 pop     rbx
 pop     r12
 pop     r13
 pop     r14
 pop     r15
 pop     rbp
 ret
LBB5_7:
 mov     edi, 20
 mov     esi, 4
 call    alloc::alloc::handle_alloc_error
LBB5_8:
 mov     rbx, rax
 mov     qword, ptr, [rbp, -, 136], r14
 lea     rdi, [rbp, -, 152]
 call    core::ptr::drop_in_place
 mov     rdi, rbx
 call    __Unwind_Resume
 ud2

In short writing in a higher-level language is much nicer than writing in a lower-level one. High-level languages aren't a new innovation by any means (they've been here since 1958) but over time they've been refined hugely. The tools for turning programs written in higher-level languages into machine code are marvels of modern engineering. They often rack up millions of lines of code and the amount of theory underpinning them is immense. They're fascinating though, and although the software used in production is immensely sophisticated writing your own hobby languages can be really simple and easy to do. This is because compilers are incredibly modular and you can add and remove pieces from them as you see fit.

These higher level languages are translated into code which the computer can be run by a piece of software called a compiler. For the end-user this is where the story pretty much ends. The program is translated and run. For the person writing the translator (compiler) it's a different story. Every processor has its own unique instruction language which means that a different compiler has to be implemented for every processor (though often families of processors – e.g. Intel i3, i5, i7, etc) will share the same machine code. There's a large amount of code which can be re-used between different implementations of compilers.

This is capitalised on extensively in modern compiler design. The ecosystem of processors is vast, especially when one begins to factor in other devices capable of computing (e.g. graphics processing units). People have build handy abstractions to help with the design of compilers. One abstraction is the idea of "passes" where the compiler goes through multiple steps to reach its end goal (translating a programming language into binary). An example of these "passes" might be "tokenisation" where the input programming language is transformed into tokens.

For example the input

fn main() {
    let x = 1;
}

Might be tokenised as a stream looking something like (KEYWORD.Function), (KEYWORD.OpeningBrace), (IDENTIFIER, "main"), (KEYWORD.Let), (IDENTIFIER, "x"), (OPERATOR.Equals), (INTEGER, 1), (KEYWORD.SemiColon), (KEYWORD.ClosingBrace). This processes the input so that later stages can more easily handle it. It's common to build a "tree" and then walk all the nodes in the tree (visiting them all in turn and outputting some code for each node ).

It turns out that this can be improved even further! If we have multiple programming languages, we can share code between them. A lot of the code in the compiler is very similar for compilers of different programming languages. Compilers are often bundled together into single software applications which include multiple compilers. For example gcc (the GNU Compiler Collection) is a collection of compilers which are grouped together.

Going even further, compilers implemented separately can do this as well. A lot of the internals can be extracted into separate libraries so that multiple compilers can use these libraries. Often compilers are divided into "frontends" and "backends." Frontends and backends communicate through the use of an "immediate language." This is a language which both the frontend and the backend "speak." The frontend transforms its input into the shared language and the backend turns this shared language into machine code.

Shouldn't we try this with other pieces of software? Make it modular and easily interchangeable. The answer is that we are, but only in some regards. Inspired by the Unix philosophy (make small programs which do only one thing and build big programs by composing the small programs; there's more to it but that's a high-level overview) WebAssembly is a specification for a common language. Unlike our compilers however WebAssembly is designed to be directly executed. This makes it possible for programs written in different languages to interact with each other with limited overhead (often data has to be duplicated and encoded by one language before being decoded by another language to overcome the differences between languages).

While interesting and important, such specifications are about realising such a philosophy for the programmer. When it comes to runtime (i.e. when a computer program is run) the user typically has extremely limited capability to change things. Our interoperable standards are many and they lie beneath the surface like sleeping beasts. We need to awake them and allow them to smash through the ice which has built up. When I talk about the "surface" I mean technology that the average person uses like search engines and social media platforms.

The downside to such an approach – interoperability and modularity is that it places a burden of additional complexity on the user. Choice is a great thing to have, but many people are likely to be unsure of what to decide to use. We don't just need to make applications interoperable. We need to make them interoperable in a way which makes them more usable. The problem is a complex one and requires some serious thought. How do we decompose applications in such a way that software becomes more usable rather than less? How do we ensure that applications which offer a large set of common features, but also offer slightly different features can interoperate? Is it possible to convince our tech giants to go for such a model?

This modularity has worked in compilers because they are essentially the culmination of a number of different services, sharing some code to save themselves effort.

Such a model might be a better alternative to antitrust, which is going to be a messy processes. Obviously, the current model we have is bad and I personally believe antitrust is better than the status quo. Forcing tech companies to open up their platforms and make them interoperable might be better than antitrust though. Maybe it's even a form of antitrust?

This is something I'm curious to investigate further, particularly around drawing inspiration from existing pieces of software which handle complexity and scale effectively.


You'll only receive email when they publish something new.

More from visibilia
All posts