Newt's Cove

Compilers: Query, Key, Value? (draft)

2025-05-25

Notice: This is a really really early DRAFT. Here be dragons.

Compilers are very pointer heavy, as they often use linked lists for lists of operations, trees, graphs, and if they use it, single static assignment (SSA) form is riddled with pointers (SSA is a program representation where the result of an operation is a pointer to that operation, essentially meaning that a program in SSA is a graph with 'no' statements). This is a huge blow to performance, as compilers like LLVM (used in Clang) and GCC spend an enormous amount of time dereferencing pointers! Arrays are used, but heavy pointer use reigns supreme in compiler implementations.

However, I believe that this might change in the future. I've found that some recent compiler implementations have been moving towards an alternative implementation strategy known as data oriented design (and in compilers, the term 'data oriented compiler design' is sometimes thrown around). The idea is that instead of utilizing pointers extensively through the compiler, many data structures are flattened into a representation that fits nicely into an array (and yes, that even includes trees and graphs). Some recent compiler implementations that have moved towards data oriented design are the Carbon and Zig language compilers. The benefit of this new paradigm is primarily performance due to increased data locality. To summarize why data locality leads to improved performance: with pointers, the processor constantly is jumping around memory, but if it can stay in the same area of memory, then it can cache that general area of memory (and cached memory accesses are significantly faster!).

Recently I've been writing a toy compiler in the Rust programming language, and I've actually been utilizing this data-oriented approach...but not for performance! You see, data-oriented design can actually lead to better memory safety if done right. In my implementation, I represent a graph as an array of nodes. Instead of pointers, I use indices into that array. Where before you would need to separately allocate each node, the array only needs to be allocated once! This makes it much easier to be certain you haven't made a memory allocation mistake. In my case, making my data structure's memory management easier to reason about is much more important than in a language with manual memory management like C++ ...because Rust doesn't let me make mistakes. If the compiler can't figure out where the life of every variable begins and ends, it simply refuses to compile the program. So in the 'safe' subset of Rust, I really don't have any other option than to use data-oriented design (in this case, as is very often the case, Rust's strictness encourages you to write better and more performant code, which can be both annoying and refreshing).

Also, here's a fun fact: 'data-oriented design' originated from the field of game development! A common architecture in that field is 'Entity-Component-System' (ECS, and no, I did not make a grammatical error because ECS is not a 'system'), which is an architectural design philosophy I've found very fascinating recently. I feel like the overlap between ECS, arena/stack/pool allocation, data oriented design/compiler design, and (if you bend your mind a little bit) even relational databases is perhaps larger than most people realize, which I think is just absolutely fascinating.