Notes on Compilers

Lately I've been revisiting compilers and linguistics. It reminded me of Herbert Gross' lectures on calculus. We have definitions, which allow us to create rules - relations between definitions - and from these two things, we can derive objectives or strategies. And in both software compilation and reverse engineering, we do a similar thing, analyzing syntax to build a tree of definitions, which in turn allow us to infer rules and relations, and objectively construct or deconstruct a thing.

Compilation

When building software, we usually compile the code we've written and the something like gcc abstracts a bunch of work away from us that we usually don't see and therefore don't think about.

First we have a 'lexer' that performs lexical analysis on source code. This operates on the text, encodings, whitespace, comments. This often involves a finite state machine. And in the end, produces tokens.

Then we have a parser that takes those tokens and uses them to forge an abstract syntax tree. This handles the relationships between tokens, fleshing out structures, nesting, and so on. This can be context-free or not, and again involves pushdown autamata. This can involve various grammars. I'll cover these in the grammar section.

After a parser has processed its data, there's semantic analysis. This can be thought of as a kind of "verification" of correctness with regards to the abstract syntax tree that's been generated. So, it performs name resolution, input validation, and type checking.

After this, the abstract syntax tree is converted to an IR or intermediate representation, and then to machine code. This involves modeling and optimization of the program to a particular target machine architecture.

Grammars

First we have LL, or "Left-to-Right, Leftmost derivation." This is a top-down grammar, meaning that when it is parsed, it begins with the left as the root of the syntax tree and works its way down to the leaves. So, the parser predicts which production rule to apply based on the next input symbol, referred to as a "look-ahead." Leftmost derivations can also perform this using various token sizes. And so there exist various LL parsers denoted LL(1), LL(2), LL(n) etc., where N indicates the number of tokens of a "look-ahead."

On the other hand, the LR grammar stands for "Left-to-Right, Rightmost derivation." This is precisely the opposite strategy. A bottom-up technique, where parsing begins with the input symbols, the leaves of the syntax tree, and works its way up to the root.

And then we have LALR "Look-Ahead LR." This is a hybrid variant of LR grammar, designed to handle a larger class of grammars while being more efficient than traditional LR parsers. Using a simplified version of LR parsing tables, they reduce the memory requirements while still maintaining efficiency.

LL parsers start from the top of the syntax tree and work down, while LR parsers start from inputs and build upward. Thus, we say then that LL parsers are predictive. While LR parsers are reductive.

In this sense, predictive and reductive are merely terms that denote the aforementioned behavior: That is to say, LL parsers predict which production to use based on the "look-ahead," while LR parsers reduce input symbols using production rules.

There are also a branch of "Look-Ahead" grammars with custom tokenization rules for look-aheads. LL(n) and LR(n) are grammars which denote the number of tokens of look-ahead the parser uses to make decisions. A greater "n" allows for handling more complex grammars. However, this may cost more memory.

Last, we have LALR, parsers which utilize some aspects from both camps in an attempt to optimize efficiency.

In Practice

Let us first define grammar. We say that a grammar is a natural language that uses some set of symbols, and adheres to some set of constraints, which produce clauses and axioms. That within the set of constraints, there follows some formal or informal syntax, or ways to infer meanings. There are a few branches of language processing in software. We can broadly generalize them as follows. The first is straightfoward:

Source -> [compiler] -> target program

In this first example, we push our source into a compiler and emit a program. Furthermore, we also have another common language processor, which is simply an interpreter. We give this interpeter some source information and input and receive output.

Source and/or input -> [interpreter] -> output

And last, we have a third kind of compiler, which first compiles a program into bytecode, which is then in turn interpreted by a virtual machine. In this vein, we can write some code which can be interpreted elsewhere. This is the premise of Just-in-Time compilers. JIT compilers translate bytecode into machine instructions before doing intermediate language processing to translate output:

Source -> [translator] -> Intermediate program -> [VM] -> Output

Similarly, a program might call upon some preprocessors which generates some specific features, or assembly targeting a particular computer architecture.

But how does this function, what mechanisms and abstractions make this possible? A compiler essentially works by parsing a set of characters—then doing lexical, syntactic, and semantic analysis, generating some intermediate code, which then is operated on to generate code for a particular target. If you didn't grasp it in the early paragraphs above, to borrow a very nice infographic from the Dragon Book, we begin with a simple mathematical statement to be translated: position = initial + rate * 60:

In the above example, we see that the compiler performs an analysis and generates a syntax tree of the context-free grammar. Each node of the tree denotes a construct occurring in the text. The term context-free grammar is kind of misleading however, but it's used to denote a grammar that can be described regardless of context. From Wikipedia:

.. a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context.

In parsing, compilers often blindly operate on a program of which the formalities are not known. It may use a strategy akin to the use of non-terminal symbols to generate a syntax tree. If you don't yet know what terminal and non-terminal symbols are, I will explain them in the next few paragraphs.

Context-free grammar doesn't mean that the grammar itself exists in a vacuum. Ironically, without the surrounding abstractions, context-free grammar would not be able to conceptually disambiguate itself.

Abstract syntax trees in compiler-land are akin to to the way we might computationally parse a sentence in natural language. But from a technical perspective, it might look like recursively iterating over a set of data to interpret a grammar. From Wikipedia, a recursive descent parser is defined as:

A kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

Note that terminals and nonterminals are much the same from linguistics. For example, the sentence below is composed of a noun and verb phrase, or predicate. And the verb phrase is constituted by a verb and noun phrase. And in turn, that noun phrase is constituted by a determiner and a noun.

The words "sentence," "noun," "verb phrase," "noun phrase," and "determiner," can be thought of as non-terminals. They are non-terminating identifiers which we can extend to describe other words, in contrast to the words "John," "hit," "the," and "ball," which are terminals. John is a noun, a subject who is part of a sentence. Inversely, "noun" is non-terminal, as it can be used to describe many persons.

Every context-free grammar can be represented as an equivalent nondeterministic or deterministic pushdown automaton. A pushdown automaton, or finite state transducer, is essentially a finite state machine with a stack. That is, it both reads input and produces output. Formally, a finite state transducer is defined as a 7-tuple: $ M = (Q, \Sigma, \Delta, \delta, \sigma, q_{0}, F)$, where $Q$ is a set of states, $\Sigma$ is an input alphabet, $\Delta$ is an output alphabet, $\delta$ is a transition relation that maps $Q \times \Sigma \rightarrow Q$, $\sigma$ is an output relation that maps $Q \times \Sigma \rightarrow \Delta$, $q_{0}$ is a start state, and $F$ is the set of final states.

You can find some notes on finite state transducers here titled "Finite-State Transducers for Text Rewriting". And to cite a simple illustration from the aforementioned blog:

In this perhaps overly potent simplification of an example, the machine has two states, as well as an input language and output language. It's input alphabet is {a, b}, and output alphabet is {x, y}. It replaces input 'a' with 'x', and 'b' with 'y'. For input 'a', it outputs 'x'. For input 'ab', it outputs 'xy.' For input 'abb', it outputs 'xyy', and so on.

A Toy Example

We can assemble some of the concepts we've discussed here, combining recursive descent and constructing a parse tree using the C programming language, implementing our own simple machine to rewrite text. Consider the equation: ((7-1)*(0+6)). We will define this in a C program, create nodes so we can create a syntax tree, then carve out several functions to recursively parse and rewrite the equation, following operator precedence, converting it to Postfix notation. That is, we will move it's operators to the right hand side of each part the equation, so it reads like this: 7 1 - 0 6 + *

So, we'll need to parse two flavors of characters here: our operators '(', ')', '+', '-', '*' and '/', and our numbers '7', '1', '0','6'. We'll break the expression parsing into several functions to be a good programmer:

Our first function:

Node* parseParenthesis(const char* input, int* index) {
    if (input[*index] == '(') {
        (*index)++; // Move past the opening parenthesis

        Node* expression = parseExpression(input, index);

        if (input[*index] == ')') {
            (*index)++; // Move past the closing parenthesis
        }

        return expression;
    } else if (isdigit(input[*index])) {
        Node* number = createNode(input[*index]);
        (*index)++; // Move past the digit
        return number;
    }

    return NULL;
}

And next, numerical expressions. In each function, we immediately and recursively call the next. When our parsing begins in main(), it calls parseExpression, which calls parseTerm, which calls parseParenthesis and detects an open parenthesis. It repeats the process until it detects the first digit and creates a node. It increments the index, moving past the digit, returns the number, and calls the parse functions again. It repeats this idiom, breaking each part of the expression down by parentheses. And inside, we index the left and right sides of the expression. When operators get indexed, we increment and call the parse functions on the right side, then create a new node and define the left side as the root, effectively shifting our operators '+', '-', '*', '/', to the right side of each subset of parentheses.

Node* parseExpression(const char* input, int* index) {
    Node* left = parseTerm(input, index);

    while (input[*index] == '+' || input[*index] == '-') {
        char operator = input[*index];
        (*index)++; // Move past the operator

        Node* right = parseTerm(input, index);

        Node* root = createNode(operator);
        root->left = left;
        root->right = right;

        left = root;
    }

    return left;
}

We then repeat this idiom, with multiplication and division operators.

Node* parseTerm(const char* input, int* index) {
    Node* left = parseParenthesis(input, index);

    while (input[*index] == '*' || input[*index] == '/') {
        char operator = input[*index];
        (*index)++; // Move past the operator

        Node* right = parseParenthesis(input, index);

        Node* root = createNode(operator);
        root->left = left;
        root->right = right;

        left = root;
    }

    return left;
}

Our output is now in Postfix notation: 7 1 - 0 6 + * We rearranged and printed the nodes, the indexes of each parts of the equation, left to right. This is a simplistic example, intended to be a demonstration of how parsing can work to create a sort of transducer in order rewrite something.

Gist here. As far as I can tell, this is also similarly how clang and various other C parsers work, utilizing a hand-written recursive descent algorithm. Of course, there's much more work a parser does beyond lexical and syntactic analysis—a parser must also disambiguate what is part of a language or not, and further inner abstractions delineating what is allowed and what is not, performing intermediate code generation and optimization. I'll write more about this in a later post.

❧

Grammar as universal structure

Lacan once quipped that "It is with the appearance of language that the dimension of truth emerges." But we also find that, far beyond words, the universe, in a very empirical way, is rich with language. For example, music is a nice example where we find grammatical structure. We begin with a particular tonal space, which we then conceptually divide chromatically or into semi-tonal structures. Then, notes can be combined to form chords with various extensions, augmentations, inversions, and so on—and perhaps follow a series of notes which, also derived from the grammar, lead the whole of the music. And in a much wider view, we would be well off to note quantum field theory, which posits a model of the universe as a set of harmonic oscillators.

Much like linguistic structure—characters to words, words to sentences, and so on. When we talk about linguistics, we think of the phonetic structure of sounds. And the morphological structure-the structure of words. Then, the syntax of the language, the way sentences are organized. And then, the semantic structure defining the meaning of the words and sentences.

The universal sentence structure is grammatically divided into two parts, a subject and a predicate. For example, consider this excerpt from "Engineering a Compiler":

Punctuation marks, such as colons, semicolons, commas, and various brackets, can be represented by their character representations.

We have a subject: "Punctuation marks, such as colons, semicolons, commas, and various brackets..." And we also have a predicate, constituted by a verb and an object: "...can be represented by their character representations."

While "Engineering a Compiler" is a great read about how regular expressions are translated into automata through syntax analysis—in a very metaphysical vein, in some sense, compilers aren't necessarily just about compilers. This becomes quite Freudian and Lacanian when we consider lexical (and syntactical) analysis on an interpersonal and cultural level, decompiling language and meaning. But as for silicon machines: RIP Lacan, there's some probability you would have loved parsers and compilers.

Stephan Bridger

Search This Blog

Latin1 vs UTF8