Compilers
1. Compiler Overview
1.1 Phases
Source Code → Lexical Analysis (tokens) → Syntax Analysis (parse tree) → Semantic Analysis (annotated tree) → Intermediate Representation (IR) → Code Optimization (optimized IR) → Code Generation (target code) → Target Program1.2 Symbol Table
A data structure maintained throughout compilation storing:
| Entry | Description |
|---|---|
| Name | Lexeme / identifier |
| Kind | Variable, function, type, etc. |
| Type | Int, float, struct, array |
| Scope | Block, function, file |
| Offset | Memory location |
| Attributes | Array dimensions, parameters |
2. Lexical Analysis
2.1 Tokens and Patterns
A token is a pair .
Token types: Keywords, identifiers, literals, operators, delimiters, comments (discarded).
Example:
Source: int x = 42 + y;Tokens: <TYPE, "int"> <ID, "x"> <OP, "="> <INT_LIT, "42"> <OP, "+"> <ID, "y"> <DELIM, ";">2.2 Regular Expressions for Tokens
| Token | Regular Expression | | ---------- | ---------------------------------- | ---- | ----- | ------ | --- | ---- | --- | --- | | Identifier | [a-zA-Z_][a-zA-Z0-9_]* | | Integer | [0-9]+ | | Float | [0-9]+\.[0-9]+([eE][+-]?[0-9]+)? | | Keyword | if | else | while | return | int | ... | | Operator | + | - | \* | / | == | != | <= | >= | | Whitespace | [ \t\n\r]+ (skip) |
2.3 Finite Automata for Lexing
Build a DFA from token regexes using Thompson”s construction + subset construction. The DFA accepts the longest matching prefix (maximal munch).
LEXER(source): pos = 0 tokens = [] while pos < len(source): match = LONGEST_MATCH(source, pos, DFA) if match == null: error("invalid token at position", pos) tokens.append(match.token) pos += match.length return tokens2.4 Handling Ambiguity
Maximal munch: Always match the longest possible token. 123abc → 123 abc, not 1 2 3 abc.
Keyword priority: Keywords override identifiers. if → keyword if, not identifier if.
3. Parsing
3.1 Context-Free Grammars
A grammar defines the syntactic structure.
Expression grammar:
E → E + T | TT → T * F | FF → ( E ) | id | num3.2 Parse Trees and Derivations
Leftmost derivation: Always expand the leftmost variable.
Rightmost derivation: Always expand the rightmost variable (used in bottom-up parsing).
3.3 Ambiguity and Precedence
Ambiguous grammar for expressions:
E → E + E | E * E | ( E ) | idDisambiguated with precedence:
E → E + T | T // + has lower precedenceT → T * F | F // * has higher precedenceF → ( E ) | idAssociativity:
- Left-associative ():
- Right-associative ():
3.4 LL Parsing (Top-Down)
LL(1): Left-to-right scan, Leftmost derivation, 1 token of lookahead.
Requirements:
- No left recursion.
- The grammar is factored (no common prefixes).
Eliminating left recursion:
// Replace: A → Aα | β// With:A → βA'A' → αA' | εLeft factoring:
// Replace: A → αβ1 | αβ2// With:A → αA'A' → β1 | β2LL(1) parsing table construction:
For each production A → α: For each terminal a in FIRST(α): add A → α to TABLE[A, a] If ε ∈ FIRST(α): For each terminal b in FOLLOW(A): add A → α to TABLE[A, b]FIRST and FOLLOW sets:
LL(1) condition: No multiple entries in the parsing table.
3.5 LR Parsing (Bottom-Up)
LR(k): Left-to-right scan, Rightmost derivation in reverse, tokens lookahead.
Shift-reduce parsing:
LR_PARSE(input): stack = [$] push start state lookahead = next_token(input) while true: s = top of stack if ACTION[s, lookahead] == SHIFT t: push lookahead push state t lookahead = next_token(input) elif ACTION[s, lookahead] == REDUCE A → α: pop 2|α| symbols push A push GOTO[prev_state, A] output production A → α elif ACTION[s, lookahead] == ACCEPT: return success else: error("syntax error")LR parsing table = ACTION + GOTO:
| State | ACTION | GOTO |
|---|---|---|
| Shift to state, Reduce by production | Go to state on nonterminal |
3.6 LR Variants
| Variant | Power | Table Size | Construction Cost |
|---|---|---|---|
| SLR(1) | Least powerful | Small | Low |
| LALR(1) | Medium | Medium | Medium |
| LR(1) | Most powerful | Large | High |
SLR(1): Uses FOLLOW sets for reduce decisions. May miss valid reductions.
LALR(1): Merges LR(1) states with the same core. Used by Yacc, Bison.
LR(1): Full power. Each state has unique lookahead context. Rarely needed in practice.
3.7 Recursive Descent (Hand-Written LL)
EXPR(): result = TERM() while token == '+': advance() right = TERM() result = BinOp('+', result, right) return result
TERM(): result = FACTOR() while token == '*': advance() right = FACTOR() result = BinOp('*', result, right) return result
FACTOR(): if token == '(': advance() result = EXPR() expect(')') return result elif token is ID: name = token.lexeme advance() return Var(name) elif token is NUM: val = token.value advance() return Num(val) else: error("unexpected token")Pros: Clear, extensible, good error messages. Cons: Manual construction, no automatic generation.
4. Abstract Syntax Tree (AST)
4.1 From Parse Tree to AST
The parse tree preserves all grammar details. The AST simplifies to essential structure:
Parse tree: (E (E (T (F id:x))) + (T (T (F num:42)) * (T (F id:y))))AST: BinOp('+', Var('x'), BinOp('*', Num(42), Var('y')))Rules for AST construction:
- Remove punctuation tokens (parentheses, commas).
- Collapse chains of single-child nodes.
- Use operator-specific node types.
4.2 AST Node Types
AST_NODE: | IntLit(value) | FloatLit(value) | BoolLit(value) | Var(name) | BinOp(op, left, right) | UnaryOp(op, operand) | If(cond, then_branch, else_branch) | While(cond, body) | FuncCall(name, args) | FuncDef(name, params, body) | Return(expr) | Block(statements) | ArrayLit(elements) | IndexExpr(array, index)5. Semantic Analysis
5.1 Type Checking
Ensure operations are applied to correctly typed operands.
TYPE_CHECK(node): switch node.type: case IntLit: return INT case Var: return lookup_type(node.name) case BinOp: check_types(node.op, TYPE_CHECK(node.left), TYPE_CHECK(node.right)) case FuncCall: params = TYPE_CHECK each arg sig = lookup_function(node.name) if sig.params != params: error "type mismatch" return sig.return_type case If: if TYPE_CHECK(node.cond) != BOOL: error "condition must be boolean" TYPE_CHECK(node.then_branch) if node.else_branch: TYPE_CHECK(node.else_branch)Type coercion rules:
| Expression | Rule |
|---|---|
int + int | int |
int + float | float (coerce int to float) |
int / int | int (integer division) |
5.2 Type Inference
Hindley-Milner type inference (used in ML, Haskell):
Algorithm W: Γ = {} (empty type environment) infer(e, Γ): case e: x (variable): return instantiate(Γ(x)) λx.e (abstraction): α = fresh type variable (τ, Γ') = infer(e, Γ ∪ {x : α}) return (α → τ, Γ') e1 e2 (application): (τ1, Γ') = infer(e1, Γ) (τ2, Γ'') = infer(e2, Γ') α = fresh type variable unify(τ1, τ2 → α) return (α, Γ'') let x = e1 in e2: (τ1, Γ') = infer(e1, Γ) (τ2, Γ'') = infer(e2, Γ' ∪ {x : generalize(τ1, Γ')}) return (τ2, Γ'')Unification algorithm:
UNIFY(t1, t2): if t1 == t2: return if t1 is type variable and t1 not in t2: substitute(t1, t2) if t2 is type variable and t2 not in t1: substitute(t2, t1) if t1 = T(s1,...,sn) and t2 = T(s1,...,sm) and n == m: for i = 1 to n: UNIFY(si, ti) else: error "type mismatch"5.3 Scope Resolution
Scoping rules:
- Static (lexical) scoping: Name resolution depends on where the function is defined.
- Dynamic scoping: Name resolution depends on where the function is called.
x = 10
function f(): return x // static: x = 10; dynamic: depends on caller
function g(): x = 20 return f()Symbol table implementation: Nested hash maps or a stack of scopes.
6. Intermediate Representations
6.1 Three-Address Code
Instructions with at most one operator, two operands, and one result:
x = y op z // binary operationx = op y // unary operationx = y // copygoto L // unconditional jumpif x goto L // conditional jumpif x op y goto L // conditional branchparam x // function parametercall f, n // function call with n argsreturn x // function returnx = y[i] // array accessx[i] = y // array storeExample:
Source: a = (b + c) * dTAC: t1 = b + c t2 = t1 * d a = t26.2 Static Single Assignment (SSA)
Every variable is defined exactly once. New versions introduced at merge points using -functions.
Before SSA: x = 1 if cond: x = 2 y = x + 3
After SSA: x_1 = 1 if cond: x_2 = 2 x_3 = φ(x_1, x_2) y_1 = x_3 + 3Dominance: Node dominates node if every path from entry to passes through .
Dominance frontier: The set of nodes where dominance ends — exactly where -functions are placed.
INSERT_PHI_FUNCTIONS(): for each variable v: worklist = all blocks that define v while worklist not empty: block = worklist.remove() for each df in dominance_frontier(block): if df has no φ-function for v: insert φ-function for v at df add df to worklist6.3 Control Flow Graph (CFG)
A directed graph where:
- Nodes are basic blocks (sequences of instructions with one entry, one exit, no branches in between).
- Edges represent possible control flow.
Basic block construction:
BUILD_BASIC_BLOCKS(instructions): leaders = first instruction leaders += targets of branches/jumps leaders += instructions after branches/jumps for each leader: basic_block = instructions from leader to next leader - 17. Optimization
7.1 Optimization Levels
| Level | When Applied | Scope |
|---|---|---|
| Local | Within basic block | Single block |
| Global | Within a function | Single function with CFG |
| Interprocedural | Across functions | Whole program |
7.2 Common Optimizations
Constant folding:
// Beforex = 3 + 5// Afterx = 8Constant propagation:
// Beforex = 10y = x + 5// Afterx = 10y = 15Dead code elimination:
// Beforex = compute() // x never usedy = x + 1 // y never usedreturn 42// Afterreturn 42Common subexpression elimination:
// Beforet1 = a + bt2 = a + b // same computation// Aftert1 = a + bt2 = t1 // reuseLoop invariant code motion:
// Beforefor i = 0 to n: x = a[i] * c // c doesn't change// Aftertemp = cfor i = 0 to n: x = a[i] * tempStrength reduction:
x * 2 → x + x // or x << 1x ** 2 → x * xx / 8 → x >> 3 // for unsigned integersLoop unrolling:
// Beforefor i = 0 to 99: a[i] = b[i] + c[i]// Afterfor i = 0; i < 96; i += 4: a[i] = b[i] + c[i] a[i+1] = b[i+1] + c[i+1] a[i+2] = b[i+2] + c[i+2] a[i+3] = b[i+3] + c[i+3]7.3 Data-Flow Analysis
Reaching definitions: Which definitions of a variable may reach each program point?
Iterate until convergence (fixed point).
Other analyses:
| Analysis | Purpose |
|---|---|
| Live variable analysis | Optimize register allocation |
| Available expressions | Enable CSE |
| Very busy expressions | Enable code motion |
| Constant propagation | Fold constants |
7.4 Register Allocation
Problem: Map an unlimited set of virtual registers to a finite set of physical registers.
Graph coloring approach:
- Build an interference graph: vertices = virtual registers, edges = simultaneous liveness.
- Color the graph with colors (physical registers).
- If coloring fails, spill a variable (move to memory).
REGISTER_ALLOCATE(interference_graph, k_registers): stack = [] while nodes remain in interference_graph: if node has degree < k: push node to stack remove node and edges else: spill one node (store to memory) while stack not empty: node = stack.pop() assign color different from all neighbors return allocationNP-hard as a general principle; heuristic algorithms (simplify, coalesce, freeze, spill-and-reload) work well in practice.
8. Code Generation
8.1 Code Generation from TAC
// TAC: // Assembly:t1 = b + c mov eax, [b]t2 = t1 * d add eax, [c]a = t2 mov ecx, [d] imul eax, ecx mov [a], eax8.2 Instruction Selection
Map IR operations to target machine instructions. Can use:
- Pattern matching: Match IR patterns against instruction templates.
- Tree pattern matching: AST tree rewriting with optimal cover (dynamic programming / BURG).
8.3 Instruction Scheduling
Reorder instructions to avoid pipeline stalls:
// Before (stall on dependency):load r1, [addr] // latency: 3 cyclesadd r2, r1, r3 // stalls 2 cycles waiting for r1store r2, [out]
// After (reorder to hide latency):load r1, [addr] // start load earlymov r4, [other] // independent workadd r5, r4, r6 // more independent workadd r2, r1, r3 // r1 ready nowstore r2, [out]8.4 Target Code
x86-64 example:
; int add(int a, int b) { return a + b; }add: push rbp mov rbp, rsp mov eax, edi ; first arg in edi add eax, esi ; second arg in esi pop rbp ret9. Common Pitfalls
Left recursion in LL grammars causes infinite loops. A rule like causes a recursive descent parser to loop. Always eliminate left recursion before building an LL parser.
FIRST/FOLLOW set computation errors. Incorrect sets lead to parsing table conflicts. Remember: only if can derive the empty string.
Forgetting to insert -functions at dominance frontiers. SSA requires -functions precisely at points where control flow merges. Placing them incorrectly leads to wrong data flow.
Incorrect interference graph for register allocation. Two variables interfere only if they are simultaneously live. If one’s definition kills the other before use, they do not interfere.
Confusing parse trees and ASTs. Parse trees preserve all grammar productions (including punctuation). ASTs abstract away syntactic details to represent semantic structure.
Semantic analysis after code generation. Type errors and scope violations must be caught during semantic analysis, before optimization and code generation. Fixing them later is much harder.
Ignoring phase ordering issues. Some optimizations interfere with others. The order of optimization passes affects the final result. Common order: SSA construction → constant propagation → dead code elimination → loop optimizations → register allocation.
Worked Examples
Example 1: Constructing an LL(1) Parse Table
Problem: Given the grammar: E -> T E’, E’ -> + T E’ | epsilon, T -> F T’, T’ -> * F T’ | epsilon, F -> ( E ) | id. Construct the FIRST and FOLLOW sets for E’. Solution: FIRST(E’) = FIRST(T E’) = {id, (} union {+} = {id, (, +}. FIRST(T) = {id, (}. FOLLOW(E’) = FOLLOW(E) = {, )}. The LL(1) parse table entry for E' with lookahead +: E' -> + T E'. For lookahead id or (: E' -> epsilon. For lookahead or ): E’ -> epsilon. Since epsilon is in FIRST(E’) and FOLLOW(E’) overlaps with FIRST(E’), entries for FOLLOW symbols map to the epsilon production.
Example 2: Three-Address Code Generation
Problem: Generate three-address code for: x = (a + b) _ (c - d). Solution: t1 = a + b; t2 = c - d; t3 = t1 _ t2; x = t3. The three-address code uses temporary variables t1, t2, t3. Each instruction has at most one operator on the right side. A basic block contains these four instructions in sequence.
Summary
- Lexical analysis converts source to tokens using regex and finite automata.
- Parsing builds parse trees from tokens using LL (top-down) or LR (bottom-up) methods.
- ASTs abstract away grammar details to represent program structure.
- Semantic analysis performs type checking, type inference (Hindley-Milner), and scope resolution.
- IR (three-address code, SSA) provides a platform-independent program representation.
- Optimization includes constant folding, dead code elimination, CSE, loop optimization, and register allocation via graph coloring.
- Code generation maps IR to target machine instructions with instruction selection and scheduling.
Cross-References
| Topic | Link |
|---|---|
| Automata and Formal Languages | View |
| Algorithm Design | View |
| Complexity Theory | View |