Skip to content

Compilers

1. Compiler Overview

1.1 Phases

Source Code
→ Lexical Analysis (tokens)
→ Syntax Analysis (parse tree)
→ Semantic Analysis (annotated tree)
→ Intermediate Representation (IR)
→ Code Optimization (optimized IR)
→ Code Generation (target code)
→ Target Program

1.2 Symbol Table

A data structure maintained throughout compilation storing:

EntryDescription
NameLexeme / identifier
KindVariable, function, type, etc.
TypeInt, float, struct, array
ScopeBlock, function, file
OffsetMemory location
AttributesArray dimensions, parameters

2. Lexical Analysis

2.1 Tokens and Patterns

A token is a pair type,lexeme\langle \text{type}, \text{lexeme} \rangle.

Token types: Keywords, identifiers, literals, operators, delimiters, comments (discarded).

Example:

Source: int x = 42 + y;
Tokens: <TYPE, "int"> <ID, "x"> <OP, "="> <INT_LIT, "42"> <OP, "+"> <ID, "y"> <DELIM, ";">

2.2 Regular Expressions for Tokens

| Token | Regular Expression | | ---------- | ---------------------------------- | ---- | ----- | ------ | --- | ---- | --- | --- | | Identifier | [a-zA-Z_][a-zA-Z0-9_]* | | Integer | [0-9]+ | | Float | [0-9]+\.[0-9]+([eE][+-]?[0-9]+)? | | Keyword | if | else | while | return | int | ... | | Operator | + | - | \* | / | == | != | <= | >= | | Whitespace | [ \t\n\r]+ (skip) |

2.3 Finite Automata for Lexing

Build a DFA from token regexes using Thompson”s construction + subset construction. The DFA accepts the longest matching prefix (maximal munch).

LEXER(source):
pos = 0
tokens = []
while pos < len(source):
match = LONGEST_MATCH(source, pos, DFA)
if match == null:
error("invalid token at position", pos)
tokens.append(match.token)
pos += match.length
return tokens

2.4 Handling Ambiguity

Maximal munch: Always match the longest possible token. 123abc123 abc, not 1 2 3 abc.

Keyword priority: Keywords override identifiers. if → keyword if, not identifier if.

3. Parsing

3.1 Context-Free Grammars

A grammar G=(V,Σ,R,S)G = (V, \Sigma, R, S) defines the syntactic structure.

Expression grammar:

E → E + T | T
T → T * F | F
F → ( E ) | id | num

3.2 Parse Trees and Derivations

Leftmost derivation: Always expand the leftmost variable.

EE+TT+TF+Tid+Tid+TFid+FFid+ididE \Rightarrow E + T \Rightarrow T + T \Rightarrow F + T \Rightarrow \mathbf{id} + T \Rightarrow \mathbf{id} + T * F \Rightarrow \mathbf{id} + F * F \Rightarrow \mathbf{id} + \mathbf{id} * \mathbf{id}

Rightmost derivation: Always expand the rightmost variable (used in bottom-up parsing).

3.3 Ambiguity and Precedence

Ambiguous grammar for expressions:

E → E + E | E * E | ( E ) | id

Disambiguated with precedence:

E → E + T | T // + has lower precedence
T → T * F | F // * has higher precedence
F → ( E ) | id

Associativity:

  • Left-associative (abc=(ab)ca - b - c = (a - b) - c): EE op TE \to E \text{ op } T
  • Right-associative (a=b=ca=(b=c)a = b = c \Rightarrow a = (b = c)): ET op EE \to T \text{ op } E

3.4 LL Parsing (Top-Down)

LL(1): Left-to-right scan, Leftmost derivation, 1 token of lookahead.

Requirements:

  1. No left recursion.
  2. The grammar is factored (no common prefixes).

Eliminating left recursion:

// Replace: A → Aα | β
// With:
A → βA'
A' → αA' | ε

Left factoring:

// Replace: A → αβ1 | αβ2
// With:
A → αA'
A' → β1 | β2

LL(1) parsing table construction:

For each production A → α:
For each terminal a in FIRST(α):
add A → α to TABLE[A, a]
If ε ∈ FIRST(α):
For each terminal b in FOLLOW(A):
add A → α to TABLE[A, b]

FIRST and FOLLOW sets:

FIRST(α)={a:αaβ}{ϵ} if αϵ\text{FIRST}(\alpha) = \{a : \alpha \Rightarrow^* a\beta\} \cup \{\epsilon\} \text{ if } \alpha \Rightarrow^* \epsilon

FOLLOW(A)={a:SαAaβ}{$ if SαA}\text{FOLLOW}(A) = \{a : S \Rightarrow^* \alpha Aa\beta\} \cup \{\$ \text{ if } S \Rightarrow^* \alpha A\}

LL(1) condition: No multiple entries in the parsing table.

3.5 LR Parsing (Bottom-Up)

LR(k): Left-to-right scan, Rightmost derivation in reverse, kk tokens lookahead.

Shift-reduce parsing:

LR_PARSE(input):
stack = [$]
push start state
lookahead = next_token(input)
while true:
s = top of stack
if ACTION[s, lookahead] == SHIFT t:
push lookahead
push state t
lookahead = next_token(input)
elif ACTION[s, lookahead] == REDUCE A → α:
pop 2|α| symbols
push A
push GOTO[prev_state, A]
output production A → α
elif ACTION[s, lookahead] == ACCEPT:
return success
else:
error("syntax error")

LR parsing table = ACTION + GOTO:

StateACTIONGOTO
Shift to state, Reduce by productionGo to state on nonterminal

3.6 LR Variants

VariantPowerTable SizeConstruction Cost
SLR(1)Least powerfulSmallLow
LALR(1)MediumMediumMedium
LR(1)Most powerfulLargeHigh

SLR(1): Uses FOLLOW sets for reduce decisions. May miss valid reductions.

LALR(1): Merges LR(1) states with the same core. Used by Yacc, Bison.

LR(1): Full power. Each state has unique lookahead context. Rarely needed in practice.

3.7 Recursive Descent (Hand-Written LL)

EXPR():
result = TERM()
while token == '+':
advance()
right = TERM()
result = BinOp('+', result, right)
return result
TERM():
result = FACTOR()
while token == '*':
advance()
right = FACTOR()
result = BinOp('*', result, right)
return result
FACTOR():
if token == '(':
advance()
result = EXPR()
expect(')')
return result
elif token is ID:
name = token.lexeme
advance()
return Var(name)
elif token is NUM:
val = token.value
advance()
return Num(val)
else:
error("unexpected token")

Pros: Clear, extensible, good error messages. Cons: Manual construction, no automatic generation.

4. Abstract Syntax Tree (AST)

4.1 From Parse Tree to AST

The parse tree preserves all grammar details. The AST simplifies to essential structure:

Parse tree: (E (E (T (F id:x))) + (T (T (F num:42)) * (T (F id:y))))
AST: BinOp('+', Var('x'), BinOp('*', Num(42), Var('y')))

Rules for AST construction:

  • Remove punctuation tokens (parentheses, commas).
  • Collapse chains of single-child nodes.
  • Use operator-specific node types.

4.2 AST Node Types

AST_NODE:
| IntLit(value)
| FloatLit(value)
| BoolLit(value)
| Var(name)
| BinOp(op, left, right)
| UnaryOp(op, operand)
| If(cond, then_branch, else_branch)
| While(cond, body)
| FuncCall(name, args)
| FuncDef(name, params, body)
| Return(expr)
| Block(statements)
| ArrayLit(elements)
| IndexExpr(array, index)

5. Semantic Analysis

5.1 Type Checking

Ensure operations are applied to correctly typed operands.

TYPE_CHECK(node):
switch node.type:
case IntLit: return INT
case Var: return lookup_type(node.name)
case BinOp: check_types(node.op, TYPE_CHECK(node.left), TYPE_CHECK(node.right))
case FuncCall:
params = TYPE_CHECK each arg
sig = lookup_function(node.name)
if sig.params != params: error "type mismatch"
return sig.return_type
case If:
if TYPE_CHECK(node.cond) != BOOL: error "condition must be boolean"
TYPE_CHECK(node.then_branch)
if node.else_branch: TYPE_CHECK(node.else_branch)

Type coercion rules:

ExpressionRule
int + intint
int + floatfloat (coerce int to float)
int / intint (integer division)

5.2 Type Inference

Hindley-Milner type inference (used in ML, Haskell):

Algorithm W:
Γ = {} (empty type environment)
infer(e, Γ):
case e:
x (variable): return instantiate(Γ(x))
λx.e (abstraction): α = fresh type variable
(τ, Γ') = infer(e, Γ ∪ {x : α})
return (α → τ, Γ')
e1 e2 (application): (τ1, Γ') = infer(e1, Γ)
(τ2, Γ'') = infer(e2, Γ')
α = fresh type variable
unify(τ1, τ2 → α)
return (α, Γ'')
let x = e1 in e2: (τ1, Γ') = infer(e1, Γ)
(τ2, Γ'') = infer(e2, Γ' ∪ {x : generalize(τ1, Γ')})
return (τ2, Γ'')

Unification algorithm:

UNIFY(t1, t2):
if t1 == t2: return
if t1 is type variable and t1 not in t2: substitute(t1, t2)
if t2 is type variable and t2 not in t1: substitute(t2, t1)
if t1 = T(s1,...,sn) and t2 = T(s1,...,sm) and n == m:
for i = 1 to n: UNIFY(si, ti)
else: error "type mismatch"

5.3 Scope Resolution

Scoping rules:

  • Static (lexical) scoping: Name resolution depends on where the function is defined.
  • Dynamic scoping: Name resolution depends on where the function is called.
x = 10
function f():
return x // static: x = 10; dynamic: depends on caller
function g():
x = 20
return f()

Symbol table implementation: Nested hash maps or a stack of scopes.

6. Intermediate Representations

6.1 Three-Address Code

Instructions with at most one operator, two operands, and one result:

x = y op z // binary operation
x = op y // unary operation
x = y // copy
goto L // unconditional jump
if x goto L // conditional jump
if x op y goto L // conditional branch
param x // function parameter
call f, n // function call with n args
return x // function return
x = y[i] // array access
x[i] = y // array store

Example:

Source: a = (b + c) * d
TAC:
t1 = b + c
t2 = t1 * d
a = t2

6.2 Static Single Assignment (SSA)

Every variable is defined exactly once. New versions introduced at merge points using ϕ\phi-functions.

Before SSA:
x = 1
if cond:
x = 2
y = x + 3
After SSA:
x_1 = 1
if cond:
x_2 = 2
x_3 = φ(x_1, x_2)
y_1 = x_3 + 3

Dominance: Node dd dominates node nn if every path from entry to nn passes through dd.

Dominance frontier: The set of nodes where dominance ends — exactly where ϕ\phi-functions are placed.

INSERT_PHI_FUNCTIONS():
for each variable v:
worklist = all blocks that define v
while worklist not empty:
block = worklist.remove()
for each df in dominance_frontier(block):
if df has no φ-function for v:
insert φ-function for v at df
add df to worklist

6.3 Control Flow Graph (CFG)

A directed graph where:

  • Nodes are basic blocks (sequences of instructions with one entry, one exit, no branches in between).
  • Edges represent possible control flow.

Basic block construction:

BUILD_BASIC_BLOCKS(instructions):
leaders = first instruction
leaders += targets of branches/jumps
leaders += instructions after branches/jumps
for each leader:
basic_block = instructions from leader to next leader - 1

7. Optimization

7.1 Optimization Levels

LevelWhen AppliedScope
LocalWithin basic blockSingle block
GlobalWithin a functionSingle function with CFG
InterproceduralAcross functionsWhole program

7.2 Common Optimizations

Constant folding:

// Before
x = 3 + 5
// After
x = 8

Constant propagation:

// Before
x = 10
y = x + 5
// After
x = 10
y = 15

Dead code elimination:

// Before
x = compute() // x never used
y = x + 1 // y never used
return 42
// After
return 42

Common subexpression elimination:

// Before
t1 = a + b
t2 = a + b // same computation
// After
t1 = a + b
t2 = t1 // reuse

Loop invariant code motion:

// Before
for i = 0 to n:
x = a[i] * c // c doesn't change
// After
temp = c
for i = 0 to n:
x = a[i] * temp

Strength reduction:

x * 2 → x + x // or x << 1
x ** 2 → x * x
x / 8 → x >> 3 // for unsigned integers

Loop unrolling:

// Before
for i = 0 to 99:
a[i] = b[i] + c[i]
// After
for i = 0; i < 96; i += 4:
a[i] = b[i] + c[i]
a[i+1] = b[i+1] + c[i+1]
a[i+2] = b[i+2] + c[i+2]
a[i+3] = b[i+3] + c[i+3]

7.3 Data-Flow Analysis

Reaching definitions: Which definitions of a variable may reach each program point?

OUT[B]=gen[B](IN[B]kill[B])\text{OUT}[B] = \text{gen}[B] \cup (\text{IN}[B] - \text{kill}[B]) IN[B]=Ppred(B)OUT[P]\text{IN}[B] = \bigcup_{P \in \text{pred}(B)} \text{OUT}[P]

Iterate until convergence (fixed point).

Other analyses:

AnalysisPurpose
Live variable analysisOptimize register allocation
Available expressionsEnable CSE
Very busy expressionsEnable code motion
Constant propagationFold constants

7.4 Register Allocation

Problem: Map an unlimited set of virtual registers to a finite set of physical registers.

Graph coloring approach:

  1. Build an interference graph: vertices = virtual registers, edges = simultaneous liveness.
  2. Color the graph with kk colors (physical registers).
  3. If coloring fails, spill a variable (move to memory).
REGISTER_ALLOCATE(interference_graph, k_registers):
stack = []
while nodes remain in interference_graph:
if node has degree < k:
push node to stack
remove node and edges
else:
spill one node (store to memory)
while stack not empty:
node = stack.pop()
assign color different from all neighbors
return allocation

NP-hard as a general principle; heuristic algorithms (simplify, coalesce, freeze, spill-and-reload) work well in practice.

8. Code Generation

8.1 Code Generation from TAC

// TAC: // Assembly:
t1 = b + c mov eax, [b]
t2 = t1 * d add eax, [c]
a = t2 mov ecx, [d]
imul eax, ecx
mov [a], eax

8.2 Instruction Selection

Map IR operations to target machine instructions. Can use:

  • Pattern matching: Match IR patterns against instruction templates.
  • Tree pattern matching: AST tree rewriting with optimal cover (dynamic programming / BURG).

8.3 Instruction Scheduling

Reorder instructions to avoid pipeline stalls:

// Before (stall on dependency):
load r1, [addr] // latency: 3 cycles
add r2, r1, r3 // stalls 2 cycles waiting for r1
store r2, [out]
// After (reorder to hide latency):
load r1, [addr] // start load early
mov r4, [other] // independent work
add r5, r4, r6 // more independent work
add r2, r1, r3 // r1 ready now
store r2, [out]

8.4 Target Code

x86-64 example:

; int add(int a, int b) { return a + b; }
add:
push rbp
mov rbp, rsp
mov eax, edi ; first arg in edi
add eax, esi ; second arg in esi
pop rbp
ret

9. Common Pitfalls

  1. Left recursion in LL grammars causes infinite loops. A rule like EE+TE \to E + T causes a recursive descent parser to loop. Always eliminate left recursion before building an LL parser.

  2. FIRST/FOLLOW set computation errors. Incorrect sets lead to parsing table conflicts. Remember: ϵFIRST(α)\epsilon \in \text{FIRST}(\alpha) only if α\alpha can derive the empty string.

  3. Forgetting to insert ϕ\phi-functions at dominance frontiers. SSA requires ϕ\phi-functions precisely at points where control flow merges. Placing them incorrectly leads to wrong data flow.

  4. Incorrect interference graph for register allocation. Two variables interfere only if they are simultaneously live. If one’s definition kills the other before use, they do not interfere.

  5. Confusing parse trees and ASTs. Parse trees preserve all grammar productions (including punctuation). ASTs abstract away syntactic details to represent semantic structure.

  6. Semantic analysis after code generation. Type errors and scope violations must be caught during semantic analysis, before optimization and code generation. Fixing them later is much harder.

  7. Ignoring phase ordering issues. Some optimizations interfere with others. The order of optimization passes affects the final result. Common order: SSA construction → constant propagation → dead code elimination → loop optimizations → register allocation.

Worked Examples

Example 1: Constructing an LL(1) Parse Table

Problem: Given the grammar: E -> T E’, E’ -> + T E’ | epsilon, T -> F T’, T’ -> * F T’ | epsilon, F -> ( E ) | id. Construct the FIRST and FOLLOW sets for E’. Solution: FIRST(E’) = FIRST(T E’) = {id, (} union {+} = {id, (, +}. FIRST(T) = {id, (}. FOLLOW(E’) = FOLLOW(E) = {, )}. The LL(1) parse table entry for E' with lookahead +: E' -> + T E'. For lookahead id or (: E' -> epsilon. For lookahead or ): E’ -> epsilon. Since epsilon is in FIRST(E’) and FOLLOW(E’) overlaps with FIRST(E’), entries for FOLLOW symbols map to the epsilon production.

Example 2: Three-Address Code Generation

Problem: Generate three-address code for: x = (a + b) _ (c - d). Solution: t1 = a + b; t2 = c - d; t3 = t1 _ t2; x = t3. The three-address code uses temporary variables t1, t2, t3. Each instruction has at most one operator on the right side. A basic block contains these four instructions in sequence.

Summary

  • Lexical analysis converts source to tokens using regex and finite automata.
  • Parsing builds parse trees from tokens using LL (top-down) or LR (bottom-up) methods.
  • ASTs abstract away grammar details to represent program structure.
  • Semantic analysis performs type checking, type inference (Hindley-Milner), and scope resolution.
  • IR (three-address code, SSA) provides a platform-independent program representation.
  • Optimization includes constant folding, dead code elimination, CSE, loop optimization, and register allocation via graph coloring.
  • Code generation maps IR to target machine instructions with instruction selection and scheduling.

Cross-References

TopicLink
Automata and Formal LanguagesView
Algorithm DesignView
Complexity TheoryView