Skip to content

Context-Free Languages

3.1 Context-Free Grammars

A context-free grammar (CFG) is a 4-tuple G=(V,Σ,R,S)G = (V, \Sigma, R, S) where:

  • VV is a finite set of variables (non-terminals).
  • Σ\Sigma is a finite set of terminals.
  • RV×(VΣ)R \subseteq V \times (V \cup \Sigma)^* is a finite set of production rules.
  • SVS \in V is the start variable.

A production AαA \to \alpha means variable AA can be replaced by string α\alpha.

Example. CFG for L={anbn:n0}L = \{a^n b^n : n \geq 0\}:

SaSbεS \to aSb \mid \varepsilon

Example. CFG for balanced parentheses:

S(S)SSεS \to (S) \mid SS \mid \varepsilon

Example. CFG for {w{0,1}:w=wR}\{w \in \{0,1\}^* : w = w^R\} (palindromes):

S0S01S101εS \to 0S0 \mid 1S1 \mid 0 \mid 1 \mid \varepsilon

Example. CFG for L={aibjck:i=j+k}L = \{a^i b^j c^k : i = j + k\}:

SaSbT,TaTcεS \to aSb \mid T, \quad T \to aTc \mid \varepsilon

The variable SS generates the aibia^i b^i part, and TT generates the akcka^k c^k part. Since SaiSbiaiTbiai+kckbiS \Rightarrow^* a^i S b^i \Rightarrow a^i T b^i \Rightarrow^* a^{i+k} c^k b^iThe total number Of bS and cS equals the total number of aS.

Example. CFG for strings with equal numbers of 0S and 1S:

S0S11S0SSεS \to 0S1 \mid 1S0 \mid SS \mid \varepsilon

The first two rules add a matched pair (in either order); the third concatenates two balanced Strings; the fourth handles the empty string.

Example. CFG for the language of all strings over {a,b}\{a, b\} that are not palindromes:

SaAbbAaaSabSbab,AaAaaAbbAabAbabS \to aAb \mid bAa \mid aSa \mid bSb \mid a \mid b, \quad A \to aAa \mid aAb \mid bAa \mid bAb \mid a \mid b

Here SS generates non-palindromes and AA generates arbitrary strings of length 1\geq 1.

Worked Example: CFG for $\{a^i b^j : 2i = j \mathrm{ or} 2j = i\}$

SS1S2,S1aS1bbε,S2aaS2bεS \to S_1 \mid S_2, \quad S_1 \to aS_1bb \mid \varepsilon, \quad S_2 \to aaS_2b \mid \varepsilon

S1S_1 generates {aib2i}\{a^i b^{2i}\} and S2S_2 generates {a2jbj}\{a^{2j} b^j\}. Their union is the desired Language.

Proof of correctness for S1S_1. By induction: S1nanS1b2nanb2nS_1 \Rightarrow^n a^n S_1 b^{2n} \Rightarrow a^n b^{2n}. So L(S1)={anb2n:n0}L(S_1) = \{a^n b^{2n} : n \geq 0\}. Similarly L(S2)={a2nbn:n0}L(S_2) = \{a^{2n} b^n : n \geq 0\}. Hence L(S)=L(S1)L(S2)L(S) = L(S_1) \cup L(S_2)Which is exactly the target language. \blacksquare

3.2 Parse Trees and Ambiguity

A parse tree (derivation tree) for a string ww according to grammar GG is a tree where:

  • The root is labelled SS.
  • Internal nodes are labelled with variables.
  • Leaves are labelled with terminals (or ε\varepsilon).
  • Children of a node labelled AA are the symbols of the right-hand side of a production AαA \to \alpha in left-to-right order.

A CFG is ambiguous if some string in its language has two or more distinct parse trees (equivalently, Two or more leftmost derivations or two or more rightmost derivations).

Example (ambiguous). SS+SS×SidS \to S + S \mid S \times S \mid \mathrm{id}. The string id+id×id\mathrm{id} + \mathrm{id} \times \mathrm{id} has two parse trees: S+(S×S)S + (S \times S) and (S+S)×S(S + S) \times S.

Removing ambiguity. Some ambiguous grammars can be made unambiguous by rewriting the productions To enforce a particular evaluation order. For the arithmetic expression grammar:

EE+TT,TT×FF,F(E)idE \to E + T \mid T, \quad T \to T \times F \mid F, \quad F \to (E) \mid \mathrm{id}

This grammar is unambiguous and enforces the standard precedence (×\times before ++) and left Associativity.

Worked Example: Proving a grammar is unambiguous

Consider the grammar: SaSbabS \to aSb \mid ab. We prove it is unambiguous.

Proof. Any string in L(G)L(G) has the form anbna^n b^n for some n1n \geq 1. We show by induction On nn that anbna^n b^n has exactly one parse tree.

Base (n=1n = 1): abab has only the parse tree using SabS \to ab.

Inductive step: an+1bn+1a^{n+1} b^{n+1} must use SaSbS \to aSb (the rule SabS \to ab produces only Strings of length 2). The inner SS must derive anbna^n b^nWhich by IH has a unique parse tree. Hence an+1bn+1a^{n+1} b^{n+1} has exactly one parse tree. \blacksquare

Example. L={aibjck:i=jorj=k}L = \{a^i b^j c^k : i = j \mathrm{ or} j = k\} is inherently ambiguous. (Proof omitted; Any grammar must have two competing mechanisms for the two conditions, and these Interfere.)

3.3 Chomsky Normal Form

A CFG is in Chomsky Normal Form (CNF) if every production is of the form:

  • ABCA \to BC (two variables), or
  • AaA \to a (single terminal), or
  • SεS \to \varepsilon (only the start variable, and SS does not appear on the right of any rule).

Theorem 3.1. Every CFG has an equivalent CFG in Chomsky Normal Form.

Proof (conversion algorithm).

  1. Add a new start variable S0S_0 with rule S0SS_0 \to S (ensures SS does not appear on any right side if it generates ε\varepsilon).
  2. Remove all ε\varepsilon-productions AεA \to \varepsilon (except possibly S0εS_0 \to \varepsilon). For each occurrence of AA in a right-hand side, add a variant without AA.
  3. Remove all unit productions ABA \to B. For each chain ABA \Rightarrow^* BAdd AαA \to \alpha for each BαB \to \alpha.
  4. Convert remaining rules: replace terminals aa in right-hand sides of length 2\geq 2 with new variables TaT_a and rules TaaT_a \to a. Replace right-hand sides of length 3\geq 3 by introducing intermediate variables. \blacksquare
Worked Example: Converting a CFG to CNF

Convert the following grammar to CNF:

SAbB,AaAε,BbBεS \to AbB, \quad A \to aA \mid \varepsilon, \quad B \to bB \mid \varepsilon

Step 1: Add new start variable. S0SS_0 \to S.

Step 2: Remove ε\varepsilon-productions. Both AA and BB are nullable. For each production Containing AA or BBAdd variants with nullable symbols removed:

  • SAbBAbaBabS \to AbB \mid Ab \mid aB \mid ab
  • AaAaA \to aA \mid a
  • BbBbB \to bB \mid b
  • S0SS_0 \to S

Step 3: Remove unit productions. S0SS_0 \to S is a unit production. Replace with all SS-productions:

  • S0AbBAbaBabS_0 \to AbB \mid Ab \mid aB \mid ab

Step 4: Convert long right-hand sides and terminals.

Introduce TaaT_a \to a and TbbT_b \to b:

  • SATbBATbTaBTaTbS \to AT_bB \mid AT_b \mid T_aB \mid T_aT_b
  • S0ATbBATbTaBTaTbS_0 \to AT_bB \mid AT_b \mid T_aB \mid T_aT_b

Handle ATbBAT_bB (length 3): introduce C1ATbC_1 \to AT_bThen SC1BS \to C_1B and S0C1BS_0 \to C_1B.

Final CNF grammar:

S0C1BATbTaBTaTbS_0 \to C_1B \mid AT_b \mid T_aB \mid T_aT_b SC1BATbTaBTaTbS \to C_1B \mid AT_b \mid T_aB \mid T_aT_b ATaATaA \to T_aA \mid T_a BTbBTbB \to T_bB \mid T_b C1ATbC_1 \to AT_b Taa,TbbT_a \to a, \quad T_b \to b

3.4 Pushdown Automata

A pushdown automaton (PDA) is a 6-tuple M=(Q,Σ,Γ,δ,q0,F)M = (Q, \Sigma, \Gamma, \delta, q_0, F) where:

  • QQ is a finite set of states.
  • Σ\Sigma is the input alphabet.
  • Γ\Gamma is the stack alphabet.
  • δ:Q×(Σ{ε})×(Γ{ε})P(Q×(Γ{ε}))\delta : Q \times (\Sigma \cup \{\varepsilon\}) \times (\Gamma \cup \{\varepsilon\}) \to \mathcal{P}(Q \times (\Gamma \cup \{\varepsilon\})) is the transition function.
  • q0Qq_0 \in Q is the start state.
  • FQF \subseteq Q is the set of accepting states.

A PDA can read input, push symbols onto the stack, pop symbols from the stack, and make ε\varepsilon-moves (without consuming input or changing the stack).

Example. PDA for L={anbn:n0}L = \{a^n b^n : n \geq 0\}:

  1. Push a marker \text{\}$ onto the stack.
  2. For each aPush X onto the stack.
  3. For each bPop X from the stack.
  4. Accept if the stack is empty (just the marker) and all input is consumed.

Formally: states q0q_0 (start), q1q_1 (pushing), q2q_2 (popping), q3q_3 (accept).

Worked Example: PDA for palindromes $\{ww^R : w \in \{0,1\}^*\}$

Design a PDA for L={wwR:w{0,1}}L = \{ww^R : w \in \{0,1\}^*\}.

Idea: Push the first half of the input onto the stack. Nondeterministically guess when the Middle is reached, then pop and compare with the remaining input.

States: q0q_0 (pushing), q1q_1 (popping/comparing), q2q_2 (accept).

Transitions:

  • Push phase (q0q_0):

  • (q0,0,ε)(q0,0)(q_0, 0, \varepsilon) \to (q_0, 0) — push 0.

  • (q0,1,ε)(q0,1)(q_0, 1, \varepsilon) \to (q_0, 1) — push 1.

  • (q0,ε,ε)(q1,ε)(q_0, \varepsilon, \varepsilon) \to (q_1, \varepsilon) — guess the midpoint.

  • Pop phase (q1q_1):

  • (q1,0,0)(q1,ε)(q_1, 0, 0) \to (q_1, \varepsilon) — match 0.

  • (q1,1,1)(q1,ε)(q_1, 1, 1) \to (q_1, \varepsilon) — match 1.

  • (q1,ε,ε)(q2,ε)(q_1, \varepsilon, \varepsilon) \to (q_2, \varepsilon) — accept if stack empty and input consumed.

Accept: {q2}\{q_2\}.

Correctness. If w=uuRw = uu^RThe PDA pushes uuGuesses the midpoint, then pops and matches uRu^R. If wuuRw \neq uu^RNo sequence of guesses leads to acceptance. \blacksquare

3.5 Equivalence of CFG and PDA

Theorem 3.2. A language is context-free if and only if some PDA recognises it.

Proof (sketch).

CFG to PDA: The PDA simulates a leftmost derivation. It maintains the current sentential form On the stack. At each step, it pops a variable from the stack and pushes the right-hand side of a Production for that variable.

PDA to CFG: Given PDA MMConstruct a grammar whose variables encode pairs of states (p,q)(p, q) Meaning “the PDA can go from state pp to state qqPopping everything pushed onto the stack.” The productions simulate the PDA”s transitions. \blacksquare

Theorem 3.2a (CFG to PDA construction). Let G=(V,Σ,R,S)G = (V, \Sigma, R, S) be a CFG. Then there exists a PDA MM with L(M)=L(G)L(M) = L(G).

Proof. Construct M = (\{q_0, q_1, q_2\}, \Sigma, V \cup \Sigma \cup \{\}, \delta, q_0, {q_2})$:

  1. (q_0, \varepsilon, \varepsilon) \to (q_1, S\)$ — push the start variable and bottom marker.
  2. For each AαRA \to \alpha \in R: (q1,ε,A)(q1,α)(q_1, \varepsilon, A) \to (q_1, \alpha) — replace a variable with its production.
  3. For each aΣa \in \Sigma: (q1,a,a)(q1,ε)(q_1, a, a) \to (q_1, \varepsilon) — match a terminal on input with stack.
  4. (q_1, \varepsilon, \) \to (q_2, $)$ — accept when only the marker remains.

The PDA maintains the current sentential form (minus terminals already matched) on the stack. When only \text{\}remains,thederivationiscompleteandallinputhasbeenconsumed.remains, the derivation is complete and all input has been consumed.\blacksquare$

3.6 Pumping Lemma for Context-Free Languages

Theorem 3.3. If LL is context-free, there exists pp such that for every wLw \in L with wp|w| \geq p, ww can be decomposed as w=uvxyzw = uvxyz satisfying:

  1. vy>0|vy| \gt 0.
  2. vxyp|vxy| \leq p.
  3. uvixyizLuv^ixy^iz \in L for all i0i \geq 0.

Proof. Let GG be a CFG in CNF with kk variables. Any parse tree of height hh generates a string Of length at most 2h12^{h-1}. Set p=2kp = 2^k. For wp|w| \geq pThe parse tree has height >k\gt k So some path repeats a variable. The substring generated between the two occurrences can be pumped. \blacksquare

Example. L={anbncn:n0}L = \{a^n b^n c^n : n \geq 0\} is not context-free.

Proof. Assume pumping length pp. Let w=apbpcpw = a^p b^p c^p. Since vxyp|vxy| \leq pThe substring vxyvxy cannot span all three letter types. Case analysis:

  • If vxyvxy is within the aS or bS or cS: pumping changes only one count, breaking the equality.
  • If vxyvxy spans aS and bS: pumping changes the a and b counts but not cBreaking equality.
  • If vxyvxy spans bS and cS: analogous.

In all cases, uv2xy2zLuv^2xy^2z \notin L. \blacksquare

Example. L={aibjck:i<j<k}L = \{a^i b^j c^k : i \lt j \lt k\} is not context-free.

Proof. Assume pumping length pp. Let w=apbp+1cp+2w = a^p b^{p+1} c^{p+2}. Since vxyp|vxy| \leq p:

  • If vxyvxy lies entirely within one block: pumping up (i=2i = 2) increases only one count, but the gap between adjacent counts is only 1 or 2, so doubling the pumped count violates the strict inequalities.
  • If vxyvxy spans two blocks: pumping changes two adjacent counts by the same additive amount. The gap between those counts is 1, so increasing both by the same positive amount makes them equal, violating i<j<ki \lt j \lt k.

In all cases, uv2xy2zLuv^2xy^2z \notin L. \blacksquare

Worked Example: $\{a^n b^n a^n : n \geq 0\}$ is not context-free

Proof. Assume pumping length pp. Let w=apbpapw = a^p b^p a^p. Since vxyp|vxy| \leq pThe substring vxyvxy cannot span all three blocks. Case analysis:

  • vxyvxy within the first apa^p block: pumping down (i=0i = 0) reduces the first count only. uv0xy0z=apkbpapLuv^0xy^0z = a^{p-k}b^pa^p \notin L for k>0k \gt 0.
  • vxyvxy within the bpb^p block: pumping changes the bb count only, breaking equality.
  • vxyvxy within the last apa^p block: analogous.
  • vxyvxy spans the first apa^p and bpb^p: pumping changes first aa and bb counts but not the last aa count.
  • vxyvxy spans bpb^p and last apa^p: analogous.

In all cases, pumping produces a string not in LL. \blacksquare

3.7 The CYK Parsing Algorithm

The Cocke—Younger—Kasami (CYK) algorithm determines membership in a context-free language When the grammar is in Chomsky Normal Form.

Theorem 3.4. Given a CFG GG in CNF and a string ww of length nnThe CYK algorithm decides Whether wL(G)w \in L(G) in O(n3G)O(n^3 \cdot |G|) time.

Algorithm. Construct a table T[i,j]T[i, j] for 1ijn1 \leq i \leq j \leq nWhere T[i,j]T[i, j] is the set Of variables that can derive the substring wiwi+1wjw_i w_{i+1} \cdots w_j.

  1. Base case (j=1j = 1): T[i, i] = \{A : A \to w_i \mathrm{ is a rule in G\}.
  2. Recursive case (j>1j \gt 1): For each split kk with ik<ji \leq k \lt j: T[i,j]:=T[i,j]{A:ABCR,BT[i,k],CT[k+1,j]}T[i, j] \mathrel{{:}{=}} T[i, j] \cup \{A : A \to BC \in R, B \in T[i, k], C \in T[k+1, j]\}
  3. Answer: wL(G)w \in L(G) iff ST[1,n]S \in T[1, n].

Proof of correctness. In CNF, every derivation of a string of length \ell involves exactly 1\ell - 1 binary productions. The table considers every possible “root” of the derivation tree For each substring, and every possible split of that substring into two parts. By induction on the Substring length, T[i,j]T[i, j] contains exactly those variables that derive wiwjw_i \cdots w_j. \blacksquare

Time complexity. The table has O(n2)O(n^2) entries. Each entry considers at most nn split points And checks all G|G| rules, giving O(n3G)O(n^3 \cdot |G|) total time.

Worked Example: CYK on a small grammar

Grammar (CNF): SABBCS \to AB \mid BC, ABAaA \to BA \mid a, BCCbB \to CC \mid b, CABaC \to AB \mid a.

String: w=baw = ba.

Length 1:

  • T[1,1]T[1,1]: w1=bw_1 = bSo T[1,1]={B}T[1,1] = \{B\} (since BbB \to b).
  • T[2,2]T[2,2]: w2=aw_2 = aSo T[2,2]={A,C}T[2,2] = \{A, C\} (since AaA \to a and CaC \to a).

Length 2:

  • T[1,2]T[1,2]: split at k=1k = 1. Check all pairs (XT[1,1],YT[2,2])(X \in T[1,1], Y \in T[2,2]):
  • X=B,Y=AX = B, Y = A: SBAS \to BA? No. BBAB \to BA? No. ABAA \to BA? No. CBAC \to BA? No.
  • X=B,Y=CX = B, Y = C: SBCS \to BC? Yes — add SS.
  • X=B,Y=AX = B, Y = A: already checked. So T[1,2]={S}T[1,2] = \{S\}.

Since ST[1,2]S \in T[1,2]The string baba is in L(G)L(G). The parse tree is SBCS \to BC Where BbB \to b and CaC \to a.