Skip to content

Regular Languages

2.1 Finite Automata

A deterministic finite automaton (DFA) is a 5-tuple M=(Q,Σ,δ,q0,F)M = (Q, \Sigma, \delta, q_0, F) where:

  • QQ is a finite set of states.
  • Σ\Sigma is the input alphabet.
  • δ:Q×ΣQ\delta : Q \times \Sigma \to Q is the transition function.
  • q0Qq_0 \in Q is the start state.
  • FQF \subseteq Q is the set of accepting (final) states.

MM accepts a string w=w1w2wnw = w_1 w_2 \cdots w_n if the sequence of states q0,δ(q0,w1),δ(δ(q0,w1),w2),q_0, \delta(q_0, w_1), \delta(\delta(q_0, w_1), w_2), \ldots ends in a state in FF.

Example. DFA for strings over {0,1}\{0, 1\} containing the substring 01:

M=({q0,q1,q2},{0,1},δ,q0,{q2})M = (\{q_0, q_1, q_2\}, \{0, 1\}, \delta, q_0, \{q_2\})

Stateδ(,0)\delta(\cdot, 0)δ(,1)\delta(\cdot, 1)
q0q_0q1q_1q0q_0
q1q_1q1q_1q2q_2
q2q_2q2q_2q2q_2

A nondeterministic finite automaton (NFA) allows multiple transitions from a state on the same Symbol, and ε\varepsilon-transitions (transitions without consuming input). Formally, δ:Q×(Σ{ε})P(Q)\delta : Q \times (\Sigma \cup \{\varepsilon\}) \to \mathcal{P}(Q).

MM accepts ww if there exists some path of transitions consuming ww that ends in a state in FF.

Worked Example: DFA for binary numbers divisible by 3

Design a DFA over Σ={0,1}\Sigma = \{0, 1\} that accepts exactly those strings (interpreted as binary Numbers, most significant bit first) that are divisible by 3.

We track the value of the number read so far, modulo 3. After reading prefix xxLet r=xmod3r = x \bmod 3. Reading a new bit bb appends bb to the right: the new value is 2r+bmod32r + b \bmod 3.

States: q0q_0 (remainder 0), q1q_1 (remainder 1), q2q_2 (remainder 2). Start state: q0q_0 (the empty prefix has value 0). Accept state: q0q_0.

M=({q0,q1,q2},{0,1},δ,q0,{q0})M = (\{q_0, q_1, q_2\}, \{0, 1\}, \delta, q_0, \{q_0\})

Stateδ(,0)\delta(\cdot, 0)δ(,1)\delta(\cdot, 1)
q0q_0q0q_0q1q_1
q1q_1q2q_2q0q_0
q2q_2q1q_1q2q_2

Correctness. By induction on input length. Base: x=εx = \varepsilon, val(ε)=0\mathrm{val}(\varepsilon) = 0 DFA is in q0q_0. Step: if after xx the DFA is in qrq_r (where r=val(x)mod3r = \mathrm{val}(x) \bmod 3), Then reading bb moves to q(2r+b)mod3q_{(2r+b) \bmod 3}Which equals qval(xb)mod3q_{\mathrm{val}(xb) \bmod 3}. \blacksquare

Worked Example: NFA for strings ending in `01`

Construct an NFA over {0,1}\{0, 1\} that accepts strings ending in the substring 01.

States: q0q_0 (start), q1q_1 (have seen a 0), q2q_2 (accept, have seen 01).

Stateδ(,0)\delta(\cdot, 0)δ(,1)\delta(\cdot, 1)
q0q_0{q0,q1}\{q_0, q_1\}{q0}\{q_0\}
q1q_1\emptyset{q2}\{q_2\}
q2q_2{q1}\{q_1\}\emptyset

Accept: {q2}\{q_2\}. The NFA is nondeterministic because from q0q_0 on input 0It can either stay In q0q_0 or move to q1q_1.

2.2 Equivalence of DFA and NFA

Theorem 2.1. For every NFA NNThere exists a DFA DD such that L(N)=L(D)L(N) = L(D).

Proof (subset construction). Given NFA N=(Q,Σ,δ,q0,F)N = (Q, \Sigma, \delta, q_0, F)Construct DFA D=(Q",Σ,δ,q0,F)D = (Q", \Sigma, \delta', q_0', F') where:

  • Q=P(Q)Q' = \mathcal{P}(Q) (each state of DD is a subset of states of NN).
  • q0=εclosure({q0})q_0' = \varepsilon\mathrm{-closure(\{q_0\})}.
  • δ(S,a)=εclosure(qSδ(q,a))\delta'(S, a) = \varepsilon\mathrm{-closure(\bigcup_{q \in S} \delta(q, a))} for SQS \subseteq Q aΣa \in \Sigma.
  • F={SQ:SF}F' = \{S \subseteq Q : S \cap F \neq \emptyset\}.

Every string accepted by NN has some path through NN‘s states. The subset construction tracks all Possible states NN could be in after reading each symbol. DD accepts exactly when at least one Possibility in NN leads to acceptance. \blacksquare

Corollary 2.2. NFA, DFA, and NFA with ε\varepsilon-transitions all recognise exactly the class of Regular languages.

Worked Example: Subset construction (NFA to DFA)

Convert the NFA from the “strings ending in 01” example to a DFA via the subset construction.

NFA states: {q0,q1,q2}\{q_0, q_1, q_2\}, Σ={0,1}\Sigma = \{0, 1\}, F={q2}F = \{q_2\}No ε\varepsilon-transitions.

Start state: {q0}\{q_0\}.

DFA Stateδ(,0)\delta'(\cdot, 0)δ(,1)\delta'(\cdot, 1)Accept?
{q0}\{q_0\}{q0,q1}\{q_0, q_1\}{q0}\{q_0\}No
{q0,q1}\{q_0, q_1\}{q0,q1}\{q_0, q_1\}{q0,q2}\{q_0, q_2\}No
{q0,q2}\{q_0, q_2\}{q0,q1}\{q_0, q_1\}{q0}\{q_0\}Yes

The DFA has 3 reachable states. {q1}\{q_1\}, {q2}\{q_2\}And {q1,q2}\{q_1, q_2\} are unreachable.

2.3 Regular Expressions

Definition. Regular expressions over Σ\Sigma are defined inductively:

  1. \emptyset (empty set), ε\varepsilon (empty string), and aa for each aΣa \in \Sigma are regex.
  2. If R1R_1 and R2R_2 are regex, then (R1R2)(R_1 \cup R_2), (R1R2)(R_1 \cdot R_2)And (R1)(R_1^*) are regex.
  3. Nothing else is a regex.

Shorthand: R+R^+ means RRR \cdot R^*. R?R? means (Rε)(R \cup \varepsilon).

Examples:

  • (01)00(01)(0 \cup 1)^* 00 (0 \cup 1)^*: strings containing 00.
  • 101011^* 01^* 01^*: strings with exactly two 0S.
  • (01)(0 \cup 1)^*: all strings over {0,1}\{0, 1\}.
  • \emptyset: the empty language.

Regex identities. The following algebraic laws hold for regular expressions and can be used To simplify or reason about them:

IdentityDescription
R=R=\emptyset \cdot R = R \cdot \emptyset = \emptysetAnnihilator for concatenation
εR=Rε=R\varepsilon \cdot R = R \cdot \varepsilon = RIdentity for concatenation
R=R=R\emptyset \cup R = R \cup \emptyset = RIdentity for union
=ε\emptyset^* = \varepsilonKleene star of empty set
ε=ε\varepsilon^* = \varepsilonKleene star of empty string
(R)=R(R^*)^* = R^*Idempotence of Kleene star
R(ST)=(RS)TR \cdot (S \cdot T) = (R \cdot S) \cdot TAssociativity of concatenation
RS=SRR \cup S = S \cup RCommutativity of union
R(ST)=(RS)TR \cup (S \cup T) = (R \cup S) \cup TAssociativity of union
RR=RR \cup R = RIdempotence of union
R(ST)=RSRTR(S \cup T) = RS \cup RTDistributivity

These identities can be used to prove that two regex describe the same language, analogous to Simplifying algebraic expressions. Two regex R1R_1 and R2R_2 are equivalent (written R1R2R_1 \equiv R_2) If L(R1)=L(R2)L(R_1) = L(R_2).

2.4 Equivalence of Regular Expressions and Finite Automata

Theorem 2.3 (Kleene). A language is regular if and only if it is described by some regular Expression.

Proof (sketch).

Regex to NFA: Induction on the structure of the regex.

  • Base cases: \emptyset, ε\varepsilon, aa each have trivial NFAs.
  • Union: Combine two NFAs with a new start state and ε\varepsilon-transitions to each.
  • Concatenation: Connect the accept states of the first NFA to the start state of the second via ε\varepsilon-transitions.
  • Kleene star: Add ε\varepsilon-transitions from a new start state to the NFA’s start and from the NFA’s accept states back to the start state.

NFA to regex: State elimination. Systematically eliminate states of the NFA one at a time, Updating the transition labels (which are regex) between remaining states. The final label from start To accept is the equivalent regex. \blacksquare

Thompson’s construction. The regex-to-NFA translation can be made fully explicit. For each Sub-expression of the regex, we build a small NFA fragment with exactly one entry and one exit State, connected by ε\varepsilon-transitions. The construction guarantees that the NFA has at most One accept state, no transitions into the start state, and no transitions out of the accept state.

Theorem 2.3a (Thompson’s construction correctness). For every regular expression RR over Σ\SigmaThompson’s construction produces an NFA NRN_R with L(NR)=L(R)L(N_R) = L(R)And NRN_R has O(R)O(|R|) States and transitions.

Proof. By structural induction on RR.

  • Base cases:

  • R=R = \emptyset: NRN_R has a start state and an accept state with no transitions. L(NR)==L()L(N_R) = \emptyset = L(\emptyset).

  • R=εR = \varepsilon: NRN_R has a start state, an accept state, and an ε\varepsilon-transition between them. L(NR)={ε}=L(ε)L(N_R) = \{\varepsilon\} = L(\varepsilon).

  • R=aR = a for aΣa \in \Sigma: NRN_R has a start state, an accept state, and a transition on aa. L(NR)={a}=L(a)L(N_R) = \{a\} = L(a).

  • Inductive cases:

  • R=R1R2R = R_1 \cup R_2: By IH, L(NR1)=L(R1)L(N_{R_1}) = L(R_1) and L(NR2)=L(R2)L(N_{R_2}) = L(R_2). Thompson adds a new start ss and accept ff with ε\varepsilon-transitions to the start states of NR1N_{R_1} and NR2N_{R_2}And ε\varepsilon-transitions from their accept states to ff. Any accepting path goes through exactly one sub-NFA, so L(NR)=L(R1)L(R2)=L(R)L(N_R) = L(R_1) \cup L(R_2) = L(R).

  • R=R1R2R = R_1 \cdot R_2: Thompson connects the accept state of NR1N_{R_1} to the start state of NR2N_{R_2} via ε\varepsilon-transitions. A string wL(NR)w \in L(N_R) iff w=w1w2w = w_1 w_2 where w1L(NR1)w_1 \in L(N_{R_1}) and w2L(NR2)w_2 \in L(N_{R_2})I.e., wL(R1)L(R2)=L(R)w \in L(R_1) \cdot L(R_2) = L(R).

  • R=R1R = R_1^*: Thompson adds a new start ss and accept ffWith ε\varepsilon-transitions from ss to ff (allowing zero repetitions) and from ss to the start of NR1N_{R_1}And from the accept of NR1N_{R_1} back to ss. Any accepting path corresponds to zero or more traversals of NR1N_{R_1}So L(NR)=L(R1)=L(R)L(N_R) = L(R_1)^* = L(R).

In all cases, the number of states added is a constant per operator, so NR=O(R)|N_R| = O(|R|). \blacksquare

2.5 DFA Minimisation and the Myhill-Nerode Theorem

Theorem 2.4 (Myhill-Nerode). The following are equivalent for a language LL:

  1. LL is regular.
  2. The Myhill-Nerode equivalence relation has finitely many equivalence classes.
  3. LL is the union of some of the equivalence classes.

Definition. Two strings x,yx, y are distinguishable with respect to LL if there exists zz Such that exactly one of xz,yzxz, yz is in LL. The Myhill-Nerode equivalence L\equiv_L is: xLyx \equiv_L y iff xx and yy are not distinguishable.

Proof of Theorem 2.4.

(1) \Rightarrow (2): Let D=(Q,Σ,δ,q0,F)D = (Q, \Sigma, \delta, q_0, F) be a DFA for LL. Define xyx \sim y iff δ(q0,x)=δ(q0,y)\delta^*(q_0, x) = \delta^*(q_0, y) (i.e., DD reaches the same state on xx and yy). Then \sim has at most Q|Q| equivalence classes. We show =L\sim = \equiv_L. If xyx \sim yThen for All zz, δ(q0,xz)=δ(q0,yz)\delta^*(q_0, xz) = \delta^*(q_0, yz)So xzLxz \in L iff yzLyz \in LMeaning xLyx \equiv_L y. Conversely, if x̸Lyx \not\equiv_L yThere exists zz with xzLxz \in L and yzLyz \notin L (or vice versa), so δ(q0,xz)δ(q0,yz)\delta^*(q_0, xz) \neq \delta^*(q_0, yz)Hence x≁yx \not\sim y.

(2) \Rightarrow (3): Trivial, since LL consists of all strings whose equivalence class is one That contains at least one string in LL.

(3) \Rightarrow (1): Suppose L\equiv_L has finitely many equivalence classes C1,,CkC_1, \ldots, C_k. Construct a DFA with one state per equivalence class, start state [x]L[x]_{\equiv_L} for x=εx = \varepsilon Transition δ([x],a)=[xa]\delta'([x], a) = [xa]And accept states = those classes contained in LL. This DFA is Well-defined because L\equiv_L is a right congruence: if xLyx \equiv_L yThen xaLyaxa \equiv_L ya for all aΣa \in \Sigma. \blacksquare

Worked Example: Myhill-Nerode classes for $L = \{0^n 1^n : n \geq 0\}$

We show LL is not regular by exhibiting infinitely many pairwise distinguishable strings.

Claim: the strings 00,01,02,03,0^0, 0^1, 0^2, 0^3, \ldots are pairwise distinguishable with respect to LL.

Proof. For iji \neq j with i<ji \lt jConsider the suffix z=1iz = 1^i. Then:

  • 0i1i=0i1iL0^i \cdot 1^i = 0^i 1^i \in L.
  • 0j1i=0j1iL0^j \cdot 1^i = 0^j 1^i \notin L (since j>ij \gt i).

Therefore 0i̸L0j0^i \not\equiv_L 0^j. Since there are infinitely many such strings, L\equiv_L has Infinitely many equivalence classes, so LL is not regular by the Myhill-Nerode theorem. \blacksquare

This …/1-number-and-algebra/3_proof-and-logic technique is often cleaner than the pumping lemma, as it avoids case analysis over Decompositions.

Corollary 2.5. The minimum-state DFA for LL has exactly as many states as there are Equivalence classes of L\equiv_L.

Minimisation algorithm:

1. Start with two partitions: F (accepting states) and Q \ F (non-accepting).
2. Repeatedly refine: for each pair (p, q) in the same partition,
if delta(p, a) and delta(q, a) are in different partitions for some a,
split p and q into different partitions.
3. Repeat until no further splitting occurs.
4. Each partition becomes a state in the minimised DFA.

This algorithm runs in O(n2k)O(n^2 k) time where nn is the number of states and k=Σk = |\Sigma|.

2.6 Pumping Lemma for Regular Languages

Theorem 2.6 (Pumping Lemma). If LL is regular, then there exists a constant pp (the pumping Length) such that for every string wLw \in L with wp|w| \geq p, ww can be decomposed as w=xyzw = xyz satisfying:

  1. y>0|y| \gt 0 (the pumped part is non-empty).
  2. xyp|xy| \leq p (the pumpable part is within the first pp symbols).
  3. xyizLxy^iz \in L for all i0i \geq 0.

Proof. Let DD be a DFA for LL with nn states. Set p=np = n. For any string ww of length n\geq n The path through DD visits at least n+1n + 1 states, so by the pigeonhole principle, some state is Visited twice within the first pp symbols. The substring yy between the two visits can be repeated Or removed (xyizxy^iz) without affecting acceptance. \blacksquare

Using the pumping lemma to prove non-regularity. To show LL is not regular, assume it is, let pp Be the pumping length, and exhibit a string wLw \in L with wp|w| \geq p such that every Decomposition w=xyzw = xyz satisfying (1) and (2) has some ii with xyizLxy^iz \notin L.

Example. L={0n1n:n0}L = \{0^n 1^n : n \geq 0\} is not regular.

Proof. Assume LL is regular with pumping length pp. Let w=0p1pLw = 0^p 1^p \in L. By (2), xyp|xy| \leq pSo yy consists only of 0S. Let y=k>0|y| = k \gt 0. Then xy0z=0pk1pLxy^0 z = 0^{p-k} 1^p \notin L (since pkpp - k \neq p). Contradiction. \blacksquare

Example. L={ww:w{0,1}}L = \{ww : w \in \{0,1\}^*\} is not regular.

Proof. Assume pumping length pp. Let w=0p10p1Lw = 0^p 1 0^p 1 \in L. Since xyp|xy| \leq p, y=0ky = 0^k for Some k>0k \gt 0. Then xy0z=0pk10p1Lxy^0 z = 0^{p-k} 1 0^p 1 \notin L (the two halves have different lengths). \blacksquare

Example. L={0n1m:nm}L = \{0^n 1^m : n \neq m\} is not regular.

Proof. Assume LL is regular with pumping length pp. Since regular languages are closed under Complement, L={0n1m:n=m}{w:w01}\overline{L} = \{0^n 1^m : n = m\} \cup \{w : w \notin 0^* 1^*\} would also be regular. Then L01={0n1n:n0}\overline{L} \cap 0^* 1^* = \{0^n 1^n : n \geq 0\} would be regular (since 010^* 1^* is regular And regular languages are closed under intersection). But {0n1n:n0}\{0^n 1^n : n \geq 0\} is not regular — Contradiction. \blacksquare

Worked Example: Proving $\{w : n_0(w) = n_1(w)\}$ is not regular

Let L={w{0,1}:n0(w)=n1(w)}L = \{w \in \{0,1\}^* : n_0(w) = n_1(w)\}.

Proof. Assume LL is regular with pumping length pp. Let w=0p1pLw = 0^p 1^p \in L. By (2), xyp|xy| \leq pSo y=0ky = 0^k for some k1k \geq 1. Then xy0z=0pk1pxy^0z = 0^{p-k}1^pWhich has pkp - k zeros and pp ones. Since k1k \geq 1 pkpp - k \neq pSo xy0zLxy^0z \notin L. Contradiction. \blacksquare

Worked Example: $L = \{a^{n!} : n \geq 0\}$ is not regular

Proof. Assume pumping length pp. Let w=ap!Lw = a^{p!} \in L with w=p!p|w| = p! \geq p. By (2), xyp|xy| \leq pSo y=aky = a^k for some 1kp1 \leq k \leq p. Choose i=(p!+k)/k=p!/k+1i = (p! + k) / k = p!/k + 1 (an integer since kpk \leq p). Then xyiz=ap!+(i1)k=ap!+p!=a2p!xy^iz = a^{p! + (i-1)k} = a^{p! + p!} = a^{2 \cdot p!}. But 2p!2 \cdot p! is not a factorial for p2p \geq 2 (since (p+1)!/(p!)2=(p+1)/p!<1(p+1)! / (p!)^2 = (p+1)/p! \lt 1 for p3p \geq 3 And 22!=4n!2 \cdot 2! = 4 \neq n! for any nn). Hence xyizLxy^iz \notin L. \blacksquare

:::caution Common Pitfall The pumping lemma says that for every decomposition satisfying (1) and (2), there exists a Pumping that fails. To prove non-regularity, you must show that all valid decompositions lead To a contradiction. A single decomposition that works is insufficient to disprove the lemma. The converse of the pumping lemma is false: if a language satisfies the pumping condition, it is Not necessarily regular.

2.7 Closure Properties of Regular Languages

Regular languages are closed under:

OperationProof technique
UnionNFA union construction
IntersectionDFA product construction
ComplementSwap accepting and non-accepting states
ConcatenationNFA concatenation construction
Kleene starNFA star construction
ReversalReverse all transitions, swap start/F
DifferenceABA \cap \overline{B}
HomomorphismApply the homomorphism to each symbol
Inverse homomorphismPre-image construction

Theorem 2.7 (Intersection via product construction). If L1L_1 and L2L_2 are regular, then L1L2L_1 \cap L_2 is regular.

Proof. Let D1=(Q1,Σ,δ1,q1,F1)D_1 = (Q_1, \Sigma, \delta_1, q_1, F_1) and D2=(Q2,Σ,δ2,q2,F2)D_2 = (Q_2, \Sigma, \delta_2, q_2, F_2). Construct D=(Q1×Q2,Σ,δ,(q1,q2),F1×F2)D = (Q_1 \times Q_2, \Sigma, \delta, (q_1, q_2), F_1 \times F_2) where δ((r1,r2),a)=(δ1(r1,a),δ2(r2,a))\delta((r_1, r_2), a) = (\delta_1(r_1, a), \delta_2(r_2, a)). Then DD accepts ww iff both D1D_1 and D2D_2 accept wwI.e., wL1L2w \in L_1 \cap L_2. \blacksquare

Theorem 2.7. If L1L_1 is regular and L2L_2 is not regular, then L1L2L_1 \cap L_2 may or may not be Regular. Closure properties do not apply when one operand is non-regular.

:::