Context-Free Grammars (CFG)

CS305 -- Formal Language Theory

_____ ______ _____ / ____| ____|/ ____| | | | |__ | | __ | | | __|| | |_ | | |____| | | |__| | \_____|_| \_____| "The grammar that builds languages beyond the reach of finite automata."

Use arrow keys or buttons to navigate. Press S to reveal steps.

1 / 21

The Big Picture: Chomsky Hierarchy

Where do context-free grammars fit in the world of formal languages?

+-----------------------------------------------------------+ | Type 0: Recursively Enumerable (Turing Machines) | | | | +---------------------------------------------------+ | | | Type 1: Context-Sensitive (Linear Bounded Auto.) | | | | | | | | +-------------------------------------------+ | | | | | Type 2: Context-Free (Pushdown Auto.) | | | | | | | | | | | | +-----------------------------------+ | | | | | | | Type 3: Regular (Finite Auto.) | | | | | | | | | | | | | | | | a*, ab+, (a|b)* | | | | | | | +-----------------------------------+ | | | | | | | | | | | | a^n b^n, balanced parens, palindromes | | | | | +-------------------------------------------+ | | | | | | | | a^n b^n c^n | | | +---------------------------------------------------+ | | | | { descriptions of Turing machines that halt } | +-----------------------------------------------------------+

Key Idea

Each level is strictly more powerful than the one inside it. Today we jump from Type 3 (regular) to Type 2 (context-free). The new superpower: a stack (via pushdown automata) or equivalently, recursive production rules.

2 / 21

Motivation: The Limits of Regular Languages

What DFAs/NFAs CAN'T do

The Pumping Lemma showed us these are NOT regular:

  • { anbn | n ≥ 0 } -- equal counts of a's then b's
  • Balanced parentheses -- (()(())), but not (()(
  • Palindromes -- strings that read the same forwards and backwards
  • Nested structures -- HTML tags, math expressions

Why DFAs fail

Finite automata have finite memory (just the current state). They can't "count" unboundedly or "match" things seen earlier.

The common pattern

All these languages need NESTING: a^n b^n : aaaa....bbbb | | +--match!--+ Balanced parens: ( ( ) ( ( ) ) ) | |_| | |_| | | | |_____|_| |_____________| Palindromes: a b c b a | | | +---+---+ match!

Analogy

A DFA is like a person counting on their fingers -- they run out. A CFG is like a person with a notepad: they can write down what to remember and check it later.

3 / 21

What is a Grammar?

Intuition: Grammars as Recipes

A grammar is a set of rewriting rules that tell you how to build strings in a language, step by step.

Recipe for a SENTENCE: SENTENCE --> SUBJECT VERB OBJECT SUBJECT --> "the dog" | "a cat" VERB --> "chased" | "ate" OBJECT --> "the ball" | "a fish" One derivation: SENTENCE => SUBJECT VERB OBJECT => "the dog" VERB OBJECT => "the dog" "chased" OBJECT => "the dog" "chased" "the ball"

Key Terminology

  • Variables (non-terminals): Placeholders that get replaced. Written in UPPERCASE. Examples: S, A, B, SENTENCE
  • Terminals: The actual characters in the final string. Written in lowercase. Examples: a, b, 0, 1, +, (, )
  • Productions: The rewriting rules. "A --> w" means "A can be replaced by w"
  • Start symbol: Where every derivation begins (usually S)

Analogy

Think of variables as categories in a recipe book. "DESSERT" isn't something you eat -- it's a category that expands into "chocolate cake" or "apple pie." Terminals are the actual food you eat!

4 / 21

Formal Definition of a CFG

Definition

A context-free grammar is a 4-tuple G = (V, T, P, S) where:

ComponentDescription
VA finite set of variables (non-terminals)
TA finite set of terminals (the alphabet)
PA finite set of productions of the form A → w, where A ∈ V and w ∈ (V ∪ T)*
SThe start symbol, S ∈ V

Why "Context-Free"?

Every production has a single variable on the left side: A → w. The variable A can be replaced regardless of context (what's around it). In context-sensitive grammars, the surrounding symbols matter.

Example: { anbn | n ≥ 0 }

G = (V, T, P, S) V = { S } T = { a, b } P = { S → aSb, S → ε } Start = S

Let's trace how this generates aabb:

S ==> aSb (used S → aSb) ==> aaSbb (used S → aSb) ==> aaεbb (used S → ε) = aabb

The recursion in S → aSb "wraps" matching a's and b's around each other. That's the power regular languages don't have!

5 / 21

Derivations: Leftmost vs. Rightmost

A derivation is a sequence of rule applications that transforms S into a string of terminals. Pick a mode and step through the derivation of "aabb".

Grammar: S → AB, A → aA | a, B → bB | b
|
Current: S   (select a mode to begin)

Key Idea

For an unambiguous grammar, different derivation orders produce the same parse tree. The derivation order is just the order you visit the tree -- the tree itself is what matters.

6 / 21

Classic Example: Simple Arithmetic Grammar

Grammar for Arithmetic: E → E + T | T T → T * F | F F → ( E ) | id

E = Expression, T = Term, F = Factor

This grammar encodes precedence:

  • * binds tighter than + (multiplication first)
  • Both are left-associative
  • Parentheses override everything

Analogy

Think of E, T, F as layers of "binding strength." F is the tightest (atoms and parens). T groups multiplications. E groups additions. Like layers of an onion -- inner layers bind first.

Derive: id + id * id

Leftmost derivation: E ==> E + T (E → E + T) ==> T + T (E → T) ==> F + T (T → F) ==> id + T (F → id) ==> id + T * F (T → T * F) ==> id + F * F (T → F) ==> id + id * F (F → id) ==> id + id * id (F → id)

Notice!

The grammar forces "id * id" to be grouped under T, while "+" connects at the E level. This gives multiplication higher precedence than addition.

7 / 21

Parse Trees

A parse tree is a visual representation of a derivation. It shows the structure of how a string is generated.

Rules for Parse Trees

  • Root = start symbol S
  • Internal nodes = variables (non-terminals)
  • Leaves = terminals (read left-to-right = the string)
  • Each internal node + its children = one production rule

Analogy: Family Tree

A parse tree is like a family tree for strings. The start symbol S is the ancestor. Each production rule is a "parent has these children." The terminals at the bottom are the youngest generation -- the actual string!

Example: S → aSb | ε

Parse tree for aabb:

S /|\ / | \ a S b /|\ / | \ a S b | ε

Reading the leaves left to right: a a ε b b = aabb

Key Idea

The tree structure shows the nesting. The outer S wraps a...b around the inner S. This is the recursive structure that DFAs cannot capture.

8 / 21

Parse Tree Builder: id + id * id

Grammar: E → E+T | T,   T → T*F | F,   F → (E) | id. Watch the tree grow step by step.

Step 0 / 8

Why This Tree is Correct

The * operation sits deeper in the tree (under T), so it is evaluated first. The + is at the top (under E), so it is evaluated last. This gives us: id + (id * id).

9 / 21

Ambiguity Explorer: id + id * id

The ambiguous grammar E → E+E | E*E | (E) | id gives TWO parse trees!

Tree 1: (id + id) * id

E / | \ E * E /|\ | E + E id | | id id
Click "Show Tree 1" to evaluate

Tree 2: id + (id * id)

E / | \ E + E | /|\ id E * E | | id id
Click "Show Tree 2" to evaluate

The "Dangling Else" Problem

Another famous ambiguity: if E then if E then S else S -- does "else" belong to the outer or inner "if"? Most languages resolve this by matching "else" to the nearest unmatched "if."

10 / 21

Why Ambiguity Matters

Different parse trees mean different meanings. The tree defines the evaluation order!

Parsing: 2 + 3 * 4

Tree A: (2 + 3) * 4 Tree B: 2 + (3 * 4) E E /|\ /|\ E * E E + E /|\ | | /|\ E + E 4 2 E * E | | | | 2 3 3 4 = 5 * 4 = 2 + 12 = 20 WRONG! = 14 CORRECT!

Ambiguity = Multiple Interpretations

  • Compilers need exactly one parse tree per program
  • If two trees exist, the compiler might pick the wrong one
  • Different trees → different compiled code → different results

Key Idea

Ambiguity is a property of the grammar, not the language. A language might have an ambiguous grammar but also an unambiguous one. The fix: rewrite the grammar to enforce the intended structure.

Caution

You cannot "test" for ambiguity in general -- it is undecidable whether an arbitrary CFG is ambiguous!

11 / 21

Eliminating Ambiguity: See the Difference

Pick an expression to see how each grammar parses it (using id=2, id=3, id=4 left to right):

↑ Click an expression above to compare parse trees

The Recipe for Eliminating Ambiguity

  1. Create one variable per precedence level
  2. Lowest precedence at the top (start symbol)
  3. Each level references the next tighter level
  4. Left-recursive rules → left-associative operators
  5. Right-recursive rules → right-associative operators
12 / 21

Chomsky Normal Form (CNF)

A restricted but equally powerful form of context-free grammars.

CNF Rules

Every production must be one of exactly two forms:

  • A → BC   (two variables, no terminals)
  • A → a     (exactly one terminal)

Plus optionally: S → ε (only for the start symbol, only if ε is in the language)

CNF: NOT CNF: S → AB S → AaB A → BC A → BCD A → a A → B B → b A → aB

Why CNF Matters

  • CYK Algorithm: A parsing algorithm that works in O(n³) time -- but it requires the grammar to be in CNF
  • Proofs: Many theoretical results are easier to prove when the grammar has a restricted form
  • Binary trees: CNF guarantees every parse tree is a binary tree (each internal node has exactly 2 children)

Analogy

CNF is like putting equations in "standard form" in algebra. It doesn't change what the equation describes -- it just reorganizes it into a form that's easier to work with systematically.

13 / 21

Converting to CNF: The 4-Step Recipe

Step 1 Step 2 Step 3 Step 4 Remove Remove Break long Replace lone ε-prods unit prods productions terminals A → ε A → B A → BCD A → aB | | | | v v v v Propagate Substitute A → BX A → T_a B nullable chains X → CD T_a → a

Step 1: Remove ε-Productions

  • Find all nullable variables (those that can derive ε)
  • For each production with a nullable var on the right, add versions with and without it
  • Delete all A → ε rules (except possibly S → ε)

Step 2: Remove Unit Productions

  • A unit production is A → B (single variable)
  • If A → B and B → w, replace with A → w
  • Repeat until no unit productions remain

Step 3: Fix Long Productions

  • If A → B1B2...Bk where k > 2
  • Break into pairs using new variables:
  • A → B1C1, C1 → B2C2, ..., Ck-2 → Bk-1Bk

Step 4: Fix Terminal Mixing

  • If a production mixes terminals and variables like A → aB
  • Replace each terminal a with a new variable Ta
  • Add Ta → a

Order matters!

Do the steps in order (1 → 2 → 3 → 4). Each step can create situations the next step fixes.

14 / 21

CNF Conversion: Step-Through

Walk through the CNF conversion of: S → ASB | ε,   A → aAS | a,   B → SbS | A | bb

Step 0 / 9

Result

Every production ends up as A → BC or A → a. The grammar is in CNF, ready for the CYK parsing algorithm!

15 / 21

Properties of Context-Free Languages

Closure Properties

OperationClosed?
UnionYES
ConcatenationYES
Kleene StarYES
IntersectionNO
ComplementNO
Intersection with RegularYES

How to prove closure under union

Given CFGs G1 (start S1) and G2 (start S2), create a new grammar with start S and rule: S → S1 | S2. Done!

NOT Closed Under Intersection

L1 = { a^n b^n c^m | n,m ≥ 0 } (match a's and b's) -- CFL! L2 = { a^m b^n c^n | n,m ≥ 0 } (match b's and c's) -- CFL! L1 ∩ L2 = { a^n b^n c^n | n ≥ 0 } This is NOT context-free! (Provable by CFL pumping lemma)

Consequence

Since CFLs are closed under union but NOT under complement, and L1 ∩ L2 = complement(complement(L1) ∪ complement(L2)), closure under complement would imply closure under intersection. So both must fail!

16 / 21

Inherently Ambiguous Languages

Some context-free languages are so "tangled" that every possible grammar for them is ambiguous.

Definition

A CFL L is inherently ambiguous if every CFG that generates L is ambiguous. There is no way to "fix" the grammar -- the ambiguity is built into the language itself.

Classic Example

L = { a^i b^j c^k | i=j OR j=k } In other words: either the a's match the b's, OR the b's match the c's (or both).

Why it's inherently ambiguous

Consider: a^n b^n c^n This string is in L because: - i=j=n (a's match b's) YES - j=k=n (b's match c's) YES Any grammar must handle BOTH reasons separately, creating two parse trees for a^n b^n c^n.

Analogy

Imagine a language with two overlapping "reasons" a string can be included. When both reasons apply simultaneously, any grammar must use one path or the other -- giving two different trees. It's like a Venn diagram overlap that can't be un-overlapped.

Important Distinction

An ambiguous grammar might be fixable (rewrite it). An inherently ambiguous language cannot be fixed -- no grammar for it is unambiguous.

17 / 21

CFG vs. Regular: Head-to-Head Comparison

Feature Regular Languages Context-Free Languages
Machine model DFA / NFA Pushdown Automaton (PDA)
Memory Finite (states only) Infinite stack
Described by Regular expressions Context-free grammars
Closure ∪, ∩, *, complement, concat ∪, *, concat (NOT ∩, complement)
Parsing O(n) -- linear scan O(n³) CYK; O(n) for some subclasses
Pumping lemma xykz (one pump) uvkxykz (two pumps)
Can do a*b*, (ab)*, keyword matching anbn, balanced parens, palindromes
Can't do anbn, matching, counting anbncn, cross-serial dependencies
Relationship Every regular language is context-free, but NOT vice versa

Key Idea

Regular languages are a proper subset of CFLs. A DFA is just a PDA that never uses its stack. So anything a DFA can do, a PDA can also do -- plus more.

18 / 21

Real-World Applications of CFGs

Compilers & Programming Languages

Source code: if (x > 0) { y = x + 1; } STATEMENT / | \ IF COND BLOCK | / \ | if x > 0 ASSIGN / | \ y = EXPR / | \ x + 1

Every programming language has a CFG (the "syntax") that defines what valid programs look like. Compilers use parsers (LL, LR, LALR) to build parse trees from source code.

XML / HTML

Nested tags are inherently context-free:

<div><p>Hello <b>world</b></p></div>

Natural Language Processing

SENTENCE / \ NP VP | / \ Det V NP | | / \ "the" | Det N "ate" | | "a" "fish"

Linguists use CFGs to model the structure of human language sentences.

Other Applications

  • JSON / YAML parsing -- nested data formats
  • Mathematical expressions -- calculators, CAS
  • DNA/RNA structure -- folding patterns modeled by stochastic CFGs
  • Protocol specification -- BNF grammars in RFCs

Analogy

CFGs are the blueprints of structured languages. Wherever you see nesting, hierarchy, or recursive structure, there's likely a CFG underneath.

19 / 21

Summary & Cheat Sheet

The Big Ideas

CFG = (V, T, P, S) ===================== V = variables (non-terminals) T = terminals (alphabet) P = productions (rewrite rules) S = start symbol "Context-free" means: Left side of every rule is a SINGLE variable. No context needed. Power: Regular ⊂ Context-Free Machine: DFA ⊂ PDA (+ stack)

What to Remember

  • CFGs can express nesting and matching
  • Parse trees show derivation structure
  • Ambiguity = multiple parse trees for one string
  • CNF: A → BC | a (useful for CYK parsing)
  • Closed under ∪, concat, * but NOT ∩ or complement

Common Grammar Patterns

a^n b^n: S → aSb | ε palindromes: S → aSa | bSb | a | b | ε balanced parens: S → SS | (S) | ε arithmetic: E → E+T | T T → T*F | F F → (E) | id

CNF Conversion Steps

1. Remove ε-productions (propagate nullables) 2. Remove unit productions (substitute chains) 3. Break long productions (A→BCD ==> A→BX, X→CD) 4. Fix terminal mixing (A→aB ==> A→T_aB, T_a→a)

Coming Next

Pushdown Automata (PDA) -- the machine model equivalent of CFGs. Think of it as an NFA with a stack. We will prove that PDAs and CFGs recognize exactly the same class of languages.

20 / 21

Challenge Quiz: Test Your CFG Knowledge

Answer 3 randomly selected questions to test your understanding of context-free grammars.

21 / 21