Skip to content

Database Systems (Advanced)

1. Advanced Relational Algebra

1.1 Extended Relational Algebra

Beyond the basic operations (selection, projection, join, etc.), extended relational algebra includes:

Grouping and aggregation: γF(R)\gamma_{F}(R) where FF is a list of aggregate functions and grouping attributes.

γdept,AVG(salary)avg_sal(Employee)\gamma_{\text{dept}, \text{AVG}(salary) \to \text{avg}\_sal}(\text{Employee})

Generalised projection: Allows arithmetic expressions in the projection list.

πname,salary×12annual(Employee)\pi_{\text{name}, \text{salary} \times 12 \to \text{annual}(\text{Employee})}

Outer join: Preserves unmatched tuples from one or both relations.

  • Left outer join RLSR \bowtie_L S: All tuples from RRWith NULLs for unmatched SS attributes.
  • Right outer join RRSR \bowtie_R S: All tuples from SSWith NULLs for unmatched RR attributes.
  • Full outer join RFSR \bowtie_F S: All tuples from both relations.

Recursive closure: Not expressible in basic relational algebra. Requires recursive CTEs or Datalog.

Division: R÷SR \div S returns tuples tt such that for every tuple sSs \in S, (t,s)R(t, s) \in R.

Theorem 1.1. Division can be expressed using basic relational algebra:

R÷S=πA(R)πA(πA(R)×SR)R \div S = \pi_{A}(R) - \pi_{A}\left(\pi_{A}(R) \times S - R\right)

Where AA is the set of attributes of RR not in SS.

Proof. Let tπA(R)t \in \pi_A(R). We need to show tR÷St \in R \div S if and only if tπA(πA(R)×SR)t \notin \pi_A(\pi_A(R) \times S - R).

(\Rightarrow) If tR÷St \in R \div SThen for every sSs \in S, (t,s)R(t, s) \in R. So (t,s)πA(R)×SR(t, s) \notin \pi_A(R) \times S - R for any ssHence tπA(πA(R)×SR)t \notin \pi_A(\pi_A(R) \times S - R).

(\Leftarrow) If tπA(πA(R)×SR)t \notin \pi_A(\pi_A(R) \times S - R)Then there is no sSs \in S such that (t,s)R(t, s) \notin R. This means for every sSs \in S, (t,s)R(t, s) \in RSo tR÷St \in R \div S. \blacksquare

Worked Example: Relational Division

Find students who have taken all courses offered by the CS department.

RR = (student, course) pairs from Enrolment. SS = courses offered by CS department: πcourse(σdept="CS’(Course))\pi_{\text{course}(\sigma_{\text{dept}=\text{"CS'}(\text{Course}))}}.

R÷SR \div S gives the students enrolled in every CS course.

Example: R={(Alice,CS101),(Alice,CS201),(Bob,CS101),(Carol,CS101),(Carol,CS201)}R = \{(\text{Alice}, \text{CS101}), (\text{Alice}, \text{CS201}), (\text{Bob}, \text{CS101}), (\text{Carol}, \text{CS101}), (\text{Carol}, \text{CS201})\}

S={CS101,CS201}S = \{\text{CS101}, \text{CS201}\}

πA(R)={Alice,Bob,Carol}\pi_A(R) = \{\text{Alice}, \text{Bob}, \text{Carol}\}

πA(R)×S={(Alice,CS101),(Alice,CS201),(Bob,CS101),(Bob,CS201),(Carol,CS101),(Carol,CS201)}\pi_A(R) \times S = \{(\text{Alice}, \text{CS101}), (\text{Alice}, \text{CS201}), (\text{Bob}, \text{CS101}), (\text{Bob}, \text{CS201}), (\text{Carol}, \text{CS101}), (\text{Carol}, \text{CS201})\}

πA(R)×SR={(Bob,CS201)}\pi_A(R) \times S - R = \{(\text{Bob}, \text{CS201})\}

πA(πA(R)×SR)={Bob}\pi_A(\pi_A(R) \times S - R) = \{\text{Bob}\}

R÷S={Alice,Carol}{Bob}={Alice,Carol}R \div S = \{\text{Alice}, \text{Carol}\} - \{\text{Bob}\} = \{\text{Alice}, \text{Carol}\}

Alice and Carol have taken all CS courses. Bob has not taken CS201.

1.2 Query Optimisation via Algebra

The query optimiser transforms a query into an equivalent but more efficient form using algebraic equivalences.

Key equivalences:

  1. Selection pushdown: σθ(RS)=σθ(R)S\sigma_\theta(R \bowtie S) = \sigma_\theta(R) \bowtie S if θ\theta only involves attributes of RR.
  2. Projection pushdown: Push projections as early as possible to reduce intermediate result sizes.
  3. Join reordering: RS=SRR \bowtie S = S \bowtie R (commutativity); (RS)T=R(ST)(R \bowtie S) \bowtie T = R \bowtie (S \bowtie T) (associativity).
  4. Selection splitting: σθ1θ2(R)=σθ1(σθ2(R))\sigma_{\theta_1 \wedge \theta_2}(R) = \sigma_{\theta_1}(\sigma_{\theta_2}(R)).
  5. Cascade of selections: σθ1(σθ2(R))=σθ1θ2(R)\sigma_{\theta_1}(\sigma_{\theta_2}(R)) = \sigma_{\theta_1 \wedge \theta_2}(R).

Theorem 1.2. Selection pushdown is always beneficial: σθ(RS)\sigma_\theta(R \bowtie S) with θ\theta on RR costs O(RS)O(|R \bowtie S|) without pushdown but O(σθ(R)+σθ(R)S/R)O(|\sigma_\theta(R)| + |\sigma_\theta(R)| \cdot |S| / |R|) with pushdown (using hash join), which is always less.

1.3 Tuple Relational Calculus

Syntax: {tu(R(u)t.A=u.A)}\{t \mid \exists u (R(u) \wedge t.A = u.A \wedge \ldots)\}

Safety. A calculus expression is safe if its result is finite. Unsafe expressions can produce infinite relations (e.g., {t¬R(t)}\{t \mid \neg R(t)\} is the complement of RRWhich is infinite if the domain is infinite).

Theorem 1.3 (Codd). Every safe relational calculus query can be expressed in relational algebra, and vice versa. Therefore, relational algebra and safe relational calculus have the same expressive power.

2. Advanced SQL

2.1 Window Function Analysis

Window functions compute values across a set of table rows related to the current row, without collapsing rows (unlike GROUP BY).

Syntax:

FUNCTION(args) OVER (
[PARTITION BY col1, col2, ...]
[ORDER BY col3, col4, ...]
[frame_clause]
)

Frame specification:

ROWSRANGEBETWEENframestartANDframeend\text{ROWS} | RANGE BETWEEN frame_start AND frame_end

Frame unitMeaning
ROWSPhysical offset from current row
RANGELogical offset (based on ORDER BY value)
UNBOUNDED PRECEDINGFrom the first row of the partition
UNBOUNDED FOLLOWINGTo the last row of the partition
CURRENT ROWCurrent row (ROWS: just this row; RANGE: peers included)
n PRECEDINGnn rows/range values before current
n FOLLOWINGnn rows/range values after current

Default frame: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (when ORDER BY is present), or ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING (when no ORDER BY).

Worked Example: Running Totals and Rankings

Table: Sales(salesperson, month, amount)

salespersonmonthamount
AliceJan5000
AliceFeb7000
AliceMar6000
BobJan3000
BobFeb4000
BobMar5000

Running total per salesperson:

SELECT salesperson, month, amount,
SUM(amount) OVER (
PARTITION BY salesperson
ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM Sales;

Result: | salesperson | month | amount | running_total | |-------------|-------|--------|---------------| | Alice | Jan | 5000 | 5000 | | Alice | Feb | 7000 | 12000 | | Alice | Mar | 6000 | 18000 | | Bob | Jan | 3000 | 3000 | | Bob | Feb | 4000 | 7000 | | Bob | Mar | 5000 | 12000 |

Ranking (RANK vs DENSE_RANK vs ROW_NUMBER):

SELECT salesperson, month, amount,
RANK() OVER (PARTITION BY salesperson ORDER BY amount DESC) AS rnk,
DENSE_RANK() OVER (PARTITION BY salesperson ORDER BY amount DESC) AS drnk,
ROW_NUMBER() OVER (PARTITION BY salesperson ORDER BY amount DESC) AS rn
FROM Sales;

If Alice had amounts 7000, 7000, 5000: RANK = (1, 1, 3), DENSE_RANK = (1, 1, 2), ROW_NUMBER = (1, 2, 3).

2.2 Advanced Query Patterns

Pivoting (rows to columns):

SELECT salesperson,
SUM(CASE WHEN month = 'Jan' THEN amount ELSE 0 END) AS jan_total,
SUM(CASE WHEN month = 'Feb' THEN amount ELSE 0 END) AS feb_total,
SUM(CASE WHEN month = 'Mar' THEN amount ELSE 0 END) AS mar_total
FROM Sales
GROUP BY salesperson;

Unpivoting (columns to rows):

SELECT salesperson, month, amount
FROM MonthlySales
UNPIVOT (amount FOR month IN (jan_total, feb_total, mar_total));

Gaps and Islands. Identifying consecutive sequences (e.g., consecutive login days):

WITH numbered AS (
SELECT user_id, login_date,
login_date - ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY login_date
) * INTERVAL '1 day' AS grp
FROM Logins
)
SELECT user_id, MIN(login_date) AS streak_start, MAX(login_date) AS streak_end, COUNT(*) AS streak_length
FROM numbered
GROUP BY user_id, grp;

2.3 Recursive CTEs for Graph Traversal

WITH RECURSIVE reachable(node, depth, path) AS (
SELECT start_node, 0, ARRAY[start_node]
FROM Edges WHERE start_node = 'A'
UNION ALL
SELECT e.end_node, r.depth + 1, r.path || e.end_node
FROM reachable r JOIN Edges e ON r.node = e.start_node
WHERE NOT (e.end_node = ANY(r.path)) -- prevent cycles
)
SELECT * FROM reachable WHERE depth <= 5;

:::caution Common Pitfall Recursive CTEs without cycle detection will loop forever on cyclic graphs. Always include a cycle check (e.g., tracking the path or visited nodes) or a depth limit. PostgreSQL supports the CYCLE clause for automatic cycle detection. :::

3. Advanced Normalisation

3.1 Join Dependencies and Fifth Normal Form (5NF)

A join dependency (JD) {R1,R2,,Rk}\bowtie\{R_1, R_2, \ldots, R_k\} holds over RR if R=R1R2RkR = R_1 \bowtie R_2 \bowtie \cdots \bowtie R_k.

Definition. A relation RR is in 5NF (Project-Join Normal Form) if for every non-trivial join dependency {R1,,Rk}\bowtie\{R_1, \ldots, R_k\} that holds over RREach RiR_i is a superkey of RR.

5NF generalises 4NF: every 4NF violation is also a 5NF violation, but not vice versa.

Worked Example: 5NF Violation

Relation: Teaching(course, teacher, textbook).

Constraint: Each course may have multiple teachers and multiple textbooks, but the teacher-textbook combination is independent (any teacher can use any textbook for a course they teach).

courseteachertextbook
CS101SmithKnuth
CS101SmithCormen
CS101JonesKnuth
CS101JonesCormen

This satisfies 4NF (no multi-valued dependencies beyond those implied by the key).

But the JD {(course,teacher),(course,textbook)}\bowtie\{(course, teacher), (course, textbook)\} holds: the relation is the join of its projections on (course, teacher) and (course, textbook).

Neither (course, teacher) nor (course, textbook) is a superkey. Therefore, the relation violates 5NF.

Decomposition:

  • R1R_1(course, teacher): {(CS101,Smith),(CS101,Jones)}\{(\text{CS101}, \text{Smith}), (\text{CS101}, \text{Jones})\}
  • R2R_2(course, textbook): {(CS101,Knuth),(CS101,Cormen)}\{(\text{CS101}, \text{Knuth}), (\text{CS101}, \text{Cormen})\}

This avoids the redundancy: adding a new teacher only requires adding one row to R1R_1Not nn rows (one per textbook).

3.2 Domain-Key Normal Form (DKNF/6NF)

Definition. A relation is in Domain-Key Normal Form (DKNF) if every constraint is a logical consequence of domain constraints and key constraints.

DKNF is the “ultimate” normal form: every relation in DKNF is free from insertion, deletion, and update anomalies. However, not every relation can be decomposed into DKNF while preserving dependencies.

Theorem 3.1 (Fagin, 1981). There is no algorithm to decompose an arbitrary relation schema into DKNF. This is in contrast to BCNF and 4NF, which can always be achieved via lossless decomposition.

3.3 Normalisation Algorithm Summary

Normal FormConditionDecomposition always possible?Dependency preserving?
1NFAtomic valuesyesYes
2NFNo partial dependenciesYesYes
3NFNo transitive dependenciesYesYes
BCNFNo non-trivial FDs with non-superkey LHSYesNot always
4NFNo non-trivial MVDs with non-superkey LHSYesNot always
5NFNo non-trivial JDs with non-superkey componentsNot alwaysNot always
DKNFOnly domain + key constraintsNot alwaysN/A

4. Query Optimisation in Depth

4.1 Cost-Based Optimisation

The optimiser estimates the cost of each query plan and chooses the cheapest one.

Cost model parameters:

ParameterSymbolDescription
BBNumber of disk blocks
RRNumber of records
V(A,R)V(A, R)Number of distinct values of attribute AA in RR
bfrbfrBlocking factor (records per block)
frtf_{rt}Fan-out of internal nodes of a B+ tree

Cost estimates for selection:

Selection typeCost (blocks)
Primary key equality (B+ tree)h+1h + 1 (tree height + leaf)
Secondary key equality (B+ tree)h+n/frth + \lceil n / f_{rt} \rceil
Linear scanR/bfr\lceil R / bfr \rceil
Comparison on sorted filelog2B+s/bfr\lceil \log_2 B \rceil + \lceil s / bfr \rceil

Theorem 4.1. For a selection σA=v(R)\sigma_{A=v}(R) with V(A,R)V(A, R) distinct values, the expected cost of a secondary B+ tree index lookup is R/(V(A,R)frt)+h+1\lceil R / (V(A, R) \cdot f_{rt}) \rceil + h + 1.

4.2 Join Algorithm Costs

AlgorithmCostBest when
Nested-loopBr+BrBsB_r + B_r \cdot B_sSmall relations, no index
Block nested-loopBr+Br/M1BsB_r + \lceil B_r / M - 1 \rceil \cdot B_sMM blocks of memory
Index nested-loopBr+Rr(costofindexlookuponS)B_r + R_r \cdot (\text{cost} of index lookup on S)Index on join attribute of SS
Sort-mergeBr+Bs+sortcostB_r + B_s + \text{sort} costBoth relations large
Hash join3(Br+Bs)3(B_r + B_s)Equi-join, no order required

Where MM is the number of available memory blocks.

Sort-merge join cost: O(BrlogM1Br+BslogM1Bs+Br+Bs)O(B_r \log_{M-1} B_r + B_s \log_{M-1} B_s + B_r + B_s).

Hash join cost: Build phase: scan RR and build hash table (BrB_r blocks read + written). Probe phase: scan SS and probe (BsB_s blocks read). Total: approximately 3(Br+Bs)3(B_r + B_s) (reading + writing both phases).

Worked Example: Choosing a Join Algorithm

RR: 2000 blocks, 100,000 records. SS: 500 blocks, 25,000 records. Memory: 52 blocks. Join on R.A=S.BR.A = S.B.

Block nested-loop: Br+Br/(M1)Bs=2000+2000/51500=2000+40×500=22,000B_r + \lceil B_r / (M-1) \rceil \cdot B_s = 2000 + \lceil 2000 / 51 \rceil \cdot 500 = 2000 + 40 \times 500 = 22,000 blocks.

Sort-merge: Sort RR: 2×2000×log2(2000/51)/512 \times 2000 \times \lceil \log_2(2000/51) \rceil / 51… More precisely, O(BrlogM1Br)=O(2000log512000)O(2000×1.88)=O(3760)O(B_r \log_{M-1} B_r) = O(2000 \log_{51} 2000) \approx O(2000 \times 1.88) = O(3760). Sort SS: O(500log51500)O(500×1.46)=O(730)O(500 \log_{51} 500) \approx O(500 \times 1.46) = O(730). Merge: 2000+500=25002000 + 500 = 2500. Total: 7000\approx 7000 blocks.

Hash join: 3(2000+500)=75003(2000 + 500) = 7500 blocks. But we need Mmin(Br,Bs)M \geq \lceil \min(B_r, B_s) \rceil… Actually, we need MBr+BsM \geq \sqrt{B_r + B_s} for the single-pass hash join. 2500=5052=M\sqrt{2500} = 50 \leq 52 = M. So single-pass hash join works.

Hybrid hash join: Partition RR into 2000/50=40\lceil 2000/50 \rceil = 40 partitions, each fitting in memory. Cost: 3×2000+3×500=75003 \times 2000 + 3 \times 500 = 7500.

Sort-merge (7000) is slightly better than hash join (7500) here, but both are far superior to block nested-loop (22,000).

If RR had an index on AA: Index nested-loop: Bs+Rs×costperlookupB_s + R_s \times \text{cost} per lookup. If the index gives O(1)O(1) lookup: 500+25000×1=25500500 + 25000 \times 1 = 25500 (worse, due to random I/O). If clustered index: 500+25000/100=750500 + 25000/100 = 750 (much better, since each lookup reads a page with 100 matching records on average).

4.3 The System R Optimiser

The System R optimiser (Selinger et al., 1979) was the first cost-based query optimiser. Key ideas:

  1. Dynamic programming. Build optimal plans bottom-up for each subset of relations in the FROM clause.
  2. Interesting orders. Only keep plans that produce output in an order useful for subsequent operations (e.g., sorted for sort-merge join or ORDER BY).
  3. Stored …/4-statistics-and-probability/2_statistics. Use catalog …/4-statistics-and-probability/2_statistics (table sizes, index cardinalities, value distributions) for cost estimation.

Theorem 4.2. The System R optimiser examines O(2n)O(2^n) join orderings for nn relations, using dynamic programming to avoid redundant computation.

5. Transaction Management in Depth

5.1 Serialisability Theory

Conflict serialisability. A schedule is conflict-serialisable if it can be transformed into a serial schedule by swapping non-conflicting operations (operations on different data items or both reads).

Precedence graph (dependency graph). Directed graph where nodes are transactions and there is an edge TiTjT_i \to T_j if TiT_i executes a write before TjT_j reads or writes the same data item.

Theorem 5.1. A schedule is conflict-serialisable if and only if its precedence graph is acyclic.

Proof. (\Rightarrow) If the schedule is conflict-serialisable, there exists an equivalent serial schedule. In this serial schedule, transactions are totally ordered. The precedence graph must be consistent with this order (no cycles).

(\Leftarrow) If the precedence graph is acyclic, perform a topological sort of the transactions. Execute the transactions in this topological order. By construction, all conflicting operations are in the same order as in the original schedule, so the serial schedule is equivalent. \blacksquare

Worked Example: Conflict Serialisability

Schedule (operations on data items A,BA, B):

T1:r(A),w(A),r(B),w(B)T_1: r(A), w(A), r(B), w(B) T2:r(A),w(A),r(B),w(B)T_2: r(A), w(A), r(B), w(B)

Schedule: r1(A),r2(A),w1(A),r1(B),w2(A),r2(B),w1(B),w2(B)r_1(A), r_2(A), w_1(A), r_1(B), w_2(A), r_2(B), w_1(B), w_2(B)

Precedence graph:

  • r2(A)r_2(A) before w1(A)w_1(A): T2T_2 reads AA before T1T_1 writes AA. Conflict: T2T1T_2 \to T_1 (read-write conflict on AA).
  • w1(A)w_1(A) before w2(A)w_2(A): T1T2T_1 \to T_2 (write-write conflict on AA).
  • r1(B)r_1(B) before r2(B)r_2(B): No conflict (both reads).
  • r1(B)r_1(B) before w1(B)w_1(B): Same transaction, no edge.
  • r2(B)r_2(B) before w1(B)w_1(B): T2T_2 reads BB before T1T_1 writes BB. Conflict: T2T1T_2 \to T_1 (read-write conflict on BB).
  • w1(B)w_1(B) before w2(B)w_2(B): T1T2T_1 \to T_2 (write-write conflict on BB).

Precedence graph: T1T2T_1 \to T_2 (from w1(A)w_1(A) before w2(A)w_2(A) and w1(B)w_1(B) before w2(B)w_2(B)) AND T2T1T_2 \to T_1 (from r2(A)r_2(A) before w1(A)w_1(A) and r2(B)r_2(B) before w1(B)w_1(B)).

The graph has a cycle: T1T2T1T_1 \to T_2 \to T_1. Therefore, this schedule is not conflict-serialisable.

5.2 View Serialisability

A schedule is view-serialisable if it is equivalent to a serial schedule under view equivalence:

  1. Initial read: If TiT_i reads the initial value of AAThen in the serial schedule, TiT_i must also read the initial value.
  2. Updated read: If TiT_i reads the value of AA written by TjT_jThen in the serial schedule, TiT_i must read the value written by TjT_j.
  3. Final write: If TiT_i performs the final write of AAThen in the serial schedule, TiT_i must perform the final write.

Theorem 5.2. Every conflict-serialisable schedule is view-serialisable, but not vice versa. Determining view serialisability is NP-complete.

5.3 Two-Phase Locking (2PL)

Basic 2PL. A transaction has two phases:

  1. Growing phase: Acquire locks, do not release any.
  2. Shrinking phase: Release locks, do not acquire any.

Theorem 5.3. 2PL guarantees conflict serialisability.

Proof. Consider any two conflicting operations pi(A)p_i(A) and qj(A)q_j(A) in transactions TiT_i and TjT_j. By 2PL, both operations acquire locks on AA. Since a lock can be held by only one transaction at a time, one must precede the other. If TiT_i acquires the lock first, then in the precedence graph, TiTjT_i \to T_j. If TjT_j acquires the lock first, TjTiT_j \to T_i. In either case, there is a consistent ordering. Since this holds for every conflicting pair, the precedence graph is acyclic, and by Theorem 5.1, the schedule is conflict-serialisable. \blacksquare

Variants of 2PL:

VariantDescriptionDeadlock?
Basic 2PLNo restrictions on when locks are releasedPossible
Strict 2PLAll locks held until commit/abortPossible
Rigorous 2PLAll locks held until commit/abort (same as strict for writes, but also reads)Possible

:::caution Common Pitfall 2PL prevents conflicts but does not prevent deadlocks. Deadlock prevention (e.g., wait-die, wound-wait) or detection (wait-for graph) is still needed. Additionally, 2PL may not be necessary: there exist conflict-serialisable schedules that cannot be produced by any 2PL protocol. :::

5.4 Multi-Version Concurrency Control (MVCC)

MVCC maintains multiple versions of each data item, allowing readers to access a consistent snapshot without blocking writers.

Key idea: Each transaction sees a snapshot of the database as of its start time. Writers create new versions.

PostgreSQL MVCC:

  • Each row has system columns: xmin (transaction ID that created the row), xmax (transaction ID that deleted or updated the row), ctid (physical location).
  • A transaction can see a row if xmin is committed and xmax is either uncommitted or not set.
  • UPDATE creates a new row version (with the old version’s xmax set).

MVCC vs locking:

AspectLockingMVCC
Read-write conflictsReaders block writersNo blocking
Write-write conflictsSerialisedFirst committer wins, others abort
Storage overheadLowHigher (multiple versions)
Garbage collectionNot neededNeeded (VACUUM in PostgreSQL)
Phantom readsPrevented by predicates locksNot always prevented

Snapshot Isolation. Each transaction reads from a snapshot as of its start. Writes are checked for conflicts at commit time: if two transactions modify the same row, the later committer aborts.

Theorem 5.4. Snapshot isolation prevents dirty reads, non-repeatable reads, and phantom reads (within a single transaction), but does not guarantee serialisability. In particular, write skew is possible under snapshot isolation.

Write skew example:

  • T1T_1 reads rows where x+y=10x + y = 10Checks x>3x > 3Updates x:=x1x := x - 1.
  • T2T_2 reads rows where x+y=10x + y = 10Checks y>3y > 3Updates y:=y1y := y - 1.
  • Both commit successfully under snapshot isolation, but the invariant x+y7x + y \geq 7 may be violated.

5.5 Log-Based Recovery

WAL (Write-Ahead Logging). Before writing a modified page to disk, the log record describing the modification must be written to the stable log.

WAL protocol: logi\log_i must be on stable storage before the data page containing the update of operation ii is written to disk.

ARIES recovery algorithm (used in most commercial DBMS):

  1. Analysis phase: Scan the log from the last checkpoint to identify dirty pages and active transactions.
  2. Redo phase: Scan the log from the last checkpoint, redoing all committed (and possibly uncommitted) updates to restore the database to the state at the time of the crash.
  3. Undo phase: Undo all updates of uncommitted transactions.

Theorem 5.5. ARIES guarantees that after recovery, the database reflects all committed transactions and no uncommitted transactions (atomicity and durability).

6. Distributed Databases

6.1 Two-Phase Commit (2PC)

Phase 1 (Voting): The coordinator sends a PREPARE message to all participants. Each participant votes YES (if it can commit) or NO (if it cannot).

Phase 2 (Decision): If all participants vote YES, the coordinator sends COMMIT. If any votes NO or times out, the coordinator sends ABORT.

Failure scenarios:

FailureHandling
Participant crashes before votingCoordinator times out, sends ABORT
Participant crashes after voting YESOn recovery, ask coordinator for decision
Coordinator crashes before sending decisionParticipants timeout, elect new coordinator or block
Coordinator crashes after sending decisionParticipants re-send decision when asked

Theorem 6.1. 2PC guarantees atomicity: all participants commit or all abort. However, 2PC is a blocking protocol: if the coordinator crashes, participants may block indefinitely.

6.2 Three-Phase Commit (3PC)

3PC adds a pre-commit phase to eliminate blocking:

  1. Voting: Same as 2PC Phase 1.
  2. Pre-commit: Coordinator sends PRE-COMMIT to all YES voters. Participants acknowledge.
  3. Commit: Coordinator sends COMMIT. Participants commit and acknowledge.

Theorem 6.2. 3PC is non-blocking if at most kk of 3k+13k + 1 processes fail (failure detector is accurate).

Proof sketch. After the pre-commit phase, no participant can unilaterally decide to abort (they have all voted YES). If the coordinator fails, the remaining participants can communicate and determine the outcome: if any participant has received PRE-COMMIT, they know the decision is COMMIT. If no participant has received PRE-COMMIT, they know the decision is ABORT (since some participant must have voted NO or timed out). \blacksquare

6.3 Consensus Algorithms

Paxos. A consensus algorithm that guarantees safety (agreement, integrity, validity) and liveness (eventual agreement) under asynchronous network conditions with f<n/2f < n/2 failures.

Raft. A more understandable alternative to Paxos:

  1. Leader election: Nodes elect a leader using randomised timeouts.
  2. Log replication: Leader appends entries to its log, replicates to followers.
  3. Safety: A log entry is committed once a majority of nodes have replicated it.

Theorem 6.3. Raft guarantees that committed entries are never lost and that all servers eventually agree on the same log entries.

10. Advanced Indexing Techniques

10.1 Bitmap Indexes

A bitmap index stores bitmaps for each distinct value of an attribute. For a column with VV distinct values and nn rows, the index consists of VV bitmaps, each of nn bits.

Example. Column “colour” with values \{Red, Green, Blue\} and 8 rows:

RowColour
1Red
2Green
3Blue
4Red
5Red
6Green
7Blue
8Red

Bitmaps:

  • Red: 10111001
  • Green: 01000100
  • Blue: 00100010

Query: “colour = Red OR colour = Green”: Red | Green = 11111101. Rows 1, 2, 4, 5, 6, 8.

Theorem 10.1. Bitmap indexes are most effective when the cardinality of the indexed column is low (few distinct values). The space usage is O(nV)O(n \cdot V) bits.

Bitwise operations. AND, OR, NOT on bitmaps take O(n/w)O(n/w) time where ww is the machine word size (64). Modern CPUs process 64 bits per cycle, so scanning 1 million rows takes 1000000/64156251000000 / 64 \approx 15625 operations.

Encoded bitmap indexes. For high-cardinality columns, use compression techniques:

  • WAH (Word-Aligned Hybrid): Run-length encoding with word-aligned literals.
  • Roaring bitmaps: Split the 32-bit value range into containers of 2162^{16} values. Each container is either a bitmap (if dense) or a sorted array (if sparse).

10.2 Hash Index Variants

Extendible hashing. Dynamically grows the hash table by doubling the number of buckets. Uses a global depth dd and per-bucket local depth dd'.

Linear hashing. Grows the hash table one bucket at a time, using a split pointer. No overflow buckets needed on average.

Comparison:

PropertyExtendibleLinear
Bucket splitTriggered by overflowGradual (one per insertion)
DirectoryYes (size 2d2^d)No directory needed
Worst-case lookup2 disk accesses1 disk access
Space utilisation~69%~77%

10.3 Multi-Dimensional Indexes

R-tree. A balanced tree for spatial data (rectangles, points). Each node stores a minimum bounding rectangle (MBR) covering all its children.

Operations:

  • Search: Starting at the root, check if the query rectangle overlaps the node’s MBR. If so, recurse into children. O(n11/d+k)O(n^{1-1/d} + k) where dd is the number of dimensions.
  • Insert: Find the leaf whose MBR needs the least enlargement. Split if overflow.
  • Delete: Find the leaf, remove the entry, condense the tree.

Theorem 10.2. An R-tree with nn rectangles in dd dimensions has O(logn)O(\log n) height (in practice) and supports point queries in O(logn)O(\log n) average time.

Worked Example: R-tree Construction

Insert rectangles into an R-tree with maximum 3 entries per node:

R1 = (1, 1, 3, 3), R2 = (2, 2, 4, 4), R3 = (5, 5, 7, 7), R4 = (6, 1, 8, 3), R5 = (1, 5, 3, 7).

Insert R1: Root = leaf {R1}. MBR = (1, 1, 3, 3). Insert R2: Leaf {R1, R2}. MBR = (1, 1, 4, 4). Insert R3: Leaf {R1, R2, R3}. MBR = (1, 1, 7, 7). Full (3 entries). Insert R4: Leaf is full. Split.

  • R1, R2 are close together. R3 is far. Best split: {R1, R2} and {R3, R4}.
  • Split 1 MBR: (1, 1, 4, 4). Split 2 MBR: (5, 1, 8, 7).
  • Create internal node with two children. Insert R5: Which leaf to insert into? R5 = (1, 5, 3, 7).
  • Enlargement for split 1 (1,1,4,4): new MBR = (1, 1, 4, 7), enlargement = 43 - 33 = 3.
  • Enlargement for split 2 (5,1,8,7): no overlap, no enlargement needed… Wait, R5 (1,5,3,7) does overlap with (5,1,8,7)? No, R5’s x-range [1,3] and split 2’s x-range [5,8] don’t overlap. So R5 goes into split 1.
  • Split 1: {R1, R2, R5}. MBR = (1, 1, 4, 7). Full again!
  • Split split 1: {R1, R2} and {R5}. Or {R1, R5} and {R2}.
  • Choose {R1, R2} (MBR 1,1,4,4) and {R5} (MBR 1,5,3,7). Minimum total enlargement.

Final R-tree:

Root (MBR: 1,1,8,7)
/ \
[R1,R2] (1,1,4,4) [R3,R4] (5,1,8,7)
| \
[R3] [R4] [R5] (1,5,3,7)

Wait, this structure doesn’t work cleanly. R-trees require careful split algorithms. The key takeaway is that R-trees group nearby spatial objects together and split when nodes overflow.

10.4 GiST (Generalised Search Tree)

GiST is a framework for building index structures for arbitrary data types and predicates. B-trees and R-trees are special cases of GiST.

Key idea. The user provides:

  • A consistent predicate: does a query match a tree entry?
  • A union method: compute the bounding predicate of a set of entries.
  • A penalty method: cost of inserting an entry into a subtree.
  • A pick-split method: how to split an overflowing node.

Theorem 10.3. GiST is correct (never misses results) if the user-supplied consistent predicate is sound: if consistent(entry, query) returns false, then no item in entry’s subtree matches query.

11. Advanced SQL Patterns

11.1 Recursive CTEs for Hierarchical Data

Employee-manager hierarchy:

WITH RECURSIVE hierarchy AS (
SELECT emp_id, name, manager_id, 1 AS level
FROM Employee
WHERE manager_id IS NULL
UNION ALL
SELECT e.emp_id, e.name, e.manager_id, h.level + 1
FROM Employee e
JOIN hierarchy h ON e.manager_id = h.emp_id
)
SELECT * FROM hierarchy ORDER BY level, name;

11.2 Window Functions for Time Series

Moving average:

SELECT date, value,
AVG(value) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7day
FROM StockPrices
WHERE symbol = 'AAPL';

Year-over-year growth:

SELECT year, quarter, revenue,
revenue - LAG(revenue, 4) OVER (PARTITION BY symbol ORDER BY year, quarter) AS yoy_change,
(revenue - LAG(revenue, 4) OVER (PARTITION BY symbol ORDER BY year, quarter))
/ LAG(revenue, 4) OVER (PARTITION BY symbol ORDER BY year, quarter) * 100 AS yoy_pct
FROM Financials
WHERE symbol = 'ACME';

11.3 Materialised Views

A materialised view stores the result of a query physically, like a table. Unlike a regular view (which is just a stored query), materialised views can be queried without recomputing.

Benefits:

  • Pre-computed results for expensive queries.
  • Can be indexed.
  • Query rewrite: the optimiser may automatically use a materialised view to answer a query.

Maintenance:

ApproachDescriptionOverhead
Complete refreshRecompute from scratchHigh, but simple
Incremental (fast refresh)Apply only changes since last refreshLow, but complex
On commitRefresh after every transactionHighest consistency

Theorem 11.1. Materialised view maintenance adds overhead to every update on the base tables. The overhead is proportional to the number of rows affected and the complexity of the view query.

12. Advanced Transaction Patterns

12.1 Savepoints

Savepoints allow partial rollback within a transaction:

BEGIN;
UPDATE Accounts SET balance = balance - 100 WHERE id = 1;
SAVEPOINT sp1;
UPDATE Accounts SET balance = balance - 200 WHERE id = 2;
-- Oops, should have been 50, not 200
ROLLBACK TO sp1;
UPDATE Accounts SET balance = balance - 50 WHERE id = 2;
COMMIT;

12.2 Isolation Level Anomalies

A comprehensive summary of anomalies by isolation level:

AnomalyRead UncommittedRead CommittedRepeatable ReadSerializable
Dirty readPossiblePreventedPreventedPrevented
Non-repeatable readPossiblePossiblePreventedPrevented
Phantom readPossiblePossiblePossiblePrevented
Write skewPossiblePossiblePossiblePrevented
Read skewPossiblePossiblePreventedPrevented

Read skew: T1T_1 reads AA and BB, T2T_2 updates AA, T1T_1 reads AA again and sees a different value. Prevented by Repeatable Read (locks on read rows).

Write skew: T1T_1 reads rows where x+y>10x + y > 10Updates xx; T2T_2 reads same rows, updates yy. Both commit, but x+yx + y may now be 10\leq 10. NOT prevented by Repeatable Read (requires Serializable).

12.3 Optimistic Concurrency Control (OCC)

Phases:

  1. Read phase: Read data into private workspace, record read set.
  2. Validation phase: At commit time, check if any data in the read set was modified by another transaction.
  3. Write phase: If validation passes, apply writes. Otherwise, abort and retry.

Validation types:

TypeCheckCost
Backward validationCheck against committed transactions that finished after our read phase startedO(committedsincestart)O(\text{committed} since start)
Forward validationCheck against active transactions that started before our validation phaseO(currentlyactive)O(\text{currently} active)

Theorem 12.1. OCC is serialisable if the validation phase ensures that for every pair of overlapping transactions (reading/writing the same data), at most one commits.

14. Distributed Database Patterns

14.1 Partitioning (Sharding)

Horizontal partitioning. Rows of a table are distributed across nodes based on a partition key.

Partitioning strategies:

StrategyDescriptionHotspot risk
Hash partitioningf(key)modnf(\text{key}) \mod nLow
Range partitioningKey ranges assigned to nodesHigh (skewed ranges)
Directory-basedLookup service maps keys to nodesLow (directory is a bottleneck)

Consistent hashing. Nodes and keys are placed on a hash ring (e.g., 00 to 212812^{128} - 1). Each key is assigned to the first node clockwise. Virtual nodes (replicas on the ring) improve balance.

Theorem 14.1. With nn real nodes and kk virtual nodes per real node, the expected ratio of the most-loaded to least-loaded node is O(logn/k)O(\log n / k).

14.2 Replication

Single-leader replication. One node accepts writes; others replicate asynchronously or synchronously.

ModeLatencyDurabilityConsistency
SynchronousHigh (wait for all replicas)HighStrong (linearisable)
Semi-synchronousMedium (wait for majority)MediumStrong (with quorum)
AsynchronousLow (no wait)LowEventual

Multi-leader replication. Any node accepts writes. Conflicts are resolved using CRDTs, last-write-wins, or application-specific logic.

Theorem 14.2. Under asynchronous replication, if the leader fails, any data not yet replicated to followers is lost. This is unavoidable (FLP impossibility: in an asynchronous system, consensus is impossible with even one failure).

14.3 Consistency Models

Linearisability. The strongest consistency model: operations appear to execute atomically in real-time order.

Sequential consistency. Operations appear to execute atomically in some total order (but not necessarily real-time order).

Eventual consistency. If no new writes are made, eventually all reads return the same value.

Consistency model hierarchy:

LinearisableSequentialCausalEventual\text{Linearisable} \subset \text{Sequential} \subset \text{Causal} \subset \text{Eventual}

Worked Example: Consistency Anomalies

Consider a social media system with two replicas (US and EU):

Linearisable. User A posts “Hello” in US. User B in EU immediately reads A’s post. This is guaranteed under linearisability: the write is visible to all subsequent reads.

Eventual consistency. User A posts “Hello” in US. User B in EU may not see “Hello” for several seconds. But eventually, B will see it.

Write skew under snapshot isolation:

  • Transaction T1 (US): Read count of “likes” on post P (value = 10). Check count > 5. Update count to 9 (unlike).
  • Transaction T2 (EU): Read count of “likes” on post P (value = 10). Check count > 5. Update count to 9 (unlike).
  • Both commit under snapshot isolation (no conflict detected since they read the same value).
  • Final count = 9, but two unlikes happened. The correct count should be 8.

This is prevented by serialisable isolation (which would detect the conflict).

15. Advanced Query Processing

15.1 Sort-Merge Join Details

Phase 1: Sort. Both relations are sorted on the join attribute.

If the relations fit in memory (Br+BsMB_r + B_s \leq M): use in-memory sort. If not: use external merge sort.

External merge sort cost for relation RR with BB blocks and MM memory blocks:

Sort(R)=2B(1+logM1B/M)\text{Sort}(R) = 2B \left(1 + \lceil \log_{M-1} \lceil B / M \rceil \rceil \right)

Phase 2: Merge. Both sorted relations are scanned simultaneously, outputting matching pairs.

Merge cost: Br+BsB_r + B_s (one pass over each sorted relation).

Total sort-merge join cost: Sort(R)+Sort(S)+Br+Bs\text{Sort}(R) + \text{Sort}(S) + B_r + B_s.

15.2 Hash Join Details

Build phase. Scan the smaller relation RR and build an in-memory hash table using the join attribute as the key. If RR fits in M1M - 1 blocks: single-pass hash join.

Probe phase. Scan the larger relation SS and probe the hash table for matches.

Grace hash join (when RR does not fit in memory):

  1. Partition phase: Hash both relations into k=min(Br,Bs)/(M1)k = \lceil \min(B_r, B_s) / (M - 1) \rceil partitions on disk. Each partition of RR must fit in memory.
  2. Build + probe phase: For each partition pair (Ri,Si)(R_i, S_i)Load RiR_i into memory, build hash table, and probe with SiS_i.

Cost: Partition: 2(Br+Bs)2(B_r + B_s). Build + probe: Br+BsB_r + B_s. Total: 3(Br+Bs)3(B_r + B_s).

Theorem 15.1. Grace hash join works correctly if the hash function distributes tuples uniformly and the partition of RR fits in memory for each partition.

15.3 Index Nested-Loop Join Cost

For each tuple in RRLook up matching tuples in SS using an index on the join attribute.

Cost=Br+Rr(costperprobe)\text{Cost} = B_r + R_r \cdot (\text{cost} per probe)

Where RrR_r is the number of records in RR and the cost per probe depends on the index:

Index typeCost per probe
B+ tree (clustered)h+matchingrecords/fanout\lceil h + \text{matching} records / \text{fan}-out \rceil
B+ tree (unclustered)h+matchingrecords\lceil h + \text{matching} records \rceil
Hash index1.2\approx 1.2 (average)
Worked Example: Join Cost Comparison

RR: 10,000 blocks, 500,000 records. SS: 5,000 blocks, 250,000 records. Memory: 101 blocks. Equi-join on R.A=S.BR.A = S.B. SS has a clustered B+ tree index on BB (height 3, fan-out 100).

Block nested-loop: Br+Br/(M1)Bs=10000+10000/1005000=10000+100×5000=510,000B_r + \lceil B_r / (M-1) \rceil \cdot B_s = 10000 + \lceil 10000/100 \rceil \cdot 5000 = 10000 + 100 \times 5000 = 510,000 blocks.

Sort-merge: Sort RR: 2×10000×(1+log100100)=400002 \times 10000 \times (1 + \lceil \log_{100} 100 \rceil) = 40000. Sort SS: 2×5000×(1+log10050)=200002 \times 5000 \times (1 + \lceil \log_{100} 50 \rceil) = 20000. Merge: 10000+5000=1500010000 + 5000 = 15000. Total: 75,00075,000 blocks.

Hash join: 3(10000+5000)=45,0003(10000 + 5000) = 45,000 blocks. Check: need M15000122M \geq \sqrt{15000} \approx 122. We have 101 < 122. Need graceful hash join. Cost: 3×15000=45,0003 \times 15000 = 45,000 plus partition overhead. Still best.

Index nested-loop: Br+Rr×probecost=10000+500000×(3+matchingrecordsperprobe/100)B_r + R_r \times \text{probe} cost = 10000 + 500000 \times (3 + \text{matching} records per probe / 100). If the average matching records per probe is 2 (500000/250000 = 2): cost = 10000+500000×3.02=1,520,00010000 + 500000 \times 3.02 = 1,520,000 blocks. Much worse due to random I/O.

Ranking: Hash join (45K) < Sort-merge (75K) < Block nested-loop (510K) < Index nested-loop (1.5M).

17. Advanced SQL: Analytic Functions and Patterns

17.1 Percentile and Distribution Functions

-- Median salary per department
SELECT department_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary) AS median_discrete
FROM Employee
GROUP BY department_id;

PERCENTILE_CONT interpolates (continuous); PERCENTILE_DISC returns an actual value from the data set (discrete).

17.2 Pattern Matching with MATCH_RECOGNIZE (SQL:2016)

SELECT *
FROM StockTicks
MATCH_RECOGNIZE (
ORDER BY tick_time
MEASURES FIRST(A.price) AS start_price, LAST(B.price) AS end_price
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST B
PATTERN (A+ B+)
DEFINE A AS A.price < PREV(A.price),
B AS B.price > PREV(B.price)
) AS v_shaped;

This finds “V-shaped” price patterns (consecutive decreases followed by consecutive increases).

17.3 Temporal Tables (System-Versioned)

SQL:2011 introduced system-versioned temporal tables, which automatically maintain historical data:

CREATE TABLE Employee (
emp_id INT PRIMARY KEY,
name VARCHAR(100),
salary DECIMAL(10,2),
sys_start TIMESTAMP GENERATED ALWAYS AS ROW START,
sys_end TIMESTAMP GENERATED ALWAYS AS ROW END,
PERIOD FOR SYSTEM_TIME (sys_start, sys_end)
) WITH SYSTEM VERSIONING;
-- Query data as of a specific point in time
SELECT * FROM Employee FOR SYSTEM_TIME AS OF '2025-01-01 00:00:00';
-- Query all historical changes
SELECT * FROM Employee FOR SYSTEM_TIME FROM '2025-01-01' TO '2025-12-31';

17.4 JSON and XML Support

JSON functions (PostgreSQL):

-- Create table with JSONB column
CREATE TABLE events (
id SERIAL PRIMARY KEY,
data JSONB NOT NULL
);
-- Insert
INSERT INTO events (data) VALUES ('{"type": "click", "user": "alice", "page": "/home"}');
-- Query
SELECT data->>'user' AS user, data->'page' AS page
FROM events
WHERE data->>'type' = 'click';
-- Index on JSONB
CREATE INDEX idx_events_type ON events USING gin ((data->>'type'));

Theorem 17.1. JSONB in PostgreSQL stores data in a decomposed binary format, enabling efficient indexing and querying. JSON (without the B) stores the exact text, which is faster to insert but slower to query.

18. Distributed Transactions: Two-Phase Commit Details

18.1 2PC Message Flow

Coordinator Participant 1 Participant 2
| | |
|--- PREPARE --------->| |
|--- PREPARE ----------------------------->|
| | |
|<-- VOTE_COMMIT -----| |
|<-- VOTE_COMMIT -------------------------|
| | |
|--- COMMIT --------->| |
|--- COMMIT ----------------------------->|
|<-- ACK -------------| |
|<-- ACK ----------------------------------|

Failure scenarios:

  1. Participant crashes before voting: Coordinator times out, sends ABORT.
  2. Participant crashes after voting YES: On recovery, participant asks coordinator for the decision.
  3. Coordinator crashes after collecting all votes but before sending decision: Participants timeout and query other participants. If any participant knows the decision (received COMMIT/ABORT before crash), it can inform others.
  4. Coordinator crashes after sending COMMIT to some but not all participants: Committed participants have committed. On recovery, coordinator re-sends COMMIT to remaining participants.

18.2 2PC Performance Analysis

Theorem 18.1. 2PC requires at least 2 RTTs (one for voting, one for decision) plus the disk I/O for logging. In a wide-area network with 100 ms RTT, 2PC adds at least 200 ms to each distributed transaction.

Optimisation: Presumed abort. If a participant does not know the outcome, it can assume ABORT (since COMMIT must be recorded in the coordinator’s log before sending). This reduces the coordinator’s recovery work.

Optimisation: Collected commit. Participants send their votes along with their redo log records. The coordinator writes all votes and the commit record in a single log flush.

19. Problem Set

7.1 Relational Algebra (Problems 1—3)

Problem 1. Express the following in relational algebra: “Find the names of students who have taken all courses offered by the department in which they are majoring.” Assume schemas: Student(name, dept), Course(cname, dept), Enrolment(name, cname).

Problem 2. Prove the equivalence: σθ1θ2(RS)=σθ1(R)σθ2(S)\sigma_{\theta_1 \wedge \theta_2}(R \bowtie S) = \sigma_{\theta_1}(R) \bowtie \sigma_{\theta_2}(S) when θ1\theta_1 involves only attributes of RR and θ2\theta_2 involves only attributes of SS.

Problem 3. Given relations R(A,B)R(A, B) and S(B,C)S(B, C) with R=10000|R| = 10\,000 records (100 blocks), S=5000|S| = 5\,000 records (50 blocks), and 101 blocks of memory, compare the cost of hash join, sort-merge join, and block nested-loop join.

7.2 SQL and Normalisation (Problems 4—7)

Problem 4. Write a single SQL query that, for each department, returns the top 3 employees by salary. Use window functions.

Problem 5. Consider the relation R(A,B,C,D,E)R(A, B, C, D, E) with functional dependencies: ABCAB \to C, CDC \to D, DED \to E, EAE \to A. Find all candidate keys, decompose into BCNF, and check if the decomposition is dependency-preserving.

Problem 6. The relation Teaching(course, teacher, textbook, room) has the constraint: “The teacher and textbook assigned to a course section are independent.” What normal form does this violate? Decompose accordingly.

Problem 7. Prove that if a relation is in BCNF, it is also in 3NF. Give an example of a relation in 3NF but not BCNF.

7.3 Transactions and Recovery (Problems 8—11)

Problem 8. Consider the following schedule: w1(A),r2(A),w2(B),r3(B),w3(C),r1(C)w_1(A), r_2(A), w_2(B), r_3(B), w_3(C), r_1(C). Draw the precedence graph and determine if the schedule is conflict-serialisable.

Problem 9. Explain how snapshot isolation allows write skew, and give a concrete example with bank accounts where the invariant “total balance >= 0” is violated.

Problem 10. A database uses ARIES recovery. The log contains (at the time of crash):

[LSN 001] BEGIN T1
[LSN 002] UPDATE T1 A: old=5, new=10
[LSN 003] BEGIN T2
[LSN 004] UPDATE T2 B: old=3, new=7
[LSN 005] COMMIT T1
[LSN 006] UPDATE T2 C: old=1, new=4
[LSN 007] CHECKPOINT (dirty: A, B, C; active: T2)

Describe the analysis, redo, and undo phases.

Problem 11. In 2PC, the coordinator crashes after sending PREPARE to all participants but before receiving all votes. Participant P1 votes YES, P2 votes NO. What happens when the coordinator recovers?

7.4 Distributed Databases (Problems 12—15)

Problem 12. Explain why 2PC is a blocking protocol. Give a specific failure scenario where participants block indefinitely.

Problem 13. Compare Paxos and Raft. In what situations is Raft preferred? What are Raft’s limitations compared to Paxos?

Problem 14. A distributed database uses synchronous replication with 3 replicas. What is the maximum number of node failures that can be tolerated while maintaining strong consistency?

Problem 15. Explain the CAP theorem. If a distributed database chooses consistency and partition tolerance (CP), what trade-offs does it make in terms of availability?

Solution to Problem 5

FDs: ABCAB \to C, CDC \to D, DED \to E, EAE \to A.

Closure computation:

AB+={A,B}AB^+ = \{A, B\} ABC+={A,B,C,D,E}ABC^+ = \{A, B, C, D, E\} (via CDC \to D, DED \to E, EAE \to A)

So ABAB is a candidate key. Similarly:

BC+={B,C,D,E,A}BC^+ = \{B, C, D, E, A\} (via CDC \to D, DED \to E, EAE \to A). So BCBC is a candidate key.

CD+={C,D,E,A}CD^+ = \{C, D, E, A\} (no BBSo not a candidate key).

DE+={D,E,A}DE^+ = \{D, E, A\} (no BBNot a key).

CE+={C,E,A,D}CE^+ = \{C, E, A, D\} (no BBNot a key).

BD+={B,D,E,A}BD^+ = \{B, D, E, A\} (no CCNot a key).

BE+={B,E,A}BE^+ = \{B, E, A\} (no CCNot a key).

AE+={A,E}AE^+ = \{A, E\} (not a key).

AC+={A,C,D,E}AC^+ = \{A, C, D, E\} (no BBNot a key).

AD+={A,D,E}AD^+ = \{A, D, E\} (no BBNot a key).

Candidate keys: {AB,BC}\{AB, BC\}.

BCNF decomposition:

CDC \to D violates BCNF (LHS CC is not a superkey). Decompose RR into:

  • R1(C,D)R_1(C, D) with CDC \to D (BCNF, key = CC)
  • R2(A,B,C,E)R_2(A, B, C, E) with ABCEAB \to CE, EAE \to A, CEABCE \to AB… Wait, let me recompute.

Actually, R2R_2 has attributes {A,B,C,E}\{A, B, C, E\} and the restricted FDs are ABCAB \to C, EAE \to A.

EAE \to A violates BCNF (LHS EE is not a superkey of R2R_2). Superkeys of R2R_2 include ABAB, BCBC (since BCDBC \to D is lost but BCBC in R2R_2: BCCBC \to CNot useful). Actually, BCBC is not a key in R2R_2 because we lost DD.

Keys of R2R_2: ABAB is a key (ABCAB \to CAnd with CC we need… ABC?ABC \to ? in R2R_2: CC doesn’t give us EE in R2R_2. So ABAB gives {A,B,C}\{A, B, C\}Not {A,B,C,E}\{A, B, C, E\}. So ABAB is NOT a key in R2R_2!

Hmm, let me recompute. In the original relation, ABAB is a key. But after removing DDThe remaining FDs are ABCAB \to C and EAE \to A.

AB+={A,B,C}AB^+ = \{A, B, C\} in R2R_2. Not a superkey (missing EE).

BE+={B,E,A,C}={A,B,C,E}BE^+ = \{B, E, A, C\} = \{A, B, C, E\} (via EAE \to A, ABCAB \to C). So BEBE is a key!

CE+={C,E,A,B}={A,B,C,E}CE^+ = \{C, E, A, B\} = \{A, B, C, E\} (via EAE \to A, ABCAB \to C). So CECE is a key!

BC+={B,C}BC^+ = \{B, C\} (no applicable FD). Not a key.

AE+={A,E}AE^+ = \{A, E\}. Not a key.

Keys of R2R_2: {BE,CE}\{BE, CE\}.

EAE \to A violates BCNF. Decompose R2R_2 into:

  • R2a(E,A)R_{2a}(E, A) with EAE \to A (BCNF, key = EE)
  • R2b(B,C,E)R_{2b}(B, C, E) with restricted FDs: BECBE \to C, CEBCE \to B… Wait, BECBE \to C comes from ABCAB \to C restricted to {B,C,E}\{B, C, E\}: we lose the dependency since AA is not in R2bR_{2b}.

Actually, ABCAB \to C restricted to R2b(B,C,E)R_{2b}(B, C, E): the LHS is ABAB but AR2bA \notin R_{2b}So this FD is lost. The only remaining FDs in R2bR_{2b} are trivial. So R2bR_{2b} is in BCNF with key BEBE (or CECE).

BCNF decomposition: R1(C,D)R_1(C, D), R2a(E,A)R_{2a}(E, A), R2b(B,C,E)R_{2b}(B, C, E).

Dependency preservation: CDC \to D is preserved (in R1R_1). EAE \to A is preserved (in R2aR_{2a}). ABCAB \to C is NOT preserved (lost in the decomposition). DED \to E is NOT preserved (lost).

So the decomposition is NOT dependency-preserving.

If you get this wrong, revise: Sections 3 and 4.

Solution to Problem 8

Schedule: w1(A),r2(A),w2(B),r3(B),w3(C),r1(C)w_1(A), r_2(A), w_2(B), r_3(B), w_3(C), r_1(C).

Conflicting operations (same data item, at least one write):

w1(A)w_1(A) and r2(A)r_2(A): T1T_1 writes AA before T2T_2 reads AA. Edge: T1T2T_1 \to T_2. w2(B)w_2(B) and r3(B)r_3(B): T2T_2 writes BB before T3T_3 reads BB. Edge: T2T3T_2 \to T_3. w3(C)w_3(C) and r1(C)r_1(C): T3T_3 writes CC before T1T_1 reads CC. Edge: T3T1T_3 \to T_1.

Precedence graph: T1T2T3T1T_1 \to T_2 \to T_3 \to T_1.

This contains a cycle (T1T2T3T1T_1 \to T_2 \to T_3 \to T_1). Therefore, the schedule is not conflict-serialisable.

If you get this wrong, revise: Section 5.1.

Common Pitfalls

  • Confusing 2NF and 3NF. 2NF removes partial dependencies; 3NF removes transitive dependencies. Fix: A relation in 3NF is also in 2NF; check for non-prime attributes depending on other non-prime attributes.
  • Wrong isolation level. Read uncommitted: dirty reads possible. Serializable: no anomalies but lowest concurrency. Fix: Balance consistency and performance; most applications use Read Committed or Repeatable Read.
  • Confusing the CAP theorem trade-offs. A distributed system can guarantee at most 2 of: Consistency, Availability, Partition tolerance. Fix: Network partitions are inevitable; choose between CP (consistent but unavailable) and AP (available but eventually consistent).

Worked Examples

Example 1: Normalisation

Problem. Relation R(A, B, C, D) with FDs: AB → C, C → D. Is R in 3NF?

Solution. Key: AB. C depends on AB (partial dependency on non-prime B? No — C depends on the full key AB). C → D: D depends on C, which is non-prime. This is a transitive dependency, violating 3NF.

Decompose: R1(A, B, C), R2(C, D). Both are in 3NF.

\blacksquare

Example 2: SQL join

Problem. Students(ID, Name, DeptID) and Departments(DeptID, DeptName). Write SQL to list all students with their department names.

Solution. SELECT s.Name, d.DeptName FROM Students s INNER JOIN Departments d ON s.DeptID = d.DeptID;

\blacksquare

Summary

  • Normalisation: 1NF (atomic), 2NF (no partial dependencies), 3NF (no transitive dependencies), BCNF.
  • ACID properties: Atomicity, Consistency, Isolation, Durability.
  • SQL: DDL (CREATE, ALTER, DROP), DML (SELECT, INSERT, UPDATE, DELETE), DCL (GRANT, REVOKE).
  • CAP theorem: distributed systems trade off consistency, availability, and partition tolerance.