Database Systems (Advanced)
1. Advanced Relational Algebra
1.1 Extended Relational Algebra
Beyond the basic operations (selection, projection, join, etc.), extended relational algebra includes:
Grouping and aggregation: where is a list of aggregate functions and grouping attributes.
Generalised projection: Allows arithmetic expressions in the projection list.
Outer join: Preserves unmatched tuples from one or both relations.
- Left outer join : All tuples from With NULLs for unmatched attributes.
- Right outer join : All tuples from With NULLs for unmatched attributes.
- Full outer join : All tuples from both relations.
Recursive closure: Not expressible in basic relational algebra. Requires recursive CTEs or Datalog.
Division: returns tuples such that for every tuple , .
Theorem 1.1. Division can be expressed using basic relational algebra:
Where is the set of attributes of not in .
Proof. Let . We need to show if and only if .
() If Then for every , . So for any Hence .
() If Then there is no such that . This means for every , So .
Worked Example: Relational Division
Find students who have taken all courses offered by the CS department.
= (student, course) pairs from Enrolment. = courses offered by CS department: .
gives the students enrolled in every CS course.
Example:
Alice and Carol have taken all CS courses. Bob has not taken CS201.
1.2 Query Optimisation via Algebra
The query optimiser transforms a query into an equivalent but more efficient form using algebraic equivalences.
Key equivalences:
- Selection pushdown: if only involves attributes of .
- Projection pushdown: Push projections as early as possible to reduce intermediate result sizes.
- Join reordering: (commutativity); (associativity).
- Selection splitting: .
- Cascade of selections: .
Theorem 1.2. Selection pushdown is always beneficial: with on costs without pushdown but with pushdown (using hash join), which is always less.
1.3 Tuple Relational Calculus
Syntax:
Safety. A calculus expression is safe if its result is finite. Unsafe expressions can produce infinite relations (e.g., is the complement of Which is infinite if the domain is infinite).
Theorem 1.3 (Codd). Every safe relational calculus query can be expressed in relational algebra, and vice versa. Therefore, relational algebra and safe relational calculus have the same expressive power.
2. Advanced SQL
2.1 Window Function Analysis
Window functions compute values across a set of table rows related to the current row, without collapsing rows (unlike GROUP BY).
Syntax:
FUNCTION(args) OVER ( [PARTITION BY col1, col2, ...] [ORDER BY col3, col4, ...] [frame_clause])Frame specification:
| Frame unit | Meaning |
|---|---|
ROWS | Physical offset from current row |
RANGE | Logical offset (based on ORDER BY value) |
UNBOUNDED PRECEDING | From the first row of the partition |
UNBOUNDED FOLLOWING | To the last row of the partition |
CURRENT ROW | Current row (ROWS: just this row; RANGE: peers included) |
n PRECEDING | rows/range values before current |
n FOLLOWING | rows/range values after current |
Default frame: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (when ORDER BY is present), or ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING (when no ORDER BY).
Worked Example: Running Totals and Rankings
Table: Sales(salesperson, month, amount)
| salesperson | month | amount |
|---|---|---|
| Alice | Jan | 5000 |
| Alice | Feb | 7000 |
| Alice | Mar | 6000 |
| Bob | Jan | 3000 |
| Bob | Feb | 4000 |
| Bob | Mar | 5000 |
Running total per salesperson:
SELECT salesperson, month, amount, SUM(amount) OVER ( PARTITION BY salesperson ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_totalFROM Sales;Result: | salesperson | month | amount | running_total | |-------------|-------|--------|---------------| | Alice | Jan | 5000 | 5000 | | Alice | Feb | 7000 | 12000 | | Alice | Mar | 6000 | 18000 | | Bob | Jan | 3000 | 3000 | | Bob | Feb | 4000 | 7000 | | Bob | Mar | 5000 | 12000 |
Ranking (RANK vs DENSE_RANK vs ROW_NUMBER):
SELECT salesperson, month, amount, RANK() OVER (PARTITION BY salesperson ORDER BY amount DESC) AS rnk, DENSE_RANK() OVER (PARTITION BY salesperson ORDER BY amount DESC) AS drnk, ROW_NUMBER() OVER (PARTITION BY salesperson ORDER BY amount DESC) AS rnFROM Sales;If Alice had amounts 7000, 7000, 5000: RANK = (1, 1, 3), DENSE_RANK = (1, 1, 2), ROW_NUMBER = (1, 2, 3).
2.2 Advanced Query Patterns
Pivoting (rows to columns):
SELECT salesperson, SUM(CASE WHEN month = 'Jan' THEN amount ELSE 0 END) AS jan_total, SUM(CASE WHEN month = 'Feb' THEN amount ELSE 0 END) AS feb_total, SUM(CASE WHEN month = 'Mar' THEN amount ELSE 0 END) AS mar_totalFROM SalesGROUP BY salesperson;Unpivoting (columns to rows):
SELECT salesperson, month, amountFROM MonthlySalesUNPIVOT (amount FOR month IN (jan_total, feb_total, mar_total));Gaps and Islands. Identifying consecutive sequences (e.g., consecutive login days):
WITH numbered AS ( SELECT user_id, login_date, login_date - ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY login_date ) * INTERVAL '1 day' AS grp FROM Logins)SELECT user_id, MIN(login_date) AS streak_start, MAX(login_date) AS streak_end, COUNT(*) AS streak_lengthFROM numberedGROUP BY user_id, grp;2.3 Recursive CTEs for Graph Traversal
WITH RECURSIVE reachable(node, depth, path) AS ( SELECT start_node, 0, ARRAY[start_node] FROM Edges WHERE start_node = 'A' UNION ALL SELECT e.end_node, r.depth + 1, r.path || e.end_node FROM reachable r JOIN Edges e ON r.node = e.start_node WHERE NOT (e.end_node = ANY(r.path)) -- prevent cycles)SELECT * FROM reachable WHERE depth <= 5;:::caution Common Pitfall Recursive CTEs without cycle detection will loop forever on cyclic graphs. Always include a cycle check (e.g., tracking the path or visited nodes) or a depth limit. PostgreSQL supports the CYCLE clause for automatic cycle detection. :::
3. Advanced Normalisation
3.1 Join Dependencies and Fifth Normal Form (5NF)
A join dependency (JD) holds over if .
Definition. A relation is in 5NF (Project-Join Normal Form) if for every non-trivial join dependency that holds over Each is a superkey of .
5NF generalises 4NF: every 4NF violation is also a 5NF violation, but not vice versa.
Worked Example: 5NF Violation
Relation: Teaching(course, teacher, textbook).
Constraint: Each course may have multiple teachers and multiple textbooks, but the teacher-textbook combination is independent (any teacher can use any textbook for a course they teach).
| course | teacher | textbook |
|---|---|---|
| CS101 | Smith | Knuth |
| CS101 | Smith | Cormen |
| CS101 | Jones | Knuth |
| CS101 | Jones | Cormen |
This satisfies 4NF (no multi-valued dependencies beyond those implied by the key).
But the JD holds: the relation is the join of its projections on (course, teacher) and (course, textbook).
Neither (course, teacher) nor (course, textbook) is a superkey. Therefore, the relation violates 5NF.
Decomposition:
- (course, teacher):
- (course, textbook):
This avoids the redundancy: adding a new teacher only requires adding one row to Not rows (one per textbook).
3.2 Domain-Key Normal Form (DKNF/6NF)
Definition. A relation is in Domain-Key Normal Form (DKNF) if every constraint is a logical consequence of domain constraints and key constraints.
DKNF is the “ultimate” normal form: every relation in DKNF is free from insertion, deletion, and update anomalies. However, not every relation can be decomposed into DKNF while preserving dependencies.
Theorem 3.1 (Fagin, 1981). There is no algorithm to decompose an arbitrary relation schema into DKNF. This is in contrast to BCNF and 4NF, which can always be achieved via lossless decomposition.
3.3 Normalisation Algorithm Summary
| Normal Form | Condition | Decomposition always possible? | Dependency preserving? |
|---|---|---|---|
| 1NF | Atomic values | yes | Yes |
| 2NF | No partial dependencies | Yes | Yes |
| 3NF | No transitive dependencies | Yes | Yes |
| BCNF | No non-trivial FDs with non-superkey LHS | Yes | Not always |
| 4NF | No non-trivial MVDs with non-superkey LHS | Yes | Not always |
| 5NF | No non-trivial JDs with non-superkey components | Not always | Not always |
| DKNF | Only domain + key constraints | Not always | N/A |
4. Query Optimisation in Depth
4.1 Cost-Based Optimisation
The optimiser estimates the cost of each query plan and chooses the cheapest one.
Cost model parameters:
| Parameter | Symbol | Description |
|---|---|---|
| Number of disk blocks | ||
| Number of records | ||
| Number of distinct values of attribute in | ||
| Blocking factor (records per block) | ||
| Fan-out of internal nodes of a B+ tree |
Cost estimates for selection:
| Selection type | Cost (blocks) |
|---|---|
| Primary key equality (B+ tree) | (tree height + leaf) |
| Secondary key equality (B+ tree) | |
| Linear scan | |
| Comparison on sorted file |
Theorem 4.1. For a selection with distinct values, the expected cost of a secondary B+ tree index lookup is .
4.2 Join Algorithm Costs
| Algorithm | Cost | Best when |
|---|---|---|
| Nested-loop | Small relations, no index | |
| Block nested-loop | blocks of memory | |
| Index nested-loop | Index on join attribute of | |
| Sort-merge | Both relations large | |
| Hash join | Equi-join, no order required |
Where is the number of available memory blocks.
Sort-merge join cost: .
Hash join cost: Build phase: scan and build hash table ( blocks read + written). Probe phase: scan and probe ( blocks read). Total: approximately (reading + writing both phases).
Worked Example: Choosing a Join Algorithm
: 2000 blocks, 100,000 records. : 500 blocks, 25,000 records. Memory: 52 blocks. Join on .
Block nested-loop: blocks.
Sort-merge: Sort : … More precisely, . Sort : . Merge: . Total: blocks.
Hash join: blocks. But we need … Actually, we need for the single-pass hash join. . So single-pass hash join works.
Hybrid hash join: Partition into partitions, each fitting in memory. Cost: .
Sort-merge (7000) is slightly better than hash join (7500) here, but both are far superior to block nested-loop (22,000).
If had an index on : Index nested-loop: . If the index gives lookup: (worse, due to random I/O). If clustered index: (much better, since each lookup reads a page with 100 matching records on average).
4.3 The System R Optimiser
The System R optimiser (Selinger et al., 1979) was the first cost-based query optimiser. Key ideas:
- Dynamic programming. Build optimal plans bottom-up for each subset of relations in the FROM clause.
- Interesting orders. Only keep plans that produce output in an order useful for subsequent operations (e.g., sorted for sort-merge join or ORDER BY).
- Stored …/4-statistics-and-probability/2_statistics. Use catalog …/4-statistics-and-probability/2_statistics (table sizes, index cardinalities, value distributions) for cost estimation.
Theorem 4.2. The System R optimiser examines join orderings for relations, using dynamic programming to avoid redundant computation.
5. Transaction Management in Depth
5.1 Serialisability Theory
Conflict serialisability. A schedule is conflict-serialisable if it can be transformed into a serial schedule by swapping non-conflicting operations (operations on different data items or both reads).
Precedence graph (dependency graph). Directed graph where nodes are transactions and there is an edge if executes a write before reads or writes the same data item.
Theorem 5.1. A schedule is conflict-serialisable if and only if its precedence graph is acyclic.
Proof. () If the schedule is conflict-serialisable, there exists an equivalent serial schedule. In this serial schedule, transactions are totally ordered. The precedence graph must be consistent with this order (no cycles).
() If the precedence graph is acyclic, perform a topological sort of the transactions. Execute the transactions in this topological order. By construction, all conflicting operations are in the same order as in the original schedule, so the serial schedule is equivalent.
Worked Example: Conflict Serialisability
Schedule (operations on data items ):
Schedule:
Precedence graph:
- before : reads before writes . Conflict: (read-write conflict on ).
- before : (write-write conflict on ).
- before : No conflict (both reads).
- before : Same transaction, no edge.
- before : reads before writes . Conflict: (read-write conflict on ).
- before : (write-write conflict on ).
Precedence graph: (from before and before ) AND (from before and before ).
The graph has a cycle: . Therefore, this schedule is not conflict-serialisable.
5.2 View Serialisability
A schedule is view-serialisable if it is equivalent to a serial schedule under view equivalence:
- Initial read: If reads the initial value of Then in the serial schedule, must also read the initial value.
- Updated read: If reads the value of written by Then in the serial schedule, must read the value written by .
- Final write: If performs the final write of Then in the serial schedule, must perform the final write.
Theorem 5.2. Every conflict-serialisable schedule is view-serialisable, but not vice versa. Determining view serialisability is NP-complete.
5.3 Two-Phase Locking (2PL)
Basic 2PL. A transaction has two phases:
- Growing phase: Acquire locks, do not release any.
- Shrinking phase: Release locks, do not acquire any.
Theorem 5.3. 2PL guarantees conflict serialisability.
Proof. Consider any two conflicting operations and in transactions and . By 2PL, both operations acquire locks on . Since a lock can be held by only one transaction at a time, one must precede the other. If acquires the lock first, then in the precedence graph, . If acquires the lock first, . In either case, there is a consistent ordering. Since this holds for every conflicting pair, the precedence graph is acyclic, and by Theorem 5.1, the schedule is conflict-serialisable.
Variants of 2PL:
| Variant | Description | Deadlock? |
|---|---|---|
| Basic 2PL | No restrictions on when locks are released | Possible |
| Strict 2PL | All locks held until commit/abort | Possible |
| Rigorous 2PL | All locks held until commit/abort (same as strict for writes, but also reads) | Possible |
:::caution Common Pitfall 2PL prevents conflicts but does not prevent deadlocks. Deadlock prevention (e.g., wait-die, wound-wait) or detection (wait-for graph) is still needed. Additionally, 2PL may not be necessary: there exist conflict-serialisable schedules that cannot be produced by any 2PL protocol. :::
5.4 Multi-Version Concurrency Control (MVCC)
MVCC maintains multiple versions of each data item, allowing readers to access a consistent snapshot without blocking writers.
Key idea: Each transaction sees a snapshot of the database as of its start time. Writers create new versions.
PostgreSQL MVCC:
- Each row has system columns:
xmin(transaction ID that created the row),xmax(transaction ID that deleted or updated the row),ctid(physical location). - A transaction can see a row if
xminis committed andxmaxis either uncommitted or not set. - UPDATE creates a new row version (with the old version’s
xmaxset).
MVCC vs locking:
| Aspect | Locking | MVCC |
|---|---|---|
| Read-write conflicts | Readers block writers | No blocking |
| Write-write conflicts | Serialised | First committer wins, others abort |
| Storage overhead | Low | Higher (multiple versions) |
| Garbage collection | Not needed | Needed (VACUUM in PostgreSQL) |
| Phantom reads | Prevented by predicates locks | Not always prevented |
Snapshot Isolation. Each transaction reads from a snapshot as of its start. Writes are checked for conflicts at commit time: if two transactions modify the same row, the later committer aborts.
Theorem 5.4. Snapshot isolation prevents dirty reads, non-repeatable reads, and phantom reads (within a single transaction), but does not guarantee serialisability. In particular, write skew is possible under snapshot isolation.
Write skew example:
- reads rows where Checks Updates .
- reads rows where Checks Updates .
- Both commit successfully under snapshot isolation, but the invariant may be violated.
5.5 Log-Based Recovery
WAL (Write-Ahead Logging). Before writing a modified page to disk, the log record describing the modification must be written to the stable log.
WAL protocol: must be on stable storage before the data page containing the update of operation is written to disk.
ARIES recovery algorithm (used in most commercial DBMS):
- Analysis phase: Scan the log from the last checkpoint to identify dirty pages and active transactions.
- Redo phase: Scan the log from the last checkpoint, redoing all committed (and possibly uncommitted) updates to restore the database to the state at the time of the crash.
- Undo phase: Undo all updates of uncommitted transactions.
Theorem 5.5. ARIES guarantees that after recovery, the database reflects all committed transactions and no uncommitted transactions (atomicity and durability).
6. Distributed Databases
6.1 Two-Phase Commit (2PC)
Phase 1 (Voting): The coordinator sends a PREPARE message to all participants. Each participant votes YES (if it can commit) or NO (if it cannot).
Phase 2 (Decision): If all participants vote YES, the coordinator sends COMMIT. If any votes NO or times out, the coordinator sends ABORT.
Failure scenarios:
| Failure | Handling |
|---|---|
| Participant crashes before voting | Coordinator times out, sends ABORT |
| Participant crashes after voting YES | On recovery, ask coordinator for decision |
| Coordinator crashes before sending decision | Participants timeout, elect new coordinator or block |
| Coordinator crashes after sending decision | Participants re-send decision when asked |
Theorem 6.1. 2PC guarantees atomicity: all participants commit or all abort. However, 2PC is a blocking protocol: if the coordinator crashes, participants may block indefinitely.
6.2 Three-Phase Commit (3PC)
3PC adds a pre-commit phase to eliminate blocking:
- Voting: Same as 2PC Phase 1.
- Pre-commit: Coordinator sends PRE-COMMIT to all YES voters. Participants acknowledge.
- Commit: Coordinator sends COMMIT. Participants commit and acknowledge.
Theorem 6.2. 3PC is non-blocking if at most of processes fail (failure detector is accurate).
Proof sketch. After the pre-commit phase, no participant can unilaterally decide to abort (they have all voted YES). If the coordinator fails, the remaining participants can communicate and determine the outcome: if any participant has received PRE-COMMIT, they know the decision is COMMIT. If no participant has received PRE-COMMIT, they know the decision is ABORT (since some participant must have voted NO or timed out).
6.3 Consensus Algorithms
Paxos. A consensus algorithm that guarantees safety (agreement, integrity, validity) and liveness (eventual agreement) under asynchronous network conditions with failures.
Raft. A more understandable alternative to Paxos:
- Leader election: Nodes elect a leader using randomised timeouts.
- Log replication: Leader appends entries to its log, replicates to followers.
- Safety: A log entry is committed once a majority of nodes have replicated it.
Theorem 6.3. Raft guarantees that committed entries are never lost and that all servers eventually agree on the same log entries.
10. Advanced Indexing Techniques
10.1 Bitmap Indexes
A bitmap index stores bitmaps for each distinct value of an attribute. For a column with distinct values and rows, the index consists of bitmaps, each of bits.
Example. Column “colour” with values \{Red, Green, Blue\} and 8 rows:
| Row | Colour |
|---|---|
| 1 | Red |
| 2 | Green |
| 3 | Blue |
| 4 | Red |
| 5 | Red |
| 6 | Green |
| 7 | Blue |
| 8 | Red |
Bitmaps:
- Red: 10111001
- Green: 01000100
- Blue: 00100010
Query: “colour = Red OR colour = Green”: Red | Green = 11111101. Rows 1, 2, 4, 5, 6, 8.
Theorem 10.1. Bitmap indexes are most effective when the cardinality of the indexed column is low (few distinct values). The space usage is bits.
Bitwise operations. AND, OR, NOT on bitmaps take time where is the machine word size (64). Modern CPUs process 64 bits per cycle, so scanning 1 million rows takes operations.
Encoded bitmap indexes. For high-cardinality columns, use compression techniques:
- WAH (Word-Aligned Hybrid): Run-length encoding with word-aligned literals.
- Roaring bitmaps: Split the 32-bit value range into containers of values. Each container is either a bitmap (if dense) or a sorted array (if sparse).
10.2 Hash Index Variants
Extendible hashing. Dynamically grows the hash table by doubling the number of buckets. Uses a global depth and per-bucket local depth .
Linear hashing. Grows the hash table one bucket at a time, using a split pointer. No overflow buckets needed on average.
Comparison:
| Property | Extendible | Linear |
|---|---|---|
| Bucket split | Triggered by overflow | Gradual (one per insertion) |
| Directory | Yes (size ) | No directory needed |
| Worst-case lookup | 2 disk accesses | 1 disk access |
| Space utilisation | ~69% | ~77% |
10.3 Multi-Dimensional Indexes
R-tree. A balanced tree for spatial data (rectangles, points). Each node stores a minimum bounding rectangle (MBR) covering all its children.
Operations:
- Search: Starting at the root, check if the query rectangle overlaps the node’s MBR. If so, recurse into children. where is the number of dimensions.
- Insert: Find the leaf whose MBR needs the least enlargement. Split if overflow.
- Delete: Find the leaf, remove the entry, condense the tree.
Theorem 10.2. An R-tree with rectangles in dimensions has height (in practice) and supports point queries in average time.
Worked Example: R-tree Construction
Insert rectangles into an R-tree with maximum 3 entries per node:
R1 = (1, 1, 3, 3), R2 = (2, 2, 4, 4), R3 = (5, 5, 7, 7), R4 = (6, 1, 8, 3), R5 = (1, 5, 3, 7).
Insert R1: Root = leaf {R1}. MBR = (1, 1, 3, 3). Insert R2: Leaf {R1, R2}. MBR = (1, 1, 4, 4). Insert R3: Leaf {R1, R2, R3}. MBR = (1, 1, 7, 7). Full (3 entries). Insert R4: Leaf is full. Split.
- R1, R2 are close together. R3 is far. Best split:
{R1, R2}and{R3, R4}. - Split 1 MBR: (1, 1, 4, 4). Split 2 MBR: (5, 1, 8, 7).
- Create internal node with two children. Insert R5: Which leaf to insert into? R5 = (1, 5, 3, 7).
- Enlargement for split 1 (1,1,4,4): new MBR = (1, 1, 4, 7), enlargement = 43 - 33 = 3.
- Enlargement for split 2 (5,1,8,7): no overlap, no enlargement needed… Wait, R5 (1,5,3,7) does overlap with (5,1,8,7)? No, R5’s x-range [1,3] and split 2’s x-range [5,8] don’t overlap. So R5 goes into split 1.
- Split 1:
{R1, R2, R5}. MBR = (1, 1, 4, 7). Full again! - Split split 1:
{R1, R2}and{R5}. Or{R1, R5}and{R2}. - Choose
{R1, R2}(MBR 1,1,4,4) and{R5}(MBR 1,5,3,7). Minimum total enlargement.
Final R-tree:
Root (MBR: 1,1,8,7) / \ [R1,R2] (1,1,4,4) [R3,R4] (5,1,8,7) | \ [R3] [R4] [R5] (1,5,3,7)Wait, this structure doesn’t work cleanly. R-trees require careful split algorithms. The key takeaway is that R-trees group nearby spatial objects together and split when nodes overflow.
10.4 GiST (Generalised Search Tree)
GiST is a framework for building index structures for arbitrary data types and predicates. B-trees and R-trees are special cases of GiST.
Key idea. The user provides:
- A consistent predicate: does a query match a tree entry?
- A union method: compute the bounding predicate of a set of entries.
- A penalty method: cost of inserting an entry into a subtree.
- A pick-split method: how to split an overflowing node.
Theorem 10.3. GiST is correct (never misses results) if the user-supplied consistent predicate is sound: if consistent(entry, query) returns false, then no item in entry’s subtree matches query.
11. Advanced SQL Patterns
11.1 Recursive CTEs for Hierarchical Data
Employee-manager hierarchy:
WITH RECURSIVE hierarchy AS ( SELECT emp_id, name, manager_id, 1 AS level FROM Employee WHERE manager_id IS NULL UNION ALL SELECT e.emp_id, e.name, e.manager_id, h.level + 1 FROM Employee e JOIN hierarchy h ON e.manager_id = h.emp_id)SELECT * FROM hierarchy ORDER BY level, name;11.2 Window Functions for Time Series
Moving average:
SELECT date, value, AVG(value) OVER ( ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS moving_avg_7dayFROM StockPricesWHERE symbol = 'AAPL';Year-over-year growth:
SELECT year, quarter, revenue, revenue - LAG(revenue, 4) OVER (PARTITION BY symbol ORDER BY year, quarter) AS yoy_change, (revenue - LAG(revenue, 4) OVER (PARTITION BY symbol ORDER BY year, quarter)) / LAG(revenue, 4) OVER (PARTITION BY symbol ORDER BY year, quarter) * 100 AS yoy_pctFROM FinancialsWHERE symbol = 'ACME';11.3 Materialised Views
A materialised view stores the result of a query physically, like a table. Unlike a regular view (which is just a stored query), materialised views can be queried without recomputing.
Benefits:
- Pre-computed results for expensive queries.
- Can be indexed.
- Query rewrite: the optimiser may automatically use a materialised view to answer a query.
Maintenance:
| Approach | Description | Overhead |
|---|---|---|
| Complete refresh | Recompute from scratch | High, but simple |
| Incremental (fast refresh) | Apply only changes since last refresh | Low, but complex |
| On commit | Refresh after every transaction | Highest consistency |
Theorem 11.1. Materialised view maintenance adds overhead to every update on the base tables. The overhead is proportional to the number of rows affected and the complexity of the view query.
12. Advanced Transaction Patterns
12.1 Savepoints
Savepoints allow partial rollback within a transaction:
BEGIN;UPDATE Accounts SET balance = balance - 100 WHERE id = 1;SAVEPOINT sp1;UPDATE Accounts SET balance = balance - 200 WHERE id = 2;-- Oops, should have been 50, not 200ROLLBACK TO sp1;UPDATE Accounts SET balance = balance - 50 WHERE id = 2;COMMIT;12.2 Isolation Level Anomalies
A comprehensive summary of anomalies by isolation level:
| Anomaly | Read Uncommitted | Read Committed | Repeatable Read | Serializable |
|---|---|---|---|---|
| Dirty read | Possible | Prevented | Prevented | Prevented |
| Non-repeatable read | Possible | Possible | Prevented | Prevented |
| Phantom read | Possible | Possible | Possible | Prevented |
| Write skew | Possible | Possible | Possible | Prevented |
| Read skew | Possible | Possible | Prevented | Prevented |
Read skew: reads and , updates , reads again and sees a different value. Prevented by Repeatable Read (locks on read rows).
Write skew: reads rows where Updates ; reads same rows, updates . Both commit, but may now be . NOT prevented by Repeatable Read (requires Serializable).
12.3 Optimistic Concurrency Control (OCC)
Phases:
- Read phase: Read data into private workspace, record read set.
- Validation phase: At commit time, check if any data in the read set was modified by another transaction.
- Write phase: If validation passes, apply writes. Otherwise, abort and retry.
Validation types:
| Type | Check | Cost |
|---|---|---|
| Backward validation | Check against committed transactions that finished after our read phase started | |
| Forward validation | Check against active transactions that started before our validation phase |
Theorem 12.1. OCC is serialisable if the validation phase ensures that for every pair of overlapping transactions (reading/writing the same data), at most one commits.
14. Distributed Database Patterns
14.1 Partitioning (Sharding)
Horizontal partitioning. Rows of a table are distributed across nodes based on a partition key.
Partitioning strategies:
| Strategy | Description | Hotspot risk |
|---|---|---|
| Hash partitioning | Low | |
| Range partitioning | Key ranges assigned to nodes | High (skewed ranges) |
| Directory-based | Lookup service maps keys to nodes | Low (directory is a bottleneck) |
Consistent hashing. Nodes and keys are placed on a hash ring (e.g., to ). Each key is assigned to the first node clockwise. Virtual nodes (replicas on the ring) improve balance.
Theorem 14.1. With real nodes and virtual nodes per real node, the expected ratio of the most-loaded to least-loaded node is .
14.2 Replication
Single-leader replication. One node accepts writes; others replicate asynchronously or synchronously.
| Mode | Latency | Durability | Consistency |
|---|---|---|---|
| Synchronous | High (wait for all replicas) | High | Strong (linearisable) |
| Semi-synchronous | Medium (wait for majority) | Medium | Strong (with quorum) |
| Asynchronous | Low (no wait) | Low | Eventual |
Multi-leader replication. Any node accepts writes. Conflicts are resolved using CRDTs, last-write-wins, or application-specific logic.
Theorem 14.2. Under asynchronous replication, if the leader fails, any data not yet replicated to followers is lost. This is unavoidable (FLP impossibility: in an asynchronous system, consensus is impossible with even one failure).
14.3 Consistency Models
Linearisability. The strongest consistency model: operations appear to execute atomically in real-time order.
Sequential consistency. Operations appear to execute atomically in some total order (but not necessarily real-time order).
Eventual consistency. If no new writes are made, eventually all reads return the same value.
Consistency model hierarchy:
Worked Example: Consistency Anomalies
Consider a social media system with two replicas (US and EU):
Linearisable. User A posts “Hello” in US. User B in EU immediately reads A’s post. This is guaranteed under linearisability: the write is visible to all subsequent reads.
Eventual consistency. User A posts “Hello” in US. User B in EU may not see “Hello” for several seconds. But eventually, B will see it.
Write skew under snapshot isolation:
- Transaction T1 (US): Read count of “likes” on post P (value = 10). Check count > 5. Update count to 9 (unlike).
- Transaction T2 (EU): Read count of “likes” on post P (value = 10). Check count > 5. Update count to 9 (unlike).
- Both commit under snapshot isolation (no conflict detected since they read the same value).
- Final count = 9, but two unlikes happened. The correct count should be 8.
This is prevented by serialisable isolation (which would detect the conflict).
15. Advanced Query Processing
15.1 Sort-Merge Join Details
Phase 1: Sort. Both relations are sorted on the join attribute.
If the relations fit in memory (): use in-memory sort. If not: use external merge sort.
External merge sort cost for relation with blocks and memory blocks:
Phase 2: Merge. Both sorted relations are scanned simultaneously, outputting matching pairs.
Merge cost: (one pass over each sorted relation).
Total sort-merge join cost: .
15.2 Hash Join Details
Build phase. Scan the smaller relation and build an in-memory hash table using the join attribute as the key. If fits in blocks: single-pass hash join.
Probe phase. Scan the larger relation and probe the hash table for matches.
Grace hash join (when does not fit in memory):
- Partition phase: Hash both relations into partitions on disk. Each partition of must fit in memory.
- Build + probe phase: For each partition pair Load into memory, build hash table, and probe with .
Cost: Partition: . Build + probe: . Total: .
Theorem 15.1. Grace hash join works correctly if the hash function distributes tuples uniformly and the partition of fits in memory for each partition.
15.3 Index Nested-Loop Join Cost
For each tuple in Look up matching tuples in using an index on the join attribute.
Where is the number of records in and the cost per probe depends on the index:
| Index type | Cost per probe |
|---|---|
| B+ tree (clustered) | |
| B+ tree (unclustered) | |
| Hash index | (average) |
Worked Example: Join Cost Comparison
: 10,000 blocks, 500,000 records. : 5,000 blocks, 250,000 records. Memory: 101 blocks. Equi-join on . has a clustered B+ tree index on (height 3, fan-out 100).
Block nested-loop: blocks.
Sort-merge: Sort : . Sort : . Merge: . Total: blocks.
Hash join: blocks. Check: need . We have 101 < 122. Need graceful hash join. Cost: plus partition overhead. Still best.
Index nested-loop: . If the average matching records per probe is 2 (500000/250000 = 2): cost = blocks. Much worse due to random I/O.
Ranking: Hash join (45K) < Sort-merge (75K) < Block nested-loop (510K) < Index nested-loop (1.5M).
17. Advanced SQL: Analytic Functions and Patterns
17.1 Percentile and Distribution Functions
-- Median salary per departmentSELECT department_id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary) AS median_discreteFROM EmployeeGROUP BY department_id;PERCENTILE_CONT interpolates (continuous); PERCENTILE_DISC returns an actual value from the data set (discrete).
17.2 Pattern Matching with MATCH_RECOGNIZE (SQL:2016)
SELECT *FROM StockTicksMATCH_RECOGNIZE ( ORDER BY tick_time MEASURES FIRST(A.price) AS start_price, LAST(B.price) AS end_price ONE ROW PER MATCH AFTER MATCH SKIP TO LAST B PATTERN (A+ B+) DEFINE A AS A.price < PREV(A.price), B AS B.price > PREV(B.price)) AS v_shaped;This finds “V-shaped” price patterns (consecutive decreases followed by consecutive increases).
17.3 Temporal Tables (System-Versioned)
SQL:2011 introduced system-versioned temporal tables, which automatically maintain historical data:
CREATE TABLE Employee ( emp_id INT PRIMARY KEY, name VARCHAR(100), salary DECIMAL(10,2), sys_start TIMESTAMP GENERATED ALWAYS AS ROW START, sys_end TIMESTAMP GENERATED ALWAYS AS ROW END, PERIOD FOR SYSTEM_TIME (sys_start, sys_end)) WITH SYSTEM VERSIONING;
-- Query data as of a specific point in timeSELECT * FROM Employee FOR SYSTEM_TIME AS OF '2025-01-01 00:00:00';
-- Query all historical changesSELECT * FROM Employee FOR SYSTEM_TIME FROM '2025-01-01' TO '2025-12-31';17.4 JSON and XML Support
JSON functions (PostgreSQL):
-- Create table with JSONB columnCREATE TABLE events ( id SERIAL PRIMARY KEY, data JSONB NOT NULL);
-- InsertINSERT INTO events (data) VALUES ('{"type": "click", "user": "alice", "page": "/home"}');
-- QuerySELECT data->>'user' AS user, data->'page' AS pageFROM eventsWHERE data->>'type' = 'click';
-- Index on JSONBCREATE INDEX idx_events_type ON events USING gin ((data->>'type'));Theorem 17.1. JSONB in PostgreSQL stores data in a decomposed binary format, enabling efficient indexing and querying. JSON (without the B) stores the exact text, which is faster to insert but slower to query.
18. Distributed Transactions: Two-Phase Commit Details
18.1 2PC Message Flow
Coordinator Participant 1 Participant 2 | | | |--- PREPARE --------->| | |--- PREPARE ----------------------------->| | | | |<-- VOTE_COMMIT -----| | |<-- VOTE_COMMIT -------------------------| | | | |--- COMMIT --------->| | |--- COMMIT ----------------------------->| |<-- ACK -------------| | |<-- ACK ----------------------------------|Failure scenarios:
- Participant crashes before voting: Coordinator times out, sends ABORT.
- Participant crashes after voting YES: On recovery, participant asks coordinator for the decision.
- Coordinator crashes after collecting all votes but before sending decision: Participants timeout and query other participants. If any participant knows the decision (received COMMIT/ABORT before crash), it can inform others.
- Coordinator crashes after sending COMMIT to some but not all participants: Committed participants have committed. On recovery, coordinator re-sends COMMIT to remaining participants.
18.2 2PC Performance Analysis
Theorem 18.1. 2PC requires at least 2 RTTs (one for voting, one for decision) plus the disk I/O for logging. In a wide-area network with 100 ms RTT, 2PC adds at least 200 ms to each distributed transaction.
Optimisation: Presumed abort. If a participant does not know the outcome, it can assume ABORT (since COMMIT must be recorded in the coordinator’s log before sending). This reduces the coordinator’s recovery work.
Optimisation: Collected commit. Participants send their votes along with their redo log records. The coordinator writes all votes and the commit record in a single log flush.
19. Problem Set
7.1 Relational Algebra (Problems 1—3)
Problem 1. Express the following in relational algebra: “Find the names of students who have taken all courses offered by the department in which they are majoring.” Assume schemas: Student(name, dept), Course(cname, dept), Enrolment(name, cname).
Problem 2. Prove the equivalence: when involves only attributes of and involves only attributes of .
Problem 3. Given relations and with records (100 blocks), records (50 blocks), and 101 blocks of memory, compare the cost of hash join, sort-merge join, and block nested-loop join.
7.2 SQL and Normalisation (Problems 4—7)
Problem 4. Write a single SQL query that, for each department, returns the top 3 employees by salary. Use window functions.
Problem 5. Consider the relation with functional dependencies: , , , . Find all candidate keys, decompose into BCNF, and check if the decomposition is dependency-preserving.
Problem 6. The relation Teaching(course, teacher, textbook, room) has the constraint: “The teacher and textbook assigned to a course section are independent.” What normal form does this violate? Decompose accordingly.
Problem 7. Prove that if a relation is in BCNF, it is also in 3NF. Give an example of a relation in 3NF but not BCNF.
7.3 Transactions and Recovery (Problems 8—11)
Problem 8. Consider the following schedule: . Draw the precedence graph and determine if the schedule is conflict-serialisable.
Problem 9. Explain how snapshot isolation allows write skew, and give a concrete example with bank accounts where the invariant “total balance >= 0” is violated.
Problem 10. A database uses ARIES recovery. The log contains (at the time of crash):
[LSN 001] BEGIN T1[LSN 002] UPDATE T1 A: old=5, new=10[LSN 003] BEGIN T2[LSN 004] UPDATE T2 B: old=3, new=7[LSN 005] COMMIT T1[LSN 006] UPDATE T2 C: old=1, new=4[LSN 007] CHECKPOINT (dirty: A, B, C; active: T2)Describe the analysis, redo, and undo phases.
Problem 11. In 2PC, the coordinator crashes after sending PREPARE to all participants but before receiving all votes. Participant P1 votes YES, P2 votes NO. What happens when the coordinator recovers?
7.4 Distributed Databases (Problems 12—15)
Problem 12. Explain why 2PC is a blocking protocol. Give a specific failure scenario where participants block indefinitely.
Problem 13. Compare Paxos and Raft. In what situations is Raft preferred? What are Raft’s limitations compared to Paxos?
Problem 14. A distributed database uses synchronous replication with 3 replicas. What is the maximum number of node failures that can be tolerated while maintaining strong consistency?
Problem 15. Explain the CAP theorem. If a distributed database chooses consistency and partition tolerance (CP), what trade-offs does it make in terms of availability?
Solution to Problem 5
FDs: , , , .
Closure computation:
(via , , )
So is a candidate key. Similarly:
(via , , ). So is a candidate key.
(no So not a candidate key).
(no Not a key).
(no Not a key).
(no Not a key).
(no Not a key).
(not a key).
(no Not a key).
(no Not a key).
Candidate keys: .
BCNF decomposition:
violates BCNF (LHS is not a superkey). Decompose into:
- with (BCNF, key = )
- with , , … Wait, let me recompute.
Actually, has attributes and the restricted FDs are , .
violates BCNF (LHS is not a superkey of ). Superkeys of include , (since is lost but in : Not useful). Actually, is not a key in because we lost .
Keys of : is a key (And with we need… in : doesn’t give us in . So gives Not . So is NOT a key in !
Hmm, let me recompute. In the original relation, is a key. But after removing The remaining FDs are and .
in . Not a superkey (missing ).
(via , ). So is a key!
(via , ). So is a key!
(no applicable FD). Not a key.
. Not a key.
Keys of : .
violates BCNF. Decompose into:
- with (BCNF, key = )
- with restricted FDs: , … Wait, comes from restricted to : we lose the dependency since is not in .
Actually, restricted to : the LHS is but So this FD is lost. The only remaining FDs in are trivial. So is in BCNF with key (or ).
BCNF decomposition: , , .
Dependency preservation: is preserved (in ). is preserved (in ). is NOT preserved (lost in the decomposition). is NOT preserved (lost).
So the decomposition is NOT dependency-preserving.
If you get this wrong, revise: Sections 3 and 4.
Solution to Problem 8
Schedule: .
Conflicting operations (same data item, at least one write):
and : writes before reads . Edge: . and : writes before reads . Edge: . and : writes before reads . Edge: .
Precedence graph: .
This contains a cycle (). Therefore, the schedule is not conflict-serialisable.
If you get this wrong, revise: Section 5.1.
Common Pitfalls
- Confusing 2NF and 3NF. 2NF removes partial dependencies; 3NF removes transitive dependencies. Fix: A relation in 3NF is also in 2NF; check for non-prime attributes depending on other non-prime attributes.
- Wrong isolation level. Read uncommitted: dirty reads possible. Serializable: no anomalies but lowest concurrency. Fix: Balance consistency and performance; most applications use Read Committed or Repeatable Read.
- Confusing the CAP theorem trade-offs. A distributed system can guarantee at most 2 of: Consistency, Availability, Partition tolerance. Fix: Network partitions are inevitable; choose between CP (consistent but unavailable) and AP (available but eventually consistent).
Worked Examples
Example 1: Normalisation
Problem. Relation R(A, B, C, D) with FDs: AB → C, C → D. Is R in 3NF?
Solution. Key: AB. C depends on AB (partial dependency on non-prime B? No — C depends on the full key AB). C → D: D depends on C, which is non-prime. This is a transitive dependency, violating 3NF.
Decompose: R1(A, B, C), R2(C, D). Both are in 3NF.
Example 2: SQL join
Problem. Students(ID, Name, DeptID) and Departments(DeptID, DeptName). Write SQL to list all students with their department names.
Solution. SELECT s.Name, d.DeptName FROM Students s INNER JOIN Departments d ON s.DeptID = d.DeptID;
Summary
- Normalisation: 1NF (atomic), 2NF (no partial dependencies), 3NF (no transitive dependencies), BCNF.
- ACID properties: Atomicity, Consistency, Isolation, Durability.
- SQL: DDL (CREATE, ALTER, DROP), DML (SELECT, INSERT, UPDATE, DELETE), DCL (GRANT, REVOKE).
- CAP theorem: distributed systems trade off consistency, availability, and partition tolerance.