Skip to content

Databases

1. Relational Model

1.1 Fundamentals

A relation is a set of tuples over a relation schema R(A1,A2,,An)R(A_1, A_2, \ldots, A_n) where each AiA_i is an attribute from domain DiD_i.

Key terms:

  • Tuple: A single row in a relation.
  • Attribute: A named column in a relation.
  • Domain: The set of allowed values for an attribute.
  • Degree (arity): Number of attributes.
  • Cardinality: Number of tuples.

1.2 Keys

Key TypeDefinition
SuperkeyA set of attributes that uniquely identifies a tuple
Candidate keyA minimal superkey (no proper subset is a superkey)
Primary keyA chosen candidate key
Foreign keyReferences the primary key of another relation
Composite keyA key consisting of multiple attributes

1.3 Relational Algebra

Selection (σ\sigma): Filter rows.

σcondition(R)={tR:t satisfies condition}\sigma_{\text{condition}}(R) = \{t \in R : t \text{ satisfies condition}\}

Projection (π\pi): Select columns.

πA1,A2,(R)={t[A1,A2,]:tR}\pi_{A_1, A_2, \ldots}(R) = \{t[A_1, A_2, \ldots] : t \in R\}

Join (\bowtie): Combine relations.

RconditionS=σcondition(R×S)R \bowtie_{\text{condition}} S = \sigma_{\text{condition}}(R \times S)

Natural join (\bowtie): Join on all shared attributes, eliminating duplicates.

Outer joins: Left (\leftarrow \bowtie), right (\bowtie \rightarrow), full (\leftarrow \bowtie \rightarrow): include unmatched tuples with NULLs.

Other operations:

  • Union (RSR \cup S): Tuples in RR or SS (same schema).
  • Difference (RSR - S): Tuples in RR but not SS.
  • Intersection (RSR \cap S): Tuples in both.
  • Cartesian product (R×SR \times S): All combinations.
  • Rename (ρx(R)\rho_x(R)): Rename relation to xx.

2. SQL

2.1 DDL (Data Definition Language)

CREATE TABLE Students (
student_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
gpa DECIMAL(3,2) CHECK (gpa >= 0.0 AND gpa <= 4.0),
major VARCHAR(50),
enrollment_date DATE DEFAULT CURRENT_DATE
);
CREATE TABLE Courses (
course_id INT PRIMARY KEY,
title VARCHAR(200) NOT NULL,
credits INT CHECK (credits > 0 AND credits <= 6),
department VARCHAR(50)
);
CREATE TABLE Enrollments (
student_id INT REFERENCES Students(student_id),
course_id INT REFERENCES Courses(course_id),
semester VARCHAR(20),
grade CHAR(2),
PRIMARY KEY (student_id, course_id, semester)
);
DROP TABLE Enrollments;
ALTER TABLE Students ADD COLUMN email VARCHAR(255) UNIQUE;

2.2 DML (Data Manipulation Language)

INSERT INTO Students VALUES (1, "Alice', 3.9, 'CS', '2024-09-01');
UPDATE Students SET gpa = 3.95 WHERE student_id = 1;
DELETE FROM Students WHERE student_id = 1;

2.3 Queries

Basic SELECT:

SELECT name, gpa FROM Students WHERE major = 'CS' ORDER BY gpa DESC;

Joins:

SELECT s.name, c.title, e.grade
FROM Students s
JOIN Enrollments e ON s.student_id = e.student_id
JOIN Courses c ON e.course_id = c.course_id
WHERE e.semester = 'Fall 2025';

Aggregate functions:

SELECT major, AVG(gpa) AS avg_gpa, COUNT(*) AS count
FROM Students
GROUP BY major
HAVING COUNT(*) >= 5;

Subqueries:

-- Correlated subquery
SELECT name FROM Students s
WHERE EXISTS (
SELECT 1 FROM Enrollments e
WHERE e.student_id = s.student_id AND e.grade = 'A+'
);
-- Scalar subquery
SELECT name, (
SELECT COUNT(*) FROM Enrollments
WHERE Enrollments.student_id = Students.student_id
) AS course_count
FROM Students;

Window functions:

SELECT name, major, gpa,
RANK() OVER (PARTITION BY major ORDER BY gpa DESC) AS rank_in_major,
SUM(gpa) OVER (PARTITION BY major) AS total_gpa
FROM Students;

3. Normalization

3.1 Functional Dependencies

A functional dependency XYX \to Y holds in relation RR if for any two tuples t1,t2t_1, t_2 in RR: if t1[X]=t2[X]t_1[X] = t_2[X], then t1[Y]=t2[Y]t_1[Y] = t_2[Y].

Armstrong’s Axioms:

  1. Reflexivity: If YXY \subseteq X, then XYX \to Y.
  2. Augmentation: If XYX \to Y, then XZYZXZ \to YZ.
  3. Transitivity: If XYX \to Y and YZY \to Z, then XZX \to Z.

Derived rules:

  • Union: XYX \to Y and XZX \to Z implies XYZX \to YZ.
  • Decomposition: XYZX \to YZ implies XYX \to Y and XZX \to Z.
  • Pseudotransitivity: XYX \to Y and WYZWY \to Z implies WXZWX \to Z.

3.2 Closure and Minimal Cover

Attribute closure X+X^+: The set of all attributes functionally determined by XX.

CLOSURE(X, F):
result = X
repeat:
changed = false
for each FD Y -> Z in F:
if Y ⊆ result and Z ⊈ result:
result = result ∪ Z
changed = true
until not changed
return result

Minimal cover: A minimal set of FDs equivalent to FF where:

  1. No FD can be derived from others (irredundant).
  2. Every FD has a single attribute on the right.
  3. No attribute on the left can be removed.

3.3 Normal Forms

1NF (First Normal Form): All attributes have atomic (indivisible) values. No repeating groups.

2NF: 1NF and no partial dependencies. Every non-prime attribute is fully dependent on the entire candidate key.

XY is a partial dependency if Y is non-prime and X is a proper subset of a candidate keyX \to Y \text{ is a partial dependency if } Y \text{ is non-prime and } X \text{ is a proper subset of a candidate key}

3NF: 2NF and no transitive dependencies. For every non-trivial FD XYX \to Y:

X is a superkeyorY is a prime attributeX \text{ is a superkey} \quad \text{or} \quad Y \text{ is a prime attribute}

BCNF (Boyce-Codd Normal Form): For every non-trivial FD XYX \to Y:

X is a superkeyX \text{ is a superkey}

Decomposition into BCNF:

DECOMPOSE(R, F):
find FD X -> Y that violates BCNF
R1 = X ∪ Y
R2 = attributes of R minus (Y - X)
recursively decompose R1 and R2

3.4 Normal Form Comparison

Normal FormEliminatesMay have
1NFMultivalued attributesPartial, transitive deps
2NFPartial dependenciesTransitive dependencies
3NFTransitive dependenciesBCNF violations
BCNFAll FDs not from superkeysLossy decomposition possible

4. Indexing

4.1 B+ Trees

A B+ tree of order dd is a balanced search tree optimized for disk access:

  • Internal nodes: Store up to 2d2d keys and 2d+12d+1 pointers.
  • Leaf nodes: Store keys and values (or pointers to records). All at the same depth.
  • Leaf links: Leaves connected in a linked list for range queries.

Structure:

[10 | 20 | 30]
/ | | \
[1,5,8] [12,15] [22,25] [32,35,40]
→→→→→→→→→→→→→→→→→→→→→→→ (linked)

Operations:

OperationTimeDisk Accesses
SearchO(logdn)O(\log_d n)O(logdn)O(\log_d n)
InsertO(logdn)O(\log_d n)O(logdn)O(\log_d n)
DeleteO(logdn)O(\log_d n)O(logdn)O(\log_d n)
Range queryO(logdn+k)O(\log_d n + k)O(logdn+k)O(\log_d n + k)

Choice of dd: dd is chosen so that one node fits in one disk block.

4.2 Hash Indexes

Static hashing: Use a hash function to map keys to bucket locations.

HASH_INDEX(key):
bucket = hash(key) % num_buckets
return lookup(bucket, key)

Dynamic hashing approaches:

  • Extendible hashing: Directory of pointers to buckets; directory doubles when a bucket overflows.
  • Linear hashing: Buckets split incrementally in a round-robin fashion.
AspectB+ TreeHash Index
Exact matchO(logn)O(\log n)O(1)O(1) avg
Range queryEfficientInefficient
InsertO(logn)O(\log n)O(1)O(1) avg
SpaceVariableVariable
OrderMaintains key orderNo ordering

4.3 Bitmap Indexes

For low-cardinality columns. Store a bitmap per distinct value:

Color: | Red | Blue | Green |
--------------------------
R1 | 1 | 0 | 0 |
R2 | 0 | 1 | 0 |
R3 | 1 | 0 | 0 |
R4 | 0 | 0 | 1 |

Query: AND/OR bitmaps with bitwise operations. Very fast for filtering.

4.4 Composite Indexes

Multi-column B+ tree: Index on (A,B)(A, B) supports queries on AA alone or (A,B)(A, B), but not BB alone.

Covering index: An index that contains all columns needed by a query, avoiding table lookups.

5. Transactions

5.1 ACID Properties

PropertyDescription
AtomicityAll operations succeed or none do (all-or-nothing)
ConsistencyTransaction transforms database from one valid state to another
IsolationConcurrent transactions don’t interfere
DurabilityOnce committed, changes survive crashes

5.2 Isolation Levels

LevelDirty ReadNon-repeatable ReadPhantom Read
Read UncommittedYesYesYes
Read CommittedNoYesYes
Repeatable ReadNoNoYes
SerializableNoNoNo

Anomaly definitions:

  • Dirty read: Reading uncommitted data from another transaction.
  • Non-repeatable read: Same query returns different results within a transaction.
  • Phantom read: New rows appear in a repeated range query.

5.3 Two-Phase Locking (2PL)

Growing phase: Acquire locks, do not release any. Shrinking phase: Release locks, do not acquire any.

TWO_PHASE_LOCKING(transaction):
// Growing phase
for each operation in transaction:
acquire required locks (read/write)
execute operation
// Shrinking phase
release all locks

Variants:

  • Strict 2PL: All locks held until commit/abort. Prevents cascading aborts.
  • Rigorous 2PL: Write locks held until commit. A common practical choice.

5.4 MVCC (Multi-Version Concurrency Control)

Instead of locking, maintain multiple versions of each row:

ROW(id=1, name='Alice', version_start=1, version_end=3)
ROW(id=1, name='Bob', version_start=3, version_end=INF) // current
  • Readers never block writers (read old version).
  • Writers never block readers (write new version).
  • Garbage collection removes old versions no longer visible to any active transaction.

Used by PostgreSQL, MySQL (InnoDB), Oracle.

5.5 Concurrency Control Comparison

MethodLocking OverheadDeadlocksPerformance (read-heavy)Performance (write-heavy)
2PLHighYesModerateGood
MVCCLowNoExcellentModerate
OptimisticLowNoGoodPoor (high conflict)

6. Query Optimization

6.1 Query Processing Pipeline

SQL Query
→ Parsing (AST)
→ Query Rewrite (view expansion, subquery flattening)
→ Logical Plan (relational algebra tree)
→ Physical Plan (join method, access path selection)
→ Execution

6.2 Relational Algebra Equivalences

Used to transform queries for efficiency:

RuleEquivalence
Selection pushdownσc(RS)=σc(R)S\sigma_c(R \bowtie S) = \sigma_c(R) \bowtie S if cc uses only RR
Join commutativityRS=SRR \bowtie S = S \bowtie R
Join associativity(RS)T=R(ST)(R \bowtie S) \bowtie T = R \bowtie (S \bowtie T)
Selection cascadeσc1(σc2(R))=σc1c2(R)\sigma_{c1}(\sigma_{c2}(R)) = \sigma_{c1 \wedge c2}(R)
Projection pushdownPush projections as early as possible

6.3 Join Algorithms

Nested-loop join:

for each r in R:
for each s in S:
if r.key == s.key:
output (r, s)

Cost: O(RS)O(|R| \cdot |S|). O(R)O(|R|) memory.

Block nested-loop join:

for each block_b in R (blocks of size M-1):
load block_b into memory
for each block_s in S:
for each r in block_b:
for each s in block_s:
if r.key == s.key: output (r, s)

Cost: O(RMS)O(\frac{|R|}{M} \cdot |S|) where MM = memory pages.

Hash join:

// Build phase
hash_table = {}
for each r in R:
hash_table[hash(r.key)].append(r)
// Probe phase
for each s in S:
for each r in hash_table[hash(s.key)]:
if r.key == s.key:
output (r, s)

Cost: O(R+S)O(|R| + |S|). O(min(R,S))O(\min(|R|, |S|)) memory.

Sort-merge join:

sort R by key
sort S by key
merge R and S on key, outputting matches

Cost: O(RlogR+SlogS)O(|R|\log|R| + |S|\log|S|). Good for range joins and sorted output.

6.4 Cost Estimation

Selectivity: Fraction of rows satisfying a condition.

sel(A=v)=1V(A,R)\text{sel}(A = v) = \frac{1}{|V(A,R)|}

sel(A>v)distinct values above vV(A,R)\text{sel}(A > v) \approx \frac{\text{distinct values above } v}{|V(A,R)|}

sel(A LIKE ’%x%’)13 (heuristic)\text{sel}(A \text{ LIKE '\%x\%'}) \approx \frac{1}{3} \text{ (heuristic)}

Join size estimation:

RSRSmax(V(R.key),V(S.key))|R \bowtie S| \approx \frac{|R| \cdot |S|}{\max(V(R.key), V(S.key))}

6.5 EXPLAIN and Query Plans

EXPLAIN ANALYZE
SELECT s.name, AVG(e.grade_num)
FROM Students s
JOIN Enrollments e ON s.student_id = e.student_id
WHERE s.major = 'CS'
GROUP BY s.name;

The query planner chooses the cheapest plan based on cost estimates.

7. Common Pitfalls

  1. Not normalizing enough. Storing denormalized data leads to update anomalies, insertion anomalies, and deletion anomalies. Aim for at least 3NF.

  2. Over-normalizing. Excessive normalization causes many joins, degrading read performance. Denormalize deliberately when reads dominate and data is rarely updated.

  3. Missing indexes on foreign keys. Joins on unindexed foreign keys cause full table scans. Always index columns used in JOIN, WHERE, and ORDER BY clauses.

  4. Choosing wrong isolation level. Serializable prevents all anomalies but is the slowest. Read Committed is in most cases sufficient for most applications.

  5. N+1 query problem. Executing one query to get a list, then N individual queries for related data. Fix with JOINs or batched fetching.

  6. Ignoring the query optimizer. Writing overly complex SQL when a simpler equivalent would let the optimizer choose a better plan. Always check EXPLAIN output.

  7. Cascading deletes without consideration. Foreign key ON DELETE CASCADE can accidentally delete large amounts of data. Use it carefully, and consider soft deletes for audit trails.

Worked Examples

Example 1: SQL Query with Joins and Aggregation

Problem: Given tables Students(sid, name, major_id) and Majors(mid, name, dept), write a query to find the department with the most students. Solution: SELECT m.dept, COUNT(s.sid) AS num_students FROM Majors m JOIN Students s ON m.mid = s.major_id GROUP BY m.dept ORDER BY num_students DESC LIMIT 1;

Example 2: Normalisation to 3NF

Problem: A relation R(A, B, C, D, E) has functional dependencies: AB -> C, C -> D, D -> E. Normalise to 3NF. Solution: Candidate keys: {A, B}. Partial dependencies: C depends on AB (not partial). No partial dependencies (since AB is the key). Transitive dependencies: C -> D and D -> E violate 3NF (D and E do not depend on the key directly, only through C). Decompose into R1(A, B, C), R2(C, D), R3(D, E). All are in 3NF. Lossless join is guaranteed because R1 intersection R2 = {C} is a candidate key for R2.

Summary

  • The relational model organizes data into relations with keys and foreign keys, manipulated via relational algebra.
  • SQL provides DDL (schema definition), DML (data manipulation), joins, subqueries, aggregation, and window functions.
  • Normalization (1NF → BCNF) eliminates anomalies through functional dependency analysis and decomposition.
  • Indexing (B+ trees, hash indexes, bitmap indexes) accelerates query performance, each suited to different access patterns.
  • Transactions enforce ACID properties via isolation levels, 2PL, and MVCC.
  • Query optimization transforms logical plans using equivalence rules, selects join algorithms, and estimates costs.

Cross-References

TopicLink
Distributed SystemsView
NetworkingView
Operating SystemsView