Databases
1. Relational Model
1.1 Fundamentals
A relation is a set of tuples over a relation schema where each is an attribute from domain .
Key terms:
- Tuple: A single row in a relation.
- Attribute: A named column in a relation.
- Domain: The set of allowed values for an attribute.
- Degree (arity): Number of attributes.
- Cardinality: Number of tuples.
1.2 Keys
| Key Type | Definition |
|---|---|
| Superkey | A set of attributes that uniquely identifies a tuple |
| Candidate key | A minimal superkey (no proper subset is a superkey) |
| Primary key | A chosen candidate key |
| Foreign key | References the primary key of another relation |
| Composite key | A key consisting of multiple attributes |
1.3 Relational Algebra
Selection (): Filter rows.
Projection (): Select columns.
Join (): Combine relations.
Natural join (): Join on all shared attributes, eliminating duplicates.
Outer joins: Left (), right (), full (): include unmatched tuples with NULLs.
Other operations:
- Union (): Tuples in or (same schema).
- Difference (): Tuples in but not .
- Intersection (): Tuples in both.
- Cartesian product (): All combinations.
- Rename (): Rename relation to .
2. SQL
2.1 DDL (Data Definition Language)
CREATE TABLE Students ( student_id INT PRIMARY KEY, name VARCHAR(100) NOT NULL, gpa DECIMAL(3,2) CHECK (gpa >= 0.0 AND gpa <= 4.0), major VARCHAR(50), enrollment_date DATE DEFAULT CURRENT_DATE);
CREATE TABLE Courses ( course_id INT PRIMARY KEY, title VARCHAR(200) NOT NULL, credits INT CHECK (credits > 0 AND credits <= 6), department VARCHAR(50));
CREATE TABLE Enrollments ( student_id INT REFERENCES Students(student_id), course_id INT REFERENCES Courses(course_id), semester VARCHAR(20), grade CHAR(2), PRIMARY KEY (student_id, course_id, semester));
DROP TABLE Enrollments;ALTER TABLE Students ADD COLUMN email VARCHAR(255) UNIQUE;2.2 DML (Data Manipulation Language)
INSERT INTO Students VALUES (1, "Alice', 3.9, 'CS', '2024-09-01');
UPDATE Students SET gpa = 3.95 WHERE student_id = 1;
DELETE FROM Students WHERE student_id = 1;2.3 Queries
Basic SELECT:
SELECT name, gpa FROM Students WHERE major = 'CS' ORDER BY gpa DESC;Joins:
SELECT s.name, c.title, e.gradeFROM Students sJOIN Enrollments e ON s.student_id = e.student_idJOIN Courses c ON e.course_id = c.course_idWHERE e.semester = 'Fall 2025';Aggregate functions:
SELECT major, AVG(gpa) AS avg_gpa, COUNT(*) AS countFROM StudentsGROUP BY majorHAVING COUNT(*) >= 5;Subqueries:
-- Correlated subquerySELECT name FROM Students sWHERE EXISTS ( SELECT 1 FROM Enrollments e WHERE e.student_id = s.student_id AND e.grade = 'A+');
-- Scalar subquerySELECT name, ( SELECT COUNT(*) FROM Enrollments WHERE Enrollments.student_id = Students.student_id) AS course_countFROM Students;Window functions:
SELECT name, major, gpa, RANK() OVER (PARTITION BY major ORDER BY gpa DESC) AS rank_in_major, SUM(gpa) OVER (PARTITION BY major) AS total_gpaFROM Students;3. Normalization
3.1 Functional Dependencies
A functional dependency holds in relation if for any two tuples in : if , then .
Armstrong’s Axioms:
- Reflexivity: If , then .
- Augmentation: If , then .
- Transitivity: If and , then .
Derived rules:
- Union: and implies .
- Decomposition: implies and .
- Pseudotransitivity: and implies .
3.2 Closure and Minimal Cover
Attribute closure : The set of all attributes functionally determined by .
CLOSURE(X, F): result = X repeat: changed = false for each FD Y -> Z in F: if Y ⊆ result and Z ⊈ result: result = result ∪ Z changed = true until not changed return resultMinimal cover: A minimal set of FDs equivalent to where:
- No FD can be derived from others (irredundant).
- Every FD has a single attribute on the right.
- No attribute on the left can be removed.
3.3 Normal Forms
1NF (First Normal Form): All attributes have atomic (indivisible) values. No repeating groups.
2NF: 1NF and no partial dependencies. Every non-prime attribute is fully dependent on the entire candidate key.
3NF: 2NF and no transitive dependencies. For every non-trivial FD :
BCNF (Boyce-Codd Normal Form): For every non-trivial FD :
Decomposition into BCNF:
DECOMPOSE(R, F): find FD X -> Y that violates BCNF R1 = X ∪ Y R2 = attributes of R minus (Y - X) recursively decompose R1 and R23.4 Normal Form Comparison
| Normal Form | Eliminates | May have |
|---|---|---|
| 1NF | Multivalued attributes | Partial, transitive deps |
| 2NF | Partial dependencies | Transitive dependencies |
| 3NF | Transitive dependencies | BCNF violations |
| BCNF | All FDs not from superkeys | Lossy decomposition possible |
4. Indexing
4.1 B+ Trees
A B+ tree of order is a balanced search tree optimized for disk access:
- Internal nodes: Store up to keys and pointers.
- Leaf nodes: Store keys and values (or pointers to records). All at the same depth.
- Leaf links: Leaves connected in a linked list for range queries.
Structure:
[10 | 20 | 30] / | | \ [1,5,8] [12,15] [22,25] [32,35,40] →→→→→→→→→→→→→→→→→→→→→→→ (linked)Operations:
| Operation | Time | Disk Accesses |
|---|---|---|
| Search | ||
| Insert | ||
| Delete | ||
| Range query |
Choice of : is chosen so that one node fits in one disk block.
4.2 Hash Indexes
Static hashing: Use a hash function to map keys to bucket locations.
HASH_INDEX(key): bucket = hash(key) % num_buckets return lookup(bucket, key)Dynamic hashing approaches:
- Extendible hashing: Directory of pointers to buckets; directory doubles when a bucket overflows.
- Linear hashing: Buckets split incrementally in a round-robin fashion.
| Aspect | B+ Tree | Hash Index |
|---|---|---|
| Exact match | avg | |
| Range query | Efficient | Inefficient |
| Insert | avg | |
| Space | Variable | Variable |
| Order | Maintains key order | No ordering |
4.3 Bitmap Indexes
For low-cardinality columns. Store a bitmap per distinct value:
Color: | Red | Blue | Green |--------------------------R1 | 1 | 0 | 0 |R2 | 0 | 1 | 0 |R3 | 1 | 0 | 0 |R4 | 0 | 0 | 1 |Query: AND/OR bitmaps with bitwise operations. Very fast for filtering.
4.4 Composite Indexes
Multi-column B+ tree: Index on supports queries on alone or , but not alone.
Covering index: An index that contains all columns needed by a query, avoiding table lookups.
5. Transactions
5.1 ACID Properties
| Property | Description |
|---|---|
| Atomicity | All operations succeed or none do (all-or-nothing) |
| Consistency | Transaction transforms database from one valid state to another |
| Isolation | Concurrent transactions don’t interfere |
| Durability | Once committed, changes survive crashes |
5.2 Isolation Levels
| Level | Dirty Read | Non-repeatable Read | Phantom Read |
|---|---|---|---|
| Read Uncommitted | Yes | Yes | Yes |
| Read Committed | No | Yes | Yes |
| Repeatable Read | No | No | Yes |
| Serializable | No | No | No |
Anomaly definitions:
- Dirty read: Reading uncommitted data from another transaction.
- Non-repeatable read: Same query returns different results within a transaction.
- Phantom read: New rows appear in a repeated range query.
5.3 Two-Phase Locking (2PL)
Growing phase: Acquire locks, do not release any. Shrinking phase: Release locks, do not acquire any.
TWO_PHASE_LOCKING(transaction): // Growing phase for each operation in transaction: acquire required locks (read/write) execute operation
// Shrinking phase release all locksVariants:
- Strict 2PL: All locks held until commit/abort. Prevents cascading aborts.
- Rigorous 2PL: Write locks held until commit. A common practical choice.
5.4 MVCC (Multi-Version Concurrency Control)
Instead of locking, maintain multiple versions of each row:
ROW(id=1, name='Alice', version_start=1, version_end=3)ROW(id=1, name='Bob', version_start=3, version_end=INF) // current- Readers never block writers (read old version).
- Writers never block readers (write new version).
- Garbage collection removes old versions no longer visible to any active transaction.
Used by PostgreSQL, MySQL (InnoDB), Oracle.
5.5 Concurrency Control Comparison
| Method | Locking Overhead | Deadlocks | Performance (read-heavy) | Performance (write-heavy) |
|---|---|---|---|---|
| 2PL | High | Yes | Moderate | Good |
| MVCC | Low | No | Excellent | Moderate |
| Optimistic | Low | No | Good | Poor (high conflict) |
6. Query Optimization
6.1 Query Processing Pipeline
SQL Query → Parsing (AST) → Query Rewrite (view expansion, subquery flattening) → Logical Plan (relational algebra tree) → Physical Plan (join method, access path selection) → Execution6.2 Relational Algebra Equivalences
Used to transform queries for efficiency:
| Rule | Equivalence |
|---|---|
| Selection pushdown | if uses only |
| Join commutativity | |
| Join associativity | |
| Selection cascade | |
| Projection pushdown | Push projections as early as possible |
6.3 Join Algorithms
Nested-loop join:
for each r in R: for each s in S: if r.key == s.key: output (r, s)Cost: . memory.
Block nested-loop join:
for each block_b in R (blocks of size M-1): load block_b into memory for each block_s in S: for each r in block_b: for each s in block_s: if r.key == s.key: output (r, s)Cost: where = memory pages.
Hash join:
// Build phasehash_table = {}for each r in R: hash_table[hash(r.key)].append(r)
// Probe phasefor each s in S: for each r in hash_table[hash(s.key)]: if r.key == s.key: output (r, s)Cost: . memory.
Sort-merge join:
sort R by keysort S by keymerge R and S on key, outputting matchesCost: . Good for range joins and sorted output.
6.4 Cost Estimation
Selectivity: Fraction of rows satisfying a condition.
Join size estimation:
6.5 EXPLAIN and Query Plans
EXPLAIN ANALYZESELECT s.name, AVG(e.grade_num)FROM Students sJOIN Enrollments e ON s.student_id = e.student_idWHERE s.major = 'CS'GROUP BY s.name;The query planner chooses the cheapest plan based on cost estimates.
7. Common Pitfalls
Not normalizing enough. Storing denormalized data leads to update anomalies, insertion anomalies, and deletion anomalies. Aim for at least 3NF.
Over-normalizing. Excessive normalization causes many joins, degrading read performance. Denormalize deliberately when reads dominate and data is rarely updated.
Missing indexes on foreign keys. Joins on unindexed foreign keys cause full table scans. Always index columns used in JOIN, WHERE, and ORDER BY clauses.
Choosing wrong isolation level. Serializable prevents all anomalies but is the slowest. Read Committed is in most cases sufficient for most applications.
N+1 query problem. Executing one query to get a list, then N individual queries for related data. Fix with JOINs or batched fetching.
Ignoring the query optimizer. Writing overly complex SQL when a simpler equivalent would let the optimizer choose a better plan. Always check EXPLAIN output.
Cascading deletes without consideration. Foreign key ON DELETE CASCADE can accidentally delete large amounts of data. Use it carefully, and consider soft deletes for audit trails.
Worked Examples
Example 1: SQL Query with Joins and Aggregation
Problem: Given tables Students(sid, name, major_id) and Majors(mid, name, dept), write a query to find the department with the most students. Solution: SELECT m.dept, COUNT(s.sid) AS num_students FROM Majors m JOIN Students s ON m.mid = s.major_id GROUP BY m.dept ORDER BY num_students DESC LIMIT 1;
Example 2: Normalisation to 3NF
Problem: A relation R(A, B, C, D, E) has functional dependencies: AB -> C, C -> D, D -> E. Normalise to 3NF. Solution: Candidate keys: {A, B}. Partial dependencies: C depends on AB (not partial). No partial dependencies (since AB is the key). Transitive dependencies: C -> D and D -> E violate 3NF (D and E do not depend on the key directly, only through C). Decompose into R1(A, B, C), R2(C, D), R3(D, E). All are in 3NF. Lossless join is guaranteed because R1 intersection R2 = {C} is a candidate key for R2.
Summary
- The relational model organizes data into relations with keys and foreign keys, manipulated via relational algebra.
- SQL provides DDL (schema definition), DML (data manipulation), joins, subqueries, aggregation, and window functions.
- Normalization (1NF → BCNF) eliminates anomalies through functional dependency analysis and decomposition.
- Indexing (B+ trees, hash indexes, bitmap indexes) accelerates query performance, each suited to different access patterns.
- Transactions enforce ACID properties via isolation levels, 2PL, and MVCC.
- Query optimization transforms logical plans using equivalence rules, selects join algorithms, and estimates costs.
Cross-References
| Topic | Link |
|---|---|
| Distributed Systems | View |
| Networking | View |
| Operating Systems | View |