Fundamental Data Structures
2.1 Arrays and Linked Lists
Array. Contiguous memory, access by index, insertion/deletion (shift required).
| Operation | Array | Singly Linked List | Doubly Linked List |
|---|---|---|---|
| Access by index | |||
| Search (by value) | |||
| Insert at head | |||
| Insert at tail | * | ||
| Delete at head | |||
| Delete at tail | * | ||
| Insert at position | * | ||
| Space overhead |
*Given a pointer to the node or its predecessor.
Linked List. Each node stores data and a pointer to the next node. insertion/deletion at Head (given pointer), access by position.
Doubly Linked List. Each node has pointers to both next and previous nodes. insertion And deletion at any position (given pointers to the node and its neighbours).
Worked Example: Reversing a Singly Linked List
To reverse a singly linked list in time and space:
prev = nullcurr = headwhile curr != null: next = curr.next curr.next = prev prev = curr curr = nexthead = prevAt each iteration, we move one node to the front of the reversed list. After iterations, all nodes are reversed.
2.2 Stacks and Queues
Stack. Last-in, first-out (LIFO). Operations: push, pop, peek, all .
Queue. First-in, first-out (FIFO). Operations: enqueue, dequeue, peek, all .
Implementation. Both can be implemented with arrays (with circular buffer) or linked lists.
Worked Example: Implementing a Queue with Two Stacks
A queue can be implemented using two stacks in_stack and out_stack.
enqueue(x): push ontoin_stack. .dequeue(): ifout_stackis empty, pop all elements fromin_stackand push ontoout_stack. Then pop fromout_stack.
The dequeue operation is amortised: each element is moved from in_stack to out_stack at most once across all operations. Over operations, the total number of element moves is at most Giving amortised per dequeue.
2.3 Hash Tables
A hash table maps keys to values using a hash function .
Collision resolution:
- Chaining: Each bucket is a linked list. Average case: where is the load factor.
- Open addressing (linear probing): Insert into the next available slot. Average case: .
Theorem 2.1 (Uniform Hashing). Under the assumption of simple uniform hashing, the expected Length of a chain in a hash table with chaining is .
Proof. Under simple uniform hashing, each of the keys is equally likely to hash to any of the slots. For any given key The expected number of other keys that hash to the same slot as is . Including itself, the expected chain length is .
Theorem 2.2 (Successful Search with Chaining). Under simple uniform hashing, the expected time for a successful search in a hash table with chaining is .
Proof. The expected length of the list containing the searched element is Since other elements are distributed uniformly. On average, the searched element is halfway through its list, giving comparisons (we examine elements before the target plus the target itself). This equals .
Double Hashing. Uses two hash functions: Where is relatively prime to . This avoids the clustering problems of linear probing.
Worked Example: Choosing a Good Hash Function
For integer keys, a common choice is the multiplication method: where (the inverse golden ratio).
For string keys, the polynomial rolling hash: where is a prime (e.g., 31 or 127).
Both avoid clustering better than simple modulo hashing when the input distribution is adversarial.
2.4 Trees
Binary Search Tree (BST). For each node: left subtree values node value, right subtree values node value.
- Search, insert, delete: where is the height.
- For a balanced BST, .
Theorem 2.3. The expected height of a randomly built BST with distinct keys is .
Proof. Let be the height of a BST built from random keys. Let . Then . Using the indicator random variable technique, for some constant . Taking logs gives by Jensen”s inequality.
2.4.1 AVL Trees
Definition. An AVL tree is a BST where for every node, the heights of its left and right subtrees differ by at most 1. The balance factor of a node is the height of its left subtree minus the height of its right subtree; valid balance factors are .
Rotations. After insertion or deletion, the balance factor may become . This is fixed by rotations:
Right rotation (LL case): Balance factor at node Balance factor at left child .
A B/ \ / \B C → D A/ \ / \D E E CLeft rotation (RR case): Mirror of right rotation.
Left-Right rotation (LR case): Balance factor at Balance factor at left child . First left-rotate Then right-rotate .
Right-Left rotation (RL case): Mirror of LR case.
Theorem 2.4. An AVL tree with nodes has height .
Proof. Let be the minimum number of nodes in an AVL tree of height . We have , And for . This is the Fibonacci recurrence, giving . Using where We get So .
Corollary 2.5. All AVL tree operations (search, insert, delete) run in time.
Proof. Each operation visits nodes along a root-to-leaf path, and each rotation is . Since at most rotations are needed (at most 2 per level), the total cost is .
2.4.2 Red-Black Trees
Definition. A red-black tree is a BST satisfying:
- Every node is either red or black.
- The root is black.
- Every leaf (NIL sentinel) is black.
- If a node is red, both its children are black.
- For every node, all simple paths from the node to descendant leaves contain the same number of black nodes (the black-height).
Theorem 2.6. A red-black tree with internal nodes has height .
Proof. Let be the root. By property 5, at least half the nodes on any root-to-leaf path are black (by property 4, no two reds are adjacent), so the black-height . Let be the minimum number of internal nodes in a subtree of height and black-height . Then (proved by induction on Using the fact that each child has black-height at least and the root is black). Since and We get .
Insertion. Insert as in a standard BST (colour the new node red), then fix violations by recolouring and rotating up to times.
Deletion. More complex than insertion. When a black node is removed, we may need to fix the black-height property by recolouring and rotating. The fix-up procedure takes time.
2.4.3 B-Trees
Definition. A B-tree of minimum degree is a rooted tree satisfying:
- Every node has at most keys.
- Every non-root node has at least keys (hence at least children).
- The root has at least 1 key.
- All leaves appear at the same depth.
Theorem 2.7. A B-tree with keys and minimum degree has height .
Proof. The root has at least 1 key and 2 children. At depth 1, each node has at least keys and children. At depth The number of nodes is at least Each with at least keys. So the total number of keys is at least . Setting gives .
Worked Example: Red-Black Tree Insertion Trace
Insert keys 7, 3, 18, 10, 22, 8, 11, 26 into an initially empty red-black tree.
Insert 7: Root is coloured black.
Insert 3: Left child of 7, coloured red. . Valid (root is black, no red-red violation).
Insert 18: Right child of 7, coloured red. . Valid.
Insert 10: Insert as left child of 18, coloured red. Now 18 has red child 10. Uncle of 10 is 3(R) — Case 1 (uncle is red): recolour uncle 3 black, parent 18 black, grandparent 7 red. Grandparent 7 is root, so recolour 7 black.
Result: with under .
Insert 22: Right child of 18, coloured red. Uncle of 22 is 10(R) — Case 1 again: recolour 10 black, 22 black, 18 red. Parent 18 is not root, grandparent is 7(B). Now 7 has right child 18(R). Check 7’s left child: 3(B). No violation.
Insert 8: Left child of 10, coloured red. Uncle of 8 is 22(R) — Case 1: recolour 8 black, 22 black, 10 red. Now 10(R) is left child of 18(B). Check grandparent 7: right child 18(B) with left child 10(R). No red-red violation.
Insert 11: Right child of 10, coloured red. Now 10(R) has both children red — wait, 10 is red and 11 is red: violation. Uncle of 11 is 8(R) — Case 1: recolour 8 black, 11 black, 10 red. Grandparent 18(B): 10(R) is left child, 22(B) is right child. No violation.
Insert 26: Right child of 22, coloured red. Uncle of 26 is 10(R) — Case 1: recolour 10 black, 26 black, 22 red. Now 22(R) is right child of 18(B). Check left child of 18: 10(B). Valid.
2.5 Skip Lists
A skip list is a probabilistic data structure that provides expected-time search, insertion, and deletion. It consists of multiple levels of linked lists.
Structure:
- Level 0 is a standard linked list containing all elements in sorted order.
- Each element is promoted to level 1 with probability ( ).
- Each level- element is promoted to level independently with probability .
- A sentinel head node links to the first element at every level.
Theorem 2.8. The expected number of levels in a skip list with elements is .
Proof. An element reaches level with probability . The expected number of elements at level is . The highest non-empty level is approximately .
Theorem 2.9. Search, insert, and delete in a skip list take expected time.
Proof. The search path at each level moves right to find an element whose successor at level is greater than the target, then drops to level . At each level, the expected number of rightward steps is at most (a geometric distribution argument). With levels, the total expected work is .
Worked Example: Skip List Search
Search for key 25 in a skip list containing keys with the following level structure (probability ):
Level 3: [HEAD] -------------------------> [31] ---------->Level 2: [HEAD] ---------> [12] -------------------------->Level 1: [HEAD] -> [7] -> [12] -> [19] -> [25] -> [31] -> [42]Level 0: [HEAD] -> [7] -> [12] -> [19] -> [25] -> [31] -> [42]Search starts at HEAD, level 3:
- Move right to 31. Drop to level 2.
- At level 2, move right to 12. Move right — next is NIL. Drop to level 1.
- At level 1, move right to 19. Move right to 25. . Found!
The search examined keys 31, 12, 19, 25 — 4 comparisons across 3 levels.
Worked Example: Skip List Insertion
Insert key 15 into the skip list above.
- Search for insertion position: traverse levels 3, 2, 1 to find that 15 goes between 12 and 19.
- Insert 15 at level 0 (always). Randomly promote: suppose coin flip says promote to level 1. Insert at level 1. Flip again: promote to level 2. Insert at level 2. Flip again: don’t promote. Stop.
Level 3: [HEAD] -------------------------> [31] ---------->Level 2: [HEAD] ---------> [12] -> [15] ------------------>Level 1: [HEAD] -> [7] -> [12] -> [15] -> [19] -> [25] -> [31] -> [42]Level 0: [HEAD] -> [7] -> [12] -> [15] -> [19] -> [25] -> [31] -> [42]Insertion takes expected time (search path + insertions at promoted levels).
2.6 Bloom Filters
A Bloom filter is a space-efficient probabilistic data structure for set membership testing. It may produce false positives but never false negatives.
Structure: A bit array of bits (initially all 0) and independent hash functions .
Operations:
- Insert : set bits to 1.
- Query : return true if all are set to 1.
Theorem 2.10. After inserting elements into a Bloom filter of size with hash functions, the probability of a false positive is approximately .
Proof. After inserting elements, each of the bits is set to 0 with probability . The probability that all bits for a query element are set is .
Optimal number of hash functions. The false positive rate is minimised when Giving a minimum rate of approximately .
Worked Example: Bloom Filter Configuration
We want to store URLs with a false positive rate of at most 1%.
From the formula: .
Optimal . At the optimal point, the false positive rate is .
We need So .
Take bits KB. Then So use hash functions.
This requires only 11.5 KB of memory versus approximately 600 KB for storing the URLs as 60-byte strings.
Worked Example: Bloom Filter Operations Trace
A Bloom filter has bits, hash functions: , .
Insert 15: , . Set bit 5. Bit array: 0000010000
Insert 22: , . Set bits 2, 6. Bit array: 0010010100
Insert 37: , . Set bits 1, 7. Bit array: 0110010110
Query 22: (set), (set). Result: possibly in set (true positive).
Query 42: (set), (set). Result: possibly in set (false positive!).
Query 50: (not set). Result: definitely not in set (true negative).
2.7 Heaps
A binary heap is a complete binary tree satisfying the heap property:
- Max-heap: parent children.
- Min-heap: parent children.
Implemented as an array: parent of node is at ; children at and .
Operations: insert Extract-max/min Peek .
Theorem 2.11 (Build-Heap). buildHeap on an array of elements runs in time.
Proof. buildHeap calls heapify on each of the non-leaf nodes. The cost of heapify at a node of height is . The total cost is Since .
Worked Example: Build-Heap Step by Step
Build a max-heap from the array .
The array represents the tree:
4 / \ 10 3 / \ / \ 5 1 8 2Non-leaf nodes (indices down to 0): nodes 3, 10, 4.
Heapify index 2 (value 3): Children are 8 and 2. Swap 3 with 8. Array: .
4 / \ 10 8 / \ / \ 5 1 3 2Heapify index 1 (value 10): Children are 5 and 1. 10 is largest. No swap.
Heapify index 0 (value 4): Children are 10 and 8. Swap with 10. Array: .
10 / \ 4 8 / \ / \ 5 1 3 2Now heapify index 1 (value 4): Children are 5 and 1. Swap with 5. Array: .
10 / \ 5 8 / \ / \ 4 1 3 2This is a valid max-heap. Total operations: 3 swaps = .
2.8 Union-Find (Disjoint Set Union)
The Union-Find data structure maintains a partition of a set into disjoint subsets, supporting:
- MakeSet(): Create a singleton set .
- Find(): Return the representative of the set containing .
- Union(): Merge the sets containing and .
Implementation with path compression and union by rank:
- Each set is represented as a rooted tree.
Findfollows parent pointers and compresses the path (makes every node on the path point directly to the root).Unionattaches the shorter tree under the root of the taller tree (by rank/size).
Theorem 2.12. With both path compression and union by rank, the amortised time per operation is Where is the inverse Ackermann function.
Proof (outline). The inverse Ackermann function grows so slowly that for all practical values of (). The …/1-number-and-algebra/3proof-and-logic uses a potential function argument on the levels of the tree. Each node is assigned a _level based on its rank. The key insight is that path compression prevents the number of nodes at high levels from growing. The total number of Find operations that can charge to any level- node is bounded by where is a fast-growing function. Summing over all levels gives total, hence amortised per operation.
Worked Example: Union-Find for Connected Components
Given an undirected graph with vertices and edges, determine the connected components.
for each vertex v: MakeSet(v)for each edge (u, v): if Find(u) != Find(v): Union(u, v)After processing all edges, vertices in the same connected component have the same representative. Total time: .
2.9 Graphs
A graph can be represented by:
- Adjacency matrix: if . Space: . Edge lookup: .
- Adjacency list: For each vertex, store a list of neighbours. Space: . Iterating over neighbours: .
| Operation | Adjacency Matrix | Adjacency List |
|---|---|---|
| Space | ||
| Check edge | ||
| Iterate neighbours | ||
| Add edge | ||
| Remove edge |
:::caution Common Pitfall Choosing the wrong graph representation can make an algorithm asymptotically slower. Use adjacency matrices for dense graphs () and adjacency lists for sparse graphs (). For example, BFS with an adjacency matrix takes but with adjacency lists takes .
:::