Algorithms and Data Structures
1. Algorithm Analysis
1.1 Asymptotic Notation
Definition. if there exist constants and such that for all .
Definition. if there exist constants and such that for all .
Definition. if and .
Definition. if .
Theorem 1.1. if and only if .
Theorem 1.2 (Limit Rule). If where , then . If , then . If , then .
1.2 Common Complexity Classes
| Class | Name | Example |
|---|---|---|
| Constant | Array access by index | |
| Logarithmic | Binary search | |
| Linear | Linear search | |
| Log-linear | Merge sort, heap sort | |
| Quadratic | Bubble sort, insertion sort | |
| Cubic | Naive matrix multiplication | |
| Exponential | Naive Fibonacci, subset problems | |
| Factorial | Generating all permutations |
1.3 Amortised Analysis
Amortised analysis gives the average cost per operation over a sequence of operations, even when individual operations may be expensive.
Methods:
- Aggregate analysis: Compute total cost of operations, divide by .
- Accounting method: Assign an amortised cost to each operation; the sum of amortised costs must be an upper bound on the total actual cost.
- Potential method: Define a potential function ; the amortised cost of the -th operation is .
Example (Dynamic Array). A dynamic array doubles in size when full. Insertion is amortised: most insertions cost ; occasional resizing costs , but is paid for by the surplus from previous insertions.
2. Fundamental Data Structures
2.1 Arrays and Linked Lists
Array. Contiguous memory, access by index, insertion/deletion (shift required).
Linked List. Each node stores data and a pointer to the next node. insertion/deletion at head (given pointer), access by position.
Doubly Linked List. Each node has pointers to both next and previous nodes. insertion and deletion at any position (given pointers to the node and its neighbours).
2.2 Stacks and Queues
Stack. Last-in, first-out (LIFO). Operations: push, pop, peek, all .
Queue. First-in, first-out (FIFO). Operations: enqueue, dequeue, peek, all .
Implementation. Both can be implemented with arrays (with circular buffer) or linked lists.
2.3 Hash Tables
A hash table maps keys to values using a hash function .
Collision resolution:
- Chaining: Each bucket is a linked list. Average case: where is the load factor.
- Open addressing (linear probing): Insert into the next available slot. Average case: .
Theorem 2.1 (Uniform Hashing). Under the assumption of simple uniform hashing, the expected length of a chain in a hash table with chaining is .
2.4 Trees
Binary Search Tree (BST). For each node: left subtree values node value, right subtree values node value.
- Search, insert, delete: where is the height.
- For a balanced BST, .
AVL Tree. A self-balancing BST where the heights of left and right subtrees differ by at most 1. All operations: worst case.
Red-Black Tree. A self-balancing BST with colour properties. All operations: .
B-Tree. A generalised balanced search tree used in databases and file systems. All leaves at the same depth. Each node has between and children.
2.5 Heaps
A binary heap is a complete binary tree satisfying the heap property:
- Max-heap: parent children.
- Min-heap: parent children.
Implemented as an array: parent of node is at ; children at and .
Operations: insert , extract-max/min , peek .
2.6 Graphs
A graph can be represented by:
- Adjacency matrix: if . Space: . Edge lookup: .
- Adjacency list: For each vertex, store a list of neighbours. Space: . Iterating over neighbours: .
3. Sorting Algorithms
3.1 Merge Sort
Algorithm. Divide the array in half, recursively sort each half, then merge.
Theorem 3.1. Merge sort runs in time in all cases (best, average, worst).
Proof. The recurrence is . By the Master theorem (case 2): , , , so .
Theorem 3.2. Merge sort is stable and requires auxiliary space.
3.2 Quicksort
Algorithm. Choose a pivot, partition the array into elements pivot and pivot, recursively sort each partition.
Theorem 3.3. Quicksort runs in expected time and worst-case time.
Proof (expected case). If the pivot is chosen uniformly at random, the expected number of comparisons is by an indicator random variable argument.
Worst case occurs when the pivot is always the smallest or largest element (e.g., already sorted array with first-element pivot): .
Theorem 3.4. Quicksort is not stable in general but uses expected stack space.
3.3 Heapsort
Algorithm. Build a max-heap in time, then repeatedly extract the maximum.
Theorem 3.5. Heapsort runs in time in all cases, uses auxiliary space, and is not stable.
3.4 Lower Bound on Comparison Sorting
Theorem 3.6. Any comparison-based sorting algorithm requires comparisons in the worst case.
Proof. A comparison-based sort can be modelled as a decision tree. The tree must have at least leaves (one for each permutation). A binary tree with leaves has height at least .
The lower bound applies only to comparison-based sorting. Non-comparison sorts like radix sort can achieve time for integers in a bounded range.
4. Graph Algorithms
4.1 Breadth-First Search (BFS)
BFS explores the graph level by level from a source vertex.
Theorem 4.1. BFS runs in time and discovers shortest paths (in terms of number of edges) from the source to all reachable vertices.
Proof. BFS uses a queue. When a vertex is dequeued, all its undiscovered neighbours are enqueued. Since vertices are processed in FIFO order, the first time a vertex is discovered is via the shortest path.
4.2 Depth-First Search (DFS)
DFS explores as far as possible along each branch before backtracking.
Theorem 4.2. DFS runs in time and can be used to detect cycles, find connected components, compute topological orderings, and identify strongly connected components.
4.3 Dijkstra's Algorithm
Problem. Find shortest paths from a single source in a weighted graph with non-negative edge weights.
Algorithm. Maintain a priority queue of vertices keyed by their current shortest-path estimate. Repeatedly extract the vertex with minimum distance and relax its edges.
Theorem 4.3. Dijkstra's algorithm with a binary heap runs in time. With a Fibonacci heap: .
Proof. Each vertex is extracted from the priority queue at most once: . Each edge is relaxed at most once: . Total: .
4.4 Bellman-Ford Algorithm
Problem. Find shortest paths from a single source, allowing negative edge weights (but no negative cycles).
Theorem 4.4. Bellman-Ford runs in time. It correctly detects negative-weight cycles.
Proof. After iterations, all shortest paths using at most edges are found. After iterations, all shortest paths (which have at most edges) are correct. A -th iteration that updates any distance indicates a negative cycle.
4.5 Floyd-Warshall Algorithm
Problem. Find all-pairs shortest paths.
Algorithm. For : for each pair , check if going through vertex improves the path.
Theorem 4.5. Floyd-Warshall runs in time and space.
4.6 Minimum Spanning Tree
Kruskal's Algorithm. Sort edges by weight, add edges to the MST as long as they don't create a cycle (using Union-Find). .
Prim's Algorithm. Start from any vertex, repeatedly add the minimum-weight edge connecting the tree to a non-tree vertex (using a priority queue). .
Theorem 4.6 (Cut Property). For any cut of the graph, the minimum-weight edge crossing the cut belongs to some MST.
Theorem 4.7 (Cycle Property). For any cycle, the maximum-weight edge on the cycle does not belong to any MST.
5. Dynamic Programming
5.1 Principles
Dynamic programming (DP) solves problems by:
- Overlapping subproblems: The same subproblems are solved repeatedly.
- Optimal substructure: The optimal solution contains optimal solutions to subproblems.
Approaches:
- Top-down (memoisation): Recursive with caching.
- Bottom-up (tabulation): Fill a table iteratively from small subproblems to large.
5.2 Common Patterns
1D DP. depends on for . Example: Fibonacci, longest increasing subsequence.
2D DP. depends on for in some set. Example: edit distance, matrix chain multiplication, longest common subsequence.
Interval DP. depends on where . Example: optimal BST, matrix chain multiplication.
5.3 Worked Example: Longest Common Subsequence
Problem. Given sequences and , find the LCS.
Recurrence:
Time: . Space: (can be reduced to for the length only).
Proof of correctness. If , any LCS of and must include , so . If , the LCS either excludes or excludes , giving the max of the two subproblems.
6. NP-Completeness
6.1 P, NP, and NP-Completeness
P: The class of decision problems solvable in polynomial time by a deterministic Turing machine.
NP: The class of decision problems solvable in polynomial time by a non-deterministic Turing machine. Equivalently, problems whose "yes" instances have polynomial-time verifiable certificates.
NP-hard: A problem is NP-hard if every problem in NP can be reduced to in polynomial time.
NP-complete: A problem is NP-complete if it is in NP and NP-hard.
Theorem 6.1. If any NP-complete problem is in P, then P = NP.
6.2 Polynomial-Time Reductions
A polynomial-time reduction from problem to problem is a polynomial-time algorithm that transforms instances of into instances of , preserving the answer.
Lemma 6.1. If and , then .
Lemma 6.2. If and is NP-hard, then is NP-hard.
6.3 Classic NP-Complete Problems
SAT (Cook-Levin Theorem, 1971). Given a Boolean formula in CNF, is there a satisfying assignment?
3-SAT. SAT restricted to clauses with exactly 3 literals.
Vertex Cover. Given a graph and integer , is there a vertex cover of size ?
Travelling Salesman Problem (decision version). Given a weighted graph and bound , is there a tour of total weight ?
Subset Sum. Given a set of integers and a target , is there a subset summing to ?
Clique. Given a graph and integer , does contain a clique of size ?
6.4 Proof Strategy for NP-Completeness
To prove a problem is NP-complete:
- Show (polynomial-time verifiable certificate).
- Show a known NP-complete problem reduces to : .
Example. 3-SAT Vertex Cover: construct a graph from the 3-SAT formula where each variable and each clause become vertices, and edges enforce the constraint that a satisfying assignment corresponds to a vertex cover.
NP-hardness does not mean the problem is unsolvable. It means there is no known polynomial-time algorithm. Many NP-complete problems have efficient approximation algorithms or can be solved exactly for practical input sizes using branch-and-bound or SAT solvers.