Indexing
5.1 B-Trees and B+ Trees
B-Tree of order . A balanced search tree where each internal node has between and children. All leaves are at the same depth.
Properties:
- Root: at least 2 children (unless it is a leaf).
- Internal node with children has keys.
- All leaves at the same depth .
- Height: .
Operations:
- Search: . At each node, binary search the keys and follow the appropriate child pointer.
- Insert: Find the correct leaf. If the leaf has room, insert. If full, split the leaf and propagate the median key up. May cascade splits to the root.
- Delete: Find the key. If in an internal node, replace with predecessor/successor and delete from the leaf. If a node underflows (fewer than keys), redistribute from a sibling or merge with a sibling.
B+ Tree. Variant where:
- All data records are stored only in leaf nodes (internal nodes store only keys for routing).
- Leaf nodes are linked in a linked list for efficient range queries.
This makes B+ trees the standard for database indexing. The linked leaf list allows range scans in where is the number of records in the range.
B+ Tree insert pseudocode:
def bplus_insert(root, key, value): leaf = find_leaf(root, key) insert_into_leaf(leaf, key, value) if len(leaf.keys) > MAX_KEYS: new_leaf, median = split_leaf(leaf) insert_into_parent(leaf.parent, median, new_leaf)Worked Example 5.1: B+ Tree Insertion
Insert keys 10, 20, 5, 15, 25, 30 into a B+ tree with maximum 2 keys per leaf node and maximum 2 Keys per internal node (order ).
Insert 10. Leaf:
Insert 20. Leaf:
Insert 5. Leaf would be — overflow (max 2 keys). Split: left leaf Right Leaf . Push median (10) to new internal node.
Internal: [10] ├── Leaf: [5] └── Leaf: [10, 20]Insert 15. Goes to right leaf. Leaf would be — overflow. Split: and . Push 15 up.
Internal: [10, 15] ├── Leaf: [5] ├── Leaf: [10, 15] └── Leaf: [20]Insert 25. Goes to rightmost leaf. Leaf: .
Insert 30. Leaf would be — overflow. Split: and . Push 25 up. Internal would be — overflow (max 2 keys). Split internal: push 15 to new root.
Root: [15] ├── Internal: [10] │ ├── Leaf: [5] │ └── Leaf: [10, 15] └── Internal: [25] ├── Leaf: [20, 25] └── Leaf: [30]Worked Example 5.2: B+ Tree Deletion
Starting from the tree in Example 5.1, delete key 25.
Delete 25. Found in rightmost leaf of the right internal node. Leaf becomes (1 key, Minimum is ). No underflow. Update internal node: key 25 changes to 30 (to maintain correct routing).
Root: [15] ├── Internal: [10] │ ├── Leaf: [5] │ └── Leaf: [10, 15] └── Internal: [30] ├── Leaf: [20] └── Leaf: [30]Now delete 15. Leaf becomes . No underflow. Internal node key 15 changes to 10. But wait — the internal node would need to distinguish between leaves and . Since the left child contains keys and the right child contains keys This Is still correct.
Now delete 10 from the left subtree”s right leaf. Leaf becomes empty — underflow.
Redistribution: Sibling leaf has 1 key (at minimum). Cannot redistribute. Merge: Merge empty leaf with . The merged leaf is . The internal node now has Only one child (the merged leaf), so it underflows.
Since the internal node is a child of the root, and the root has two children, we can merge: remove The internal node and promote its remaining child to be a direct child of the root.
Root: [30] ├── Leaf: [5] ├── Leaf: [10] (originally contained 10 before deletion) └── Leaf: [30]Wait — after deleting 10, the leaf should become And merging with gives . Let me retrace. The point is that deletion can trigger cascading merges up the tree.
5.2 Hash Indexes
Static hashing. A hash function maps keys to buckets. Average lookup: under uniform hashing. No support for range queries.
Collision handling. When two keys hash to the same bucket, the DBMS must resolve the collision:
- Separate chaining: Each bucket contains a linked list of entries. Lookup requires traversing the chain. Average chain length under uniform hashing is .
- Open addressing (linear probing): If bucket is full, try , Etc. (mod ). Prone to primary clustering: consecutive occupied slots increase the average probe length.
- Open addressing (quadratic probing): Try Etc. Reduces clustering but may not probe all buckets.
- Open addressing (double hashing): Use a second hash function : probe sequence is for . Minimises clustering.
Extendible hashing. Dynamically grows the number of buckets. Uses a global depth and Per-bucket local depth .
- Hash key to bits.
- If the bucket is full and its local depth equals the global depth, double the directory (global depth increases by 1) and split the bucket.
- If the bucket’s local depth is less than the global depth, split the bucket without doubling the directory.
Linear hashing. Gradually grows one bucket at a time using a pointer to the next bucket to split. Simpler than extendible hashing but may have slightly higher overflow probability.
Comparison:
| Property | B+ Tree | Hash Index |
|---|---|---|
| Point query | average | |
| Range query | Not supported | |
| Insert/Delete | average | |
| Space | Nodes may be half-empty | May have overflow chains |
| Order | Maintains key order | No ordering |
5.3 Bitmap Indexes
A bitmap index creates one bitmap per distinct value of an attribute. For a table with rows And attribute with values Store bitmaps of bits each, where Bitmap has a 1 in position if row has .
Use case: Low-cardinality columns (gender, status, country). Bitmap indexes support fast bitwise AND/OR for multi-criteria queries.
Bitwise operations for query evaluation:
| Query | Bitmap operation |
|---|---|
| AND | \mathrm{bitmap_}{A,v_1} AND \mathrm{bitmap_}{B,v_2} |
| OR | \mathrm{bitmap_}{A,v_1} OR \mathrm{bitmap_}{A,v_2} |
| NOT \mathrm{bitmap_}{A,v_1} |
Compression. For columns with many distinct values, run-length encoding (WAH or BBC) compresses Bitmaps effectively while still supporting bitwise operations.
5.4 Query Processing Cost Estimation
The cost model estimates the number of disk I/O operations for a query plan. We assume the buffer Pool has pages and each disk page access costs one I/O.
Selection cost estimation:
| Access method | Cost (I/Os) |
|---|---|
| Full table scan | (or if pages available) |
| B+ tree equality search | leaf + 1 data page |
| B+ tree range search | leaf + |
| Hash equality search | 1 (ideal) |
Where is the fanout (average number of children per internal node).
Join cost estimation (from Section 7):
| Algorithm | Cost |
|---|---|
| Block nested-loop | |
| Sort-merge | |
| Hash join | (if build relation fits in pages) |
Worked Example 5.3: Comparing Access Methods
Scenario: Student table with 50000 tuples, 2500 pages (20 tuples per page). Buffer pool has 52 Pages. Attribute ID has a B+ tree index of height 3 (fanout 100). Query: SELECT * FROM Student WHERE ID = 12345.
Method 1: Full table scan. Cost = 2500 I/Os.
Method 2: B+ tree index on ID. Follow root internal internal leaf: 3 I/Os For the index traversal. Then 1 I/O for the data page. Total: 4 I/Os.
The B+ tree index is faster for a single equality lookup. However, for a query selecting 50% of the table, a full scan (2500 I/Os) would be faster than 25000 individual B+ tree lookups ( I/Os). This is why the optimiser chooses between index and scan based on Selectivity estimates.