Operating Systems (Advanced)
1. Advanced Process Management
1.1 Process Creation and Fork
The fork() system call creates a new process by duplicating the calling process. The new process (child) is an exact copy of the parent, except for its PID, PPID, and resource …/4-statistics-and-probability/2_statistics.
Copy-on-Write (COW). Modern Unix systems use COW semantics for fork(): the parent”s pages are shared (read-only) between parent and child. A page is copied only when either process writes to it.
Theorem 1.1. With COW, fork() takes time in the common case (no immediate writes), but the worst-case cost of subsequent page faults is where is the number of pages.
Proof. fork() only creates a new page table and copies page table entries (marking pages as read-only). This takes for the page table setup but for actual page copies. Each subsequent write triggers a page fault, copying one page. In the worst case, every page is written, giving total page fault cost.
vfork(). An alternative to fork() that does not copy page tables. The child runs in the parent’s address space and must call exec() or _exit() before returning. Faster than fork() but dangerous if misused.
| Property | fork() + COW | fork() (traditional) | vfork() |
|---|---|---|---|
| Page table copy | Yes (shared, COW) | Yes (full copy) | No |
| Address space | Separate | Separate | Shared with parent |
| Safety | Safe | Safe | Unsafe if child modifies memory |
| Speed | Fast | Slow ( pages) | Very fast |
Worked Example: fork() Process Tree
Consider the following code:
#include <stdio.h>#include <unistd.h>
int main() { fork(); fork(); fork(); printf("Hello\n"); return 0;}How many “Hello” lines are printed?
Each fork() doubles the number of processes. After 3 forks: processes. Each prints “Hello”, so 8 lines are printed.
Process tree (P = parent, C = child):
P0├── C1 (1st fork)│ ├── C3 (2nd fork, from C1)│ │ ├── C7 (3rd fork, from C3)│ │ └── P_C3 (3rd fork parent, from C3)│ └── P_C1 (2nd fork parent, from C1)│ ├── C5 (3rd fork, from P_C1)│ └── P_P_C1 (3rd fork parent, from P_C1)└── P0_c (1st fork parent, from P0) ├── C2 (2nd fork, from P0_c) │ ├── C6 (3rd fork, from C2) │ └── P_C2 (3rd fork parent, from C2) └── P_P0_c (2nd fork parent, from P0_c) ├── C4 (3rd fork, from P_P0_c) └── P_P_P0_c (3rd fork parent, from P_P0_c)1.2 Thread Libraries and Models
Three thread models:
- 1:1 (kernel-level): Each user thread maps to one kernel thread. Examples: Linux pthreads, Windows threads.
- Advantage: True parallelism on multi-core; blocking system calls block only the calling thread.
- Disadvantage: Thread creation is expensive (kernel involvement); limited by kernel resources.
- N:1 (user-level): All user threads map to one kernel thread. Examples: GNU Pth (historical).
- Advantage: Fast creation and switching (no kernel involvement); unlimited threads (within memory).
- Disadvantage: Cannot exploit multi-core; blocking system call blocks all threads.
- M:N (hybrid): user threads map to kernel threads (). Examples: Solaris threads (historical), Go goroutines (with M:N scheduling).
- Advantage: Combines benefits of both; scheduler can adapt.
- Disadvantage: Complex implementation; scheduling decisions are harder.
| Property | 1:1 | N:1 | M:N |
|---|---|---|---|
| Parallelism | Full | None | Partial |
| Creation cost | High | Low | Low |
| Context switch | Kernel | User-space | User-space |
| Blocking I/O | Per-thread | Blocks all | Blocks one LWP |
| Implementation | Kernel | Library | Library + kernel |
POSIX threads (pthreads). The standard API for C/C++ on Unix systems.
Key functions:
| Function | Purpose |
|---|---|
pthread_create | Create a new thread |
pthread_join | Wait for thread termination |
pthread_exit | Terminate the calling thread |
pthread_mutex_lock/unlock | Lock/unlock a mutex |
pthread_cond_wait/signal | Wait/signal a condition variable |
pthread_cancel | Request cancellation of a thread |
1.3 Real-Time Scheduling
Real-time systems require tasks to complete within strict timing constraints.
Hard real-time: Missing a deadline is a system failure (e.g., pacemakers, avionics). Soft real-time: Missing a deadline degrades performance but is not catastrophic (e.g., video streaming, games).
Scheduling algorithms for real-time systems:
Rate-Monotonic Scheduling (RMS). Fixed-priority preemptive scheduling where priority is inversely proportional to the period.
Theorem 1.2 (Liu and Layland, 1973). A set of periodic tasks with utilisation is schedulable under RMS, where is the computation time and is the period of task .
Proof (sufficiency). The critical instant for a task occurs when it is released simultaneously with all higher-priority tasks. The worst-case response time of task is . For tasks with equal utilisation The bound is derived by considering the maximum interference from higher-priority tasks at the critical instant. As This bound approaches .
Utilisation bounds for RMS:
| Bound | |
|---|---|
| 1 | 1.000 |
| 2 | 0.828 |
| 3 | 0.779 |
| 4 | 0.756 |
| 5 | 0.743 |
Earliest Deadline First (EDF). Dynamic-priority preemptive scheduling where the task with the nearest deadline has the highest priority.
Theorem 1.3. EDF can schedule any task set with total utilisation . This is optimal among all preemptive uniprocessor scheduling algorithms.
Proof. If The total demand (computation time) in any interval does not exceed . Since EDF always prioritises the task that must complete soonest, no deadline is missed if the total demand is feasible.
Worked Example: Rate-Monotonic Scheduling
Three periodic tasks:
- Task 1: , ,
- Task 2: , ,
- Task 3: , ,
Total utilisation: .
RMS bound for : .
Since The task set is schedulable by RMS (sufficient condition).
Priority assignment (RMS): Task 1 has highest priority (shortest period ), then Task 2 (), then Task 3 ().
Timeline (first 20 time units):
Time: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Task1: [==] [==] [==] [==] [==]Task2: [====] [====]Task3: [===] [===]Let me verify more carefully. At : all tasks arrive. Task 1 (highest priority) runs for 2 units (0—2). Task 2 runs for 2 units (2—4). Task 3 runs for 3 units (4—7).
At : Task 1 released again. Preempt Task 3 (at … Wait, Task 3 finishes at ). Task 1 runs at But Task 3 is still running (4—7). Preempt Task 3 at . Task 1 runs 5—7. Task 3 resumes at … But Task 1 already used 5—7. Task 3 had completed 1 unit (4—5), needs 2 more.
Let me redo:
- : T1 arrives, runs 0—2.
- : T2 runs 2—4.
- : T3 runs 4—5.
- : T1 arrives (period 5), preempts T3. T1 runs 5—7.
- : T3 resumes. T3 needs 2 more units (1 done at 4—5). T3 runs 7—9.
- : T2? No, T2 next arrives at . T3 finishes at . Idle 9—10.
- : T1 arrives, runs 10—12. T2 arrives at But T1 has higher priority.
- : T2 runs 12—14.
- : T3? T3 next arrives at . Idle 14—15.
- : T1 arrives, runs 15—17.
- : Idle 17—20.
- : T1, T2, T3 all arrive again.
Deadlines: T1 at 5, 10, 15, 20 — all met. T2 at 10, 20 — all met. T3 at 20 — met.
All deadlines satisfied.
:::caution Common Pitfall The RMS utilisation bound is a sufficient condition, not necessary. A task set with may still be schedulable under RMS. The response time analysis (RTA) provides a necessary and sufficient test but is more complex to compute. :::
2. Advanced Memory Management
2.1 Multi-Level Page Tables
A multi-level page table reduces the memory required to store the page table for sparse address spaces.
Two-level page table. The virtual address is split into three parts:
| Field | Bits (32-bit) | Purpose |
|---|---|---|
| Outer (p1) | 10 bits | Index into page directory |
| Inner (p2) | 10 bits | Index into page table |
| Offset | 12 bits | Offset within page |
Theorem 2.1. A two-level page table for a 32-bit address space with 4 KB pages requires at most 4 MB of page table memory (1 page directory + 1024 page tables), compared to 4 MB for a flat page table. For a sparse address space using only pages, the two-level table uses KB.
Proof. The page directory is one page (4 KB, holding 1024 entries of 4 bytes each). Each page table is also one page. With page tables actually used (one per 4 MB region that has any mapped pages), the total memory is KB. A flat page table always uses bytes = 4 MB.
Four-level page tables (x86-64). 64-bit virtual addresses use 4 levels of page tables (PML4, PDP, PD, PT) with 9 bits per level and 12 bits of offset, supporting 48-bit virtual addresses ( TB).
2.2 Inverted Page Tables
An inverted page table stores one entry per physical frame (rather than per virtual page). Each entry maps a frame to the (process ID, virtual page) pair that occupies it.
Advantages: Memory proportional to physical memory size, not virtual address space size. Critical for 64-bit systems.
Disadvantages: Searching for a virtual page requires hashing the (PID, VPN) pair. TLBs mitigate this cost.
2.3 Demand Paging Analysis
Under demand paging with a local replacement policy (each process gets a fixed number of frames), the working set model defines the set of pages a process needs within a time window :
Theorem 2.2 (Working Set Theorem). A process with working set size needs at least frames to avoid thrashing. If allocated fewer frames, the page fault rate increases dramatically.
2.4 Thrashing and Load Control
Thrashing occurs when the system spends more time paging than executing useful work.
Detection. A process is thrashing if its page fault rate exceeds a threshold (e.g., measured by the ratio of productive CPU time to page fault handling time).
Prevention strategies:
- Working set model: Admit a process only if there are enough free frames for its working set.
- Page fault frequency (PFF): Dynamically adjust frames based on page fault rate. If too high, allocate more frames. If too low, take some away.
- L=S criterion: The locality model ensures (locality size) is sufficient for (degree of multiprogramming).
Worked Example: Multi-Level Page Table Lookup
A 32-bit system with 4 KB pages and a two-level page table. Virtual address: 0x2A4B0C1C.
Break down the address:
- Binary: 0010 1010 0100 1011 0000 1100 0001 1100
- p1 (bits 31—22): 0010 1010 10 = 0x0AA = 170
- p2 (bits 21—12): 10 0100 1011 = 0x12B = 299
- offset (bits 11—0): 0000 1100 0001 1100 = 0x0C1C = 3100
Steps:
- Read CR3 to find the page directory base address.
- Read page directory entry at index 170. If present bit is 0, page fault. Otherwise, get the page table base address.
- Read page table entry at index 299. If present bit is 0, page fault. Otherwise, get the physical frame number.
- Compute physical address: frame number || offset.
If page directory entry 170 gives frame 0x3F and page table entry 299 gives frame 0x1A5: Physical address = 0x1A5 << 12 | 0x0C1C = 0x1A50C1C.
TLB hit: If the TLB caches the mapping for this virtual page, steps 1—3 are skipped, reducing the translation to one memory access.
3. Advanced Synchronisation
3.1 Futexes
A futex (fast userspace mutex) is a Linux synchronisation primitive that avoids kernel transitions in the non-contended case.
Mechanism:
- Try to atomically decrement the futex value (userspace, using
cmpxchg). - If the value was 1 (unlocked), the lock is acquired. No kernel call.
- If the value was 0 (contended), call
futex(FUTEX_WAIT)to sleep in the kernel. - When releasing, set the value to 1 and call
futex(FUTEX_WAKE)to wake one waiter.
Performance. In the uncontended case, futexes are 10—100x faster than traditional mutexes (no kernel transition). In the contended case, they have similar overhead to kernel mutexes.
3.2 Read-Copy-Update (RCU)
RCU is a synchronisation mechanism optimised for read-mostly data structures. Readers access data without any locking; writers create copies and update pointers atomically.
RCU protocol:
- Reader: dereference the pointer. No memory barriers or locks needed (on most architectures).
- Writer: Allocate a new copy of the data structure, modify the copy, atomically update the pointer to point to the new copy.
- Reclamation: After a grace period (all pre-existing readers have finished), free the old copy.
Grace period. Determined by waiting for all CPUs to pass through a quiescent state (no RCU readers active).
Advantages: Zero overhead for readers; suitable for high-read workloads (e.g., routing tables, file system metadata).
Disadvantages: Writers are expensive (copy the entire structure); not suitable for write-heavy workloads.
Theorem 3.1. RCU guarantees that no reader accesses freed memory, provided that: (1) readers do not sleep while holding an RCU reference, and (2) the old data is freed only after a grace period.
3.3 Sequence Locks (Seqlocks)
A sequence lock allows readers to proceed without locking but requires them to verify that the data was not modified during the read.
Protocol:
- Writer: Increment sequence number, write data, increment sequence number again.
- Reader: Read sequence number, read data, read sequence number again. If the two sequence numbers differ (or are odd), retry.
Use case: Very small, frequently read data where writes are rare (e.g., gettimeofdayStatistics counters).
3.4 Memory Ordering
Modern processors reorder memory operations for performance. The programmer must use memory barriers to enforce ordering when necessary.
Types of reordering:
| Type | Description | Example |
|---|---|---|
| Store-store | Stores reordered | Store A, Store B may become Store B, Store A |
| Load-load | Loads reordered | Load A, Load B may become Load B, Load A |
| Store-load | Store followed by load reordered | Store A, Load B may become Load B, Store A |
| Load-store | Load followed by store reordered | Load A, Store B may become Store B, Load A |
Memory barrier instructions:
| Barrier | x86-64 | ARM | Effect |
|---|---|---|---|
| Full barrier | mfence | dmb ish | All preceding memory ops complete before any subsequent op |
| Write barrier | (implicit) | dmb ishst | All stores complete before subsequent stores |
| Read barrier | (implicit on x86) | dmb ishld | All loads complete before subsequent loads |
| Acquire | lock prefix | ldar | Subsequent loads/stores cannot move before this |
| Release | (varies) | stlr | Preceding loads/stores cannot move after this |
:::caution Common Pitfall X86-64 has a stronger memory model (TSO — Total Store Order) than ARM and RISC-V. Code that works correctly on x86-64 may fail on ARM due to reordering. Always use proper synchronisation primitives (mutexes, atomic operations with explicit ordering) rather than relying on the hardware memory model. :::
Worked Example: Double-Checked Locking with Memory Ordering
The classic double-checked locking pattern for lazy initialisation:
// INCORRECT (without proper memory ordering)if (ptr == NULL) { // First check (no lock) lock(m); if (ptr == NULL) { // Second check (with lock) ptr = malloc(sizeof(*ptr)); init(ptr); } unlock(m);}return ptr;This is broken on ARM/POWER because: (1) the compiler may reorder the write to ptr before init(ptr)And (2) the processor may reorder the store to ptr before stores inside init(ptr). A reader on another CPU could see ptr != NULL but find the object not fully initialised.
Correct version using C11 atomics:
atomic_ptr ptr = ATOMIC_VAR_INIT(NULL);
if (atomic_load_explicit(&ptr, memory_order_acquire) == NULL) { lock(m); if (atomic_load_explicit(&ptr, memory_order_relaxed) == NULL) { obj_t *p = malloc(sizeof(*p)); init(p); atomic_store_explicit(&ptr, p, memory_order_release); } unlock(m);}return atomic_load_explicit(&ptr, memory_order_acquire);The memory_order_release on the store ensures all prior writes (including init) are visible before the store to ptr. The memory_order_acquire on the load ensures all subsequent reads see the effects of the release store.
4. Advanced File Systems
4.1 Journaling in Detail
Journaling file systems (ext4, NTFS, XFS) maintain a log (journal) of changes before committing them to the main file system structures.
Journaling modes:
| Mode | What is journaled | Performance | Safety |
|---|---|---|---|
| Writeback | Metadata only | Best | Metadata consistent, data may be lost |
| Ordered | Metadata + data written before metadata committed | Good | Data written before metadata, no stale data |
| Data | Metadata + data | Worst | Full consistency guarantee |
Journal structure. A circular buffer on disk divided into blocks. Each transaction records:
- Transaction ID (monotonically increasing).
- Block descriptors: (block number, old data, new data) or (block number, new data) for new blocks.
- Commit block: Marks the transaction as complete.
Recovery algorithm:
- Find the last committed transaction in the journal.
- Replay all committed but not-yet-applied transactions.
- Discard any uncommitted transactions (partial writes due to crash).
Theorem 4.1. Journaling guarantees file system consistency after a crash, provided that the journal write reaches disk before the main file system writes.
4.2 Log-Structured File Systems
A log-structured file system (LFS) writes all modifications (data and metadata) sequentially to a log, avoiding random writes.
Key ideas:
- The entire disk is a log. New data is always appended to the end.
- Inodes and data blocks are written to the log.
- An inode map (in memory and periodically checkpointed to disk) maps inode numbers to their locations in the log.
- Garbage collection reclaims space from obsolete log segments.
Segment cleaning. When the log fills up, segments are cleaned by reading live blocks from partially-full segments and writing them to new segments. The cleaned segments are then available for new writes.
Cost-benefit model for segment selection: Clean the segment with the highest ratio of dead blocks to live blocks (most benefit for least cost):
Where is the utilisation of segment .
Theorem 4.2 (Rosenblum and Ousterhout). With optimal segment cleaning, LFS write throughput approaches the raw disk bandwidth for high utilisation, and requires 1.5—2x the write bandwidth for 90% utilisation.
4.3 Copy-on-Write File Systems
ZFS and Btrfs use copy-on-write for all modifications:
- When a block is modified, write the new block to a free location.
- Update the parent pointer to point to the new block (recursively up to the root).
- The old block is freed after the transaction commits.
Advantages:
- Snapshots are free (just keep the old root pointer).
- Always consistent on disk (either old or new root, never a mix).
- Built-in RAID and checksumming.
Disadvantages:
- Write amplification (small writes trigger many block copies).
- Fragmentation (related blocks may be scattered).
5. I/O Systems in Depth
5.1 I/O Scheduling Algorithms
The OS must decide the order in which to service I/O requests to minimise total seek time and rotational latency.
SCAN (Elevator) variants:
| Algorithm | Direction | Fairness |
|---|---|---|
| SCAN | Reverses at disk ends | Unfair to middle tracks |
| C-SCAN | One-way, jumps to start | Fair, but wastes time on jump |
| LOOK | Reverses at last request | Better than SCAN |
| C-LOOK | One-way, jumps to first request | Best overall |
Theorem 5.1. For a disk with pending requests, SSTF (Shortest Seek Time First) minimises total head movement but may cause starvation. SCAN/C-SCAN guarantee bounded waiting time.
Worked Example: Disk Scheduling Comparison
Disk with 200 cylinders (0—199). Request queue: 98, 183, 37, 122, 14, 124, 65, 67. Head at 53, moving toward 0.
FCFS: 53 -> 98 -> 183 -> 37 -> 122 -> 14 -> 124 -> 65 -> 67. Total = = cylinders.
SSTF: Always go to the nearest request. 53 -> 65 (12) -> 67 (2) -> 37 (30) -> 14 (23) -> 98 (84) -> 122 (24) -> 124 (2) -> 183 (59). Total = cylinders.
SCAN: Moving toward 0. 53 -> 37 (16) -> 14 (23) -> 0 (14) -> 65 (65) -> 67 (2) -> 98 (31) -> 122 (24) -> 124 (2) -> 183 (59). Total = cylinders.
C-SCAN: Moving toward 0, then jump to 199. 53 -> 37 (16) -> 14 (23) -> 0 (14) -> 199 (199) -> 183 (16) -> 124 (59) -> 122 (2) -> 98 (24) -> 67 (31) -> 65 (2). Total = cylinders.
C-LOOK: Moving toward 0, then jump to highest request. 53 -> 37 (16) -> 14 (23) -> 0 (14) -> 183 (183) -> 124 (59) -> 122 (2) -> 98 (24) -> 67 (31) -> 65 (2). Total = cylinders.
SSTF and SCAN are best here (236 cylinders), but SSTF may starve requests far from the head.
5.2 Direct Memory Access (DMA)
DMA allows I/O devices to transfer data directly to/from memory without CPU involvement for each byte/word.
DMA transfer process:
- CPU programs the DMA controller with: source address, destination address, transfer count, and operation (read/write).
- CPU continues executing other tasks.
- DMA controller transfers data between the device and memory, stealing bus cycles as needed.
- When the transfer is complete, DMA sends an interrupt to the CPU.
Bus mastering. Modern devices (PCIe) act as bus masters, performing DMA independently. The DMA engine is on the device, not a separate controller.
Scatter-gather DMA. A single DMA operation can transfer data to/from multiple non-contiguous memory locations, specified by a list of (address, length) pairs.
Theorem 5.2. DMA reduces CPU overhead for I/O from (programmed I/O, one interrupt per byte) to (one interrupt per transfer).
6. Virtualisation in Depth
6.1 Hardware-Assisted Virtualisation
Intel VT-x and AMD-V provide hardware support for efficient virtualisation:
- VMX root mode: The hypervisor runs in a more privileged mode than the guest OS.
- VMCS (Virtual Machine Control Structure): Stores the guest state and controls transitions between guest and hypervisor.
- VM entry/exit: Transitions between guest and hypervisor mode. A VM exit is triggered by privileged instructions, I/O, or page faults.
EPT (Extended Page Tables) / NPT (Nested Page Tables). Two-dimensional page tables mapping (guest virtual -> guest physical -> host physical) addresses. Without EPT, every guest TLB miss requires two page table walks (shadow page tables). With EPT, the hardware handles the nested translation in a single walk.
Theorem 6.1. EPT reduces the overhead of guest page table walks from (shadow page tables, levels per walk, two walks) to (single nested walk).
6.2 Paravirtualisation
In paravirtualisation, the guest OS is modified to cooperate with the hypervisor:
- Replace privileged instructions with hypercalls (analogous to system calls but directed to the hypervisor).
- Replace the guest’s page table management with hypervisor-managed shadow page tables.
- Examples: Xen (PV guests), VMware Tools (partial paravirtualisation).
Advantage: Lower overhead than full virtualisation (no need to trap and emulate every privileged instruction).
Disadvantage: Requires modifying the guest OS kernel. Not possible for proprietary OSes (without vendor support).
6.3 Containers vs Virtual Machines
| Property | Virtual Machines | Containers |
|---|---|---|
| Isolation | Hardware-level | OS-level (shared kernel) |
| Startup time | Minutes | Seconds/milliseconds |
| Resource overhead | GBs (full OS) | MBs (shared kernel) |
| Density | 10s per host | 100s—1000s per host |
| Security | Strong (hardware isolation) | Weaker (kernel shared) |
| Portability | Good (self-contained) | Good (images) |
| Performance | Near-native (with VT-x) | Near-native |
7. Security in Depth
7.1 Capability-Based Security
A capability is an unforgeable token that identifies an object and the operations permitted on it.
Principle: Access to resources is controlled by possession of capabilities, not by identity (as in ACL-based systems).
Advantages: Principle of least privilege is enforced ; capabilities can be delegated.
Disadvantages: Revocation is difficult (once a capability is given, the receiver can copy it).
7.2 Sandbox Technologies
Seccomp (Secure Computing Mode). Linux mechanism to restrict the system calls a process can make.
- Seccomp-BPF: Uses BPF programs to filter system calls by number and arguments.
- Seccomp-notify: Allows a userspace process to handle filtered system calls (for mediated access).
Namespaces. Linux namespaces provide isolation for:
- PID: Separate process ID space.
- Network: Separate network interfaces, routing tables, firewall rules.
- Mount: Separate file system mount points.
- UTS: Separate hostname.
- IPC: Separate System V IPC and POSIX message queues.
- User: Separate UID/GID mappings.
- Cgroup: Separate resource control groups.
7.3 Mandatory Access Control (MAC)
SELinux (Security-Enhanced Linux). Implements MAC using security policies:
- Subjects: Processes.
- Objects: Files, sockets, etc.
- Security context: (user, role, type, level).
- Policy rules: Define which subjects can access which objects and how.
MAC vs DAC:
| Aspect | DAC (Discretionary) | MAC (Mandatory) |
|---|---|---|
| Who sets policy | Resource owner | System administrator |
| Enforcement | Voluntary (user-dependent) | Enforced by kernel |
| Bypass | Root can bypass | Even root must comply |
| Granularity | Per-file permissions | Fine-grained (type enforcement) |
10. Inter-Process Communication in Depth
10.1 Pipes and FIFOs
Anonymous pipes. Unidirectional, byte-stream communication between related processes (parent-child). Created by pipe() system call, returns two file descriptors: pipefd[0] for reading, pipefd[1] for writing.
Named pipes (FIFOs). Special files in the filesystem that allow communication between unrelated processes. Created by mkfifo(). Have a name in the filesystem but no content on disk.
Properties:
| Property | Anonymous Pipe | FIFO |
|---|---|---|
| Direction | Unidirectional | Unidirectional |
| Processes | Related (parent-child) | Any (with filesystem access) |
| Persistence | Process lifetime | Filesystem lifetime |
| Blocking | Reader blocks on empty; writer blocks on full | Same |
| Kernel buffer | 64 KB | 64 KB |
Theorem 10.1. A pipe has a fixed kernel buffer ( one page or 64 KB). If the buffer is full, write() blocks. If the buffer is empty, read() blocks. If all write ends are closed, read() returns 0 (EOF).
10.2 Message Queues
System V message queues and POSIX message queues allow processes to exchange messages with type-based prioritisation.
POSIX message queue API:
| Function | Purpose |
|---|---|
mq_open | Open/create a message queue |
mq_send | Send a message |
mq_receive | Receive a message (highest priority first) |
mq_notify | Register for asynchronous notification |
mq_close | Close a message queue |
mq_unlink | Remove a message queue |
Advantages over pipes:
- Message boundaries are preserved.
- Messages can have priorities.
- Multiple readers and writers.
Disadvantages:
- More complex API.
- Limited maximum message size and queue depth.
10.3 Shared Memory
Shared memory is the fastest IPC mechanism: multiple processes map the same physical memory into their address spaces.
POSIX shared memory:
shm_open("/myshm", O_CREAT | O_RDWR, 0666);ftruncate(shm_fd, size);ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);System V shared memory:
shmget(key, size, IPC_CREAT | 0666);shmat(shmid, NULL, 0);Synchronisation. Shared memory provides NO synchronisation. Processes must use mutexes, semaphores, or other mechanisms to coordinate access.
Theorem 10.2. Shared memory communication has overhead per access (just a memory read/write), compared to but with higher constant factor for pipes and message queues (which involve kernel transitions and data copying).
10.4 Signals
Signals are asynchronous notifications sent to a process or thread.
Standard signals:
| Signal | Number | Default action | Meaning |
|---|---|---|---|
| SIGINT | 2 | Terminate | Interrupt (Ctrl+C) |
| SIGKILL | 9 | Terminate | Cannot be caught or ignored |
| SIGSEGV | 11 | Terminate + core | Segmentation fault |
| SIGTERM | 15 | Terminate | Termination request |
| SIGCHLD | 17 | Ignore | Child process status changed |
| SIGSTOP | 19 | Stop | Cannot be caught or ignored |
| SIGCONT | 18 | Continue | Resume stopped process |
Signal handling issues:
- Reentrancy: Only async-signal-safe functions can be called from signal handlers (e.g.,
write``_exit``sigaction). Functions likemalloc``printfAndstrerrorare NOT async-signal-safe. - Race conditions: A signal may arrive between checking a condition and acting on it.
- Signal masking:
sigprocmask()blocks specific signals during critical sections.
Worked Example: IPC Selection
An application has a producer process and a consumer process that need to exchange structured messages (up to 1 KB each). Which IPC mechanism is most appropriate?
Options:
Pipes: Preserves byte stream but not message boundaries. Requires framing protocol. Unidirectional — need two pipes for bidirectional.
Message queues: Preserves message boundaries, supports priorities. Good fit for structured messages. Kernel copies data twice (sender -> kernel -> receiver).
Shared memory: Fastest (no kernel copies). Requires explicit synchronisation. Must implement message queue logic manually.
Recommendation: If performance is critical and the message format is simple, use shared memory with a mutex and condition variables for synchronisation. If simplicity is more important, use POSIX message queues.
Throughput comparison (approximate, for 1 KB messages):
- Pipe: ~500 MB/s
- Message queue: ~200 MB/s
- Shared memory: ~10 GB/s (limited by memory bandwidth)
11. Advanced Scheduling
11.1 Completely Fair Scheduler (CFS)
The Completely Fair Scheduler (CFS) is the default process scheduler in Linux.
Key ideas:
- Each task maintains a virtual runtime (
vruntime), which is the actual runtime scaled by the task’s priority (niceness). - CFS always picks the task with the smallest
vruntime. - The
vruntimeincrement per tick is: \text{vruntime} += \text{actual}\_time \times \text{weight_0} / \text{weight}Where depends on the nice value.
Target latency. CFS aims to give each task a fair share of CPU time within a “sched period” (target latency, 6 ms). If there are tasks, each gets ms per period.
Red-black tree. CFS stores all runnable tasks in a red-black tree keyed by vruntime. The leftmost node (minimum vruntime) is selected next. Insertion and extraction take .
Theorem 11.1. CFS provides proportional fairness: the ratio of CPU time received by any two tasks approaches the ratio of their weights as time goes to infinity.
11.2 Multi-Processor Scheduling
Load balancing. On SMP (Symmetric Multi-Processing) systems, the scheduler must distribute load across CPUs.
Linux load balancing:
- Periodic load balancing: Every 200 ms, the scheduler checks if any CPU is significantly more loaded than others and migrates tasks.
- Active balancing: When a CPU becomes idle, it steals tasks from other CPUs.
NUMA awareness. On NUMA systems, the scheduler prefers to run a task on the same NUMA node as its memory to avoid remote memory access latency.
11.3 CPU Isolation and Cgroups
cgroups (Control Groups) allow resource allocation and isolation:
| Controller | Resource |
|---|---|
| cpu | CPU time shares |
| cpuacct | CPU accounting |
| memory | Memory usage limit |
| blkio | Block I/O bandwidth |
| devices | Device access control |
| freezer | Freezing/thawing processes |
CPU shares. If cgroup A has 1024 shares and cgroup B has 512 shares, A gets 2/3 of CPU time and B gets 1/3 (when both are active).
Worked Example: CFS Virtual Runtime
Two tasks with default priority (nice = 0, weight = 1024):
Target latency = 6 ms. Each task gets 3 ms per period.
Period 1:
- Task A runs 3 ms.
vruntime_A = 3 × 1024/1024 = 3. - Task B runs 3 ms.
vruntime_B = 3 × 1024/1024 = 3.
Period 2:
- Both have
vruntime = 3. CFS picks the leftmost (arbitrary tiebreak). Say Task A. - Task A runs 3 ms.
vruntime_A = 6. - Task B runs 3 ms.
vruntime_B = 6.
Now change Task A’s nice value to -5 (higher priority, weight = 3121):
Period 3:
- Task B has lower
vruntime(6 vs 6, tie). Say CFS picks B (or A, tie). - Task A runs 3 ms.
vruntime_A = 6 + 3 × 1024/3121 = 6 + 0.984 = 6.984. - Task B runs 3 ms.
vruntime_B = 6 + 3 × 1024/1024 = 9.
After several periods, Task A’s vruntime grows more slowly (because its weight is higher), so CFS picks it more often. The ratio of runtimes approaches weight_A / weight_B = 3121/1024 ≈ 3.05.
13. Memory Mapping and Shared Memory
13.1 mmap System Call
mmap() maps a file or device into memory. The mapped region can be accessed like any memory location.
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);Parameters:
| Parameter | Description |
|---|---|
| addr | Preferred address (NULL for kernel to choose) |
| length | Number of bytes to map |
| prot | Protection: PROT_READ, PROT_WRITE, PROT_EXEC |
| flags | MAP_PRIVATE (copy-on-write) or MAP_SHARED (shared) |
| fd | File descriptor (or -1 for anonymous mapping) |
| offset | Offset in file |
Theorem 13.1. Memory-mapped I/O can be faster than read()/write() for large files because it avoids copying data from kernel space to user space (zero-copy).
Demand paging with mmap. Pages are not read from disk until they are accessed (page fault triggers the read). This is especially beneficial when only a portion of a large file is needed.
13.2 Copy-on-Write Semantics
MAP_PRIVATE. Each process gets a private copy of the mapped region. Writes trigger COW: the page is duplicated, and the process’s page table is updated to point to the new copy.
MAP_SHARED. Writes are visible to all processes that share the mapping and are written back to the underlying file.
Worked Example: mmap for File Processing
Read and modify every 4th byte of a 1 GB file, without loading the entire file into memory.
int fd = open("large_file.dat", O_RDWR);void *map = mmap(NULL, 1UL << 30, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
for (size_t i = 0; i < (1UL << 30); i += 4096) { // every page ((char *)map)[i] = ~((char *)map)[i]; // flip bits}
munmap(map, 1UL << 30);close(fd);Only the pages that are actually touched are read from disk (demand paging). For a 1 GB file, this takes 1 GB of virtual address space but only pages of physical memory (where is the number of pages touched).
With read()/write()We would need a buffer, read each page into the buffer, modify it, and write it back. This involves an extra copy (kernel buffer to user buffer).
13.3 Memory-Mapped IPC
Processes can communicate via shared memory mappings:
// Process 1 (creator)int fd = shm_open("/my_shm", O_CREAT | O_RDWR, 0666);ftruncate(fd, 4096);int *shared = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);shared[0] = 42;
// Process 2 (user)int fd = shm_open("/my_shm", O_RDWR, 0666);int *shared = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);printf("%d\n", shared[0]); // prints 4214. Advanced File System Internals
14.1 ext4 File System
ext4 is the default file system on most Linux distributions.
Key features:
| Feature | Description |
|---|---|
| Journaling | Writeback, ordered, or data mode |
| Extents | Replace indirect blocks with extents (start, length) |
| Delayed allocation | Blocks allocated at writeback, not at write() |
| Uninitialized extents | Reserve space without allocating blocks |
| Large file support | Up to 16 TB individual files |
| Subdirectories | Unlimited nesting depth |
Extent-based allocation. Instead of block pointers, ext4 stores extents: (logical block, physical block, length). This reduces metadata overhead for large contiguous files.
On-disk structure:
| Component | Location | Description |
|---|---|---|
| Superblock | Block 0 (or 1024) | File system parameters (size, block size, etc.) |
| Block group descriptors | After superblock | Pointers to block bitmaps, inode bitmaps, inode tables |
| Block bitmap | Per block group | Which blocks are free/used |
| Inode bitmap | Per block group | Which inodes are free/used |
| Inode table | Per block group | Array of inodes |
| Data blocks | Remaining | File data |
14.2 Journaling Modes Compared
| Mode | Metadata | Data | Crash recovery |
|---|---|---|---|
| Writeback | Journaled | Not journaled | Metadata consistent; data may be stale |
| Ordered | Journaled | Written before metadata | No stale data (data written before commit) |
| Journal | Journaled | Journaled | Full consistency |
Performance impact. Ordered mode is the default. Journal mode is safest but slowest (every data write goes to the journal first). Writeback mode is fastest but least safe.
14.3 File System Check (fsck)
fsck verifies and repairs file system consistency:
- Superblock check: Verify parameters are valid.
- Block bitmap check: Find referenced-but-free and free-but-referenced blocks.
- Inode check: Verify link counts, format, and referenced blocks.
- Directory check: Verify
.and..entries, detect orphaned directories. - Link count check: Verify hard link counts match references.
Theorem 14.1. fsck runs in time where is the number of inodes and is the number of blocks.
16. Deadlock Analysis in Depth
16.1 Resource-Allocation Graph
A resource-allocation graph is a directed bipartite graph with vertices for processes (circles) and resources (squares). An edge means process has requested resource . An edge means resource is held by process .
Theorem 16.1. A deadlock exists in a resource-allocation graph if and only if the graph contains a cycle. If each resource type has exactly one instance, this is both necessary and sufficient. If resources have multiple instances, a cycle is necessary but not sufficient.
Proof (single instance). () If there is a deadlock, processes in the deadlock set are waiting for resources held by other processes in the set. Following the wait-for chain must eventually revisit a process (since the set is finite), forming a cycle. () If there is a cycle, each process in the cycle holds a resource needed by the next process. None can proceed, so all are deadlocked.
Worked Example: Resource-Allocation Graph Analysis
Processes: P1, P2, P3. Resources: R1, R2, R3 (each with 1 instance).
Edges: P1 -> R1, R1 -> P2, P2 -> R2, R2 -> P3, P3 -> R3, R3 -> P1.
Graph: P1 -> R1 -> P2 -> R2 -> P3 -> R3 -> P1.
This contains a cycle: P1 - R1 - P2 - R2 - P3 - R3 - P1. Therefore, deadlock exists.
Now add: P3 -> R1 (P3 also requests R1). No change to the deadlock (P3 is already deadlocked).
Now add: P4 -> R3 (P4 requests R3, but R3 is held by P1). P4 is blocked but not in a cycle, so P4 is NOT deadlocked — it will proceed once P1 releases R3.
If R3 had 2 instances and the second instance is free: P4 could get the second instance of R3 immediately.
16.2 Deadlock Prevention vs Avoidance
| Strategy | Approach | Trade-off |
|---|---|---|
| Prevention | Eliminate one of the 4 necessary conditions | May reduce concurrency |
| Avoidance | Dynamically check if granting a request leads to deadlock | Requires advance knowledge of resource needs |
Prevention techniques:
- Eliminate mutual exclusion: Use spooling or non-sharable resource emulation. Not always possible.
- Eliminate hold-and-wait: Require all resources to be acquired at once. Low utilisation.
- Eliminate no preemption: Preempt resources from blocked processes. Complex to implement.
- Eliminate circular wait: Impose a total ordering on resources; acquire in increasing order.
Wait-Die scheme (prevention based on timestamps):
| Situation | Action |
|---|---|
| Older process requests resource held by younger | Wait (older waits for younger) |
| Younger process requests resource held by older | Die (younger aborts and restarts) |
Wound-Wait scheme:
| Situation | Action |
|---|---|
| Older process requests resource held by younger | Wound (preempt younger) |
| Younger process requests resource held by older | Wait (younger waits for older) |
Theorem 16.2. Both wait-die and wound-wait prevent deadlocks and are starvation-free.
Proof (wait-die). A cycle in the wait-for graph would require where each arrow means “waiting for.” In wait-die, a process only waits for older processes. So is older than , is older than …, is older than . This is a contradiction (). Therefore no cycle exists. Starvation-freedom: a process only dies (aborts) if it is younger than the holder. After restarting, it retains its original timestamp, so it becomes “older” relative to new processes. Eventually, it will be the oldest and never die again.
16.3 Banker’s Algorithm Detailed Example
The Banker’s algorithm avoids deadlock by only granting a request if the resulting state is “safe” (there exists a sequence that allows all processes to complete).
Safe state. A state is safe if there exists a sequence such that each process can run to completion with the available resources.
Worked Example: Banker's Algorithm
5 processes, 3 resource types (A, B, C). Total resources: (10, 5, 7).
| Process | Allocation | Maximum | Need |
|---|---|---|---|
| P0 | (0, 1, 0) | (7, 5, 3) | (7, 4, 3) |
| P1 | (2, 0, 0) | (3, 2, 2) | (1, 2, 2) |
| P2 | (3, 0, 2) | (9, 0, 2) | (6, 0, 0) |
| P3 | (2, 1, 1) | (2, 2, 2) | (0, 1, 1) |
| P4 | (0, 0, 2) | (4, 3, 3) | (4, 3, 1) |
Available = Total - sum(Allocation) = (10, 5, 7) - (7, 2, 5) = (3, 3, 2).
Safety check:
- Available = (3, 3, 2). Can any process run? P1 needs (1, 2, 2) (3, 3, 2). Yes. Run P1. After P1 completes: Available += P1’s allocation = (3, 3, 2) + (2, 0, 0) = (5, 3, 2).
- Available = (5, 3, 2). P3 needs (0, 1, 1) (5, 3, 2). Yes. Run P3. Available += (2, 1, 1) = (7, 4, 3).
- Available = (7, 4, 3). P4 needs (4, 3, 1) (7, 4, 3). Yes. Run P4. Available += (0, 0, 2) = (7, 4, 5).
- Available = (7, 4, 5). P0 needs (7, 4, 3) (7, 4, 5). Yes. Run P0. Available += (0, 1, 0) = (7, 5, 5).
- Available = (7, 5, 5). P2 needs (6, 0, 0) (7, 5, 5). Yes. Run P2. Available += (3, 0, 2) = (10, 5, 7).
Safe sequence: P1, P3, P4, P0, P2. The state is safe.
Request from P1 for (1, 0, 2):
Check: Need[P1] = (1, 2, 2). Request (1, 0, 2) Need[P1]? Yes. Request Available (3, 3, 2)? Yes.
Tentative state: Available = (3, 3, 2) - (1, 0, 2) = (2, 3, 0). P1: Allocation = (3, 0, 2), Need = (0, 2, 0).
Safety check:
- Available = (2, 3, 0). P1 needs (0, 2, 0) (2, 3, 0). Yes. Run P1. Available += (3, 0, 2) = (5, 3, 2).
- Available = (5, 3, 2). P3 needs (0, 1, 1) (5, 3, 2). Yes. Run P3. Available += (2, 1, 1) = (7, 4, 3).
- P4: (4, 3, 1) (7, 4, 3). Yes. Available = (7, 4, 5).
- P0: (7, 4, 3) (7, 4, 5). Yes. Available = (7, 5, 5).
- P2: (6, 0, 0) (7, 5, 5). Yes.
Safe sequence: P1, P3, P4, P0, P2. Request granted.
:::caution Common Pitfall The Banker’s algorithm requires knowing the maximum resource needs of each process in advance. In practice, processes often do not know (or cannot express) their maximum needs. This limits the practical applicability of the Banker’s algorithm. Additionally, the algorithm has time complexity per resource request (where is the number of resource types and is the number of processes), which may be prohibitive for large systems.
17. Problem Set
8.1 Process Management (Problems 1—3)
Problem 1. A process calls fork() twice in sequence (not nested). Draw the process tree and determine how many processes exist. What if the parent calls wait() after each fork()?
Problem 2. Three periodic real-time tasks with computation times and periods: , , . Determine if they are schedulable under (a) RMS, and (b) EDF. Draw the schedule for the first 24 time units under each algorithm.
Problem 3. Compare the 1:1, N:1, and M:N thread models for an application with 1000 short-lived threads, each performing one blocking I/O operation. Which model is most appropriate and why?
8.2 Memory Management (Problems 4—7)
Problem 4. A 64-bit system uses 4 KB pages and a four-level page table with 9 bits per level. How many bits of virtual address are supported? What is the maximum page table size for a process that maps the entire address space?
Problem 5. A process has a working set of 8 pages and accesses the following page references: 1, 2, 3, 4, 1, 2, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8. How many page faults occur with LRU using (a) 8 frames, (b) 7 frames, (c) 4 frames?
Problem 6. Translate the virtual address 0x00003ABC4DEF in a 48-bit paging system with 4 KB pages and four-level page tables. Page directory entries: PML4[0] = frame 0x100, PDP[3] = frame 0x200, PD[0xA] = frame 0x300, PT[0xB] = frame 0x4DE]. Give the physical address.
Problem 7. Explain why an inverted page table makes shared memory between processes more difficult to implement. Describe a solution.
8.3 Synchronisation (Problems 8—11)
Problem 8. Implement a bounded buffer (size ) using futexes instead of POSIX mutexes and condition variables. Show the acquire and release code.
Problem 9. Explain how RCU can be used to implement a lock-free linked list traversal while another thread is modifying the list. What happens if a reader is preempted while traversing?
Problem 10. The following code is intended to implement a spinlock on a weakly-ordered architecture. Identify the bug and fix it:
void lock(int *l) { while (*l == 1); // spin *l = 1;}void unlock(int *l) { *l = 0;}Problem 11. Three processes each need to access two resources (A and B) in different orders. Process 1 needs A then B, Process 2 needs B then C, Process 3 needs C then A. Is deadlock possible? If so, how would you prevent it using (a) resource ordering, and (b) the Banker’s algorithm?
8.4 File Systems and I/O (Problems 12—15)
Problem 12. A file system uses 1 KB blocks and 4-byte block pointers. An inode has 10 direct blocks, one single-indirect, one double-indirect, and one triple-indirect block. What is the maximum file size?
Problem 13. Compare journaling writeback mode with ordered mode for the following scenario: a process writes 100 KB of data to a file and then updates the file’s size in the metadata. The system crashes after the metadata is written but before all data blocks reach the disk. What is the state of the file after recovery under each mode?
Problem 14. A disk has 1000 cylinders, rotational speed of 7200 RPM, and 200 sectors per track (512 bytes each). The disk head is at cylinder 200. What is the average seek time (assuming linear seek), average rotational latency, and the total time to read a contiguous 1 MB file starting at cylinder 500?
Problem 15. Explain how ZFS ensures data integrity through copy-on-write, checksumming, and RAID-Z. What happens when a single disk fails in a RAID-Z1 configuration?
Solution to Problem 12
Block size = 1 KB = bytes. Pointer size = 4 bytes.
Pointers per block: .
Direct blocks: bytes.
Single indirect: bytes ( KB).
Double indirect: bytes ( MB).
Triple indirect: bytes ( GB).
Maximum file size: bytes GB.
If you get this wrong, revise: Section 4.
Solution to Problem 10
The bug is that the read (while (*l == 1)) and write (*l = 1) are not atomic and the compiler/hardware may reorder them. Additionally, the read-modify-write is not atomic — two threads could both read 0 and then both write 1.
Fix using GCC atomics:
void lock(int *l) { while (__atomic_exchange_n(l, 1, __ATOMIC_ACQUIRE)) { // spin (l was already 1) }}void unlock(int *l) { __atomic_store_n(l, 0, __ATOMIC_RELEASE);}__atomic_exchange_n atomically reads the old value and writes 1. The __ATOMIC_ACQUIRE on lock ensures subsequent memory operations are not reordered before the lock acquisition. The __ATOMIC_RELEASE on unlock ensures all prior memory operations complete before releasing the lock.
If you get this wrong, revise: Section 3.4.
Common Pitfalls
- Confusing processes and threads. Processes have separate address spaces; threads share the same address space within a process. Fix: Context switching between processes is expensive (address space swap); between threads is cheaper.
- Wrong deadlock condition. Deadlock requires four conditions: mutual exclusion, hold and wait, no preemption, circular wait. Fix: Prevent deadlock by eliminating one condition; use Banker’s algorithm for avoidance.
- Confusing virtual memory and swapping. Virtual memory maps logical addresses to physical; swapping moves entire processes between RAM and disk. Fix: Virtual memory allows partial loading (pages); swapping moves complete processes.
Worked Examples
Example 1: Process scheduling
Problem. Processes P1 (burst 6), P2 (burst 2), P3 (burst 8) arrive in order at time 0. Calculate average waiting time using SJF.
Solution. SJF order: P2 (2), P1 (6), P3 (8). Waiting times: P2 = 0, P1 = 2, P3 = 8. Average = (0 + 2 + 8)/3 = 3.33.
Example 2: Deadlock detection
Problem. Three processes P1, P2, P3 each hold one resource and request a second. Is there a deadlock?
Solution. P1 holds R1, requests R2. P2 holds R2, requests R3. P3 holds R3, requests R1. This forms a cycle (circular wait), so there is a deadlock.
Summary
- Process management: scheduling algorithms (FCFS, SJF, Round Robin), deadlock (conditions, prevention, avoidance, detection).
- Memory management: paging, segmentation, virtual memory, page replacement (LRU, FIFO, optimal).
- File systems: directory structure, allocation methods (contiguous, linked, indexed).
- Synchronisation: semaphores, monitors, critical sections; Peterson’s algorithm for two processes.
:::