Distributed Systems
1. CAP Theorem
1.1 Definition
A distributed system can guarantee at most two of the following three properties simultaneously:
| Property | Definition |
|---|---|
| Consistency | Every read receives the most recent write |
| Availability | Every request receives a non-error response |
| Partition tolerance | System operates despite network partitions |
Since network partitions are inevitable in practice, the real choice is between C and A during a partition.
1.2 CP vs AP Systems
| Type | Consistency | Availability | Partition | Examples |
|---|---|---|---|---|
| CP | Yes | No (during partition) | Yes | HBase, MongoDB, ZooKeeper |
| AP | No (eventual) | Yes | Yes | Cassandra, DynamoDB, Riak |
1.3 PACELC Theorem
An extension of CAP: If there is a Partition, choose Availability or Consistency; Else, choose Latency or Consistency.
2. Consistency Models
2.1 Linearizability (Strong Consistency)
Every operation appears to execute atomically at a single point in time. The result is equivalent to executing operations on a single copy in real-time order.
Cost: Requires coordination on every write → higher latency.
2.2 Sequential Consistency
All operations appear in some total order consistent with program order of each process. Less strict than linearizability (operations can appear to execute at non-real times).
2.3 Causal Consistency
Operations that are causally related must be seen by all processes in the same order. Concurrent (non-causal) operations may be seen in different orders.
Process A: write(x=1) → write(y=2) // write(x) causally precedes write(y)Process B: read(x) must see x=1 before (or same time as) y=22.4 Eventual Consistency
If no new updates are made, eventually all reads return the last written value. No guarantee on how long “eventually” takes.
Variants:
- Read-your-writes: A process always sees its own writes.
- Session consistency: Guarantees read-your-writes within a session.
- Monotonic reads: If a process reads x, subsequent reads of x never return older values.
- Monotonic writes: Writes from a single process are applied in order.
2.5 Comparison
| Model | Guarantees | Latency | Use Case |
|---|---|---|---|
| Linearizable | Real-time order | High | Banking, leader election |
| Sequential | Program order, total order | Medium | Shared memory simulation |
| Causal | Causal order | Low | Collaborative editing |
| Eventual | Converges if no updates | Lowest | Caching, social media feeds |
3. Consensus
3.1 Problem Definition
Consensus requires agreement on a value among processes, satisfying:
- Agreement: All correct processes decide the same value.
- Validity: The decided value was proposed by some process.
- Termination: All correct processes eventually decide.
FLP impossibility (1985): In an asynchronous system with even one failure, deterministic consensus is impossible.
3.2 Paxos
Roles:
- Proposer: Proposes values.
- Acceptor: Votes on proposals.
- Learner: Learns the decided value.
Phase 1 (Prepare/Promise):
PROPOSER: choose proposal number n send PREPARE(n) to majority of acceptors
ACCEPTOR: if n > any previously seen: promise not to accept any proposal < n send PROMISE(n, accepted_value, accepted_num) // (or PROMISE(n, null, null) if no prior acceptance) else: ignore
PROPOSER (collects majority promises): if any promise included accepted_value: propose (n, accepted_value) else: propose (n, proposer"s value)Phase 2 (Accept/Accepted):
PROPOSER: send ACCEPT(n, value) to majority of acceptors
ACCEPTOR: if n >= highest promised: accept (n, value) send ACCEPTED(n, value) else: ignore
LEARNER: upon receiving ACCEPTED from majority: value is decidedProperties: Guaranteed safety (only one value decided) even with multiple proposers. Liveness requires a stable leader.
3.3 Raft
Raft decomposes consensus into subproblems: leader election, log replication, safety.
States: Follower, Candidate, Leader.
Terms: Divided into terms with increasing term numbers. Each term has at most one leader.
Leader Election:
FOLLOWER: reset election timeout timer on heartbeat from leader if timeout expires: become Candidate, increment currentTerm vote for self send RequestVote to all servers if receive votes from majority: become Leader else: remain Candidate, retry after timeout
CANDIDATE: if receive AppendEntries with term >= currentTerm: revert to Follower if election timeout expires: start new electionLog Replication:
LEADER: on client request: append entry to own log send AppendEntries(log_entry, prev_index, prev_term) to all followers when majority of followers acknowledge: commit entry (apply to state machine) notify client
FOLLOWER: on AppendEntries from leader: if log doesn't contain prev_entry at prev_index: reject if term doesn't match: reject append entry to log send ACKSafety: A leader must have the most up-to-date log entry for each index.
Election timeout: Randomized (in most cases 150-300ms) to prevent split votes.
3.4 Comparison
| Property | Paxos | Raft |
|---|---|---|
| Understandability | Hard | Designed for clarity |
| Leader | Not required | Required |
| Membership change | Complex | Joint consensus |
| Implementation | Used in Google Chubby | Used in etcd, CockroachDB |
4. Replication
4.1 Single-Leader Replication
All writes go through the leader. Followers replicate asynchronously.
Leader → Follower1 (replication log)Leader → Follower2 (replication log)Client → Leader (write), any replica (read)Pros: Simple, no write conflicts. Cons: Leader is a bottleneck and single point of failure.
4.2 Multi-Leader Replication
Multiple nodes accept writes. Conflicts resolved by conflict resolution.
Conflict resolution strategies:
- Last-write-wins (LWW): Timestamp-based (may lose data).
- Version vectors: Track causal history, merge divergent versions.
- Application-specific: Custom merge logic.
Use cases: Multi-region deployments, offline-first clients.
4.3 Leaderless Replication
No designated leader. Any node accepts writes. Clients write to multiple nodes and read from multiple, using quorum.
Quorum reads and writes:
where = write quorum, = read quorum, = total replicas.
| Trade-off | ||
|---|---|---|
| 1 | Fastest reads, most writes | |
| Balanced read/write latency | ||
| 1 | Fastest writes, most reads |
4.4 Synchronous vs Asynchronous Replication
| Type | Latency | Durability | Consistency |
|---|---|---|---|
| Synchronous | High | Strong | Strong |
| Asynchronous | Low | Risky | Eventual |
| Semi-sync | Medium | Moderate | Moderate |
5. Partitioning (Sharding)
5.1 Horizontal Partitioning
Divide rows across nodes based on a partition key.
Hash partitioning:
PARTITION(key, N): node = hash(key) % N route to nodeRange partitioning: Assign contiguous ranges of keys to nodes.
Directory-based: A lookup service maps keys to partitions.
5.2 Consistent Hashing
Nodes and keys are placed on a hash ring (0 to ). Each key is assigned to the nearest node clockwise.
CONSISTENT_HASH(key, nodes): hash_key = hash(key) for each node in sorted by hash(node): if hash(node) >= hash_key: return node return first node // wrap aroundVirtual nodes: Each physical node has multiple virtual nodes on the ring, improving balance.
Advantages:
- Adding/removing a node only moves of keys.
- Balanced data distribution.
- Decentralized (no directory server needed).
5.3 Partitioning Challenges
- Skewed distribution: Some partitions receive more data/traffic.
- Cross-partition queries: Joins across partitions are expensive.
- Rebalancing: Moving data when nodes join/leave.
6. Fault Tolerance
6.1 Failure Models
| Failure Type | Behavior |
|---|---|
| Crash failure | Process stops and does not recover |
| Fail-stop | Crash + detectable (other nodes can tell) |
| Byzantine failure | Arbitrary behavior (including malicious) |
| Omission failure | Fails to send/receive messages |
| Timing failure | Responds outside specified timing bounds |
6.2 Crash Fault Tolerance
A system tolerates crash failures with replicas (quorum = ).
6.3 Byzantine Fault Tolerance
A system tolerates Byzantine failures with replicas.
Byzantine Generals Problem: How can loyal generals agree on a battle plan when some generals are traitors sending conflicting messages?
PBFT (Practical Byzantine Fault Tolerance):
PBFT phases for each request: 1. PRE-PREPARE: Leader assigns sequence number, sends to all replicas 2. PREPARE: Each replica sends PREPARE to all others 3. COMMIT: After 2f+1 PREPARE/PRE-PREPARE, send COMMIT to all 4. REPLY: After 2f+1 COMMIT, execute and replyTime: messages per consensus round.
6.4 Failure Detection
Heartbeat-based: Periodic messages. If no heartbeat within timeout, suspect failure.
Phi accrual failure detector: Probabilistic approach based on arrival history of heartbeats. Outputs a suspicion level .
Gossip-based: Nodes exchange membership information. Converges with rounds.
7. Distributed Transactions
7.1 Two-Phase Commit (2PC)
Phase 1 (Prepare/Voting):
COORDINATOR: send PREPARE to all participants
PARTICIPANT: write undo/redo logs to stable storage if can commit: send VOTE_COMMIT else: send VOTE_ABORTPhase 2 (Commit/Abort):
COORDINATOR: if all VOTE_COMMIT: send COMMIT to all else: send ABORT to all
PARTICIPANT: on COMMIT: commit transaction, release locks on ABORT: rollback, release locksProblems:
- Blocking: If coordinator crashes, participants are blocked (holding locks) until recovery.
- Single point of failure: Coordinator failure stalls all participants.
7.2 Three-Phase Commit (3PC)
Adds a pre-commit phase to eliminate blocking:
- CanCommit: Coordinator asks if participants can commit.
- PreCommit: Coordinator tells participants to prepare to commit.
- DoCommit: Coordinator commands commit.
Non-blocking (if no failures occur during pre-commit), but requires timeout assumptions (synchronous rounds).
7.3 Saga Pattern
Break a transaction into a sequence of local transactions with compensating actions.
SAGA(order): T1: reserve_inventory(order) T2: charge_payment(order) T3: ship_order(order)
// Compensating actions (on failure): C3: cancel_shipment(order) C2: refund_payment(order) C1: release_inventory(order)Execution:
- Execute sequentially.
- If fails, execute compensating actions .
Pros: No distributed locks, no blocking. Cons: Eventual consistency (reads may see intermediate state).
7.4 Comparison
| Method | Atomicity | Blocking | Consistency | Complexity |
|---|---|---|---|---|
| 2PC | Strong | Yes | Strong | Medium |
| 3PC | Strong | No* | Strong | High |
| Saga | Eventual | No | Eventual | Low |
*Non-blocking only under synchrony assumptions.
8. Time and Ordering
8.1 Physical Clocks
Clock skew: Difference between clocks on different machines.
Clock drift: Rate at which clocks diverge (typical: 1 part in to ).
NTP (Network Time Protocol): Synchronizes clocks within ~1ms on LANs, ~10-100ms on WANs.
GPS clocks: Provide high accuracy (~100ns) with external receivers.
8.2 Lamport Clocks
Logical clocks that capture happens-before () relationships.
Rules:
- Each process increments its clock before each event.
- When sending a message, include clock value.
- On receiving a message, set clock to .
LAMPORT_SEND(process, msg): process.clock += 1 msg.timestamp = process.clock send(msg)
LAMPORT_RECEIVE(process, msg): process.clock = max(process.clock, msg.timestamp) + 1 deliver(msg)Property: If (happens-before), then .
Limitation: does NOT imply (clocks can be concurrent).
8.3 Vector Clocks
Extend Lamport clocks with a vector where = process ‘s logical clock.
Rules:
- Increment own entry: .
- On send: include full vector.
- On receive: for all .
VECTOR_SEND(process, msg): process.V[process.id] += 1 msg.V = copy(process.V) send(msg)
VECTOR_RECEIVE(process, msg): for j = 1 to n: process.V[j] = max(process.V[j], msg.V[j]) process.V[process.id] += 1 deliver(msg)Comparison of vector clocks:
- If : event happened before .
- If and : events are concurrent.
Space: per event where = number of processes.
8.4 Ordering Comparison
| Method | Causality detection | Concurrent detection | Space per event |
|---|---|---|---|
| Physical time | No | No | Constant |
| Lamport | Yes (one direction) | No | Constant |
| Vector clock | Yes | Yes |
9. Common Pitfalls
Assuming CAP means you must permanently sacrifice C or A. The trade-off is only during partitions. Without partitions, a CP system can be both consistent and available. The choice matters for how the system behaves under failure.
Running Paxos/Raft without quorum. Consensus algorithms require majority quorum to guarantee safety. With nodes, you need at least available for progress.
Using synchronous replication without considering latency. Synchronous replication doubles or triples write latency (waiting for remote acknowledgments). Consider the trade-off carefully for your latency requirements.
Ignoring network partitions in real systems. Network partitions happen (not just node crashes). Design for them: use circuit breakers, timeouts, and graceful degradation.
Choosing wrong partition key. A poor partition key creates hotspots (uneven load). Use a key with uniform distribution and align with query patterns.
Forgetting compensating actions in sagas. Every step in a saga must have a corresponding compensating action that can be safely retried. Idempotency is essential for reliability.
Relying on physical clocks for ordering. Physical clocks drift and have limited precision. Use logical clocks (Lamport, vector) for ordering in distributed algorithms.
Worked Examples
Example 1: Vector Clock Causality
Problem: Three events: A at process P1 (VC: 1,0,0), B at P1 (VC: 2,0,0), C at P2 (VC: 0,1,0) which received message from A. Is A -> C? Solution: Event A has VC (1,0,0). Process P2 receives A’s message before event C, so C’s VC includes A’s clock: C = max(A, local event) = max((1,0,0), (0,1,0)) = (1,1,0). Since VC(C) >= VC(A) component-wise, A happened before C (A -> C). Event B has VC (2,0,0), which does not causally precede C (0 < 1 at P2’s counter). B and C are concurrent.
Example 2: Consensus with Paxos
Problem: A cluster of 5 acceptors receives Prepare messages from a proposer with ballot number 10. Acceptors 1, 3, 5 respond with ballot 10 and no previously accepted value. Acceptors 2, 4 respond with ballot 7 and accepted value V_old. What value is chosen? Solution: The proposer must use the highest-numbered accepted value among the quorum responses. Acceptors 2 and 4 accepted V_old with ballot 7. Since the proposer has a majority quorum ({1,2,3} or {2,3,4} etc.), and at least two acceptors returned V_old, the proposer issues Accept messages with ballot 10 and value V_old. Once a majority accepts, V_old is chosen.
Summary
- CAP theorem: Distributed systems trade consistency for availability during partitions (PACELC extends this).
- Consistency models range from linearizable (strongest) to eventual (weakest), trading latency for guarantees.
- Consensus (Paxos, Raft) ensures agreement despite failures; FLP proves impossibility under asynchrony.
- Replication (single-leader, multi-leader, leaderless) trades simplicity, conflict resolution, and availability.
- Partitioning uses hash or range partitioning; consistent hashing minimizes data movement on topology changes.
- Fault tolerance handles crash failures () and Byzantine failures ().
- Distributed transactions: 2PC (atomic but blocking), saga (eventual, non-blocking).
- Time and ordering: Lamport clocks capture happens-before; vector clocks distinguish concurrent events.
Cross-References
| Topic | Link |
|---|---|
| Databases | View |
| Networking | View |
| Operating Systems | View |