Failover
Summary: The process of automatically promoting a follower to be the new leader when the previous leader becomes unavailable.
Sources: chapter5
Last updated: 2026-04-15
Failover is a critical mechanism for ensuring high availability in single-leader-replication systems. If the leader node crashes or becomes unreachable, another node must be elected to take over its responsibilities (source: chapter5, p. 156).
The Failover Process
- Detection: The system must detect that the leader has failed (often using a timeout mechanism).
- Election: A new leader is chosen (e.g., through a consensus algorithm or by a controller node).
- Reconfiguration: The system is updated to point all writes and read-after-write requests to the new leader (source: chapter5, p. 156).
Risks and Challenges
- Data Loss: If asynchronous replication was used, the new leader might not have all the writes from the old leader before it failed (source: chapter5, p. 157).
- Split Brain: If two nodes both believe they are the leader (e.g., due to a network partition), they might both accept writes, leading to data corruption and conflicts (source: chapter5, p. 157).
- Timeout Sensitivity: If the timeout is too short, a temporary load spike or network glitch could trigger an unnecessary failover. If it’s too long, recovery time increases.