Failover

Summary: The process of automatically promoting a follower to be the new leader when the previous leader becomes unavailable.

Sources: chapter5

Last updated: 2026-04-15


Failover is a critical mechanism for ensuring high availability in single-leader-replication systems. If the leader node crashes or becomes unreachable, another node must be elected to take over its responsibilities (source: chapter5, p. 156).

The Failover Process

  1. Detection: The system must detect that the leader has failed (often using a timeout mechanism).
  2. Election: A new leader is chosen (e.g., through a consensus algorithm or by a controller node).
  3. Reconfiguration: The system is updated to point all writes and read-after-write requests to the new leader (source: chapter5, p. 156).

Risks and Challenges

  • Data Loss: If asynchronous replication was used, the new leader might not have all the writes from the old leader before it failed (source: chapter5, p. 157).
  • Split Brain: If two nodes both believe they are the leader (e.g., due to a network partition), they might both accept writes, leading to data corruption and conflicts (source: chapter5, p. 157).
  • Timeout Sensitivity: If the timeout is too short, a temporary load spike or network glitch could trigger an unnecessary failover. If it’s too long, recovery time increases.