In Search of an Understandable Consensus Algorithm

The goal of a consensus algorithm is to allow multiple machines to work as a coherent group which can survive the failures from some of its members.
Paxos has been the most common consensus algorithm used around, yet it is quite hard to understand, and hard to implement.

In this paper, the authors introduce Raft, a new consensus algorithm meant to be understandable, as opposed to Paxos.

Replicated State Machines

Consensus algorithms are typically used when setting up replicated state machines.
Several machines will connect with each other to share a common state and replicate it. If one or some machines have a failure, the rest of them will continue operating normally, therefore increasing reliability.

Replicated state machines typically use a replicated log. Each server keeps a log of commands that need to be executed in order.
The consensus algorithm’s role is to keep that log consistent across all machines.

They ensure the following properties:

Safety - they never return an incorrect result, no matter the (non-byzantine) conditions.
Functionality - they are fully functional as long as any majority of the servers are operational.
Timing - they don’t depend on timing to ensure consistency. Clock skew doesn’t cause failure.
Speed - As soon as the fastest half of all servers has responded, the command can complete. Slow servers don’t slow down the cluster.

Understandability

Understandability is achieved by having a constant focus on it. Whenever, when building the protocol’s design, two alternatives were possible, they always focused on the most understandable one.
While this is very objective, they used two techniques to make those decisions.

Problem Decomposition: Wherever possible, they divided problems into separate pieces that could be solved, explained and understood independently.
Reduce the number of states: this allowed making the system more coherent and reduced non-determinism.

Raft

Any Raft cluster has a single leader, and many followers. Ideally, there should be at least 5 servers, so the cluster can tolerate two failures.

New nodes will start with the follower state, listening for log updates from the leader.

Leader Election

If a node doesn’t receive any updates from the leader, they time out, considering there no available one. A new election ensues.

When performing an election, the first node to time out starts a new election, transitions to the candidate state. It then requests votes from each other node in the cluster.
When receiving a vote request, nodes are expected to approve or reject the vote. They can only approve one node as being the leader per election round.
When a majority of nodes have approved the election, the node is appointed leader and can start streaming logs.

Log Replication

Once a leader has been elected, it can begin servicing new log entries and replicating them to other nodes.

Any new log entry goes into an uncommitted log. The leader then transmits it to each node, and wants for their replies.
Once half of the nodes have approved the new log, the log can be approved, and be moved to the committed log.

The request to add new log entries can fail until the event is moved to committed. That allows rejecting new lines if they won’t be committed, while not having to wait for the slowest node to acknowledge the change.

Safety

Raft provides several ways to ensure safety of the logs data across leaders election.

A term is shared across nodes, which is incremented whenever a node creates a new election. Log entries appended, or new elections with the wrong term will be rejected.

Section 5.4 shows how the leader election completeness allows for safety of the stored data.

Snapshots

Over time, the raft log will grow, and it will take more and more time to start new nodes, as they will have many log lines to apply before being available.
For this reason, raft allows applying snapshots.

A snapshot includes 3 information:

The last included index
The last included term
The expected state of the state machine

With the last included index and term, the node can accept any new log entry that is being transmitted.
With the expected state of the state machine, all previous log lines, allowing to build the FSM into its final state can be discarded.

The leader decides when to send a snapshot request, if a new node is connecting, or one is lagging too far behind.
When receiving a snapshot, the node is expected to discard its log, to set the state machine to the provided state and to accept new requests coming later.

Conclusion

This a very interesting paper, and the understandable is definitely not an overstatement.

While the need for implementing a consensus algorithm should be pretty rare (build stateless services instead), doing so with raft would definitely make sense.
And while I’m sure it wouldn’t be trivial, it doesn’t seem like it would be such a big problem to implement though.