Introduction & Outline
Introduction
- Aim
- PAXOS
- RAFT
- PBFT
- Learning Objectives
- Three consensus mechanisms PAXOS, RAFT, and PBFT (typical in communication network).
- Assumed Previous Knowledge
- none
Outline
- PAXOS
- RAFT
- PBFT
PAXOS
PAXOS Main Idea
- It is the most fundamental distributed consensus algorithm which allows consensus over a value under unreliable communications.
- The key idea is that a majority represents the whole — if more than half of processes choose a value, that value is the consensus
- Paxos could work in scenarios
- Messages passing through the network can be delayed, lost, out of order, or duplicated, but not corrupted.
- Fail-stop model: processes may fail by stopping, may restart, but not present Byzantine faults, i.e., processes would not operate in a malicious way.
Nodes in PAXOS
There are three roles that nodes in a system running the Paxos protocol can undertake. A single process may assume all three roles
- Proposer
- Proposes values to be decided. An elected proposer acts as a single leader to propose a new value. Proposers handle client requests.
- Acceptor
- Acceptors evaluate and accept or reject proposals proposed by the proposers according to several rules and conditions.
- Learner
- Learns the decision, that is, the agreed-upon value
Phases in PAXOS
The protocol is composed of two phases, the prepare phase and the accept phase.
- At the end of the prepare phase: A majority of acceptors have promised a specific proposal number.
- At the end of the accept phase: A majority of acceptors have accepted a proposed value, and consensus is reached.
The prepare phase
- The proposer receives a request to reach consensus on a value by a client.
- The proposer sends a message prepare (n) to a majority or all acceptors.
- When an acceptor receives this prepare (n) message, it makes a “promise.”
At this stage, no value is proposed for a decision yet. The majority of acceptors is enough under the assumption that all acceptors in the majority will respond. Here, the n represents the proposal number which must be globally unique and must be greater than any proposal number this proposer has used before. For example, n can be a timestamp in nanoseconds or some other incrementing value. If a timeout occurs, then the proposer will retry with a higher n. In other words, if the proposer is unable to make progress due to a lack of responses from the acceptors, it can retry with a higher proposal number.
If no previous promise has been made by responding to this prepare (n) message, then the acceptor now promises to ignore any request less than the proposal number n. It records n and replies with message promise(n).
If the acceptor has previously promised, that is, already responded to another prepare message with some proposal number lower than n, the acceptor performs the following:
- If the acceptor has not received any accept messages already from a proposer in the accept phase, it stores the higher proposal number n and then sends a promise message to the proposer.
- If the acceptor has received an accept message earlier with some other lower proposal number, it must have already accepted a proposed value from some proposer. This previous full proposal is now sent along with the promise message to the proposer, indicating that the acceptor has already accepted a value.
The accept phase
- The proposer waits until it gets responses from the majority of the acceptors for this proposal n.
- When responses are received, the proposer evaluates what value v should be sent in the accept message. (more explanations will be given later)
- The proposer now sends an accept message – a full proposal of the form accept (n, v) – to the acceptors, where n is the promised proposal number and v is the actual proposed value.
- When an acceptor receives this accept(n, v) message, it replies with accepted(n, v) and sends accepted(n, v) to all learners or ignores the proposal. (more explanations will be given later)
- If a majority of acceptors accept the value v in the proposal, then v becomes the decided value of the protocol i.e., consensus is reached.
- Note: We have used the term majority indicating that a majority of acceptors have responded to or accepted a message. Also note that in order to tolerate f faulty acceptors, at least a set consisting of 2f + 1 acceptors is required.
The learning is considered part of the second phase where learners learn about the decided value from the acceptors. As soon as a proposal is accepted in the accept phase, the acceptor informs the learners.
When responses are received, the proposer evaluates what value v should be sent in the accept message. It performs the following:
- If the proposer received one or more promise messages with full proposals, it chooses the value v in the proposal with the highest proposal number.
- If no promise messages received by the proposer include a full proposal, the proposer can choose any value it wants.
When an acceptor receives this accept(n, v) message, it does the following:
- If the acceptor has promised not to accept this proposal number previously, it will ignore the message.
- Otherwise, if it has responded to the corresponding prepare request with the same n, that is, prepare(n), only then it replies with accepted(n, v) indicating acceptance of the proposal.
- Finally, the acceptor sends accepted(n, v) to all learners.
learn phase
RAFT
Visualization of Raft: http://thesecretlivesofdata.com/raft/
The Figure shows a third phase called the learn phase in a normal run of Paxos, but it is just for visualizing the protocol in a simpler way; learning is in fact part of phase 2, the accept phase.
Main Idea
- RAFT is easy to understand and easy to implement. RAFT stands for Replicated And Fault Tolerant.
- The key idea behind RAFT is to enable state machine replication with a persistent log.
- RAFT allows cluster reconfiguration which enables cluster membership changes without service interruption.
- RAFT allows log compaction to alleviate the issue of consuming too much storage and slow rebuild after node crashes.
- RAFT operates under scenarios:
- No Byzantine failures.
- Unreliable network communication.
- Asynchronous communication and processors.
Nodes in RAFT
three roles
- Leader: receives client requests, manages replication logs, and manages communication with the followers.
- Follower: nodes are passive in nature and only respond to Remote Procedure Call (RPCs). They never initiate any communication.
- Candidate: is a role that is used by a node that is trying to become a leader by requesting votes.
three states
Phases in RAFT
The first is leader election, and the second is log replication.
leader election
a heartbeat mechanism is used to trigger a leader election process.
- All nodes start up as followers.
- Followers will run as followers as long as they keep receiving valid RPCs (remote procedure calls) from a leader or a candidate.
- If a follower does not receive heartbeats from the leader for some time, then an “election timeout” occurs, which indicates that the leader has failed. Now the follower node undertakes the candidate role and attempts to become the leader by starting the election process.
- If a node receives votes from the majority of the nodes, then it becomes the leader, other nodes are followers. If no one wins the elections and election timeout occurs, the election process starts again with a new term.
The figure demonstrated the specific processes of a RAFT leader election
log replication
- The log replication phase of RAFT is straightforward.
- The client sends commands/requests to the leader to be executed by the replicated state machines.
- The leader assigns a term and index to the command so that the command can be uniquely identified in the logs held by nodes. It appends this command to its log.
- When the leader has a new entry in its log, at the same time it sends out the requests to replicate this command via the AppendEntries RPC to the follower nodes.
- When the leader is able to replicate the command to the majority of the follower nodes, that is, acknowledged, the entry is considered committed on the cluster
- Now the leader executes the command in its state machine and returns the result to the client. It also notifies the followers that the entry is committed via the AppendEntries RPC, and the followers execute committed commands in their state machines.
A set of logs from five nodes is shown in this figure.
Whole Process
PBFT
Main Idea
- Practical Byzantine fault tolerance (PBFT), as the name suggests, it is a protocol designed to provide consensus in the presence of Byzantine faults.
- PBFT constitutes three subprotocols called normal operation, view change, and checkpointing.
- Subprotocols
- The normal operation subprotocol refers to a mechanism executed when everything is running normally, and the system is error-free.
- The view change is a subprotocol that runs when a faulty leader node is detected in the system.
- Checkpointing is used to discard the old data from the system.
Nodes in PBFT
to tolerate Byzantine faults, the minimum number of nodes required is in a partially synchronous environment, where n is the number of nodes and f is the number of faulty nodes.
PBFT ensures Byzantine fault tolerance as long as the number of nodes in a system stays .
three roles
- Replica: Every participant in the PBFT protocol.
- Leader: In each round, a leader node, called the primary node, handles the communication with the client.
- Backups: The rest of the nodes except the leader
three phases
These phases run one after another to complete a single protocol run.
- Pre-prepare
- Accepts a request from the client.
- Assigns to it the next sequence number. This sequence number is the order in which the request is going to be executed.
- Broadcasts this information as the pre-prepare message to all backup replicas.
- Prepare
- Accepts the pre-prepare message only if the replica has not accepted any pre-prepare messages for the same view or sequence number before
- Sends the prepare message to all replicas
- Commit
- The replica waits for 2f + 1 prepare messages with the same view, sequence, and request.
- It sends a commit message to all replicas.
- It waits until a 2f + 1 valid commit message arrives and is accepted.
- It executes the received request.
- It sends a reply containing the execution result to the client.