Consensus Protocols: Three-phase Commit

Last time we looked extensively at two-phase commit, a consensus algorithm that has the benefit of low latency but which is offset by fragility in the face of participant machine crashes. In this short note, I’m going to explain how the addition of an extra phase to the protocol can shore things up a bit, at the cost of a greater latency.

The fundamental difficulty with 2PC is that, once the decision to commit has been made by the co-ordinator and communicated to some replicas, the replicas go right ahead and act upon the commit statement without checking to see if every other replica got the message. Then, if a replica that committed crashes along with the co-ordinator, the system has no way of telling what the result of the transaction was (since only the co-ordinator and the replica that got the message know for sure). Since the transaction might already have been committed at the crashed replica, the protocol cannot pessimistically abort – as the transaction might have had side-effects that are impossible to undo. Similarly, the protocol cannot optimistically force the transaction to commit, as the original vote might have been to abort.

This problem is – mostly – circumvented by the addition of an extra phase to 2PC, unsurprisingly giving us a three-phase commit protocol. The idea is very simple. We break the second phase of 2PC – ‘commit’ – into two sub-phases. The first is the ‘prepare to commit’ phase. The co-ordinator sends this message to all replicas when it has received unanimous ‘yes’ votes in the first phase. On receipt of this messages, replicas get into a state where they are able to commit the transaction – by taking necessary locks and so forth – but crucially do not do any work that they cannot later undo. They then reply to the co-ordinator telling it that the ‘prepare to commit’ message was received.

The purpose of this phase is to communicate the result of the vote to every replica so that the state of the protocol can be recovered no matter which replica dies.

The last phase of the protocol does almost exactly the same thing as the original ‘commit or abort’ phase in 2PC. If the co-ordinator receives confirmation of the delivery of the ‘prepare to commit’ message from all replicas, it is then safe to go ahead with committing the transaction. However, if delivery is not confirmed, the co-ordinator cannot guarantee that the protocol state will be recovered should it crash (if you are tolerating a fixed number f of failures, the co-ordinator can go ahead once it has received f+1 confirmations). In this case, the co-ordinator will abort the transaction.

If the co-ordinator should crash at any point, a recovery node can take over the transaction and query the state from any remaining replicas. If a replica that has committed the transaction has crashed, we know that every other replica has received a ‘prepare to commit’ message (otherwise the co-ordinator wouldn’t have moved to the commit phase), and therefore the recovery node will be able to determine that the transaction was able to be committed, and safely shepherd the protocol to its conclusion. If any replica reports to the recovery node that it has not received ‘prepare to commit’, the recovery node will know that the transaction has not been committed at any replica, and will therefore be able either to pessimistically abort or re-run the protocol from the beginning.

So does 3PC fix all our problems? Not quite, but it comes close. In the case of a network partition, the wheels rather come off – imagine that all the replicas that received ‘prepare to commit’ are on one side of the partition, and those that did not are on the other. Then both partitions will continue with recovery nodes that respectively commit or abort the transaction, and when the network merges the system will have an inconsistent state. So 3PC has potentially unsafe runs, as does 2PC, but will always make progress and therefore satisfies its liveness properties. The fact that 3PC will not block on single node failures makes it much more appealing for services where high availability is more important than low latencies.

Next time I talk about consensus, it will be to try and describe Paxos, which is really a generalisation of 2PC and 3PC which has found massive popularity recently in building real-world distributed replicated state machines such as Chubby.

As ever – comments or questions, feel free to get in touch.

17 thoughts on “Consensus Protocols: Three-phase Commit

  1. Hi,

    3PC indeed fails if there are both network and node failures.

    The advantage (liveness) also exists for any practical 2PC that I know of: it is called Heuristics. IMHO the extra overhead for 3PC does not make it interesting – that is why we (at Atomikos) use 2PC instead…

    Of course, I could be wrong and any opinions are welcome.

  2. Hi –

    You’re right that, in practice, 2PC can work well. If you’ve got a synchronous network (i.e. you can be conservative with your timeouts), or you only have fail-stop failures, or you have atomic broadcast, and you don’t have message re-ordering you’re probably going to do quite well with 2PC.

    What worries me with 2PC (and 3PC) is the failure mode. If you have fail-recover then you can get into a situation where transactions can get erroneously committed. You can throw a lot of engineering at the problem to reduce the practical risk, and I concede that’s a completely valid design decision.

    cheers,

    Henry

  3. Hi,

    Re the failure mode: Nancy Lynch et al have shown that it is impossible to reach distributed agreement with both network AND node failures present, so there is nothing one can do against that. However, the pragmatic solution (in our products at least) is to use heuristics and log them – so a human administrator can at least detect those with all the relevant transaction context to resolve them manually.

    I agree it is not perfect, but I think it is the best one can do in 2PC…

    Guy

  4. Hi Guy –

    Yes, FLP shows that distributed consensus is impossible in an async network with a single node failure.

    However, the key is that 2PC sacrifices correctness, or without timeout-recovery of coordinator failure it tolerates only one fault.

    Paxos, which is maybe heavyweight for your purposes, sacrifices liveness but at the same time tolerates n/2 faults. If there is a network partition then the protocol ensures that no transactions will be erroneously committed, and liveness returns when the network reconnects.

    So, in the context of FLP, you have to choose between correctness, validity and liveness. Paxos chooses correctness – 2PC either chooses correctness (but is extremely fragile to failures) or liveness (but failures are a real risk).

    cheers,

    Henry

  5. Consider a scenario where all participants replied with ‘yes’ in the first phase. The coordinator then sent ‘prepare-commit’ to all the participants. Now, all the participants received the ‘prepare-commit’ messages and they proceed to acknowledge it. However, the co-ordinator gets all except one ack(delayed ack message). Hence, it times out one participant and aborts the transaction. It tries to send the abort message to all the participants, but only the timed out guy receives the abort message, then aborts locally and then crashes. The coordinator also crashes.

    Given possibility of this situation, even when all live participants have a pre-commit message with them, the new coordinator cannot either commit or abort ?

  6. @harisankarh agreed the new coordinator cannot know consistent outcome as it cannot know which state the downed node is at. There is a way to slightly paper over the gap which is to order the participants so that the new coordinator can know whether the commit cycle has started; if node N>0 is down then the new coordinator should abort. If node N=0 is down then it is still stuck not knowing if the commit cycle has started.

    In later articles it is pointed out that this is a liveliness problem in the face of failures for this 3PC problem which Paxos solves.

  7. Thank you for this wonderful exposition. Like harisankarh@ I’m not sure I understand how 3PC solves the liveness problem of 2PC. In 2PC, if the coordinater and one node both crash before the beginning of the second phase, the other nodes can’t tell what the decision was (assuming all of them voted to commit). Wouldn’t a similar scenario arise in 3PC if before the beginning of the third phase both the coordinator and one replica crash (assuming all the other nodes received the prepare-to-commit message)? It could be that the crashed replica received an abort message from the coordinator (e.g if some acks were not received by the coordinator) and managed to abort before crashing, or it could be that it received a commit message and managed to commit before crashing. The rest of the system has the same state in both scenarios.

  8. @Erez: I think the key in 3PC is that the addition of the prepare-to-commit (P) phase disambiguates what to do when all surviving nodes are at W state. For example, consider the following states in 2PC: initial (I), waiting (W), abort (A) and commit (C). Also, let there be 5 nodes. Consider the following 2 scenarios.

    Scenario 1 with 2PC) The coordinator moves from I to W and signals the other nodes to move to W. All the replicas move from I to W, except 1 which moves from I to A (aborts). Now, the coordinator and that aborting replica crashes, and we end up with 3 nodes at W state.

    Scenario 2 with 2PC) The coordinator moves from I to W, and tells all the other replicas to move to W. All replicas move to W. Now, the coordinator moves to C, and sends a commit message to replica 1, but at this point it crashes. Replica 1 receives the commit message, but then it crashes. Again, as in scenario 1, we’re left with 3 nodes at W state.

    The point to notice here is that by observing the state of the 3 surviving nodes, which is W, we cannot tell whether the three nodes have to commit or abort, and hence cannot make a decision. Therefore, the three nodes have to block.

    In 3PC, the addition of the P phase helps disambiguate the two scenarios above. For example, let’s revisit the previous two scenarios.

    Scenario 1 with 3PC) Coordinator moves from I to W, and tells all other replicas to move to W. All replicas move from I to W, except 1 replica which aborts. Coordinator and aborting replica crashes. Since the three nodes observe that all of them are in W state, all of them aborts.

    Scenario 2 with 3PC) Coordinator moves from I to W, and tells all other replicas to move to W. All replicas move from I to W. Coordinator moves to P, tells replica 1 to move to P, and then both of them crash. The difference here is that the 3 other replicas can still make a decision to abort, and when the coordinator and replica 1 comes back, it pings the other nodes and notices that they have aborted.

    The key thing to notice in Scenario 2 with 3PC is that when the node crashes after entering P state, and they come back alive, they can still jump to the A state, since they haven’t committed anything to the database.

Comments are closed.