# Consensus Protocols: Three-phase Commit

Posted: November 29th, 2008 | Author: | Filed under: computer science, Distributed systems, Note | Tags: , , , | 9 Comments »

Last time we looked extensively at two-phase commit, a consensus algorithm that has the benefit of low latency but which is offset by fragility in the face of participant machine crashes. In this short note, I’m going to explain how the addition of an extra phase to the protocol can shore things up a bit, at the cost of a greater latency.

The fundamental difficulty with 2PC is that, once the decision to commit has been made by the co-ordinator and communicated to some replicas, the replicas go right ahead and act upon the commit statement without checking to see if every other replica got the message. Then, if a replica that committed crashes along with the co-ordinator, the system has no way of telling what the result of the transaction was (since only the co-ordinator and the replica that got the message know for sure). Since the transaction might already have been committed at the crashed replica, the protocol cannot pessimistically abort – as the transaction might have had side-effects that are impossible to undo. Similarly, the protocol cannot optimistically force the transaction to commit, as the original vote might have been to abort.

This problem is – mostly – circumvented by the addition of an extra phase to 2PC, unsurprisingly giving us a three-phase commit protocol. The idea is very simple. We break the second phase of 2PC – ‘commit’ – into two sub-phases. The first is the ‘prepare to commit’ phase. The co-ordinator sends this message to all replicas when it has received unanimous ‘yes’ votes in the first phase. On receipt of this messages, replicas get into a state where they are able to commit the transaction – by taking necessary locks and so forth – but crucially do not do any work that they cannot later undo. They then reply to the co-ordinator telling it that the ‘prepare to commit’ message was received.

The purpose of this phase is to communicate the result of the vote to every replica so that the state of the protocol can be recovered no matter which replica dies.

The last phase of the protocol does almost exactly the same thing as the original ‘commit or abort’ phase in 2PC. If the co-ordinator receives confirmation of the delivery of the ‘prepare to commit’ message from all replicas, it is then safe to go ahead with committing the transaction. However, if delivery is not confirmed, the co-ordinator cannot guarantee that the protocol state will be recovered should it crash (if you are tolerating a fixed number $f$ of failures, the co-ordinator can go ahead once it has received $f+1$ confirmations). In this case, the co-ordinator will abort the transaction.

If the co-ordinator should crash at any point, a recovery node can take over the transaction and query the state from any remaining replicas. If a replica that has committed the transaction has crashed, we know that every other replica has received a ‘prepare to commit’ message (otherwise the co-ordinator wouldn’t have moved to the commit phase), and therefore the recovery node will be able to determine that the transaction was able to be committed, and safely shepherd the protocol to its conclusion. If any replica reports to the recovery node that it has not received ‘prepare to commit’, the recovery node will know that the transaction has not been committed at any replica, and will therefore be able either to pessimistically abort or re-run the protocol from the beginning.

So does 3PC fix all our problems? Not quite, but it comes close. In the case of a network partition, the wheels rather come off – imagine that all the replicas that received ‘prepare to commit’ are on one side of the partition, and those that did not are on the other. Then both partitions will continue with recovery nodes that respectively commit or abort the transaction, and when the network merges the system will have an inconsistent state. So 3PC has potentially unsafe runs, as does 2PC, but will always make progress and therefore satisfies its liveness properties. The fact that 3PC will not block on single node failures makes it much more appealing for services where high availability is more important than low latencies.

Next time I talk about consensus, it will be to try and describe Paxos, which is really a generalisation of 2PC and 3PC which has found massive popularity recently in building real-world distributed replicated state machines such as Chubby.

As ever – comments or questions, feel free to get in touch.

### 9 Comments on “Consensus Protocols: Three-phase Commit”

1. 1 Consensus with lossy links: Establishing a TCP connection at Paper Trail said at 2:15 pm on January 12th, 2009:

[...] can lead to protocols that are rather different to the *PC variants we’ve looked at in the last couple of posts. We’ve been considering situations like database commit, where many participants agree [...]

2. 2 Consensus Protocols: Paxos at Paper Trail said at 5:03 pm on February 3rd, 2009:

[...] you read my previous article on three-phase commit, you might be wondering why we need another consensus algorithm. 3PC does away with the main [...]

3. 3 Guy Pardon said at 11:42 am on March 3rd, 2009:

Hi,

3PC indeed fails if there are both network and node failures.

The advantage (liveness) also exists for any practical 2PC that I know of: it is called Heuristics. IMHO the extra overhead for 3PC does not make it interesting – that is why we (at Atomikos) use 2PC instead…

Of course, I could be wrong and any opinions are welcome.

4. 4 Henry said at 12:57 pm on March 3rd, 2009:

Hi -

You’re right that, in practice, 2PC can work well. If you’ve got a synchronous network (i.e. you can be conservative with your timeouts), or you only have fail-stop failures, or you have atomic broadcast, and you don’t have message re-ordering you’re probably going to do quite well with 2PC.

What worries me with 2PC (and 3PC) is the failure mode. If you have fail-recover then you can get into a situation where transactions can get erroneously committed. You can throw a lot of engineering at the problem to reduce the practical risk, and I concede that’s a completely valid design decision.

cheers,

Henry

5. 5 Guy Pardon said at 4:16 pm on March 3rd, 2009:

Hi,

Re the failure mode: Nancy Lynch et al have shown that it is impossible to reach distributed agreement with both network AND node failures present, so there is nothing one can do against that. However, the pragmatic solution (in our products at least) is to use heuristics and log them – so a human administrator can at least detect those with all the relevant transaction context to resolve them manually.

I agree it is not perfect, but I think it is the best one can do in 2PC…

Guy

6. 6 Henry said at 4:24 pm on March 3rd, 2009:

Hi Guy -

Yes, FLP shows that distributed consensus is impossible in an async network with a single node failure.

However, the key is that 2PC sacrifices correctness, or without timeout-recovery of coordinator failure it tolerates only one fault.

Paxos, which is maybe heavyweight for your purposes, sacrifices liveness but at the same time tolerates n/2 faults. If there is a network partition then the protocol ensures that no transactions will be erroneously committed, and liveness returns when the network reconnects.

So, in the context of FLP, you have to choose between correctness, validity and liveness. Paxos chooses correctness – 2PC either chooses correctness (but is extremely fragile to failures) or liveness (but failures are a real risk).

cheers,

Henry

7. 7 Paper: Consensus Protocols: Paxos | Unix Stuff said at 12:21 am on March 16th, 2009:

[...] of articles on consensus protocols. We already covered his 2 Phase Commit article and he also has a 3 Phase Commit article showing how to handle 2PC under single node [...]

8. 8 The Paper Trail « hussam's bitstream said at 7:02 pm on June 3rd, 2012:

[...] Consensus Protocols: Three-Phase Commit [...]

9. 9 Paper Trail » Blog Archive » Consensus Protocols: Paxos said at 3:20 pm on April 1st, 2013:

[...] you read my previous article on three-phase commit, you might be wondering why we need another consensus algorithm. 3PC does away with the main [...]