You can’t really read two articles about distributed systems today without someone mentioning the Paxos algorithm. Google use it in Chubby, Yahoo use it, or something a bit like it, in ZooKeeper and it seems that it’s considered the ne plus ultra of consensus algorithms. It also comes with a reputation as being fantastically difficult to understand – a subtle, complex algorithm that is only properly appreciated by a select few.
This is kind of true and not true at the same time. Paxos is an algorithm whose entire behaviour is subtly difficult to grasp. However, the algorithm itself is fairly intuitive, and certainly relatively simple. In this article I’ll describe how basic Paxos operates, with reference to previous articles on two-phase and three-phase commit. I’ve included a bibliography at the end, for those who want plenty more detail.
After a hiatus for the Christmas break, during which I travelled to the States, had a job interview, went to Vegas, became an uncle and got a cold, I’m back on a more regular posting schedule now. And I’ve got lots to post about.
Before I talk about other theoretical consensus protocols such as Paxos, I want to illustrate a consensus protocol running in the wild, and show how different modelling assumptions can lead to protocols that are rather different to the *PC variants we’ve looked at in the last couple of posts. We’ve been considering situations like database commit, where many participants agree en-masse to the result of a transaction. We’ve assumed that all participants may communicate reliably, without fear of packet loss (or if the packets are lost then the situation is the same as if the host that had sent the packet had failed).
The Transmission Control Protocol (TCP) gives us at least some approximation to a reliable link due to the use of sequence numbers and acknowledgements. However before we can use TCP both hosts involved in a point to point communication have to establish a connection: that is, they must both agree that a connection is established. This is a two-party consensus problem. Neither party can rely on reliable transmission, and can instead only use the IP stack and below to negotiate a connection. IP does not give reliable transmission semantics to packets and works only on a best-effort principle. If the network is noisy or prone to outages then packets will be lost. How can we achieve consensus in this scenario?
Those who have been reading this blog as far back as my explanation of FLP impossibility will probably be thinking that this is a trick question. FLP impossibility shows that if there is an unbounded delay in the transmission of a packet (i.e. an asynchronous network model) then consensus is, in general, unsolvable. Lossy links can be regarded as delaying packet delivery infinitely – therefore it seems very likely that consensus is unsolvable with packet loss.
In fact, this is completely true. Consensus with arbitrary packet loss is an unsolvable problem, even in an otherwise synchronous network. In this post I want to demonstrate the short and intuitive proof that this is the case, then show how this impossibility is avoided where possible in TCP connection establishment.
Last time we looked extensively at two-phase commit, a consensus algorithm that has the benefit of low latency but which is offset by fragility in the face of participant machine crashes. In this short note, I’m going to explain how the addition of an extra phase to the protocol can shore things up a bit, at the cost of a greater latency.
For the next few articles here, I’m going to write about one of the most fundamental concepts in distributed computing – of equal importance to the theory and practice communities. The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something – it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network.
For example, some possible uses of consensus are:
- deciding whether or not to commit a transaction to a database
- synchronising clocks by agreeing on the current time
- agreeing to move to the next stage of a distributed algorithm (this is the famous replicated state machine approach)
- electing a leader node to coordinate some higher-level protocol
Such a simple-sounding problem has surprisingly been at the core particularly of theoretical distributed systems research for over twenty years. How come? As I see it, the answers are threefold.
Although GFS provides Google with reliable, scalable distributed file storage, it does not provide any facility for structuring the data contained in the files beyond a hierarchical directory structure and meaningful file names. It’s well known that more expressive solutions are required for large data sets. Google’s terabytes upon terabytes of data that they retrieve from web crawlers, amongst many other sources, need organising, so that client applications can quickly perform lookups and updates at a finer granularity than the file level.
So they built BigTable, wrote it up, and published it in OSDI 2006. The paper is here, and my walkthrough follows.