The CAP theorem gets another airing.
I think the article makes a point worth making again, and makes it fairly well – that CAP is really about P=> ~(C & A). A couple of things I want to call out though, after a rollicking discussion on Hacker News.
“For a distributed (i.e., multi-node) system to not require partition-tolerance it would have to run on a network which is guaranteed to never drop messages (or even deliver them late) and whose nodes are guaranteed to never die. You and I do not work with these types of systems because they don’t exist.”
This is a bit strong, at least theoretically. Actually all you need to not require partition-tolerance is to guarantee that your particular kryptonite failure pattern never occurs. Many protocols are robust to a dropped message here or there. A quorum system requires a fairly dramatic failure (one node completely partitioned) before one side of the partition has to occur. In practice, of course, these failures happen more often than we would like, which is why we worry about fault-tolerant properties of distributed algorithms.
Therefore the paragraph on failure probabilities is less powerful. It’s not always a problem if a single failure occurs, and therefore you shouldn’t immediately worry about sacrificing availability or consistency as soon as one node starts running slowly. CAP only establishes the existence of a failure pattern that torpedoes any distributed implementation of an atomic object, not its high probability.
Barbara Liskov has just been announced as the recipient of the 2008 Turing Award, which is one of the most important prizes in computer science, and can be thought of as our field’s equivalent to the various Nobel Prizes. Professor Liskov is a worthy recipient of the award, even if judged alone by her citation which lists a number of the important contributions she has made to operating systems, programming languages and distributed systems.
Professor Liskov seems to be particularly well known for the Liskov substitution principle which says that some property of a supertype ought to hold of its subtypes. I’m not in any position to speak as to the importance of this contribution. However, her more recent work has been regarding the tolerance of Byzantine failures in distributed systems, which is much more close to my heart.
The only work of Liskov’s that I am really familiar with is the late 90s work on Practical Byzantine Fault Tolerance with Miguel Castro and is first published in this OSDI ’99 paper. I’m not going to do a full review, but the topic sits so nicely with my recent focus on consensus protocols that it makes sense to briefly discuss its importance.
It’s one thing to wax lyrical about an algorithm or protocol having simply read the paper it appeared in. It’s another to have actually taken the time to build an implementation. There are many slips twixt hand and mouth, and the little details that you’ve abstracted away at the point of reading come back to bite you hard at the point of writing.
I’m a big fan of building things to understand them – this blog is essentially an expression of that idea, as the act of constructing an explanation of something helps me understand it better. Still, I felt that in order to be properly useful, this blog probably needed more code.
So when, yesterday, it was suggested I back up my previous post on Paxos with a toy implementation I had plenty of motivation to pick up the gauntlet. However, I’m super-pressed for time at the moment while I write my PhD thesis, so I gave myself a deadline of a few hours, just to keep it interesting.
A few hours later, I’d written this from-scratch implementation of Paxos. There’s enough interesting stuff in it, I think, to warrant this post on how it works. Hopefully some of you will find it useful, and something you can use as a springboard to your own implementations. You can run an example by simply invoking python toy_paxos.py.
You can’t really read two articles about distributed systems today without someone mentioning the Paxos algorithm. Google use it in Chubby, Yahoo use it, or something a bit like it, in ZooKeeper and it seems that it’s considered the ne plus ultra of consensus algorithms. It also comes with a reputation as being fantastically difficult to understand – a subtle, complex algorithm that is only properly appreciated by a select few.
This is kind of true and not true at the same time. Paxos is an algorithm whose entire behaviour is subtly difficult to grasp. However, the algorithm itself is fairly intuitive, and certainly relatively simple. In this article I’ll describe how basic Paxos operates, with reference to previous articles on two-phase and three-phase commit. I’ve included a bibliography at the end, for those who want plenty more detail.
After a hiatus for the Christmas break, during which I travelled to the States, had a job interview, went to Vegas, became an uncle and got a cold, I’m back on a more regular posting schedule now. And I’ve got lots to post about.
Before I talk about other theoretical consensus protocols such as Paxos, I want to illustrate a consensus protocol running in the wild, and show how different modelling assumptions can lead to protocols that are rather different to the *PC variants we’ve looked at in the last couple of posts. We’ve been considering situations like database commit, where many participants agree en-masse to the result of a transaction. We’ve assumed that all participants may communicate reliably, without fear of packet loss (or if the packets are lost then the situation is the same as if the host that had sent the packet had failed).
The Transmission Control Protocol (TCP) gives us at least some approximation to a reliable link due to the use of sequence numbers and acknowledgements. However before we can use TCP both hosts involved in a point to point communication have to establish a connection: that is, they must both agree that a connection is established. This is a two-party consensus problem. Neither party can rely on reliable transmission, and can instead only use the IP stack and below to negotiate a connection. IP does not give reliable transmission semantics to packets and works only on a best-effort principle. If the network is noisy or prone to outages then packets will be lost. How can we achieve consensus in this scenario?
Those who have been reading this blog as far back as my explanation of FLP impossibility will probably be thinking that this is a trick question. FLP impossibility shows that if there is an unbounded delay in the transmission of a packet (i.e. an asynchronous network model) then consensus is, in general, unsolvable. Lossy links can be regarded as delaying packet delivery infinitely – therefore it seems very likely that consensus is unsolvable with packet loss.
In fact, this is completely true. Consensus with arbitrary packet loss is an unsolvable problem, even in an otherwise synchronous network. In this post I want to demonstrate the short and intuitive proof that this is the case, then show how this impossibility is avoided where possible in TCP connection establishment.