The CAP theorem gets another airing.
I think the article makes a point worth making again, and makes it fairly well – that CAP is really about P=> ~(C & A). A couple of things I want to call out though, after a rollicking discussion on Hacker News.
“For a distributed (i.e., multi-node) system to not require partition-tolerance it would have to run on a network which is guaranteed to never drop messages (or even deliver them late) and whose nodes are guaranteed to never die. You and I do not work with these types of systems because they don’t exist.”
This is a bit strong, at least theoretically. Actually all you need to not require partition-tolerance is to guarantee that your particular kryptonite failure pattern never occurs. Many protocols are robust to a dropped message here or there. A quorum system requires a fairly dramatic failure (one node completely partitioned) before one side of the partition has to occur. In practice, of course, these failures happen more often than we would like, which is why we worry about fault-tolerant properties of distributed algorithms.
Therefore the paragraph on failure probabilities is less powerful. It’s not always a problem if a single failure occurs, and therefore you shouldn’t immediately worry about sacrificing availability or consistency as soon as one node starts running slowly. CAP only establishes the existence of a failure pattern that torpedoes any distributed implementation of an atomic object, not its high probability.
Over on the Cloudera blog I’ve written an article that should be of interest to readers of this blog.
I’m no great fan of the ubiquity of the CAP theorem – it’s a solid impossibility result which appeals to the theorist in me, but it doesn’t capture every fundamental tension in a distributed system. For example: we make our systems distributed across more than one machine usually for reasons of performance and to eliminate a single point of failure. Neither of these motivations are captured verbatim by the CAP theorem. There’s more to designing distributed systems!
In this, I agree with Stonebraker; it’s the erroneous representation of ‘partition tolerance’ that I found very strange. I’ve been a good deal more forceful in private about this than I have in public 🙂
This is a really great article. Sean Quinlan talks very openly and critically about the design of the Google File System given ten years of use (ten years!).
What’s interesting is that the general sentiment seems to be that the concessions that GFS made for performance and simplicity (single master, loose consistency model) have turned out to probably be net bad decisions, although they probably weren’t at the time.
There are scaling issues with GFS – the well known many-small-files problem that also plagues HDFS, and a similar huge-files problem. Both increase the amount of metadata that a master has to maintain, and both therefore degrade performance due to some linear costs on metadata operations.
A replacement for GFS is in the works – a complete re-do, not just incremental updates to the design. This will involve, at least, a rethink of metadata storage and clearly a more fault-tolerant and scalable design. Multiple appenders, also a bugbear of HDFS, may be serialised through a single writer process, which would help the consistency issues.
“Our user base has definitely migrated from being a MapReduce-based world to more of an interactive world that relies on things such as BigTable. Gmail is an obvious example of that. Videos aren’t quite as bad where GFS is concerned because you get to stream data, meaning you can buffer. Still, trying to build an interactive database on top of a file system that was designed from the start to support more batch-oriented operations has certainly proved to be a pain point.”
One of those great articles where every paragraph is rich with insight. Worth reading, probably twice.
Barbara Liskov has just been announced as the recipient of the 2008 Turing Award, which is one of the most important prizes in computer science, and can be thought of as our field’s equivalent to the various Nobel Prizes. Professor Liskov is a worthy recipient of the award, even if judged alone by her citation which lists a number of the important contributions she has made to operating systems, programming languages and distributed systems.
Professor Liskov seems to be particularly well known for the Liskov substitution principle which says that some property of a supertype ought to hold of its subtypes. I’m not in any position to speak as to the importance of this contribution. However, her more recent work has been regarding the tolerance of Byzantine failures in distributed systems, which is much more close to my heart.
The only work of Liskov’s that I am really familiar with is the late 90s work on Practical Byzantine Fault Tolerance with Miguel Castro and is first published in this OSDI ’99 paper. I’m not going to do a full review, but the topic sits so nicely with my recent focus on consensus protocols that it makes sense to briefly discuss its importance.
This is one of my favourite papers from OSDI ’08 (yes, still doing a few reviews, trying to get to five or so before SOSP…). FlightPath is a system developed by some folks mainly at UT Austin for peer-to-peer streaming in dynamic networks. This is a reasonably challenging problem in itself, although one that’s seen a good deal of work before. However, the really cool thing about this paper is that they treat participants in the network as potentially rational agents. Since Lamport’s seminal work on the Byzantine generals problem, it’s been standard practice to assign one of two behaviour modes to members of distributed systems: either you’re alturistic, which means that you do exactly what the protocol tells you to do, no matter what the cost to yourself, or Byzantine, which means that you do whatever you like, again no matter what the cost to yourself.
It was realised recently that this is a false dichotomy: there’s a whole class of behaviour that’s not captured by these two extremes. Rational agents participate in a protocol as long as it is worth their while to do so. At its most simple, this means that rational agents will not incur a cost unless they expect to recoup a benefit that is worth equal to or more than the original cost to them. This gave rise to the Byzantine-Alturistic-Rational (BAR) model, due to the same UTA group, which can be used to more realistically model the performance of peer-to-peer protocols.