Consistency and availability in Amazon's Dynamo

There is a continuing and welcome trend amongst large, modern technology companies like Google, Yahoo and Amazon to publish details of their systems at academic conferences. One of the problems that researchers at universities have is making a convincing case that their ideas would work well in the real world, since no matter how many assumptions are made there really is no substitute for field testing, and the infrastructure, workloads and data just aren’t available to do that effectively. However, companies have infrastructure to burn and a genuine use-case with genuine users. Using their experience and data to discover what does and doesn’t work, and what is and is not really important provides an invaluable feedback loop to researchers.

More than that, large systems are built from a set of independent ideas. Most academic papers leave the construction of a practical real-world system as an exercise for the reader. Synthesising a set of disparate techniques often throws up lots of gotchas which no papers directly address. Companies with businesses to run have a much greater incentive to build a robust system that works.

At 2007’s Symposium on Operating Systems Principles (SOSP), Amazon presented a paper about one of their real-world systems: “Dynamo: Amazon’s Highly Available Key-value Store”. It wound up winning, I think, the audience prize for best paper. In this post, I was planning to describe Dynamo ‘inside-out’, based on a reading group mandated close reading of the paper. However, trying to lucidly explain a dense 12 page paper leads to many more than 12 pages of explanation. So instead, I want to focus on one particular aspect of Dynamo which I think is the most interesting.

Continue reading

Dijkstra award 2008 goes to 'Sparse Partitions'

Not sure why I didn’t post this when I first found out – the Djikstra 2008 prize for an outstanding and influential paper in distributed systems has been awarded to David Peleg and Baruch Awerbuch for their 1990 FOCS paper Sparse Partitions (on-line copy available from MIT ad-hoc algorithms course here).

The citation does a much better job than I could of explaining the paper’s relevance. The general idea is that the authors show that there are efficient ways of constructing clustered representations of graphs that remain within a small factor of the original in terms of route lengths. Further, the authors show that this can be done in a distributed manner. This has lots (and lots) of potential applications – the typical example is for a compact routing scheme, where nodes can store smaller routing tables between clusters rather than between nodes.

I’ve got the paper cued up on my list of walkthroughs to write, so expect a better explanation than the one above soon.

A Brief Tour of FLP Impossibility

One of the most important results in distributed systems theory was published in April 1985 by Fischer, Lynch and Patterson. Their short paper ‘Impossibility of Distributed Consensus with One Faulty Process’, which eventually won the Dijkstra award given to the most influential papers in distributed computing, definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment.

This particular result, known as the ‘FLP result’, settled a dispute that had been ongoing in distributed systems for the previous five to ten years. The problem of consensus – that is, getting a distributed network of processors to agree on a common value – was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps. In particular, the synchronous solution was resilient to faults, where processors crash and take no further part in the computation. Informally, synchronous models allow failures to be detected by waiting one entire step length for a reply from a processor, and presuming that it has crashed if no reply is received.

This kind of failure detection is impossible in an asynchronous setting, where there are no bounds on the amount of time a processor might take to complete its work and then respond with a message. Therefore it’s not possible to say whether a processor has crashed or is simply taking a long time to respond. The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem.

In this post, I want to give a tour of the proof itself because, although it is quite subtle, it is short and profound. I’ll start by introducing consensus, and then after describing some notation and assumptions I’ll work through the main two lemmas in the paper.

If you want to follow along at home (highly, highly recommended) a copy of the paper is available here.

Continue reading