Paper Trail

Computer Systems, Distributed Algorithms and Databases

How consistent is eventual consistency?

This page, from the ‘PBS’ team at Berkeley’s AMPLab is quite interesting. It allows you to tweak the parameters of a Dynamo-style system, then by running a series of Monte Carlo simulations gives an estimate of the likelihood of staleness of reads after writes.

Since the Dynamo paper appeared and really popularised eventual consistency, the debate has focused on a fairly binary treatment of its merits. Either you can’t afford to be wrong, ever, or it’s ok to have your reads be stale for a potentially unbounded amount of time. In fact, the suitability of eventual consistency is dependent partly on the distribution of stale reads; that is the speed of quiescence of a system immediately after a write. If the probability of a ever seeing a stale read due to consistency delays can be reduced to smaller than the probability of every machine in the network simultaneously catching fire, we can probably make use of eventual consistency.

Looking at many designed systems (where there is little more than conventional wisdom on how to choose R and W), it’s clear that an analytical model relating system parameters to distributions of behaviour is sorely needed. PBS is a good step in that direction. It would be good to see the work extended to handle a treatment of failure distributions (although a good failure model is hard to find!). The reply latencies of write and read replicas are modelled exponentially distributed CDFs, but in reality there’s a more significant probability of the reply latency becoming infinite. Once that distribution is correctly modelled, PBS should be able to run simulations against it with no change.

A great use for this tool would be to enter some operational parameters, such as the required consistency probability, max number of nodes, availability requirements and maximum request latency, and have PBS suggest some points in the system design space that would meet these requirements with high probability. As the size of the R / W quora get larger, the variance on the request latencies gets larger, but the resilience to failures increases as does the likelihood of fresh reads. For full credit, PBS could additionally model a write / read protocol (i.e. 2-phase commit) which has different consistency properties. As Daniel Abadi discusses, when things are running well different consistency guarantees trade off between latency and the strength of consistency.

Nice work PBS team!

A little bit of recruiting

Today was a pretty good day at Cloudera, although it was not much unlike any other. Today I was:

  • checking a leader election protocol for correctness
  • cleaning up some code in an open source project
  • designing a specialised messaging system
  • working with one of our outstanding interns on the very cool project he’s taking on over the summer
  • cheerleading as Cloudera pushed another open-source project out of the nest.

We have more interesting work than we can possibly do. We need great engineers to come solve some really fascinating problems. I wanted to take advantage of the fact that this blog is currently getting a lot of traffic to put this message in front of a lot of eyeballs: you should really consider coming to work at Cloudera.

We’re hiring pretty aggressively in several engineering areas, but two in particular that I can say a lot about:

Distributed systems engineers. Engineering jobs where you are doing what we would call ‘systems’ work are few and far between. They exist, but they’re not common. One thing I have found, speaking to engineers at other companies, that a job that encourages specialisation in distributed systems is even less easily found, especially at an exciting company with plenty of really smart people. At Cloudera we need people to work on a variety of large distributed systems – execution frameworks, distributed filesystems, data ingest pipelines, schedulers, coordination systems, query processing and more. The problems you would work on are fundamental and challenging, and the people you would get to work with know the systems inside and out. The need to scale is real – we have customers who have thousands of machines in their clusters, and we write software to run across every single core.

Applications engineers. Think managing one of those clusters is easy? It isn’t. The monitoring and lifecycle management challenge posed by Apache Hadoop and other systems can be enormously complex. Each cluster produces a vast amount of data, and your job as an application engineer is to extract the right signals from that data and bring it to operations engineers through clear, meaningful visualisations so that they can make sense of all that information. We’re looking for full-stack developers, as happy writing a server process to aggregate and chomp through log files as they are building a reusable heatmap control to properly visualise a multi-dimensional distribution.

The perks at Cloudera are good – we have offices in San Francisco and Palo Alto, we have communal lunches brought in every day, there’s a subsidised gym membership, you can buy books and we’re pretty light on procedural overhead. But if you’re a good fit, you’ll probably be attracted because the problems are real, interesting and hard, and the people you’ll work with to solve them are smart, knowledgable and a whole load of fun.

If you’re interested, drop me a line or send me your resume at henry AT cloudera DOT com, and I’ll happily tell you anything you want to know about this place.

STM: Not (much more than) a research toy?

It’s a sign of how down-trodden the Software Transactional Memory (STM) effort must have become that the article (sorry, ACM subscription required) published in a recent CACM might have been just as correctly called “STM: Not as bad as the worst possible case”. The authors present a series of experiments that demonstrate that highly concurrent STM code beats sequential, single threaded code. You’d hope that this had long ago become a given, but what this demonstrates is only hey, STM allows some parallelism. And this weak lower bound got a whole article.

Another conclusion from the article is that STM performs best when there is little contention for transactions between threads. Again, that should really be a given – all reasonable concurrency primitives have high throughput when there is little contention but high parallelism. (A lot of work has gone into making this a very fast case (since it is the most common) for locking, see e.g. biased locking schemes in the Hotspot JVM).

Bryan Cantrill (previously of Fishworks, now of Joyent) rips on transactional memory more eloquently than I ever could. STM is a declarative solution to thread safety, which I like, but no more declarative really than synchronised blocks – and Cantrill points out the elephant in the room that the CACM article seemed to ignore: doing IO inside transactions is hugely problematic (because how precisely do you roll back a network packet?).

A recent paper at SOSP 2009 called Operating System Transactions attacked this problem, although not from the viewpoint of STM, but to provide atomicity and isolation for situations where bugs arise from the separation between reads, and writes that depend on that read (Time Of Check To Time Of Use – TOCTTOU). Perhaps there’s an overlap between this paper and STM approaches, but it’s not clear whether the workloads inside an operating system’s system call layer are general enough to map onto typical user-space STM work.

The Theorem That Will Not Go Away

The CAP theorem gets another airing.

I think the article makes a point worth making again, and makes it fairly well – that CAP is really about P=> ~(C & A). A couple of things I want to call out though, after a rollicking discussion on Hacker News.

“For a distributed (i.e., multi-node) system to not require partition-tolerance it would have to run on a network which is guaranteed to never drop messages (or even deliver them late) and whose nodes are guaranteed to never die. You and I do not work with these types of systems because they don’t exist.”

This is a bit strong, at least theoretically. Actually all you need to not require partition-tolerance is to guarantee that your particular kryptonite failure pattern never occurs. Many protocols are robust to a dropped message here or there. A quorum system requires a fairly dramatic failure (one node completely partitioned) before one side of the partition has to occur. In practice, of course, these failures happen more often than we would like, which is why we worry about fault-tolerant properties of distributed algorithms.

Therefore the paragraph on failure probabilities is less powerful. It’s not always a problem if a single failure occurs, and therefore you shouldn’t immediately worry about sacrificing availability or consistency as soon as one node starts running slowly. CAP only establishes the existence of a failure pattern that torpedoes any distributed implementation of an atomic object, not its high probability.

CAP confusion: Problems with Partition Tolerance

Over on the Cloudera blog I’ve written an article that should be of interest to readers of this blog.

I’m no great fan of the ubiquity of the CAP theorem – it’s a solid impossibility result which appeals to the theorist in me, but it doesn’t capture every fundamental tension in a distributed system. For example: we make our systems distributed across more than one machine usually for reasons of performance and to eliminate a single point of failure. Neither of these motivations are captured verbatim by the CAP theorem. There’s more to designing distributed systems!

In this, I agree with Stonebraker; it’s the erroneous representation of ‘partition tolerance’ that I found very strange. I’ve been a good deal more forceful in private about this than I have in public :)

Apache ZooKeeper is looking for Google Summer of Code applicants

Students! Over at Apache ZooKeeper we’re looking for great students with a strong interest in distributed systems to work with us over the summer as part of Google’s Summer of Code, 2010.

Summer of Code is a great program – providing stipends to students and more importantly connecting them with mentors in open source projects. ZooKeeper has a number of interesting projects to get started on.

ZooKeeper is a distributed coordination platform on which you can build the distributed equivalents of many traditional concurrent primitives like locks, queues and barriers. It’s heavily used in the real world – Yahoo! use it extensively, and many other major companies rely on it. The other committers and I are actively looking to increase participation in the project – there is loads of really interesting work left to do. If consensus protocols, distributed systems, scalability, fault-tolerance and performance are your thing, this is certainly the project for you.

If you have any questions at all, drop me a line at henry at apache dot org.

Images are back

Thanks to those of you who bugged me about putting images back up. They should all (except one that’s causing me a bit of a headache) be back now, after I discovered copies of all in an dusty corner of my hard drive.

Real post with actual content coming in the next week or so!

“Well, I’m back.”

Astute readers may have noticed that this blog was unavailable for the past three-and-a-half months. The short reason for this was that the computer on which Paper Trail was hosted was on a boat in the Atlantic and it is surprisingly hard to get a good internet connection out there, let alone a power supply to the shipping crate it was in.

The long reason was that I have now moved across the ocean, from Cambridge to San Francisco, where I am living in order that the commute to Cloudera be a little shorter than 24 hours. My possessions finally arrived on Thursday – over three months since we shipped them – and we are finally getting everything in order, including getting this blog back online. I’m now hosted at Bluehost, and have transferred all the old posts over to the new WordPress installation. I don’t yet have access to the images, so unfortunately those wil have to wait, but most other things are in order. Please excuse the dust as I find out which links are broken.

The old address – – will still work, but the correct link is now our very own domain: Hopefully we will start showing up in Google again when the crawlers do their thing. Please update your rss readers – those of you who are still following and haven’t deleted my feed in disgust. Does anyone even still use rss readers anymore?

I am in the process of deciding what to write about for the next few posts. I still have the remainder of the theory of computation posts to write, which would be fun to do but is a bit of a departure from the systems focus. I never really fully explained Byzantine Fault Tolerance – at least, I never got as far as describing Zyzzyva and other modern systems. At the same time, some interesting stuff in systems research has happened since the blog went quiet – Google released the Go programming language which is intriguing for writing user-space systems software. SOSP 2009 happened, with some very cool papers which I really want to write about. And I’ve been busy myself – I was recently made a committer on the Apache ZooKeeper project, which is a distributed coordination system written by some engineers and researchers at Yahoo!, and is very cool. My largest contribution was a patch for ‘observers’ – which are listeners, in Paxos terminology – which help maintain the read performance of the cluster as the number of clients scales.

So, lots going on, plenty to write about, and some exciting possibilities coming down the queue. Good to be back.

GFS Retrospective in ACM Queue

This is a really great article. Sean Quinlan talks very openly and critically about the design of the Google File System given ten years of use (ten years!).

What’s interesting is that the general sentiment seems to be that the concessions that GFS made for performance and simplicity (single master, loose consistency model) have turned out to probably be net bad decisions, although they probably weren’t at the time.

There are scaling issues with GFS – the well known many-small-files problem that also plagues HDFS, and a similar huge-files problem. Both increase the amount of metadata that a master has to maintain, and both therefore degrade performance due to some linear costs on metadata operations.

A replacement for GFS is in the works – a complete re-do, not just incremental updates to the design. This will involve, at least, a rethink of metadata storage and clearly a more fault-tolerant and scalable design. Multiple appenders, also a bugbear of HDFS, may be serialised through a single writer process, which would help the consistency issues.

“Our user base has definitely migrated from being a MapReduce-based world to more of an interactive world that relies on things such as BigTable. Gmail is an obvious example of that. Videos aren’t quite as bad where GFS is concerned because you get to stream data, meaning you can buffer. Still, trying to build an interactive database on top of a file system that was designed from the start to support more batch-oriented operations has certainly proved to be a pain point.”

One of those great articles where every paragraph is rich with insight. Worth reading, probably twice.

SOSP 2009 Program Available

The accepted papers for SOSP 2009 are here. As ever, some excellent looking papers. If you search for the titles you can often turn up drafts or even the submitted versions.

The best looking sessions to me are ‘scalability’ and ‘clusters’, but there’s at least one great looking title in every session. I’ll start posting some reviews once I find some bandwidth (and have finished the computation theory series – next one on its way).

Congratulations to all accepted authors!