Raft is a new-ish consensus implementation whose great benefit, to my mind it, is its applicability for real systems. We briefly discussed it internally at Cloudera, and I thought I’d share what I contributed, below. There’s an underlying theme here regarding the role of distributed systems research in practitioners’ daily work, and how the act of building a distributed system has not yet been sufficiently well commoditised to render a familiarity with the original research unnecessary. I think I’d argue that bridging that gap further is necessary: no matter how much fun it is to read all these papers, it shouldn’t be a pre-requisite to being successful in implementing a distributed system. I have more to write on this.
“The trouble with Paxos is that it’s ‘only’ a consensus algorithm; a theoretical achievement but not one necessarily suited to building practical systems. Remember that the demonstration that a correct, message-optimal protocol even existed was the main contribution. To that end, a lot of practical considerations were left by the wayside. Leader election is an exercise for the reader (since Paxos is robust to bad implementations where there are several leaders, it doesn’t matter what election scheme is used). Paxos is not concerned with ‘logs’ at all; that it can be used to build replicated-state machines with durable logs is a corollary, not the main theorem.
Raft fills in a ton of these gaps, and more power to them for doing so. The leader election algorithm is set in stone. There are additional constraints to ensure that updates are seen and processed in hole-free order (Paxos doesn’t guarantee this), which is exactly what you want from a distributed log. Raft also specifies a view-change algorithm, which Paxos does not, but VS replication does. The huge effort required to get ZOOKEEPER-107 committed shows how hard this is to retrofit onto an existing system.
So: there’s a tendency to conflate ‘distributed replicated with strong consistency properties’ with ‘consensus algorithm’. Consensus shows you can agree on a single value, multi-Paxos shows you can agree on a bunch of them, but neither give you a complete system for a replicated log which is actually what most of our distributed systems want to interact with.”
I wrote an FAQ on The CAP Theorem. The aim is to definitively settle some of the common misconceptions around CAP so as to help prevent its invocation in useless places. If someone says they got around CAP, refer them to the FAQ. It should be a pretty simple introduction to the theorem as well. I think that CAP itself is a pretty uninteresting result, but it does at least shine a light on tradeoffs implicit in distributed systems. I have a couple of residual thoughts about failures rather than partitions that I might write up at some point in the future, but otherwise I hope the FAQ helps move the conversation on.
Impala and aggregation trees
I also wrote a quick answer on Quora about Impala’s execution trees, and how deeper trees help do aggregation more effectively. There’s a lot more of interest to write about planning, partitioning and scheduling queries.
Not so HotOS
Matt Welsh talks about what he wishes systems researchers would work on. I partially agree: there are few papers in the HotOS 2013 program that pass the “if this were true, a lot of our assumptions would be wrong” test. I don’t think that IO is an unworthy topic however, but the main problem with improving IO interfaces is inertia rather than want for better ideas. I hope that SSDs and / or flash-as-RAM will be forcing functions here. Otherwise his big topics are all worthy research challenges, but I have always seen HotOS as a venue for new ideas, not new topics – despite its name. If it really were intended to be a venue for the fashionable areas of study in systems it would have much less potential value.
Building distributed systems
Andy Gross (Basho VP Eng) gave a closing keynote calling for more reusable primitives in distributed systems. I couldn’t agree more (and have been gently complaining about this for years to anyone who would listen), although I think doing this right is not straightforward and requires a lot of careful thought about RPC mechanisms, libraries vs. services and much more. I have some thoughts on this that I want to wrap up in a blog post sometime soon.
Highly available transactions
Peter Bailis, a Berkeley PhD candidate, has some solid work on Highly Available Transactions, based on exploring the space of consistency models available to databases and finding that there are demonstrably useful consistency guarantees that can be made when availability is kept high. This is the right way to move on from all the CAP arguments – given that we are stuck with certain impossibility results, let’s map out the vast terrain that we can conquer.
You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage – which is a simple idea – can drastically improve performance for certain workloads.
Caveat: This is a personal, general research summary post, and as usual doesn’t neccessarily reflect our thinking at Cloudera about columnar storage.
Disks are still the major bottleneck in query execution over large datasets. Even a machine with twelve disks running in parallel (for an aggregate bandwidth of north of 1GB/s) can’t keep all the cores busy; running a query against memory-cached data can get tens of GB/s of throughput. IO bandwidth matters. Therefore, the best thing an engineer can do to improve the performance of disk-based query engines (like RDBMs and Impala) usually is to improve the performance of reading bytes from disk. This can mean decreasing the latency (for small queries where the time to find the data to read might dominate), but most usually this means improving the effective throughput of reads from disk.
The traditional way to improve disk bandwidth has been to wait, and allow disks to get faster. However, disks are not getting faster very quickly (having settled at roughly 100 MB/s, with ~12 disks per server), and SSDs can’t yet achieve the storage density to be directly competitive with HDDs on a per-server basis.
The other way to improve disk performance is to maximise the ratio of ‘useful’ bytes read to total bytes read. The idea is not to read more data than is absolutely necessary to serve a query, so the useful bandwidth realised is increased without actually improving the performance of the IO subsystem. Enter columnar storage, a principle for file format design that aims to do exactly that for query engines that deal with record-based data.
If you have a strong background in either databases or distributed systems, and fancy working on such an exciting technology, send me a note!
It’s great to finally be able to say something about what I’ve been working at Cloudera for nearly a year. At StrataConf / Hadoop World in New York a couple of weeks ago we announced Cloudera Impala. Impala is a distributed query execution engine that understands a subset of SQL, and critically runs over HDFS and HBase as storage managers. It’s very similar in functionality to Apache Hive, but it is much, much, much (anecdotally up to 100x) faster.
There’s one particular aspect of the Paxos protocol that gives readers of this blog – and for some time, me! – some difficulty. This short post tries to clear up some confusion on a part of the protocol that is poorly explained in pretty much every major description.