Paper Trail

Computer Systems, Distributed Algorithms and Databases

Distributed systems theory for the distributed systems engineer

Gwen Shapira, SA superstar and now full-time engineer at Cloudera, asked a question on Twitter that got me thinking.

My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”, and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed. But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program). Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context. What good is requiring that level of expertise of engineers?

And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory; particularly material that does so without condescending. Considering that gap lead me to another interesting question:

What distributed systems theory should a distributed systems engineer know?

A little theory is, in this case, not such a dangerous thing. So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer; what I consider ‘table stakes’ for distributed systems engineers competent enough to design a new system. Let me know what you think I missed!

Read the rest of this entry »

The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google

Note: this is a personal blog post, and doesn’t reflect the views of my employers at Cloudera

Map-Reduce is on its way out. But we shouldn’t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise.

This morning, at their I/O Conference, Google revealed that they’re not using Map-Reduce to process data internally at all any more.

We shouldn’t be surprised. The writing has been on the wall for Map-Reduce for some time. The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems.

Read the rest of this entry »

Paper notes: MemC3, a better Memcached

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

Fan and Andersen, NSDI 2013

The big idea: This is a paper about choosing your data structures and algorithms carefully. By paying careful attention to the workload and functional requirements, the authors reimplement memcached to achieve a) better concurrency and b) better space efficiency. Specifically, they introduce a variant of cuckoo hashing that is highly amenable to concurrent workloads, and integrate the venerable CLOCK cache eviction algorithm with the hash table for space-efficient approximate LRU.

Read the rest of this entry »

Paper notes: Anti-Caching

Anti-Caching: A New Approach to Database Management System Architecture

DeBrabant et. al., VLDB 2013

The big idea: Traditional databases typically rely on the OS page cache to bring hot tuples into memory and keep them there. This suffers from a number of problems:

  • No control over granularity of caching or eviction (so keeping a tuple in memory might keep all the tuples in its page as well, even though there’s not necessarily a usage correlation between them)
  • No control over when fetches are performed (fetches are typically slow, and transactions may hold onto locks or latches while the access is being made)
  • Duplication of resources – tuples can occupy both disk blocks and memory pages.

Instead, this paper proposes a DB-controlled mechanism for tuple caching and eviction called anti-caching. The idea is that the DB chooses exactly what to evict and when. The ‘anti’ aspect arises when you consider that the disk is now the place to store recently unused tuples, not the source of ground truth for the entire database. The disk, in fact, can’t easily store the tuples that are in memory because, as we shall see, the anti-caching mechanism may choose to write tuples into arbitrary blocks upon eviction, which will not correspond to the pages that are already on disk, giving rise to a complex rewrite / compaction problem on eviction. The benefits are realised partly in IO: only tuples that are cold are evicted (rather than those that are unluckily sitting in the same page), and fetches and evictions may be batched.

Read the rest of this entry »

Paper notes: Stream Processing at Google with Millwheel

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

Akidau et. al., VLDB 2013

The big idea: Streaming computations at scale are nothing new. Millwheel is a standard DAG stream processor, but one that runs at ‘Google’ scale. This paper really answers the following questions: what guarantees should be made about delivery and fault-tolerance to support most common use cases cheaply? What optimisations become available if you choose these guarantees carefully?

Read the rest of this entry »

Paper notes: DB2 with BLU Acceleration

DB2 with BLU Acceleration: So Much More than Just a Column Store

Raman et. al., VLDB 2013

The big idea: IBM’s venerable DB2 technology was based on traditional row-based technology. By moving to a columnar execution engine, and crucially then by taking full advantage of the optimisations that columnar formats allow, the ‘BLU Acceleration’ project was able to improve read-mostly BI workloads by a 10 to 50 times speed-up.

Read the rest of this entry »

Étale cohomology

The second in an extremely irregular series of posts made on behalf of my father, who has spent much of his retirement so far doing very hard mathematics. What is attached here is the essay he wrote for the Part III of the Cambridge Mathematical Tripos, a one year taught course. The subject is the Étale cohomology.

Says my Dad: “I am afraid that I have been lured away from the translation of SGA 4.5 for some time by the attraction of working on Wolfgang Krull’s report on “Idealtheorie” from 1935 (again I am not aware of an English version anywhere) which is yet another important classic. However during a year at Cambridge I did write an essay as a very basic introduction to Étale Cohomology which was based on the first part of SGA 4.5. So with the usual imprecation of caveat lector, here it is as a temporising partial substitute should any other beginner be interested.”

Here’s part of the introduction:

This essay has been written as part of the one year Certificate of Advanced Study in Mathematics (CASM) course at Cambridge University which coincides with Part III of the Mathematical Tripos. The starting point is, of necessity, roughly that reached in the lectures which in this particular year did not include much in the way of schemes and sheaves, nor, in the case of the author, much in the way of algebraic number theory. Thus the frontiers of the subject can safely rest undisturbed by the contents of this essay. Rather it has been written with a reader in mind corresponding roughly to the author at the start of the enterprise. That is someone who is interested to find out what all the fuss was with the French algebraic geometers in the 1960s but is in need of some fairly elementary background to map out the abstractions involved and with any luck to avoid drowning in the “rising sea”.

And here’s the essay itself!

ByteArrayOutputStream is really, really slow sometimes in JDK6

TLDR: Yesterday I mentioned on Twitter that I’d found a bad performance problem when writing to a large ByteArrayOutputStream in Java. After some digging, it appears to be the case that there’s a bad bug in JDK6 that doesn’t affect correctness, but does cause performance to nosedive when a ByteArrayOutputStream gets large. This post explains why.

Read the rest of this entry »

On Raft, briefly

Raft is a new-ish consensus implementation whose great benefit, to my mind it, is its applicability for real systems. We briefly discussed it internally at Cloudera, and I thought I’d share what I contributed, below. There’s an underlying theme here regarding the role of distributed systems research in practitioners’ daily work, and how the act of building a distributed system has not yet been sufficiently well commoditised to render a familiarity with the original research unnecessary. I think I’d argue that bridging that gap further is necessary: no matter how much fun it is to read all these papers, it shouldn’t be a pre-requisite to being successful in implementing a distributed system. I have more to write on this.

“The trouble with Paxos is that it’s ‘only’ a consensus algorithm; a theoretical achievement but not one necessarily suited to building practical systems. Remember that the demonstration that a correct, message-optimal protocol even existed was the main contribution. To that end, a lot of practical considerations were left by the wayside. Leader election is an exercise for the reader (since Paxos is robust to bad implementations where there are several leaders, it doesn’t matter what election scheme is used). Paxos is not concerned with ‘logs’ at all; that it can be used to build replicated-state machines with durable logs is a corollary, not the main theorem. Raft fills in a ton of these gaps, and more power to them for doing so. The leader election algorithm is set in stone. There are additional constraints to ensure that updates are seen and processed in hole-free order (Paxos doesn’t guarantee this), which is exactly what you want from a distributed log. Raft also specifies a view-change algorithm, which Paxos does not, but VS replication does. The huge effort required to get ZOOKEEPER-107 committed shows how hard this is to retrofit onto an existing system. So: there’s a tendency to conflate ‘distributed replicated with strong consistency properties’ with ‘consensus algorithm’. Consensus shows you can agree on a single value, multi-Paxos shows you can agree on a bunch of them, but neither give you a complete system for a replicated log which is actually what most of our distributed systems want to interact with.”

Some miscellanea


I wrote an FAQ on The CAP Theorem. The aim is to definitively settle some of the common misconceptions around CAP so as to help prevent its invocation in useless places. If someone says they got around CAP, refer them to the FAQ. It should be a pretty simple introduction to the theorem as well. I think that CAP itself is a pretty uninteresting result, but it does at least shine a light on tradeoffs implicit in distributed systems. I have a couple of residual thoughts about failures rather than partitions that I might write up at some point in the future, but otherwise I hope the FAQ helps move the conversation on.

Impala and aggregation trees

I also wrote a quick answer on Quora about Impala’s execution trees, and how deeper trees help do aggregation more effectively. There’s a lot more of interest to write about planning, partitioning and scheduling queries.

Not so HotOS

Matt Welsh talks about what he wishes systems researchers would work on. I partially agree: there are few papers in the HotOS 2013 program that pass the “if this were true, a lot of our assumptions would be wrong” test. I don’t think that IO is an unworthy topic however, but the main problem with improving IO interfaces is inertia rather than want for better ideas. I hope that SSDs and / or flash-as-RAM will be forcing functions here. Otherwise his big topics are all worthy research challenges, but I have always seen HotOS as a venue for new ideas, not new topics – despite its name. If it really were intended to be a venue for the fashionable areas of study in systems it would have much less potential value.

Building distributed systems

Andy Gross (Basho VP Eng) gave a closing keynote calling for more reusable primitives in distributed systems. I couldn’t agree more (and have been gently complaining about this for years to anyone who would listen), although I think doing this right is not straightforward and requires a lot of careful thought about RPC mechanisms, libraries vs. services and much more. I have some thoughts on this that I want to wrap up in a blog post sometime soon.

Highly available transactions

Peter Bailis, a Berkeley PhD candidate, has some solid work on Highly Available Transactions, based on exploring the space of consistency models available to databases and finding that there are demonstrably useful consistency guarantees that can be made when availability is kept high. This is the right way to move on from all the CAP arguments – given that we are stuck with certain impossibility results, let’s map out the vast terrain that we can conquer.