Paper Trail

Computer Systems, Distributed Algorithms and Databases

Make any algorithm lock-free with this one crazy trick

Lock-free algorithms often operate by having several versions of a data structure in use at one time. The general pattern is that you can prepare an update to a data structure, and then use a machine primitive to atomically install the update by changing a pointer. This means that all subsequent readers will follow the pointer to its new location – for example, to a new node in a linked-list – but this pattern can’t do anything about readers that have already followed the old pointer value, and are traversing the previous version of the data structure.

Those readers will see a correct, linearizable version of the data structure, so this pattern doesn’t present a correctness concern. Instead, the problem is garbage collection: who retires the old version of the data structure, to free the memory it’s taking now that it’s unreachable? To put it another way: how do you tell when all possible readers have finished reading an old version?

Of course, there are many techniques for solving this reclamation problem. See this paper for a survey, and this paper for a recent improvement over epoch-based reclamation. RCU, which is an API for enabling single-writer, multi-reader concurrency in the Linux kernel, has an elegant way of solving the problem.

Every reader in RCU marks a critical section by calling rcu_read_lock() / rcu_read_unlock(). Writers typically take responsibility for memory reclamation, which means they have to wait for all in-flight critical sections to complete. The way this is done is, conceptually, really really simple: during a critical section, a thread may not block, or be pre-empted by the scheduler. So as soon as a thread yields its CPU, it’s guaranteed to be out of its critical section.

This gives RCU a really simple scheme to check for a grace period after a write has finished: try to run a thread on every CPU in the system. Once the thread runs, a context switch has happened which means any previous critical sections must have completed, and so any previous version of the data structure can be reclaimed. The readers don’t have to do anything to signal that they are finished with their critical section: it’s implicit in them starting to accept context switches again!

In reality, this is not quite what RCU does (but the idea is the same, see this terrific series). Instead, it takes advantage of kernel context-switch counters and waits for them to increase:

“In practice Linux implements synchronize_rcu by waiting for all CPUs in the system to pass through a context switch, instead of scheduling a thread on each CPU. This design optimizes the Linux RCU implementation for low-cost RCU critical sections, but at the cost of delaying synchronize_rcu callers longer than necessary. In principle, a writer waiting for a particular reader need only wait for that reader to complete an RCU critical section. The reader, however, must communicate to the writer that the RCU critical section is complete. The Linux RCU implementation essentially batches reader-to-writer communication by waiting for context switches. When possible, writers can use an asynchronous version of synchronize_rcu, call_rcu, that will asynchronously invokes a specified callback after all CPUs have passed through at least one context switch.”

From RCU Usage In the Linux Kernel: One Decade Later.

The most elegant thing about vanilla RCU is that the system is lock-free by definition not by design – it has nothing to do with the semantics of RCU’s primitives, and everything to do with the fact that being in the kernel allows you to enforce a sympathetic system model. If threads can’t block or otherwise be prevented from making progress, any (non-pathological) algorithm must, by definition, always make progress! Even a concurrency scheme that nominally used spinlocks to protect critical sections would be lock-free, because every thread would exit their critical section in bounded time – the other threads would all be serialised behind this lock, but there would be progress.

(There are other flavours of RCU that don’t restrict critical sections in this way, as they require critical sections to allow pre-emption).

Distributed systems theory for the distributed systems engineer

Gwen Shapira, SA superstar and now full-time engineer at Cloudera, asked a question on Twitter that got me thinking.

My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”, and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed. But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program). Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context. What good is requiring that level of expertise of engineers?

And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory; particularly material that does so without condescending. Considering that gap lead me to another interesting question:

What distributed systems theory should a distributed systems engineer know?

A little theory is, in this case, not such a dangerous thing. So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer; what I consider ‘table stakes’ for distributed systems engineers competent enough to design a new system. Let me know what you think I missed!

Read the rest of this entry »

The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google

Note: this is a personal blog post, and doesn’t reflect the views of my employers at Cloudera

Map-Reduce is on its way out. But we shouldn’t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise.

This morning, at their I/O Conference, Google revealed that they’re not using Map-Reduce to process data internally at all any more.

We shouldn’t be surprised. The writing has been on the wall for Map-Reduce for some time. The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems.

Read the rest of this entry »

Paper notes: MemC3, a better Memcached


MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing


Fan and Andersen, NSDI 2013

The big idea: This is a paper about choosing your data structures and algorithms carefully. By paying careful attention to the workload and functional requirements, the authors reimplement memcached to achieve a) better concurrency and b) better space efficiency. Specifically, they introduce a variant of cuckoo hashing that is highly amenable to concurrent workloads, and integrate the venerable CLOCK cache eviction algorithm with the hash table for space-efficient approximate LRU.

Read the rest of this entry »

Paper notes: Anti-Caching


Anti-Caching: A New Approach to Database Management System Architecture


DeBrabant et. al., VLDB 2013

The big idea: Traditional databases typically rely on the OS page cache to bring hot tuples into memory and keep them there. This suffers from a number of problems:

  • No control over granularity of caching or eviction (so keeping a tuple in memory might keep all the tuples in its page as well, even though there’s not necessarily a usage correlation between them)
  • No control over when fetches are performed (fetches are typically slow, and transactions may hold onto locks or latches while the access is being made)
  • Duplication of resources – tuples can occupy both disk blocks and memory pages.

Instead, this paper proposes a DB-controlled mechanism for tuple caching and eviction called anti-caching. The idea is that the DB chooses exactly what to evict and when. The ‘anti’ aspect arises when you consider that the disk is now the place to store recently unused tuples, not the source of ground truth for the entire database. The disk, in fact, can’t easily store the tuples that are in memory because, as we shall see, the anti-caching mechanism may choose to write tuples into arbitrary blocks upon eviction, which will not correspond to the pages that are already on disk, giving rise to a complex rewrite / compaction problem on eviction. The benefits are realised partly in IO: only tuples that are cold are evicted (rather than those that are unluckily sitting in the same page), and fetches and evictions may be batched.

Read the rest of this entry »

Paper notes: Stream Processing at Google with Millwheel


MillWheel: Fault-Tolerant Stream Processing at Internet Scale


Akidau et. al., VLDB 2013

The big idea: Streaming computations at scale are nothing new. Millwheel is a standard DAG stream processor, but one that runs at ‘Google’ scale. This paper really answers the following questions: what guarantees should be made about delivery and fault-tolerance to support most common use cases cheaply? What optimisations become available if you choose these guarantees carefully?

Read the rest of this entry »

Paper notes: DB2 with BLU Acceleration


DB2 with BLU Acceleration: So Much More than Just a Column Store

Raman et. al., VLDB 2013

The big idea: IBM’s venerable DB2 technology was based on traditional row-based technology. By moving to a columnar execution engine, and crucially then by taking full advantage of the optimisations that columnar formats allow, the ‘BLU Acceleration’ project was able to improve read-mostly BI workloads by a 10 to 50 times speed-up.

Read the rest of this entry »

Étale cohomology

The second in an extremely irregular series of posts made on behalf of my father, who has spent much of his retirement so far doing very hard mathematics. What is attached here is the essay he wrote for the Part III of the Cambridge Mathematical Tripos, a one year taught course. The subject is the Étale cohomology.

Says my Dad: “I am afraid that I have been lured away from the translation of SGA 4.5 for some time by the attraction of working on Wolfgang Krull’s report on “Idealtheorie” from 1935 (again I am not aware of an English version anywhere) which is yet another important classic. However during a year at Cambridge I did write an essay as a very basic introduction to Étale Cohomology which was based on the first part of SGA 4.5. So with the usual imprecation of caveat lector, here it is as a temporising partial substitute should any other beginner be interested.”

Here’s part of the introduction:

This essay has been written as part of the one year Certificate of Advanced Study in Mathematics (CASM) course at Cambridge University which coincides with Part III of the Mathematical Tripos. The starting point is, of necessity, roughly that reached in the lectures which in this particular year did not include much in the way of schemes and sheaves, nor, in the case of the author, much in the way of algebraic number theory.
Thus the frontiers of the subject can safely rest undisturbed by the contents of this essay. Rather it has been written with a reader in mind corresponding roughly to the author at the start of the enterprise. That is someone who is interested to find out what all the fuss was with the French algebraic geometers in the 1960s but is in need of some fairly elementary background to map out the abstractions involved and with any luck to avoid drowning in the “rising sea”.

And here’s the essay itself!

ByteArrayOutputStream is really, really slow sometimes in JDK6

TLDR: Yesterday I mentioned on Twitter that I’d found a bad performance problem when writing to a large ByteArrayOutputStream in Java. After some digging, it appears to be the case that there’s a bad bug in JDK6 that doesn’t affect correctness, but does cause performance to nosedive when a ByteArrayOutputStream gets large. This post explains why.

Read the rest of this entry »

On Raft, briefly

Raft is a new-ish consensus implementation whose great benefit, to my mind it, is its applicability for real systems. We briefly discussed it internally at Cloudera, and I thought I’d share what I contributed, below. There’s an underlying theme here regarding the role of distributed systems research in practitioners’ daily work, and how the act of building a distributed system has not yet been sufficiently well commoditised to render a familiarity with the original research unnecessary. I think I’d argue that bridging that gap further is necessary: no matter how much fun it is to read all these papers, it shouldn’t be a pre-requisite to being successful in implementing a distributed system. I have more to write on this.

“The trouble with Paxos is that it’s ‘only’ a consensus algorithm; a theoretical achievement but not one necessarily suited to building practical systems. Remember that the demonstration that a correct, message-optimal protocol even existed was the main contribution. To that end, a lot of practical considerations were left by the wayside. Leader election is an exercise for the reader (since Paxos is robust to bad implementations where there are several leaders, it doesn’t matter what election scheme is used). Paxos is not concerned with ‘logs’ at all; that it can be used to build replicated-state machines with durable logs is a corollary, not the main theorem.

Raft fills in a ton of these gaps, and more power to them for doing so. The leader election algorithm is set in stone. There are additional constraints to ensure that updates are seen and processed in hole-free order (Paxos doesn’t guarantee this), which is exactly what you want from a distributed log. Raft also specifies a view-change algorithm, which Paxos does not, but VS replication does. The huge effort required to get ZOOKEEPER-107 committed shows how hard this is to retrofit onto an existing system.

So: there’s a tendency to conflate ‘distributed replicated with strong consistency properties’ with ‘consensus algorithm’. Consensus shows you can agree on a single value, multi-Paxos shows you can agree on a bunch of them, but neither give you a complete system for a replicated log which is actually what most of our distributed systems want to interact with.”