Make any algorithm lock-free with this one crazy trick

Lock-free algorithms often operate by having several versions of a data structure in use at one time. The general pattern is that you can prepare an update to a data structure, and then use a machine primitive to atomically install the update by changing a pointer. This means that all subsequent readers will follow the pointer to its new location – for example, to a new node in a linked-list – but this pattern can’t do anything about readers that have already followed the old pointer value, and are traversing the previous version of the data structure.

Continue reading

The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google

Note: this is a personal blog post, and doesn’t reflect the views of my employers at Cloudera

Map-Reduce is on its way out. But we shouldn’t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise.

This morning, at their I/O Conference, Google revealed that they’re not using Map-Reduce to process data internally at all any more.

We shouldn’t be surprised. The writing has been on the wall for Map-Reduce for some time. The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems.

Continue reading

On Raft, briefly

Raft is a new-ish consensus implementation whose great benefit, to my mind it, is its applicability for real systems. We briefly discussed it internally at Cloudera, and I thought I’d share what I contributed, below. There’s an underlying theme here regarding the role of distributed systems research in practitioners’ daily work, and how the act of building a distributed system has not yet been sufficiently well commoditised to render a familiarity with the original research unnecessary. I think I’d argue that bridging that gap further is necessary: no matter how much fun it is to read all these papers, it shouldn’t be a pre-requisite to being successful in implementing a distributed system. I have more to write on this.

“The trouble with Paxos is that it’s ‘only’ a consensus algorithm; a theoretical achievement but not one necessarily suited to building practical systems. Remember that the demonstration that a correct, message-optimal protocol even existed was the main contribution. To that end, a lot of practical considerations were left by the wayside. Leader election is an exercise for the reader (since Paxos is robust to bad implementations where there are several leaders, it doesn’t matter what election scheme is used). Paxos is not concerned with ‘logs’ at all; that it can be used to build replicated-state machines with durable logs is a corollary, not the main theorem.

Raft fills in a ton of these gaps, and more power to them for doing so. The leader election algorithm is set in stone. There are additional constraints to ensure that updates are seen and processed in hole-free order (Paxos doesn’t guarantee this), which is exactly what you want from a distributed log. Raft also specifies a view-change algorithm, which Paxos does not, but VS replication does. The huge effort required to get ZOOKEEPER-107 committed shows how hard this is to retrofit onto an existing system.

So: there’s a tendency to conflate ‘distributed replicated with strong consistency properties’ with ‘consensus algorithm’. Consensus shows you can agree on a single value, multi-Paxos shows you can agree on a bunch of them, but neither give you a complete system for a replicated log which is actually what most of our distributed systems want to interact with.”


  • Reasoning about Knowledge
  • Toward a Cloud Computing Research Agenda (2009) –

    “One of the LADIS attendees commented at some point that Byzantine Consensus could be used to improve Chubby, making it tolerant of faults that could disrupt it as currently implemented. But for our keynote speakers, enhancing Chubby to tolerate such faults turns out to be of purely academic interest.”

  • Low-level data structures

    The llds general working thesis is: for large memory applications, virtual memory layers can hurt application performance due to increased memory latency when dealing with large data structures. Specifically, data page tables/directories within the kernel and increased DRAM requests can be avoided to boost application memory access.

  • High-Performance Concurrency Control for Main-Memory Databases (via High Scalability) – MVCC is interesting and elegant, and also underpins some datastores with persistence, like HBase. I like this paper as the best survey.