Paper Trail

Computer Systems, Distributed Algorithms and Databases

Distributed systems theory for the distributed systems engineer

Gwen Shapira, SA superstar and now full-time engineer at Cloudera, asked a question on Twitter that got me thinking. My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”, and I’d have prescribed a laundry list of primary source material which would […]

The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google

Note: this is a personal blog post, and doesn’t reflect the views of my employers at Cloudera Map-Reduce is on its way out. But we shouldn’t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise. This morning, at their I/O Conference, Google revealed […]

Paper notes: MemC3, a better Memcached

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing Fan and Andersen, NSDI 2013 The big idea: This is a paper about choosing your data structures and algorithms carefully. By paying careful attention to the workload and functional requirements, the authors reimplement memcached to achieve a) better concurrency and b) better space efficiency. […]

Paper notes: Anti-Caching

Anti-Caching: A New Approach to Database Management System Architecture DeBrabant et. al., VLDB 2013 The big idea: Traditional databases typically rely on the OS page cache to bring hot tuples into memory and keep them there. This suffers from a number of problems: No control over granularity of caching or eviction (so keeping a tuple […]

Paper notes: Stream Processing at Google with Millwheel

MillWheel: Fault-Tolerant Stream Processing at Internet Scale Akidau et. al., VLDB 2013 The big idea: Streaming computations at scale are nothing new. Millwheel is a standard DAG stream processor, but one that runs at ‘Google’ scale. This paper really answers the following questions: what guarantees should be made about delivery and fault-tolerance to support most […]

√Čtale cohomology

The second in an extremely irregular series of posts made on behalf of my father, who has spent much of his retirement so far doing very hard mathematics. What is attached here is the essay he wrote for the Part III of the Cambridge Mathematical Tripos, a one year taught course. The subject is the […]

ByteArrayOutputStream is really, really slow sometimes in JDK6

TLDR: Yesterday I mentioned on Twitter that I’d found a bad performance problem when writing to a large ByteArrayOutputStream in Java. After some digging, it appears to be the case that there’s a bad bug in JDK6 that doesn’t affect correctness, but does cause performance to nosedive when a ByteArrayOutputStream gets large. This post explains […]

On Raft, briefly

Raft is a new-ish consensus implementation whose great benefit, to my mind it, is its applicability for real systems. We briefly discussed it internally at Cloudera, and I thought I’d share what I contributed, below. There’s an underlying theme here regarding the role of distributed systems research in practitioners’ daily work, and how the act […]

Some miscellanea

CAP FAQ I wrote an FAQ on The CAP Theorem. The aim is to definitively settle some of the common misconceptions around CAP so as to help prevent its invocation in useless places. If someone says they got around CAP, refer them to the FAQ. It should be a pretty simple introduction to the theorem […]

Columnar Storage

You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and […]