MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Akidau et. al., VLDB 2013
The big idea: Streaming computations at scale are nothing new. Millwheel is a standard DAG stream processor, but one that runs at ‘Google’ scale. This paper really answers the following questions: what guarantees should be made about delivery and fault-tolerance to support most common use cases cheaply? What optimisations become available if you choose these guarantees carefully?
DB2 with BLU Acceleration: So Much More than Just a Column Store
Raman et. al., VLDB 2013
The big idea: IBM’s venerable DB2 technology was based on traditional row-based technology. By moving to a columnar execution engine, and crucially then by taking full advantage of the optimisations that columnar formats allow, the ‘BLU Acceleration’ project was able to improve read-mostly BI workloads by a 10 to 50 times speed-up.
The second in an extremely irregular series of posts made on behalf of my father, who has spent much of his retirement so far doing very hard mathematics. What is attached here is the essay he wrote for the Part III of the Cambridge Mathematical Tripos, a one year taught course. The subject is the Étale cohomology.
Says my Dad: “I am afraid that I have been lured away from the translation of SGA 4.5 for some time by the attraction of working on Wolfgang Krull’s report on “Idealtheorie” from 1935 (again I am not aware of an English version anywhere) which is yet another important classic. However during a year at Cambridge I did write an essay as a very basic introduction to Étale Cohomology which was based on the first part of SGA 4.5. So with the usual imprecation of caveat lector, here it is as a temporising partial substitute should any other beginner be interested.”
Here’s part of the introduction:
This essay has been written as part of the one year Certificate of Advanced Study in Mathematics (CASM) course at Cambridge University which coincides with Part III of the Mathematical Tripos. The starting point is, of necessity, roughly that reached in the lectures which in this particular year did not include much in the way of schemes and sheaves, nor, in the case of the author, much in the way of algebraic number theory.
Thus the frontiers of the subject can safely rest undisturbed by the contents of this essay. Rather it has been written with a reader in mind corresponding roughly to the author at the start of the enterprise. That is someone who is interested to find out what all the fuss was with the French algebraic geometers in the 1960s but is in need of some fairly elementary background to map out the abstractions involved and with any luck to avoid drowning in the “rising sea”.
And here’s the essay itself!
TLDR: Yesterday I mentioned on Twitter that I’d found a bad performance problem when writing to a large
ByteArrayOutputStream in Java. After some digging, it appears to be the case that there’s a bad bug in JDK6 that doesn’t affect correctness, but does cause performance to nosedive when a
ByteArrayOutputStream gets large. This post explains why.
Raft is a new-ish consensus implementation whose great benefit, to my mind it, is its applicability for real systems. We briefly discussed it internally at Cloudera, and I thought I’d share what I contributed, below. There’s an underlying theme here regarding the role of distributed systems research in practitioners’ daily work, and how the act of building a distributed system has not yet been sufficiently well commoditised to render a familiarity with the original research unnecessary. I think I’d argue that bridging that gap further is necessary: no matter how much fun it is to read all these papers, it shouldn’t be a pre-requisite to being successful in implementing a distributed system. I have more to write on this.
“The trouble with Paxos is that it’s ‘only’ a consensus algorithm; a theoretical achievement but not one necessarily suited to building practical systems. Remember that the demonstration that a correct, message-optimal protocol even existed was the main contribution. To that end, a lot of practical considerations were left by the wayside. Leader election is an exercise for the reader (since Paxos is robust to bad implementations where there are several leaders, it doesn’t matter what election scheme is used). Paxos is not concerned with ‘logs’ at all; that it can be used to build replicated-state machines with durable logs is a corollary, not the main theorem.
Raft fills in a ton of these gaps, and more power to them for doing so. The leader election algorithm is set in stone. There are additional constraints to ensure that updates are seen and processed in hole-free order (Paxos doesn’t guarantee this), which is exactly what you want from a distributed log. Raft also specifies a view-change algorithm, which Paxos does not, but VS replication does. The huge effort required to get ZOOKEEPER-107 committed shows how hard this is to retrofit onto an existing system.
So: there’s a tendency to conflate ‘distributed replicated with strong consistency properties’ with ‘consensus algorithm’. Consensus shows you can agree on a single value, multi-Paxos shows you can agree on a bunch of them, but neither give you a complete system for a replicated log which is actually what most of our distributed systems want to interact with.”