Some miscellanea

Posted: May 19th, 2013 | Author: | Filed under: Distributed systems, link, Note | No Comments »

CAP FAQ

I wrote an FAQ on The CAP Theorem. The aim is to definitively settle some of the common misconceptions around CAP so as to help prevent its invocation in useless places. If someone says they got around CAP, refer them to the FAQ. It should be a pretty simple introduction to the theorem as well. I think that CAP itself is a pretty uninteresting result, but it does at least shine a light on tradeoffs implicit in distributed systems. I have a couple of residual thoughts about failures rather than partitions that I might write up at some point in the future, but otherwise I hope the FAQ helps move the conversation on.

Impala and aggregation trees

I also wrote a quick answer on Quora about Impala’s execution trees, and how deeper trees help do aggregation more effectively. There’s a lot more of interest to write about planning, partitioning and scheduling queries.

Not so HotOS

Matt Welsh talks about what he wishes systems researchers would work on. I partially agree: there are few papers in the HotOS 2013 program that pass the “if this were true, a lot of our assumptions would be wrong” test. I don’t think that IO is an unworthy topic however, but the main problem with improving IO interfaces is inertia rather than want for better ideas. I hope that SSDs and / or flash-as-RAM will be forcing functions here. Otherwise his big topics are all worthy research challenges, but I have always seen HotOS as a venue for new ideas, not new topics – despite its name. If it really were intended to be a venue for the fashionable areas of study in systems it would have much less potential value.

Building distributed systems

Andy Gross (Basho VP Eng) gave a closing keynote calling for more reusable primitives in distributed systems. I couldn’t agree more (and have been gently complaining about this for years to anyone who would listen), although I think doing this right is not straightforward and requires a lot of careful thought about RPC mechanisms, libraries vs. services and much more. I have some thoughts on this that I want to wrap up in a blog post sometime soon.

Highly available transactions

Peter Bailis, a Berkeley PhD candidate, has some solid work on Highly Available Transactions, based on exploring the space of consistency models available to databases and finding that there are demonstrably useful consistency guarantees that can be made when availability is kept high. This is the right way to move on from all the CAP arguments – given that we are stuck with certain impossibility results, let’s map out the vast terrain that we can conquer.


Columnar Storage

Posted: January 30th, 2013 | Author: | Filed under: Databases | 4 Comments »

You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage – which is a simple idea – can drastically improve performance for certain workloads.

Caveat: This is a personal, general research summary post, and as usual doesn’t neccessarily reflect our thinking at Cloudera about columnar storage.

Disks are still the major bottleneck in query execution over large datasets. Even a machine with twelve disks running in parallel (for an aggregate bandwidth of north of 1GB/s) can’t keep all the cores busy; running a query against memory-cached data can get tens of GB/s of throughput. IO bandwidth matters. Therefore, the best thing an engineer can do to improve the performance of disk-based query engines (like RDBMs and Impala) usually is to improve the performance of reading bytes from disk. This can mean decreasing the latency (for small queries where the time to find the data to read might dominate), but most usually this means improving the effective throughput of reads from disk.

The traditional way to improve disk bandwidth has been to wait, and allow disks to get faster. However, disks are not getting faster very quickly (having settled at roughly 100 MB/s, with ~12 disks per server), and SSDs can’t yet achieve the storage density to be directly competitive with HDDs on a per-server basis.

The other way to improve disk performance is to maximise the ratio of ‘useful’ bytes read to total bytes read. The idea is not to read more data than is absolutely necessary to serve a query, so the useful bandwidth realised is increased without actually improving the performance of the IO subsystem. Enter columnar storage, a principle for file format design that aims to do exactly that for query engines that deal with record-based data.

Read the rest of this entry »


Cloudera Impala

Posted: November 4th, 2012 | Author: | Filed under: Cloudera Impala, Distributed systems | 1 Comment »

If you have a strong background in either databases or distributed systems, and fancy working on such an exciting technology, send me a note!

It’s great to finally be able to say something about what I’ve been working at Cloudera for nearly a year. At StrataConf / Hadoop World in New York a couple of weeks ago we announced Cloudera Impala. Impala is a distributed query execution engine that understands a subset of SQL, and critically runs over HDFS and HBase as storage managers. It’s very similar in functionality to Apache Hive, but it is much, much, much (anecdotally up to 100x) faster.

Why do I think this is important? (and bear in mind I’m speaking for myself, not Cloudera here). The way I view what we do at Cloudera is that we’re really participating in a revolution in data storage. The great contribution of the Google papers was the application of commodity hardware to enterprise problems. Storing your data reliably on commodity servers with relatively cheap hard disks is enabled by the extreme cheapness of storage density (which means that replicating three times for fault-tolerance is suddenly a viable design choice, for example), and this completely changes the cost / benefit equation for data storage. You can have a lot more of it, and you can have it a lot cheaper. This is where the whole ‘big data’ idea arguably springs from – now you can store more data than ever before. What do you do with it?

Well, you analyse it. But that’s easier said than done. Traditional data analytics platforms are extremely powerful, but rely on an integrated approach – you don’t buy a couple of servers and throw a data warehouse that you downloaded on them, you buy a fully integrated solution that includes hardware and software. The hardware is tuned to the requirements of the software, and the data layout is carefully managed to get great performance on supported queries. This is a very serious platform archetype, and one that I believe will continue to be important to the market, but it’s fundamentally not configured to process data at the scale that the storage substrate now enables, because you would have to move data into the processing platform to do so, which is a very costly proposition. So typically what you do is identify an important subset of your data that requires the full analytics power to your integrated solution, and move that and only that. But that blunts the advantage of keeping all of that data in the first place – if you can only process some of it, why are you keeping all of it?

This is where Apache Hadoop originally came in. Hadoop shines for heavy-lifting batch-processing jobs that have high tolerance for latency but need to touch most bytes in your dataset. Where it doesn’t shine is for interactive analyses, where a user might be refining the analytic questions they ask iteratively. It’s useless to wait for 30 minutes just to find that you probably should have grouped by a different columns. Similarly, users of popular BI tools will want to construct reports that they can interact with. Impala is the tool for these use cases; by allowing for relatively low-latency queries over the great unwashed masses of your data you get the value of the data coupled with some of the processing power of the integrated engines. We’ve brought some of the execution engine magic of massively-parallel distributed databases to the incredibly flexible storage engine model of HDFS and HBase.

Technically, of course, this is a complete blast to work on – a large-scale distributed system that does highly non-trivial stuff at each node. Bliss. Impala’s been the result of the great effort of a small team. That means I’ve written code in lots of different parts of the system. Our INSERT support is mostly me, as is our DDL support (we don’t do CREATE TABLE yet, but we do SHOW / DESCRIBE etc.). I also wrote the failure-detection framework that runs on both the state-store (to evict Impala backends that have failed from the current view of cluster membership) and on each Impala backend (to detect when the state-store has failed). In fact there are a ton of things that I’ve been involved with. Some of the most exciting parts of the code base are the query execution engine itself, which compiles query fragments into a highly optimised program fragment via LLVM, although the planner is also a very complex bit of software.

The source code is available here. At this point, I should acknowledge that the Github repo doesn’t build out of the box, and we haven’t yet provided instructions on how to do so. There’s no great conspiracy behind this, just the annoying consequence of the number of hours in the day being finite. We can’t post the repo exactly as we have it internally at Cloudera for a couple of reasons: we rely on internal infrastructure for some of the build steps which can’t be easily replicated externally, and some of our test suites are customer-confidential. We were rushing (as ever) to get a release out the door, and we made the decision to postpone sorting out the build for the public repo until after the launch. Since that’s… now, I’ve been spending a little time figuring out what we can do. Build systems are not my expertise, and Impala’s is a little capricious due to the extent that we mix C++ (for the execution engine) and Java code (for the planning and metastore interaction). I hope to have something that can build – without tests – by the end of next week, that is by roughly November 9th.

Otherwise we’re working on our roadmap for moving from the current beta to GA, and I have a long list of bugs and bits of polish that I’m looking forward to getting a few minutes to work on. We have some pretty exciting plans coming up, and it’s going to be great to be able to talk about them a little more publicly.

From the perspective of this blog, this means I’ve been reading a ton of interesting database papers (I know distributed systems, but I’m only just getting up to speed with all the database literature). That means a good opportunity to restart some of the paper reviews, although I’m even more likely to say something stupid on a less firm technical footing!

There’s been a whole bunch of buzz about Impala, which has been very gratifying.

  • These questions at Quora about how Impala compares to similar systems (including my answer about Apache Drill)
  • Curt Monash has a writeup (although he does make it sound like no query will return in under one second, which isn’t the case – query start-up latency is very important to us since that’s an area we can get a huge gain over Hive’s Hadoop implementation; in the case of a simple SELECT 1 the query should return in ~100ms under ideal conditions).
  • Marcel Kornacker is the tech lead for Impala – really, my technical boss – and a very interesting guy. He was Joe Hellerstein’s first graduate student (I think), and recently worked at Google on F1. Wired did a write-up on him which is equal parts awesome and hilarious. Marcel’s bread is fantastic though.
  • Wired also did a pretty decent piece on Impala itself.
  • There’s a good introductory blog post by Justin and Marcel on the new Cloudera blog site
  • Impalas make you cool, according to rap music.

On some subtleties of Paxos

Posted: November 3rd, 2012 | Author: | Filed under: Distributed systems | 1 Comment »

There’s one particular aspect of the Paxos protocol that gives readers of this blog – and for some time, me! – some difficulty. This short post tries to clear up some confusion on a part of the protocol that is poorly explained in pretty much every major description.

This is the observation that causes problems: two different nodes can validly accept different proposals for the same Paxos instance. That means that we could have a situation where node A has accepted a proposal P=(S, V), and node E has accepted a proposal P'=(S', V'). We can get there very easily by having two competing proposers that interleave their prepare phases, and crash after sending an accept to one node each. On the face of it, this is a very concerning situation – how can two nodes both believe a different value has been accepted? Doesn’t this violate one of the consensus guarantees of uniformity?

The answer lies in the fact that the nodes doing the accepting are not (necessarily) the nodes that ‘learn’ about the agreed value. Paxos calls for a distinguished category of nodes, called ‘learners’, which hear about acceptance events from the nodes doing the accepting. We call the latter nodes ‘acceptors’, and say that learners ‘commit’ to a value once they can be sure it’s never going to change for a given Paxos instance.

But when does a learner know that a value has really been accepted? It can’t just go on the first acceptance that it receives (since as we have shown, two different acceptors can have accepted different values and may race to send their values to the learners). Instead, a learner must wait for a majority of acceptors to return the same proposal. Once N/2 + 1 acceptors are in agreement, the learner can commit the value to its log, or whatever is required once consensus is reached. The rest of this post shows why this is both necessary and sufficient.

If no majority of acceptors have accepted the same value, it’s trivial to see why a learner cannot commit to a value sent by any acceptor, for the same race-based argument made earlier. A more interesting case is the following: suppose that a majority of acceptors have accepted a proposal with the same value, but with different sequence numbers (i.e. proposed by a different proposer). Can a learner commit to that value once it has learnt about all the acceptances? In the following section we show that it can.

Conditions for learner commit

Theorem: Let a majority of acceptors have accepted some proposal with value V, and let P = (S,V) be the proposal with the largest sequence number S amongst all those acceptors. Then there is no proposal P' = (S', V') that is accepted at any node with S'>S and V' \neq V.

Proof: We proceed in two steps. The first shows that when P is accepted, there is no proposal already accepted at any node with a later sequence number. Assume that this is false, and some node has accepted P'. Then a majority of acceptors must have promised to accept only proposals with sequence number S'' > S'. Since S < S', P cannot be accepted by a majority of acceptors, contradicting the assumption.

Second, we prove by a very similar argument that once P has been accepted, no node will accept a proposal like P'. Again, assume this is false. Then a majority of acceptors must have sent a promise of (S', V') in order for the proposer of P' to be sending accept messages. If so, then either that same majority should have ignored the accept message for P (since S < S'), or V' = V if the accept of P happened before the proposal of P' (by the first half of this proof we know that there is no proposal with a later sequence number than S already accepted, so the proposer is guaranteed to choose V as the already accepted value with the largest sequence number). In either case there is a contradiction; P has not been accepted or V = V'.

What this theorem shows is that once a value has been accepted by a majority of acceptors, no proposal can change it. The sequence number might change (consider what happens if a new proposer comes along and runs another proposal over the same instance - the sequence number will increase at some acceptors, but the proposer must choose the majority value for its accept message). But since the majority-accepted value will never change, the learners can commit a value when they hear it from a majority of acceptors.

Fault tolerance

Now it's instructive to think about what this means for Paxos' fault-tolerance guarantees. Imagine that a proposal was accepted at a minimal (i.e. N/2+1 nodes) majority of acceptors before the proposer crashed. In order for the value to be committed by a learner, every one of those acceptors must successfully send its accepted value on to the learners. So if a single node in that majority fails, that Paxos instance will not terminate for all learners. That appears to be not as fault-tolerant as we were promised.

There are several ways to interpret this fact. The first is that Paxos only guarantees that it will continue to be correct, and live, with up to N/2 + 1 failures; and failing to reach agreement for a single proposal does not contravene these properties. It's also true that if the proposer dies before sending any accept messages, that proposal will also never complete. However, another proposer can always come along and finish that instance of the protocol; it's this that is no longer true if a majority of acceptors fail.

The second interpretation is that it makes sense for acceptors to also act as learners, so that they can update their values for a given Paxos instance once they realise that consensus is complete. It's often true that learners and acceptors are the same thing in a real Paxos deployment, and the aim is usually to have as many acceptors up-to-date as possible.

So that's a short look at how the distribution of accepted proposals can evolve during Paxos, and how the protocol guarantees that eventually the cluster will converge on a value that will never change.


Links

Posted: August 6th, 2012 | Author: | Filed under: Uncategorized | No Comments »
  • Reasoning about Knowledge
  • Toward a Cloud Computing Research Agenda (2009) –
    “One of the LADIS attendees commented at some point that Byzantine Consensus could be used to improve Chubby, making it tolerant of faults that could disrupt it as currently implemented. But for our keynote speakers, enhancing Chubby to tolerate such faults turns out to be of purely academic interest.”

  • Low-level data structures
    The llds general working thesis is: for large memory applications, virtual memory layers can hurt application performance due to increased memory latency when dealing with large data structures. Specifically, data page tables/directories within the kernel and increased DRAM requests can be avoided to boost application memory access.

  • High-Performance Concurrency Control for Main-Memory Databases (via High Scalability) – MVCC is interesting and elegant, and also underpins some datastores with persistence, like HBase. I like this paper as the best survey.

Something a bit different: translations of classic mathematical texts (!)

Posted: August 4th, 2012 | Author: | Filed under: Uncategorized | No Comments »

During his retirement, my father has been able to spend much time indulging his love of mathematics. This included, amongst other impressive endeavours, attending Cambridge at a more advanced age than average to take (and pass!) the Part III of the Mathematical Tripos, often considered one of the hardest taught courses in maths in the world.

Since then, he has hardly been idle, and has recently been undertaking a translation of a classic work in modern algebra by Dedekind and Weber from its original 100+ pages of German into English.

Having completed this monumental piece of work, it seemed only proper to share it a little more widely so that other students might benefit from his efforts – and that’s where I come in, since I’m the one with the website. So if you have any passing interest in 19th / 20th century modern algebra, I encourage you to check out Noel Robinson’s translation of “Theory of Algebraic Functions of One Variable”, hosted on this site.


EuroSys 2012 blog notes

Posted: April 15th, 2012 | Author: | Filed under: Uncategorized | 1 Comment »

EuroSys 2012 was last week – one of the premier European systems conferences. Over at the Cambridge System Research Group’s blog, various people from the group have written notes on the papers presented. They’re very well-written summaries, and worth checking out for an overview of the research presented.


FLP and CAP aren’t the same thing

Posted: March 25th, 2012 | Author: | Filed under: Distributed systems | 1 Comment »

An interesting question came up on Quora this last week. Roughly speaking, the question asked how, if at all, the FLP theorem and the CAP theorem were related. I’d thought idly about exactly the same question myself before. Both theorems concern the impossibility of solving fairly similar fundamental distributed systems problems in what appear to be fairly similar distributed systems settings. The CAP theorem gets all the airtime, but FLP to me is a more beautiful result. Wouldn’t it be fascinating if both theorems turned out to be equivalent; that is effectively restatements of each other?

Read the rest of this entry »


Should I take a systems reading course?

Posted: March 9th, 2012 | Author: | Filed under: Uncategorized | 1 Comment »

A smart student asked me a couple of days ago whether I thought taking a 2xx-level reading course in operating systems was a good idea. The student, understandably, was unsure whether talking about these systems was as valuable as actually building them, and also whether, since his primary interest is in ‘distributed’ systems, he stood to benefit from a deep understanding of things like virtual memory.

Read the rest of this entry »


I’m talking at Strata Conference 2012

Posted: January 18th, 2012 | Author: | Filed under: Uncategorized | No Comments »

I’ll be giving a talk at this year’s Strata Conference in Santa Clara, on February 29th. My talk is called Monitoring Apache Hadoop – A Big Data Problem?. I’d be lying if I said that every slide was fully realised at this point, but you can read the abstract to see what I’ve committed myself to. The general idea is that building large scale shared-nothing distributed systems is at most half the problem in making them a reality. Managing these systems day-to-day requires the understanding and analysis of a serious amount of data; so there’s a nice cycle here that you might be able to use the data processing systems you’re trying to understand to understand them. I’ll try and tie the whole thing together with a discussion of failure; the thesis being that partial failure in distributed systems is both to blame for the incidents we’re trying to understand, and making understanding them very difficult – I believe this is true in a very fundamental sense, so I’ll make that case and also talk about what is to be done.

(And if I’m not a big enough draw – perish the thought – there are many, many other interesting sessions. In particular, Josh will be talking about Crunch, and Sarah will be giving both introductory and advanced Hadoop classes – both people I work with, and both fantastic speakers!)