Cloudera Impala
If you have a strong background in either databases or distributed systems, and fancy working on such an exciting technology, send me a note!
It’s great to finally be able to say something about what I’ve been working at Cloudera for nearly a year. At StrataConf / Hadoop World in New York a couple of weeks ago we announced Cloudera Impala. Impala is a distributed query execution engine that understands a subset of SQL, and critically runs over HDFS and HBase as storage managers. It’s very similar in functionality to Apache Hive, but it is much, much, much (anecdotally up to 100x) faster.
Why do I think this is important? (and bear in mind I’m speaking for myself, not Cloudera here). The way I view what we do at Cloudera is that we’re really participating in a revolution in data storage. The great contribution of the Google papers was the application of commodity hardware to enterprise problems. Storing your data reliably on commodity servers with relatively cheap hard disks is enabled by the extreme cheapness of storage density (which means that replicating three times for fault-tolerance is suddenly a viable design choice, for example), and this completely changes the cost / benefit equation for data storage. You can have a lot more of it, and you can have it a lot cheaper. This is where the whole ‘big data’ idea arguably springs from - now you can store more data than ever before. What do you do with it?
Well, you analyse it. But that’s easier said than done. Traditional data analytics platforms are extremely powerful, but rely on an integrated approach - you don’t buy a couple of servers and throw a data warehouse that you downloaded on them, you buy a fully integrated solution that includes hardware and software. The hardware is tuned to the requirements of the software, and the data layout is carefully managed to get great performance on supported queries. This is a very serious platform archetype, and one that I believe will continue to be important to the market, but it’s fundamentally not configured to process data at the scale that the storage substrate now enables, because you would have to move data into the processing platform to do so, which is a very costly proposition. So typically what you do is identify an important subset of your data that requires the full analytics power to your integrated solution, and move that and only that. But that blunts the advantage of keeping all of that data in the first place - if you can only process some of it, why are you keeping all of it?
This is where Apache Hadoop originally came in. Hadoop shines for heavy-lifting batch-processing jobs that have high tolerance for latency but need to touch most bytes in your dataset. Where it doesn’t shine is for interactive analyses, where a user might be refining the analytic questions they ask iteratively. It’s useless to wait for 30 minutes just to find that you probably should have grouped by a different columns. Similarly, users of popular BI tools will want to construct reports that they can interact with. Impala is the tool for these use cases; by allowing for relatively low-latency queries over the great unwashed masses of your data you get the value of the data coupled with some of the processing power of the integrated engines. We’ve brought some of the execution engine magic of massively-parallel distributed databases to the incredibly flexible storage engine model of HDFS and HBase.
Technically, of course, this is a complete blast to work on - a large-scale distributed system that does highly non-trivial stuff at each node. Bliss. Impala’s been the result of the great effort of a small team. That means I’ve written code in lots of different parts of the system. Our INSERT support is mostly me, as is our DDL support (we don’t do CREATE TABLE yet, but we do SHOW / DESCRIBE etc.). I also wrote the failure-detection framework that runs on both the state-store (to evict Impala backends that have failed from the current view of cluster membership) and on each Impala backend (to detect when the state-store has failed). In fact there are a ton of things that I’ve been involved with. Some of the most exciting parts of the code base are the query execution engine itself, which compiles query fragments into a highly optimised program fragment via LLVM, although the planner is also a very complex bit of software.
The source code is available here. At this point, I should acknowledge that the Github repo doesn’t build out of the box, and we haven’t yet provided instructions on how to do so. There’s no great conspiracy behind this, just the annoying consequence of the number of hours in the day being finite. We can’t post the repo exactly as we have it internally at Cloudera for a couple of reasons: we rely on internal infrastructure for some of the build steps which can’t be easily replicated externally, and some of our test suites are customer-confidential. We were rushing (as ever) to get a release out the door, and we made the decision to postpone sorting out the build for the public repo until after the launch. Since that’s… now, I’ve been spending a little time figuring out what we can do. Build systems are not my expertise, and Impala’s is a little capricious due to the extent that we mix C++ (for the execution engine) and Java code (for the planning and metastore interaction). I hope to have something that can build - without tests - by the end of next week, that is by roughly November 9th.
Otherwise we’re working on our roadmap for moving from the current beta to GA, and I have a long list of bugs and bits of polish that I’m looking forward to getting a few minutes to work on. We have some pretty exciting plans coming up, and it’s going to be great to be able to talk about them a little more publicly.
From the perspective of this blog, this means I’ve been reading a ton of interesting database papers (I know distributed systems, but I’m only just getting up to speed with all the database literature). That means a good opportunity to restart some of the paper reviews, although I’m even more likely to say something stupid on a less firm technical footing!
There’s been a whole bunch of buzz about Impala, which has been very gratifying.
- These questions at Quora about how Impala compares to similar systems (including my answer about Apache Drill)
- Curt Monash has a writeup (although he does make it sound like no query will return in under one second, which isn’t the case - query start-up latency is very important to us since that’s an area we can get a huge gain over Hive’s Hadoop implementation; in the case of a simple SELECT 1 the query should return in ~100ms under ideal conditions).
- Marcel Kornacker is the tech lead for Impala - really, my technical boss - and a very interesting guy. He was Joe Hellerstein’s first graduate student (I think), and recently worked at Google on F1. Wired did a write-up on him which is equal parts awesome and hilarious. Marcel’s bread is fantastic though.
- Wired also did a pretty decent piece on Impala itself.
- There’s a good introductory blog post by Justin and Marcel on the new Cloudera blog site
- Impalas make you cool, according to rap music.