Posted: April 15th, 2012 | Author: Henry | Filed under: Uncategorized | 1 Comment »
EuroSys 2012 was last week – one of the premier European systems conferences. Over at the Cambridge System Research Group’s blog, various people from the group have written notes on the papers presented. They’re very well-written summaries, and worth checking out for an overview of the research presented.
Posted: March 25th, 2012 | Author: Henry | Filed under: Distributed systems | No Comments »
An interesting question came up on Quora this last week. Roughly speaking, the question asked how, if at all, the FLP theorem and the CAP theorem were related. I’d thought idly about exactly the same question myself before. Both theorems concern the impossibility of solving fairly similar fundamental distributed systems problems in what appear to be fairly similar distributed systems settings. The CAP theorem gets all the airtime, but FLP to me is a more beautiful result. Wouldn’t it be fascinating if both theorems turned out to be equivalent; that is effectively restatements of each other?
Read the rest of this entry »
Posted: March 9th, 2012 | Author: Henry | Filed under: Uncategorized | No Comments »
A smart student asked me a couple of days ago whether I thought taking a 2xx-level reading course in operating systems was a good idea. The student, understandably, was unsure whether talking about these systems was as valuable as actually building them, and also whether, since his primary interest is in ‘distributed’ systems, he stood to benefit from a deep understanding of things like virtual memory.
Read the rest of this entry »
Posted: January 18th, 2012 | Author: Henry | Filed under: Uncategorized | No Comments »
I’ll be giving a talk at this year’s Strata Conference in Santa Clara, on February 29th. My talk is called Monitoring Apache Hadoop – A Big Data Problem?. I’d be lying if I said that every slide was fully realised at this point, but you can read the abstract to see what I’ve committed myself to. The general idea is that building large scale shared-nothing distributed systems is at most half the problem in making them a reality. Managing these systems day-to-day requires the understanding and analysis of a serious amount of data; so there’s a nice cycle here that you might be able to use the data processing systems you’re trying to understand to understand them. I’ll try and tie the whole thing together with a discussion of failure; the thesis being that partial failure in distributed systems is both to blame for the incidents we’re trying to understand, and making understanding them very difficult – I believe this is true in a very fundamental sense, so I’ll make that case and also talk about what is to be done.
(And if I’m not a big enough draw – perish the thought – there are many, many other interesting sessions. In particular, Josh will be talking about Crunch, and Sarah will be giving both introductory and advanced Hadoop classes – both people I work with, and both fantastic speakers!)
Posted: January 4th, 2012 | Author: Henry | Filed under: Uncategorized | 1 Comment »
This page, from the ‘PBS’ team at Berkeley’s AMPLab is quite interesting. It allows you to tweak the parameters of a Dynamo-style system, then by running a series of Monte Carlo simulations gives an estimate of the likelihood of staleness of reads after writes.
Since the Dynamo paper appeared and really popularised eventual consistency, the debate has focused on a fairly binary treatment of its merits. Either you can’t afford to be wrong, ever, or it’s ok to have your reads be stale for a potentially unbounded amount of time. In fact, the suitability of eventual consistency is dependent partly on the distribution of stale reads; that is the speed of quiescence of a system immediately after a write. If the probability of a ever seeing a stale read due to consistency delays can be reduced to smaller than the probability of every machine in the network simultaneously catching fire, we can probably make use of eventual consistency.
Looking at many designed systems (where there is little more than conventional wisdom on how to choose R and W), it’s clear that an analytical model relating system parameters to distributions of behaviour is sorely needed. PBS is a good step in that direction. It would be good to see the work extended to handle a treatment of failure distributions (although a good failure model is hard to find!). The reply latencies of write and read replicas are modelled exponentially distributed CDFs, but in reality there’s a more significant probability of the reply latency becoming infinite. Once that distribution is correctly modelled, PBS should be able to run simulations against it with no change.
A great use for this tool would be to enter some operational parameters, such as the required consistency probability, max number of nodes, availability requirements and maximum request latency, and have PBS suggest some points in the system design space that would meet these requirements with high probability. As the size of the R / W quora get larger, the variance on the request latencies gets larger, but the resilience to failures increases as does the likelihood of fresh reads. For full credit, PBS could additionally model a write / read protocol (i.e. 2-phase commit) which has different consistency properties. As Daniel Abadi discusses, when things are running well different consistency guarantees trade off between latency and the strength of consistency.
Nice work PBS team!
Posted: June 13th, 2011 | Author: Henry | Filed under: cloudera | No Comments »
Today was a pretty good day at Cloudera, although it was not much unlike any other. Today I was:
- checking a leader election protocol for correctness
- cleaning up some code in an open source project
- designing a specialised messaging system
- working with one of our outstanding interns on the very cool project he’s taking on over the summer
- cheerleading as Cloudera pushed another open-source project out of the nest.
We have more interesting work than we can possibly do. We need great engineers to come solve some really fascinating problems. I wanted to take advantage of the fact that this blog is currently getting a lot of traffic to put this message in front of a lot of eyeballs: you should really consider coming to work at Cloudera.
We’re hiring pretty aggressively in several engineering areas, but two in particular that I can say a lot about:
Distributed systems engineers. Engineering jobs where you are doing what we would call ‘systems’ work are few and far between. They exist, but they’re not common. One thing I have found, speaking to engineers at other companies, that a job that encourages specialisation in distributed systems is even less easily found, especially at an exciting company with plenty of really smart people. At Cloudera we need people to work on a variety of large distributed systems – execution frameworks, distributed filesystems, data ingest pipelines, schedulers, coordination systems, query processing and more. The problems you would work on are fundamental and challenging, and the people you would get to work with know the systems inside and out. The need to scale is real – we have customers who have thousands of machines in their clusters, and we write software to run across every single core.
Applications engineers. Think managing one of those clusters is easy? It isn’t. The monitoring and lifecycle management challenge posed by Apache Hadoop and other systems can be enormously complex. Each cluster produces a vast amount of data, and your job as an application engineer is to extract the right signals from that data and bring it to operations engineers through clear, meaningful visualisations so that they can make sense of all that information. We’re looking for full-stack developers, as happy writing a server process to aggregate and chomp through log files as they are building a reusable heatmap control to properly visualise a multi-dimensional distribution.
The perks at Cloudera are good – we have offices in San Francisco and Palo Alto, we have communal lunches brought in every day, there’s a subsidised gym membership, you can buy books and we’re pretty light on procedural overhead. But if you’re a good fit, you’ll probably be attracted because the problems are real, interesting and hard, and the people you’ll work with to solve them are smart, knowledgable and a whole load of fun.
If you’re interested, drop me a line or send me your resume at henry AT cloudera DOT com, and I’ll happily tell you anything you want to know about this place.
Posted: April 21st, 2011 | Author: Henry | Filed under: Operating systems | No Comments »
It’s a sign of how down-trodden the Software Transactional Memory (STM) effort must have become that the article (sorry, ACM subscription required) published in a recent CACM might have been just as correctly called “STM: Not as bad as the worst possible case”. The authors present a series of experiments that demonstrate that highly concurrent STM code beats sequential, single threaded code. You’d hope that this had long ago become a given, but what this demonstrates is only hey, STM allows some parallelism. And this weak lower bound got a whole article.
Another conclusion from the article is that STM performs best when there is little contention for transactions between threads. Again, that should really be a given – all reasonable concurrency primitives have high throughput when there is little contention but high parallelism. (A lot of work has gone into making this a very fast case (since it is the most common) for locking, see e.g. biased locking schemes in the Hotspot JVM).
Bryan Cantrill (previously of Fishworks, now of Joyent) rips on transactional memory more eloquently than I ever could. STM is a declarative solution to thread safety, which I like, but no more declarative really than synchronised blocks – and Cantrill points out the elephant in the room that the CACM article seemed to ignore: doing IO inside transactions is hugely problematic (because how precisely do you roll back a network packet?).
A recent paper at SOSP 2009 called Operating System Transactions attacked this problem, although not from the viewpoint of STM, but to provide atomicity and isolation for situations where bugs arise from the separation between reads, and writes that depend on that read (Time Of Check To Time Of Use – TOCTTOU). Perhaps there’s an overlap between this paper and STM approaches, but it’s not clear whether the workloads inside an operating system’s system call layer are general enough to map onto typical user-space STM work.
Posted: October 7th, 2010 | Author: Henry | Filed under: computer science, Distributed systems | Tags: cap, consensus | 1 Comment »
The CAP theorem gets another airing.
I think the article makes a point worth making again, and makes it fairly well – that CAP is really about P=> ~(C & A). A couple of things I want to call out though, after a rollicking discussion on Hacker News.
“For a distributed (i.e., multi-node) system to not require partition-tolerance it would have to run on a network which is guaranteed to never drop messages (or even deliver them late) and whose nodes are guaranteed to never die. You and I do not work with these types of systems because they don’t exist.”
This is a bit strong, at least theoretically. Actually all you need to not require partition-tolerance is to guarantee that your particular kryptonite failure pattern never occurs. Many protocols are robust to a dropped message here or there. A quorum system requires a fairly dramatic failure (one node completely partitioned) before one side of the partition has to occur. In practice, of course, these failures happen more often than we would like, which is why we worry about fault-tolerant properties of distributed algorithms.
Therefore the paragraph on failure probabilities is less powerful. It’s not always a problem if a single failure occurs, and therefore you shouldn’t immediately worry about sacrificing availability or consistency as soon as one node starts running slowly. CAP only establishes the existence of a failure pattern that torpedoes any distributed implementation of an atomic object, not its high probability.
Posted: April 27th, 2010 | Author: Henry | Filed under: Distributed systems, link | 2 Comments »
Over on the Cloudera blog I’ve written an article that should be of interest to readers of this blog.
I’m no great fan of the ubiquity of the CAP theorem – it’s a solid impossibility result which appeals to the theorist in me, but it doesn’t capture every fundamental tension in a distributed system. For example: we make our systems distributed across more than one machine usually for reasons of performance and to eliminate a single point of failure. Neither of these motivations are captured verbatim by the CAP theorem. There’s more to designing distributed systems!
In this, I agree with Stonebraker; it’s the erroneous representation of ‘partition tolerance’ that I found very strange. I’ve been a good deal more forceful in private about this than I have in public
Posted: March 24th, 2010 | Author: Henry | Filed under: Uncategorized | 1 Comment »
Students! Over at Apache ZooKeeper we’re looking for great students with a strong interest in distributed systems to work with us over the summer as part of Google’s Summer of Code, 2010.
Summer of Code is a great program – providing stipends to students and more importantly connecting them with mentors in open source projects. ZooKeeper has a number of interesting projects to get started on.
ZooKeeper is a distributed coordination platform on which you can build the distributed equivalents of many traditional concurrent primitives like locks, queues and barriers. It’s heavily used in the real world – Yahoo! use it extensively, and many other major companies rely on it. The other committers and I are actively looking to increase participation in the project – there is loads of really interesting work left to do. If consensus protocols, distributed systems, scalability, fault-tolerance and performance are your thing, this is certainly the project for you.
If you have any questions at all, drop me a line at henry at apache dot org.