Tuesday Links, 27th January 2009

Web highlights discovered in the last week or so:

Diagrams, and the state of the union

Due to popular request, I’ve started retrospectively adding some diagrams to articles that really need them. First to get the treatment has been two-phase commit – by far the most popular article on this blog. The Dynamo article will be the next up, then 3PC and maybe the GFS and BigTable entries.

I’m using an old version of OmniGraffle, which came installed on my Powerbook G4. I recently replaced the power supply board in the G4 (which involved some hair-raising open Mac surgery) and am delighting in having all these great applications at my fingertips again. OmniGraffle makes diagrams for things like this so very easy. The effort expended in producing a diagram is far less than that for writing 1000 words, so if the old adage is true this is a very efficient way of producing content.

Although Real Work is consuming a lot of my time at the moment, I’ve been pretty good at finding time here and there to write for this blog. I’ve got two outward-facing goals here. The first is to make available some clear explanations for basic distributed systems theory and practice. I think there’s a niche for good work here – I don’t think any textbooks I have read adequately treat practice in a sufficiently theoretical way, and the theory textbooks can be too abstract to be accessible. Therefore I’ve been writing articles like the tour of FLP impossibility, the aforementioned introduction to two-phase and three-phase commit and the discussion of consensus in the context of lossy links. Continuing in this vein, I have plans to talk about Paxos, failure detectors, distributed spanner construction and some more simple, fundamental distributed algorithms such as leadership election.

My second ‘public-facing’ goal is to survey some of the more interesting (and occasionally less interesting) systems research, with a particular emphasis on real systems that exist and work. Hence the GFS, BigTable and PNUTS articles, and the recent series on OSDI papers. Part of my day job is being familiar with OSDI, NSDI, SOSP, HotOS, Mobi* etc. conferences and workshops, and by writing the articles I get the chance to consolidate my understanding, which is highly useful.

I’d be very interested to hear, by mail or by comment, if there are any particular topics that you’d like me to cover. I suspect the imminent article on Paxos will be popular (executive summary: it’s not that hard, especially if you already understand 3PC), but otherwise it’s hard to gauge what people are looking forward to reading on this blog. I even would enjoy picking up writing about algorithms that I abortively started to do. So help me out!

OSDI '08: Corey, an operating system for many cores

Just before Christmas, the systems community held one of its premier conferences – Operating Systems Design and Implementation (OSDI ’08). This biannual conference showcases some of the best research in operating systems, networks, distributed systems and software technology from the past couple of years.

Although I wasn’t lucky enough to go, I did grab a copy of the proceedings and had a read through a bunch of the papers that interested me. I plan to post summaries of a few to this blog. I see people ask repeatedly on various forums (fora?) “what’s new in computer science?”. No-one seems to give a satisfactory answer, for a number of reasons. Hopefully I can redress some of the balance here, at least in the systems world.

Without further ado, I’ll get stuck in to one of the OSDI papers: Corey: an operating system for many cores by Boyd-Wickizer et al from a combination of MIT, Fudan University, MSR Asia and Xi’an Jiaotong University (12 authors!). Download the paper and play along at home, as usual.
Continue reading

Consensus with lossy links: Establishing a TCP connection

After a hiatus for the Christmas break, during which I travelled to the States, had a job interview, went to Vegas, became an uncle and got a cold, I’m back on a more regular posting schedule now. And I’ve got lots to post about.

Before I talk about other theoretical consensus protocols such as Paxos, I want to illustrate a consensus protocol running in the wild, and show how different modelling assumptions can lead to protocols that are rather different to the *PC variants we’ve looked at in the last couple of posts. We’ve been considering situations like database commit, where many participants agree en-masse to the result of a transaction. We’ve assumed that all participants may communicate reliably, without fear of packet loss (or if the packets are lost then the situation is the same as if the host that had sent the packet had failed).

The Transmission Control Protocol (TCP) gives us at least some approximation to a reliable link due to the use of sequence numbers and acknowledgements. However before we can use TCP both hosts involved in a point to point communication have to establish a connection: that is, they must both agree that a connection is established. This is a two-party consensus problem. Neither party can rely on reliable transmission, and can instead only use the IP stack and below to negotiate a connection. IP does not give reliable transmission semantics to packets and works only on a best-effort principle. If the network is noisy or prone to outages then packets will be lost. How can we achieve consensus in this scenario?

Those who have been reading this blog as far back as my explanation of FLP impossibility will probably be thinking that this is a trick question. FLP impossibility shows that if there is an unbounded delay in the transmission of a packet (i.e. an asynchronous network model) then consensus is, in general, unsolvable. Lossy links can be regarded as delaying packet delivery infinitely – therefore it seems very likely that consensus is unsolvable with packet loss.

In fact, this is completely true. Consensus with arbitrary packet loss is an unsolvable problem, even in an otherwise synchronous network. In this post I want to demonstrate the short and intuitive proof that this is the case, then show how this impossibility is avoided where possible in TCP connection establishment.

Continue reading