GFS Retrospective in ACM Queue

This is a really great article. Sean Quinlan talks very openly and critically about the design of the Google File System given ten years of use (ten years!).

What’s interesting is that the general sentiment seems to be that the concessions that GFS made for performance and simplicity (single master, loose consistency model) have turned out to probably be net bad decisions, although they probably weren’t at the time.

There are scaling issues with GFS – the well known many-small-files problem that also plagues HDFS, and a similar huge-files problem. Both increase the amount of metadata that a master has to maintain, and both therefore degrade performance due to some linear costs on metadata operations.

A replacement for GFS is in the works – a complete re-do, not just incremental updates to the design. This will involve, at least, a rethink of metadata storage and clearly a more fault-tolerant and scalable design. Multiple appenders, also a bugbear of HDFS, may be serialised through a single writer process, which would help the consistency issues.

“Our user base has definitely migrated from being a MapReduce-based world to more of an interactive world that relies on things such as BigTable. Gmail is an obvious example of that. Videos aren’t quite as bad where GFS is concerned because you get to stream data, meaning you can buffer. Still, trying to build an interactive database on top of a file system that was designed from the start to support more batch-oriented operations has certainly proved to be a pain point.”

One of those great articles where every paragraph is rich with insight. Worth reading, probably twice.

Miss me?

I’ve been away from this blog for far longer than I hoped. I’ve got plenty of content queued up though, so in the next few days I should start to get my postings back in order. Rumours of my death are greatly exaggerated; of my gainful employment less so. That, clearly, has been taking up some time, but I’m getting into my groove now.

A small conciliatory nugget that regular readers of this blog might find interesting: a trademark Robinson monster post on Cloudera’s blog describing how to build a simple distributed queue with Apache ZooKeeper.

More on ZooKeeper and our usual distributed systems programming to follow.

Barbara Liskov's Turing Award, and Byzantine Fault Tolerance

Barbara Liskov has just been announced as the recipient of the 2008 Turing Award, which is one of the most important prizes in computer science, and can be thought of as our field’s equivalent to the various Nobel Prizes. Professor Liskov is a worthy recipient of the award, even if judged alone by her citation which lists a number of the important contributions she has made to operating systems, programming languages and distributed systems.

Professor Liskov seems to be particularly well known for the Liskov substitution principle which says that some property of a supertype ought to hold of its subtypes. I’m not in any position to speak as to the importance of this contribution. However, her more recent work has been regarding the tolerance of Byzantine failures in distributed systems, which is much more close to my heart.

The only work of Liskov’s that I am really familiar with is the late 90s work on Practical Byzantine Fault Tolerance with Miguel Castro and is first published in this OSDI ’99 paper. I’m not going to do a full review, but the topic sits so nicely with my recent focus on consensus protocols that it makes sense to briefly discuss its importance.
Continue reading

OSDI '08: FlightPath: Obedience vs. Choice in Cooperative Services

This is one of my favourite papers from OSDI ’08 (yes, still doing a few reviews, trying to get to five or so before SOSP…). FlightPath is a system developed by some folks mainly at UT Austin for peer-to-peer streaming in dynamic networks. This is a reasonably challenging problem in itself, although one that’s seen a good deal of work before. However, the really cool thing about this paper is that they treat participants in the network as potentially rational agents. Since Lamport’s seminal work on the Byzantine generals problem, it’s been standard practice to assign one of two behaviour modes to members of distributed systems: either you’re alturistic, which means that you do exactly what the protocol tells you to do, no matter what the cost to yourself, or Byzantine, which means that you do whatever you like, again no matter what the cost to yourself.

It was realised recently that this is a false dichotomy: there’s a whole class of behaviour that’s not captured by these two extremes. Rational agents participate in a protocol as long as it is worth their while to do so. At its most simple, this means that rational agents will not incur a cost unless they expect to recoup a benefit that is worth equal to or more than the original cost to them. This gave rise to the Byzantine-Alturistic-Rational (BAR) model, due to the same UTA group, which can be used to more realistically model the performance of peer-to-peer protocols.
Continue reading

Consensus Protocols: A Paxos Implementation

It’s one thing to wax lyrical about an algorithm or protocol having simply read the paper it appeared in. It’s another to have actually taken the time to build an implementation. There are many slips twixt hand and mouth, and the little details that you’ve abstracted away at the point of reading come back to bite you hard at the point of writing.

I’m a big fan of building things to understand them – this blog is essentially an expression of that idea, as the act of constructing an explanation of something helps me understand it better. Still, I felt that in order to be properly useful, this blog probably needed more code.

So when, yesterday, it was suggested I back up my previous post on Paxos with a toy implementation I had plenty of motivation to pick up the gauntlet. However, I’m super-pressed for time at the moment while I write my PhD thesis, so I gave myself a deadline of a few hours, just to keep it interesting.

A few hours later, I’d written this from-scratch implementation of Paxos. There’s enough interesting stuff in it, I think, to warrant this post on how it works. Hopefully some of you will find it useful, and something you can use as a springboard to your own implementations. You can run an example by simply invoking python toy_paxos.py.

Continue reading