Last time we looked extensively at two-phase commit, a consensus algorithm that has the benefit of low latency but which is offset by fragility in the face of participant machine crashes. In this short note, I’m going to explain how the addition of an extra phase to the protocol can shore things up a bit, at the cost of a greater latency.
For the next few articles here, I’m going to write about one of the most fundamental concepts in distributed computing – of equal importance to the theory and practice communities. The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something – it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network.
For example, some possible uses of consensus are:
- deciding whether or not to commit a transaction to a database
- synchronising clocks by agreeing on the current time
- agreeing to move to the next stage of a distributed algorithm (this is the famous replicated state machine approach)
- electing a leader node to coordinate some higher-level protocol
Such a simple-sounding problem has surprisingly been at the core particularly of theoretical distributed systems research for over twenty years. How come? As I see it, the answers are threefold.
This blog post is an excellent survey of the last thirty years of research into consensus problems.
One of the most important results in distributed systems theory was published in April 1985 by Fischer, Lynch and Patterson. Their short paper ‘Impossibility of Distributed Consensus with One Faulty Process’, which eventually won the Dijkstra award given to the most influential papers in distributed computing, definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment.
This particular result, known as the ‘FLP result’, settled a dispute that had been ongoing in distributed systems for the previous five to ten years. The problem of consensus – that is, getting a distributed network of processors to agree on a common value – was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps. In particular, the synchronous solution was resilient to faults, where processors crash and take no further part in the computation. Informally, synchronous models allow failures to be detected by waiting one entire step length for a reply from a processor, and presuming that it has crashed if no reply is received.
This kind of failure detection is impossible in an asynchronous setting, where there are no bounds on the amount of time a processor might take to complete its work and then respond with a message. Therefore it’s not possible to say whether a processor has crashed or is simply taking a long time to respond. The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem.
In this post, I want to give a tour of the proof itself because, although it is quite subtle, it is short and profound. I’ll start by introducing consensus, and then after describing some notation and assumptions I’ll work through the main two lemmas in the paper.
If you want to follow along at home (highly, highly recommended) a copy of the paper is available here.