One of the most important results in distributed systems theory was published in April 1985 by Fischer, Lynch and Patterson. Their short paper ‘Impossibility of Distributed Consensus with One Faulty Process’, which eventually won the Dijkstra award given to the most influential papers in distributed computing, definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment.
This particular result, known as the ‘FLP result’, settled a dispute that had been ongoing in distributed systems for the previous five to ten years. The problem of consensus – that is, getting a distributed network of processors to agree on a common value – was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps. In particular, the synchronous solution was resilient to faults, where processors crash and take no further part in the computation. Informally, synchronous models allow failures to be detected by waiting one entire step length for a reply from a processor, and presuming that it has crashed if no reply is received.
This kind of failure detection is impossible in an asynchronous setting, where there are no bounds on the amount of time a processor might take to complete its work and then respond with a message. Therefore it’s not possible to say whether a processor has crashed or is simply taking a long time to respond. The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem.
In this post, I want to give a tour of the proof itself because, although it is quite subtle, it is short and profound. I’ll start by introducing consensus, and then after describing some notation and assumptions I’ll work through the main two lemmas in the paper.
If you want to follow along at home (highly, highly recommended) a copy of the paper is available here.
The problem of consensus is genuinely fundamental to distributed systems research. Getting distributed processors to agree on a value has many, many applications. For example, the problem of deciding whether to commit a transaction to a database could be decided by a consensus algorithm between a majority of replicas. You can think of consensus as ‘voting’ for a value.
Several papers in the literature set the problem in the context of generals at different camps outside the enemy castle deciding whether or not to attack. A consensus algorithm that didn’t work would perhaps cause one general to attack while all the others stayed back – leaving the first general in trouble.
There are several variations on the consensus problem that differ in ‘strength’ – a solution to a stronger problem typically will solve weaker problems at the same time. A strong form of consensus is as follows:
Given a set of processors, each with an initial value:
- All non-faulty processes eventually decide on a value
- All processes that decide do so on the same value
- The value that has been decided must have proposed by some process
These three properties are referred to as termination, agreement and validity. Any algorithm that has these three properties can be said to solve the consensus problem.
Termination and agreement are fairly self explanatory – note that we explicitly do not require any particular behaviour from faulty nodes, which is just as well: they’re faulty! Validity is slightly more subtle, although it seems reasonable. The idea behind validity is that we want to exclude trivial solutions that just decide ‘No’ whatever the initial set of values is. Such an algorithm would satisfy termination and agreement, but would be completely vacuous, and no use to use at all.
The FLP proof actually concerns a slightly weaker form of consensus – for termination it is enough only that some non-faulty process decides. Note that it’s not enough to have one distinguished process that always decides its own value, as that process might fail, and another will have to take its place for even weak termination to be satisfied.
Clearly a solution to strong consensus will be a perfectly good solution for weak consensus as well, so by ruling out the possibility of the latter the FLP result similarly precludes a solution to the former.
For simplicity, the authors consider that the only values possible are 0 and 1. Every argument in the paper generalises to more values.
You can’t reasonably talk about a distributed algorithm without talking about the context it is designed to execute in and the assumptions it makes about the underlying distributed system. This set of assumptions is known as the system model and its choice has a profound effect on what algorithms are appropriate. Indeed, this paper is concerned entirely with an impossibility result on one particular class of system models.
There is a huge variety in system models, ranging from those based solidly on real-world experience to more theoretical models. Most share some basic properties in common: a network of distributed processors which are vertices in a connected graph where the edges are communication links along which messages may be sent. The variation comes in how each processor views the passage of time, whether links are reliable and how – if at all – processors are allowed to fail.
The FLP result is based on the asynchronous model, which is actually a class of models which exhibit certain properties of timing. The main characteristic of asynchronous models is that there is no upper bound on the amount of time processors may take to receive, process and respond to an incoming message. Therefore it is impossible to tell if a processor has failed, or is simply taking a long time to do its processing. The asynchronous model is a weak one, but not completely physically unrealistic. We have all encountered web servers that seem to take an arbitrarily long time to serve us a page. Now that mobile ad-hoc networks are becoming more and more pervasive, we see that devices in those networks may power down during processing to save battery, only to reappear later and continue as though nothing had happened. This introduces an arbitrary delay which fits the asynchronous model.
Other models are arguably more appropriate for networks like the Internet, but the asynchronous model is very general without being completely unrealistic.
The communication links between processors are assumed to be reliable. It was well known that given arbitrarily unreliable links no solution for consensus could be found even in a synchronous model; therefore by assuming reliable links and yet still proving consensus impossible in an asynchronous system the FLP result is made stronger. TCP gives reliable enough message delivery that this is a realistic assumption.
Processors are allowed to fail according to the fail-stop model – this simply means that processors that fail do so by ceasing to work correctly. There are more general failure models, such as Byzantine failures where processors fail by deviating arbitrarily from the algorithm they are executing. Again, by strengthening the model – by making strong assumptions about what is possible – the authors make their result much more general. The argument is as follows: if a given problem is proven unsolvable in a very restricted environment where only a few things can go wrong, it will certainly be unsolvable in an environment where yet more things can go wrong.
In order to formally model these properties, some notation needs to be introduced. In section 2 of the paper the system model is carefully defined. There are processors which communicate by sending messages. A message is a pair where is the processor the message is intended for, and is the contents of the message. Messages are stored in an abstract data structure called the message buffer which is a multiset – simply a set where more than one of any element is allowed – which supports two operations. The first, send(p,m) simply places the message in the message buffer. The second, receive(p) either returns a message for processor (and removes it from the message buffer) or the special value , which does nothing.
The semantics of receive aren’t completely intuitive (but make perfect sense!). If there are no messages in the message system addressed to then receive(p) will return . However if there are messages for then receive(p) will return one of them at random or . This captures the idea that messages may be delivered non-deterministically and that they may be received in any order, as well as being arbitrarily delayed as per the asynchronous model. The only requirement is that calling receive(p) infinitely many times will ensure that all messages in the message buffer are eventually delivered to . This means that messages may be arbitrarily delayed, but not completely lost.
A configuration is defined as the internal state of all of the processors – the current step in the algorithm that they are executing and the contents of their memory – together with the contents of the message buffer. The system moves from one configuration to the next by a step which consists of a processor performing receive(p) and moving to another internal state. Therefore, the only way that the system state may evolve is by some processor receiving a message, or null, from the message buffer. Each step is therefore uniquely defined by the message that is received (possibly ) and the process that received it. That pair is called an event in the paper, so configurations move from one to another through events.
Since the receive operation is non-deterministic, there are many different possible executions for a given initial state (which for consensus is simply differentiated by the set of starting values that each processor has). In order to show that an algorithm solves consensus, we have to show that it satisfies validity, agreement and termination for every possible execution. The set of executions includes the possibility that processors are delayed arbitrarily, which we model by having them receive for an arbitrary number of operations until they get their message. A particular execution, defined by a possibly infinite sequence of events from a starting configuration is called a schedule and the sequence of steps taken to realise the schedule is a run.
Non-faulty processes take infinitely many steps in a run (presumably eventually just receiving once the algorithm has finished its work) – otherwise a process is considered faulty. An admissible run is one where at most one process is faulty (capturing the failure requirements of the system model) and every message is eventually delivered (this means that every processor eventually gets chosen to receive infinitely many times). We say that a run is deciding provided that some process eventually decides according to the properties of consensus, and that a consensus protocol is totally correct if every admissible run is a deciding run.
With all those heavy definitions out of the way, the authors present their main theorem which is that no totally correct consensus algorithm exists. The idea behind it is to show that there is some admissible run – i.e. one with only one processor failure and eventual delivery of every message – that is not a deciding run – i.e. in which no processor eventually decides. This is enough to prove the general impossibility of consensus – it’s enough to show that there’s just one initial configuration in which a given protocol will not work because starting in that configuration can never be ruled out, and the result is a protocol which runs for ever (because no processor decides).
As the paper says, ‘the basic idea is to show circumstances under which the protocol remains forever indecisive’. There are two steps to this, presented as two separate lemmas. The first demonstrates the existence of an initial condition which the second exploits. Things are tied together thereafter.
The First Lemma
(This is lemma 2 in the paper)
The point of the first lemma is to show that there is some initial configuration in which the decision is not predetermined, but in fact arrived as a result of the sequence of steps taken and the occurrence of any failure. For example, say that this initial configuration was two processors whose initial value was ’0′ and one whose value was ’1′. The authors show that what is decided from this configuration depends on the order in which messages are received and whether any of the processors fail, not just what the initial values of the processors are. This is due to the inherent non-determinism of the asynchronous system model.
The way this is done is quite neat. Suppose that the opposite was true – that all initial configurations have predetermined executions. There must be some initial configuration which results in a ’0′ being decided and some for which ’1′ is decided, by validity (which tells us that all results must be possible). Each configuration is uniquely determined by the set of initial values in the processors. We can order these values in a chain where two configurations are next to each other if they differ by only one value, so the only difference between two adjacent configurations is the starting value of one processor.
At some point along this chain, a ’0′-deciding initial configuration must be next to a ’1′-deciding one, and therefore they must differ by only one initial value. Call the ’0′-deciding configuration and the ’1′-deciding configuration . Call the processor whose initial value is different in both . Now, from there must be a run that decides ’0′ even if fails initially – because we allow one processor to fail. Therefore neither sends nor receives any messages, so its initial value cannot be observed by the rest of the processors, one of whom must eventually decide ’0′.
The crucial observation is that this run can also be made from . Because the configurations of and are exactly the same, except for the value at which can’t contribute anything to the run because has failed, any sequence of steps that may take can also be taken by . But if they take exactly the same sequence of steps, then the eventual decision taken must be the same for both configurations. This contradicts our assumption that the result of the consensus algorithm is predetermined only by the initial configuration in and , and that both configurations decide differently. So either one (or possibly both) of these configurations can potentially decide ’0′ or ’1′ and the eventual result depends on the pattern of failures and message deliveries.
We call the configurations that may lead to either decision value bivalent, and configurations that will only result in one value 0-valent or 1-valent. I’ll use these terms for the rest of this article.
The Second Lemma
(This is lemma 3 in the paper)
The second main lemma constitutes most of the body of the proof, although you wouldn’t know it to read it the first time. The statement of the lemma says, informally, that if a protocol starts from a bivalent configuration and there is a message that is applicable to that configuration, then the set of configurations reachable through any sequence of messages where is applied last contains a bivalent configuration.
Even more informally, but more intuitively, it says that if you delay a message that is pending any amount from one event to arbitrarily many, there will be one configuration in which you receive that message and end up in a bivalent state. This is crucial to the entire proof: you can probably see that guaranteeing that there is a path from a bivalent (where you see ‘bivalent’, read ‘undecided’ for clarity) configuration to another bivalent configuration by delaying a message long enough raises the possibility that an infinite loop will ensue and the protocol will remain forever undecided.
The proof of this lemma proceeds by contradiction. We need to set up some terminology as in the paper. Let be a bivalent configuration and let be some event that is applicable to . Let be the set of configurations reachable from without applying , and let be the set of configurations resulting from applying to configurations in .
Assume that contains no bivalent configurations. The proof shows that this leads to a contradiction.
First, we show that contains both 0- and 1-valent configurations if it contains no bivalent configurations. Let for be an -valent configuration reachable from . Both and exist because is bivalent. For , if it is in (i.e. it has not received ) let be the result of receiving at . If it is not in then it has received ; in this case let be a configuration in from which is reachable (i.e. the configuration that preceded at the point that was received).
That confusing definition is quite a simple idea: we are fixing an in that either comes before or after depending on whether was received in the run to .
Now we can see that must be 0-valent. Either it comes before , in which case it too must be 0-valent, because it cannot be bivalent (by assumption) and if it were 1-valent the protocol would ‘change its mind’ between and which contradicts the definition of -valent.
If comes after , by the same argument it must be 0-valent. Therefore contains a 0-valent state.
We can define relative to similarly, to prove that contains a 1-valent configuration. Therefore we’ve shown that must contain both 0-valent and 1-valent configuration if it contains no bivalent configurations.
We use this fact to set up the rest of the lemma’s proof. By a similar argument to the ‘chain’ one used in the proof of the previous lemma, we can see that there must be two configurations and in where is the result of receiving some message in , such that applying to gives an -valent configuration. Call these configurations and .
and are called ‘neighbours’ because they are separated by a single message delivery. Now, consider the message that separates them, . Consider the two alternatives for the destination of , that is .
- If – i.e. the message that separates them is for a different processor than – then we can get to by receiving at . If the destination of a set of messages is different then we can receive them in any order to get to the same configuration, because all the processors see the same order of events, just at different times. (This is called the commutativity of schedules in the paper and is proved by its own lemma. I have left the proof out of this – already proof heavy – article). But is 0-valent, and is 1-valent. So we should not be able to reach from , as again this would imply the protocol had changed its mind. Therefore this is a contradiction.
- Instead consider . Now consider a finite deciding run from (the configuration such that receiving results in a 0-valent configuration), in which takes no steps (such a run must exist to cope with the possibility of failing). Let be the configuration that results at the end of that deciding run. We can also apply that run to either or to get two new configurations and , and by the commutativity of schedules we can also receive at to get (same set of messages, different delivery order), and followed by at to get . (Remember that and are both available for receipt because takes no steps in the deciding run). But this implies that is bivalent, which is a contradiction since the run to is deciding.
So, by contradiction, we have shown that must contain a bivalent configuration.
Bringing it all together
The final step is to show that any deciding run also allows the construction of an infinite non-deciding run. This is done by piecing together sections of runs in such a way as to avoid ever making a decision, that is to enter a univalent configuration.
Start from a bivalent initial configuration (which exists as a result of the first lemma). To make the run admissible, place the processors in the system in a queue and have them receive messages in queue order, being placed at the back of the queue when they are done. This ensures that every message is eventually delivered.
Let be the earliest message to the first processor in the queue; possibly a null message. Then by the second lemma we can reach a bivalent configuration reachable from where is the last message received. Similarly, we can reach another bivalent configuration from by the same argument. And this may continue for ever. This run is admissible, but never reaches a decision, and it follows that the protocol is not totally correct.
Phew! That was a lot of technical proof to wade through. The meat of the proof is in the two lemmas. The first showed that there must be some initial configuration of the system in which the outcome of consensus is not already determined – you can’t just have a protocol that queries the state of the network and then indexes a massive look-up table, as failures or message delays are guaranteed to scupper that plan. This is the lemma where the possibility of failure is key: if there were no failures then the network state *would* be enough to determine consensus (because if there are no failures then message delay can be recognised for what it is).
The second lemma showed that, if you start in a bivalent configuration then it is always possible to reach another bivalent configuration by delaying a pending message. This can continue forever – even though all messages are eventually delivered – and so no protocol can guarantee that a univalent state might ever be reached.
What does this mean practically? The chance of entering that infinite undeciding loop can be made arbitrarily small – it’s one thing for nondecision to be possible, another thing entirely for it to be guaranteed. Still, there is much to be said for keeping this possibility in mind when your system livelocks during consensus and you don’t know why. Also, in practice, real systems aren’t always truly asynchronous.
This paper spawned a huge amount of subsequent work in the theoretical distributed systems community. Failure detectors were introduced by Toureg et. al. to characterise just how much the asynchronous model needed to be strengthened to allow for correct consensus. Randomised algorithms exist that practically make the probability of nondecision 0 if run for long enough. A lot of proofs by reduction were published that showed the equivalence of some problem to consensus, and therefore its impossibility. Different system models were postulated to circumvent the FLP problem, including partial synchrony which has had a lot of academic success.
The paper won the 2001 Dijkstra prize for the most influential paper in distributed computing. All three authors, Michael Fischer, Nancy Lynch and Mike Paterson are still very active and extremely eminent researchers in the field. Nancy Lynch, in particular, seems to have been involved in almost every important piece of work in distributed algorithms.