<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paper Trail</title>
	<atom:link href="http://the-paper-trail.org/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://the-paper-trail.org/blog</link>
	<description>Wading through academic treacle</description>
	<lastBuildDate>Tue, 27 Apr 2010 17:44:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>CAP confusion: Problems with Partition Tolerance</title>
		<link>http://the-paper-trail.org/blog/?p=290</link>
		<comments>http://the-paper-trail.org/blog/?p=290#comments</comments>
		<pubDate>Tue, 27 Apr 2010 17:44:29 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Distributed systems]]></category>
		<category><![CDATA[link]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=290</guid>
		<description><![CDATA[Over on the Cloudera blog I&#8217;ve written an article that should be of interest to readers of this blog. I&#8217;m no great fan of the ubiquity of the CAP theorem &#8211; it&#8217;s a solid impossibility result which appeals to the theorist in me, but it doesn&#8217;t capture every fundamental tension in a distributed system. For [...]]]></description>
			<content:encoded><![CDATA[<p>Over on the <a href="http://www.cloudera.com/blog">Cloudera blog</a> I&#8217;ve written an <a href="http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/">article</a> that should be of interest to readers of this blog. </p>
<p>I&#8217;m no great fan of the ubiquity of the CAP theorem &#8211; it&#8217;s a solid impossibility result which appeals to the theorist in me, but it doesn&#8217;t capture every fundamental tension in a distributed system. For example:  we make our systems distributed across more than one machine usually for reasons of performance and to eliminate a single point of failure. Neither of these motivations are captured verbatim by the CAP theorem. There&#8217;s more to designing distributed systems!</p>
<p>In this, I agree with Stonebraker; it&#8217;s the erroneous representation of &#8216;partition tolerance&#8217; that I found very strange. I&#8217;ve been a good deal more forceful in private about this than I have in public <img src='http://the-paper-trail.org/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=290</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache ZooKeeper is looking for Google Summer of Code applicants</title>
		<link>http://the-paper-trail.org/blog/?p=286</link>
		<comments>http://the-paper-trail.org/blog/?p=286#comments</comments>
		<pubDate>Wed, 24 Mar 2010 17:18:02 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=286</guid>
		<description><![CDATA[Students! Over at Apache ZooKeeper we&#8217;re looking for great students with a strong interest in distributed systems to work with us over the summer as part of Google&#8217;s Summer of Code, 2010. Summer of Code is a great program &#8211; providing stipends to students and more importantly connecting them with mentors in open source projects. [...]]]></description>
			<content:encoded><![CDATA[<p>Students! Over at <a href="http://hadoop.apache.org/zookeeper/">Apache ZooKeeper</a> we&#8217;re looking for great students with a strong interest in distributed systems to work with us over the summer as part of <a href="http://code.google.com/soc/">Google&#8217;s Summer of Code, 2010</a>. </p>
<p>Summer of Code is a great program &#8211; providing stipends to students and more importantly connecting them with mentors in open source projects. ZooKeeper has a number of <a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&#038;pid=12310801&#038;customfield_12310260=gsoc">interesting projects</a> to get started on.</p>
<p>ZooKeeper is a distributed coordination platform on which you can build the distributed equivalents of many traditional concurrent primitives like locks, queues and barriers. It&#8217;s heavily used in the real world &#8211; Yahoo! use it extensively, and many other major companies rely on it.  The other committers and I are actively looking to increase participation in the project &#8211; there is <em>loads</em> of really interesting work left to do. If consensus protocols, distributed systems, scalability, fault-tolerance and performance are your thing, this is certainly the project for you.</p>
<p>If you have any questions at all, drop me a line at henry at apache dot org. </p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=286</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Images are back</title>
		<link>http://the-paper-trail.org/blog/?p=284</link>
		<comments>http://the-paper-trail.org/blog/?p=284#comments</comments>
		<pubDate>Fri, 29 Jan 2010 22:42:45 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Note]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=284</guid>
		<description><![CDATA[Thanks to those of you who bugged me about putting images back up. They should all (except one that&#8217;s causing me a bit of a headache) be back now, after I discovered copies of all in an dusty corner of my hard drive. Real post with actual content coming in the next week or so!]]></description>
			<content:encoded><![CDATA[<p>Thanks to those of you who bugged me about putting images back up. They should all (except one that&#8217;s causing me a bit of a headache) be back now, after I discovered copies of all in an dusty corner of my hard drive.</p>
<p>Real post with actual content coming in the next week or so!</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=284</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>&#8220;Well, I&#8217;m back.&#8221;</title>
		<link>http://the-paper-trail.org/blog/?p=257</link>
		<comments>http://the-paper-trail.org/blog/?p=257#comments</comments>
		<pubDate>Mon, 07 Dec 2009 06:35:42 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Note]]></category>
		<category><![CDATA[meta]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=257</guid>
		<description><![CDATA[Astute readers may have noticed that this blog was unavailable for the past three-and-a-half months. The short reason for this was that the computer on which Paper Trail was hosted was on a boat in the Atlantic and it is surprisingly hard to get a good internet connection out there, let alone a power supply [...]]]></description>
			<content:encoded><![CDATA[<p>Astute readers may have noticed that this blog was unavailable for the past three-and-a-half months. The short reason for this was that the computer on which Paper Trail was hosted was on a boat in the Atlantic and it is surprisingly hard to get a good internet connection out there, let alone a power supply to the shipping crate it was in.</p>
<p>The long reason was that I have now moved across the ocean, from Cambridge to San Francisco, where I am living in order that the commute to <a href="http://www.cloudera.com/">Cloudera</a> be a little shorter than 24 hours. My possessions finally arrived on Thursday &#8211; over three months since we shipped them &#8211; and we are finally getting everything in order, including getting this blog back online. I&#8217;m now hosted at <a href="http://www.bluehost.com">Bluehost</a>, and have transferred all the old posts over to the new WordPress installation. I don&#8217;t yet have access to the images, so unfortunately those wil have to wait, but most other things are in order. Please excuse the dust as I find out which links are broken. </p>
<p>The old address &#8211; http://hnr.dnsalias.net/wordpress &#8211; will still work, but the correct link is now our very own domain: <a href="http://the-paper-trail.org/">http://the-paper-trail.org/</a>. Hopefully we will start showing up in Google again when the crawlers do their thing. Please update your rss readers &#8211; those of you who are still following and haven&#8217;t deleted my feed in disgust. Does anyone even still use rss readers anymore?</p>
<p>I am in the process of deciding what to write about for the next few posts. I still have the remainder of the theory of computation posts to write, which would be fun to do but is a bit of a departure from the systems focus. I never really fully explained Byzantine Fault Tolerance &#8211; at least, I never got as far as describing Zyzzyva and other modern systems. At the same time, some interesting stuff in systems research has happened since the blog went quiet &#8211; Google released the <a href="http:/golang.org">Go</a> programming language which is intriguing for writing user-space systems software. <a href="http://www.sigops.org/sosp/sosp09/">SOSP 2009</a> happened, with some very cool papers which I really want to write about. And I&#8217;ve been busy myself &#8211; I was recently made a committer on the <a href="http://hadoop.apache.org/zookeeper">Apache ZooKeeper</a> project, which is a distributed coordination system written by some engineers and researchers at Yahoo!, and is very cool. My largest contribution was a <a href="http://issues.apache.org/jira/browse/ZOOKEEPER-368">patch</a> for &#8216;observers&#8217; &#8211; which are listeners, in Paxos terminology &#8211; which help maintain the read performance of the cluster as the number of clients scales. </p>
<p>So, lots going on, plenty to write about, and some exciting possibilities coming down the queue. Good to be back.</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=257</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>GFS Retrospective in ACM Queue</title>
		<link>http://the-paper-trail.org/blog/?p=256</link>
		<comments>http://the-paper-trail.org/blog/?p=256#comments</comments>
		<pubDate>Wed, 12 Aug 2009 20:01:11 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Distributed systems]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[link]]></category>
		<category><![CDATA[filesystems]]></category>
		<category><![CDATA[gfs]]></category>
		<category><![CDATA[google]]></category>

		<guid isPermaLink="false">http://hnr.dnsalias.net/wordpress/?p=256</guid>
		<description><![CDATA[This is a really great article. Sean Quinlan talks very openly and critically about the design of the Google File System given ten years of use (ten years!). What&#8217;s interesting is that the general sentiment seems to be that the concessions that GFS made for performance and simplicity (single master, loose consistency model) have turned [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://queue.acm.org/detail.cfm?id=1594206">This</a> is a really great article. Sean Quinlan talks very openly and critically about the design of the Google File System given ten years of use (ten years!).</p>
<p>What&#8217;s interesting is that the general sentiment seems to be that the concessions that GFS made for performance and simplicity (single master, loose consistency model) have turned out to probably be net bad decisions, although they probably weren&#8217;t at the time.</p>
<p>There are scaling issues with GFS &#8211; the well known <a href="http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/">many-small-files problem</a> that also plagues HDFS, and a similar huge-files problem. Both increase the amount of metadata that a master has to maintain, and both therefore degrade performance due to some linear costs on metadata operations.</p>
<p>A replacement for GFS is in the works &#8211; a complete re-do, not just incremental updates to the design. This will involve, at least, a rethink of metadata storage and clearly a more fault-tolerant and scalable design. Multiple appenders, also a bugbear of HDFS, may be serialised through a single writer process, which would help the consistency issues.</p>
<blockquote><p>&#8220;Our user base has definitely migrated from being a MapReduce-based world to more of an interactive world that relies on things such as BigTable. Gmail is an obvious example of that. Videos aren&#8217;t quite as bad where GFS is concerned because you get to stream data, meaning you can buffer. Still, trying to build an interactive database on top of a file system that was designed from the start to support more batch-oriented operations has certainly proved to be a pain point.&#8221;</p></blockquote>
<p>One of those great articles where every paragraph is rich with insight. Worth reading, probably twice.</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=256</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SOSP 2009 Program Available</title>
		<link>http://the-paper-trail.org/blog/?p=254</link>
		<comments>http://the-paper-trail.org/blog/?p=254#comments</comments>
		<pubDate>Mon, 29 Jun 2009 13:49:17 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[Note]]></category>
		<category><![CDATA[sosp]]></category>

		<guid isPermaLink="false">http://hnr.dnsalias.net/wordpress/2009/06/sosp-2009-program-available/</guid>
		<description><![CDATA[The accepted papers for SOSP 2009 are here. As ever, some excellent looking papers. If you search for the titles you can often turn up drafts or even the submitted versions. The best looking sessions to me are &#8216;scalability&#8217; and &#8216;clusters&#8217;, but there&#8217;s at least one great looking title in every session. I&#8217;ll start posting [...]]]></description>
			<content:encoded><![CDATA[<p>The accepted papers for SOSP 2009 are <a href="http://sigops.org/sosp/sosp09/program.html">here</a>. As ever, some excellent looking papers. If you search for the titles you can often turn up drafts or even the submitted versions.</p>
<p>The best looking sessions to me are &#8216;scalability&#8217; and &#8216;clusters&#8217;, but there&#8217;s at least one great looking title in every session. I&#8217;ll start posting some reviews once I find some bandwidth (and have finished the computation theory series &#8211; next one on its way).</p>
<p>Congratulations to all accepted authors!</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=254</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Introduction to Computation Theory &#8211; Part One</title>
		<link>http://the-paper-trail.org/blog/?p=221</link>
		<comments>http://the-paper-trail.org/blog/?p=221#comments</comments>
		<pubDate>Sun, 07 Jun 2009 15:16:17 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Computation Theory]]></category>
		<category><![CDATA[computer science]]></category>

		<guid isPermaLink="false">http://hnr.dnsalias.net/wordpress/?p=221</guid>
		<description><![CDATA[(This article represents a brief divergence from our usual focus on distributed systems and algorithms. Normal service will be resumed shortly.) One of the courses I do some teaching for is the dryly named &#8216;Theory of Computation&#8217;. Although such a course name is unlikely to pique a lot of interest, in actual fact it is [...]]]></description>
			<content:encoded><![CDATA[<p>(This article represents a brief divergence from our usual focus on distributed systems and algorithms. Normal service will be resumed shortly.)</p>
<p>One of the courses I do some teaching for is the dryly named &#8216;Theory of Computation&#8217;. Although such a course name is unlikely to pique a lot of interest, in actual fact it is probably the most fundamental and important course that you could take about computer science. Computation theory allows us to ask, and answer, questions about the limits of computers and the programs they execute. Is every problem solvable by a computer program? What do we even mean by a &#8216;problem&#8217;?</p>
<p>It turns out that there are profound and beautiful truths to be discovered when considering these questions. In a short series of articles, I&#8217;m going to give a brief introduction to the theory of computability, aimed at those who have never studied Turing machines, Cantor diagonalisation or Godel incompleteness. An appreciation for some of these ideas and results should give a perspective on what computers are, and are not, capable of.</p>
<p>This first article presents a bit of background material on the nature of infinity, which is right at the heart of computation theory. In the next article I&#8217;ll tie this in with the idea of computability and show how these ideas elegantly allow us to establish limits on what we can do with computers &#8211; even in theory, let alone practice.<br />
<span id="more-221"></span></p>
<h2>Different Types of Infinity</h2>
<p>A very important and fundamental question to computation theory is this: &#8220;how many unique computer programs are there?&#8221;</p>
<p>Although we&#8217;re not in a position to be precise about the details of such a question, we can at least give some thought to the kind of answers it might have. Are there just a finite number of computations &#8211; different executions with different results that a theoretical computer could perform?</p>
<p>It&#8217;s clear that this is unlikely. For example, we can conceive of a program that prints out a unique integer, and we can think of infinitely many programs that each print out a different integer. So it seems that there should be infinitely many programs, or computations. But let&#8217;s not be too hasty throwing that word &#8216;infinite&#8217; around. What do we understand by infinity?</p>
<p>Let&#8217;s get one thing straight right away. We can&#8217;t say that &#8216;infinity&#8217; is a quantity, or a number. We can&#8217;t add one to infinity. Infinity minus infinity is <em>meaningless</em>, not zero. Rather, infinity is a <em>predicate</em> on sets of things. If I describe to you a set, if we say that it does not contain a finite number of things then we say that the set has the <em>property</em> of being infinite. An infinite set is one that doesn&#8217;t have a number that we can call its size.</p>
<p>Bear with me a moment while I get pedantic. What does &#8216;size&#8217; mean? What&#8217;s the size of a set? Obviously, the size of a set is the number of items in it. But how do we arrive at that number? We count the items, one by one. So what we do is give every item in the set a label in order &#8217;1&#8242;, &#8217;2&#8242;, &#8217;3&#8242;&#8230; and so on until all the elements in the set have a label. The value of the last label that we assigned is what we call the size, or <em>cardinality</em> of a set. It&#8217;s the number of labels we need to uniquely label every element in the set. If a set is infinite, there is no finite number of labels that we can use to uniquely label every element. No matter what number of labels we use, there will always be elements (in fact, infinitely many of them) left over at the end that didn&#8217;t get a label. <em>This</em> is what infinite means. If we take a finite number of things out of an infinite set the set will never be empty, no matter how many we take out.</p>
<p>An example of an infinite set is the set of all words that are some number of As followed by the same number of Bs. For example, AB, AABB, AAABBB are all in the set. There are infinitely many words like this. How do we prove that a set is infinite? One way to do it is to take a set that we already know is infinite and show that for every element in our known set there is exactly one element in our candidate infinite set. This is similar to our labelling technique for finite sets &#8211; we take an infinite set and show that we can use it to label our candidate set. Since there are infinitely many labels from our infinite set, and we have shown that we need all of them to label our candidate set, we can conclude that our candidate set must itself be infinite, otherwise we would only need a finite number of labels to label it. By coming up with a labelling that uses every element of the infinite set, no matter what it is, we show that no finite labelling can exist.</p>
<p>So how do we apply this technique to our A&#8230;B&#8230; set? Well, we take a known infinite set, in our case the natural numbers <img src='http://s.wordpress.com/latex.php?latex=N%3D%7B1%2C2%2C3%2C%26%238230%3B%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='N={1,2,3,&#8230;}' title='N={1,2,3,&#8230;}' class='latex' /> and find a labelling. Here the labelling is easy to find. Let&#8217;s write every element of <img src='http://s.wordpress.com/latex.php?latex=A&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A' title='A' class='latex' /> as <img src='http://s.wordpress.com/latex.php?latex=A%5EnB%5En&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A^nB^n' title='A^nB^n' class='latex' /> where <img src='http://s.wordpress.com/latex.php?latex=A%5En&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A^n' title='A^n' class='latex' /> means &#8216;repeat <img src='http://s.wordpress.com/latex.php?latex=A&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A' title='A' class='latex' /> <img src='http://s.wordpress.com/latex.php?latex=n&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='n' title='n' class='latex' /> times&#8217;. Then there is exactly one word in <img src='http://s.wordpress.com/latex.php?latex=A&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A' title='A' class='latex' /> for every positive integer <img src='http://s.wordpress.com/latex.php?latex=n&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='n' title='n' class='latex' /> (negative values of <img src='http://s.wordpress.com/latex.php?latex=n&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='n' title='n' class='latex' /> don&#8217;t make a lot of sense). This makes it very easy to come up with a labelling from <img src='http://s.wordpress.com/latex.php?latex=N&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='N' title='N' class='latex' /> to <img src='http://s.wordpress.com/latex.php?latex=A&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A' title='A' class='latex' />. A word <img src='http://s.wordpress.com/latex.php?latex=A%5EnB%5En&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A^nB^n' title='A^nB^n' class='latex' /> in <img src='http://s.wordpress.com/latex.php?latex=A&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A' title='A' class='latex' /> has a label <img src='http://s.wordpress.com/latex.php?latex=n&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='n' title='n' class='latex' /> in <img src='http://s.wordpress.com/latex.php?latex=N&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='N' title='N' class='latex' />. For every natural number, there is exactly one word. Therefore <img src='http://s.wordpress.com/latex.php?latex=A&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A' title='A' class='latex' /> must be infinite!</p>
<p>This technique is the most useful when it comes to reasoning about the cardinality of various sets, and is hugely important when talking about computability. Computer scientists will call the act of finding a labelling <em>establishing a bijection between <img src='http://s.wordpress.com/latex.php?latex=A&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='A' title='A' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=N&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='N' title='N' class='latex' /></em>. A bijection is simply a precise word for a relationship that pairs elements of two sets together such that every element of each set has a unique partner in the other.</p>
<p>Now, we have taken as given that the set of integers is infinite, and no-one would argue with us. Is that the end of the story? Do all infinite sets have a bijection with the integers?</p>
<p>It turns out that the answer is no, not all infinite sets can be placed in correspondence with the integers. This result is extremely famous, and due to <a href="http://en.wikipedia.org/wiki/Georg_Cantor">Georg Cantor</a>, a man who spent much of his mathematical career looking into infinity and drove himself mad as a consequence.</p>
<h2>&#8216;Smaller&#8217;, But Still The Same Size</h2>
<p>Let&#8217;s consider two different sets and challenge our intuitions about their size as compared to the integers. Firstly, consider the even integers. Is the set of even integers infinite? We would think so, because no matter what finite collection of them we can put together, we can always think of one that we&#8217;ve left out. This is a giveaway property of an infinite set. But at the same time, it seems like there should be fewer even integers than all the integers combined &#8211; intuitively we would expect there to be twice as many odd and even integers as just the even ones. But &#8216;twice as many&#8217; is a murky concept when dealing with infinity, which as we discussed earlier doesn&#8217;t yield to normal arithmetic.</p>
<p>Indeed, an attempt to find a bijection between even integers and the whole set of integers turns up a correspondence pretty quickly. Recall that we can write every even integer in the form <img src='http://s.wordpress.com/latex.php?latex=2x&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='2x' title='2x' class='latex' /> for all <img src='http://s.wordpress.com/latex.php?latex=x%5Cin%20%5Cmathbb%7BN%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='x\in \mathbb{N}' title='x\in \mathbb{N}' class='latex' />. This trivially gives us our bijection &#8211; label every even integer with the integer you get by dividing it by two. Conversely, every integer has a corresponding even integer that is the result of multiplying the integer by two. So 1 maps to 2, 7 maps to 14 and 316 maps to 632.</p>
<p>So it turns out that, completely counter-intuitively, there are as many integers as there are <em>both even and odd integers</em>! Infinity does not always behave as we expect. In particular it&#8217;s not sufficient to show that one set is a proper subset of another in order to demonstrate their different sizes when both sets are infinite.</p>
<h2>Bigger Than Infinite?</h2>
<p>I said earlier that there <em>are</em> infinite sets that are not of the same size as the integers. We&#8217;ve tried to find a set that was smaller, but had no luck. Let&#8217;s consider an apparently larger set and see where we get.</p>
<p>The <em>real numbers</em> are a good candidate. The real numbers are fairly complex to formally define, but we know that they include all the integers, as well as all the numbers with decimal expansions between the integers. Indeed, some real numbers have expansions that are infinitely long &#8211; a fact that will prove very important when we talk about the limits of Turing machines.</p>
<p>Is there a bijection between real numbers and integers? This is a tricky question, and required a genuine bit of ingenuity to resolve. The problem is that when a bijection is not obviously forthcoming (it&#8217;s hard to write each real number in terms of a single integer) we have to look in the other direction and prove that there is no such bijection. This is a good deal harder &#8211; a positive proof just requires us to come up with a single exemplar, whereas a negative proof has to be valid for every single possible bijection!</p>
<p>Cantor&#8217;s attack on the negative proof went as follows. Imagine there is such a bijection. Then we could conceive of a list with all the real numbers in order, as labelled by their integer counterparts. So at the top would be real number 1, followed by real number 2 and so on. Obviously, the list would be infinite so we couldn&#8217;t write it down. That&#8217;s no problem, we just have to imagine it.</p>
<p>Note at this point how the list doesn&#8217;t depend at all on the nature of the bijection itself &#8211; no matter what the bijection is, we can construct this list. This generality is crucial for easily dismissing the entire set of bijections at once.</p>
<p>The next step is the neat one. If this list was a bijection, it would contain every single real number. So if we could come up with some real number that wasn&#8217;t in the list, we would show that it was not a bijection. Cantor&#8217;s great insight was to see that there is a way to construct such a real number by showing that it could not be equal to any real number in the list.</p>
<p>Cantor imagined the list of real numbers as a table, with the rows being real numbers, and the columns being the individual digits of each real number. So the 3rd column would contain the 3rd digits of every real number. Therefore we can identify the <img src='http://s.wordpress.com/latex.php?latex=j%5E%7Bth%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='j^{th}' title='j^{th}' class='latex' /> digit from the <img src='http://s.wordpress.com/latex.php?latex=i%5E%7Bth%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='i^{th}' title='i^{th}' class='latex' /> real number by talking about the digit <img src='http://s.wordpress.com/latex.php?latex=d_%7Bij%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='d_{ij}' title='d_{ij}' class='latex' />.</p>
<p>To keep the presentation straightforward, we&#8217;re going to consider only the real numbers between 0 and 1 &#8211; if there are more of even these kinds of number than real numbers, we&#8217;ll still have established the result.</p>
<table cellpadding=5>
<tr>
<td></td>
<td></td>
<td>Digit 1</td>
<td>Digit 2</td>
<td>Digit 3</td>
<td>Digit 4</td>
<td>&#8230;</td>
</tr>
<tr>
<td><img src='http://s.wordpress.com/latex.php?latex=r_1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_1' title='r_1' class='latex' /></td>
<td align='center'>0.</td>
<td align='center'>7</td>
<td align='center'>3</td>
<td align='center'>9</td>
<td align='center'>4</td>
<td align='center'>&#8230;</td>
</tr>
<tr>
<td><img src='http://s.wordpress.com/latex.php?latex=r_2&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_2' title='r_2' class='latex' /></td>
<td align='center'>0.</td>
<td align='center'>9</td>
<td align='center'>3</td>
<td align='center'>2</td>
<td align='center'>8</td>
<td align='center'>&#8230;</td>
</tr>
<tr>
<td><img src='http://s.wordpress.com/latex.php?latex=r_3&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_3' title='r_3' class='latex' /></td>
<td align='center'>0.</td>
<td align='center'>4</td>
<td align='center'>2</td>
<td align='center'>9</td>
<td align='center'>8</td>
<td align='center'>&#8230;</td>
</tr>
<tr>
<td>&#8230;</td>
</tr>
</table>
<p>This table is the first few rows of an example bijection.</p>
<p>Now the key observation here is that, for a real number <img src='http://s.wordpress.com/latex.php?latex=r_p&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_p' title='r_p' class='latex' /> to be different from <img src='http://s.wordpress.com/latex.php?latex=r_q&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_q' title='r_q' class='latex' /> it is enough for <img src='http://s.wordpress.com/latex.php?latex=r_p&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_p' title='r_p' class='latex' /> to differ from <img src='http://s.wordpress.com/latex.php?latex=r_q&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_q' title='r_q' class='latex' /> at just one digit. That is, as long as there exists a <img src='http://s.wordpress.com/latex.php?latex=k&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='k' title='k' class='latex' /> for which <img src='http://s.wordpress.com/latex.php?latex=d_%7Bpk%7D%20%5Cneq%20d_%7Bqk%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='d_{pk} \neq d_{qk}' title='d_{pk} \neq d_{qk}' class='latex' /> then <img src='http://s.wordpress.com/latex.php?latex=r_p%20%5Cneq%20r_q&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_p \neq r_q' title='r_p \neq r_q' class='latex' />.</p>
<p>So to find a real number that is not in the list, we just have to find a way to construct one that is different from every real number in the list at at least one digit.</p>
<p>Obviously it&#8217;s no good just changing a single digit &#8211; say we set the 3rd digit of our real number to 7 &#8211; because it&#8217;s guaranteed that there are some real numbers whose 3rd digit is 7 and we don&#8217;t know whether or not the rest of our real number matches them.</p>
<p>However, just as there are an infinite number of real numbers in our list, there are infinitely many columns in our table &#8211; so infinitely many digits to play with. So if we just make sure that each digit is different from its counterpart in one real number each, that will be sufficient because there are enough digits to go around. We will construct a real number <img src='http://s.wordpress.com/latex.php?latex=C%3Dc_1c_2c_3%5Chdots&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='C=c_1c_2c_3\hdots' title='C=c_1c_2c_3\hdots' class='latex' /> so that each digit <img src='http://s.wordpress.com/latex.php?latex=c_k&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='c_k' title='c_k' class='latex' /> of <img src='http://s.wordpress.com/latex.php?latex=C&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='C' title='C' class='latex' /> is different from digit <img src='http://s.wordpress.com/latex.php?latex=d_%7Bik%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='d_{ik}' title='d_{ik}' class='latex' /> of the <img src='http://s.wordpress.com/latex.php?latex=i%5E%7Bth%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='i^{th}' title='i^{th}' class='latex' /> real number in the list, and that <img src='http://s.wordpress.com/latex.php?latex=i&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='i' title='i' class='latex' /> is different for all <img src='http://s.wordpress.com/latex.php?latex=c_k&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='c_k' title='c_k' class='latex' />.</p>
<p>There are two challenges here. The first is to come up with a method that lets us always pick a digit that is different. The second is to come up with a mapping &#8211; in fact, another bijection &#8211; between the columns and the rows so that every digit in our constructed real number is responsible for being different from exactly one real number in the table.</p>
<p>Both challenges are easily resolved. If the digit we wish to be different from is <img src='http://s.wordpress.com/latex.php?latex=x&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='x' title='x' class='latex' />, then it&#8217;s simple to construct <img src='http://s.wordpress.com/latex.php?latex=y%20%5Cneq%20x&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='y \neq x' title='y \neq x' class='latex' /> by adding one to <img src='http://s.wordpress.com/latex.php?latex=x&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='x' title='x' class='latex' /> (and if <img src='http://s.wordpress.com/latex.php?latex=x&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='x' title='x' class='latex' /> is 9, wrapping <img src='http://s.wordpress.com/latex.php?latex=y&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='y' title='y' class='latex' /> around to 0).</p>
<p>We can also neatly deal with the second challenge. In our constructed real number <img src='http://s.wordpress.com/latex.php?latex=C&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='C' title='C' class='latex' />, we&#8217;ll give digit <img src='http://s.wordpress.com/latex.php?latex=c_1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='c_1' title='c_1' class='latex' /> the responsibility of being different from the 1st real number in the list, at digit <img src='http://s.wordpress.com/latex.php?latex=d_%7B11%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='d_{11}' title='d_{11}' class='latex' />. Similarly for <img src='http://s.wordpress.com/latex.php?latex=c_2%20%5Cneq%20d_%7B22%7D&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='c_2 \neq d_{22}' title='c_2 \neq d_{22}' class='latex' />. And so on. Every digit of <img src='http://s.wordpress.com/latex.php?latex=C&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='C' title='C' class='latex' /> causes it to be different from a unique real number in the list. By combining all these digits together, we get a new, infinitely long real number that is different from <em>every single listed real number</em>.</p>
<table cellpadding=5>
<tr>
<td></td>
<td></td>
<td>Digit 1</td>
<td>Digit 2</td>
<td>Digit 3</td>
<td>Digit 4</td>
<td>&#8230;</td>
</tr>
<tr>
<td><img src='http://s.wordpress.com/latex.php?latex=r_1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_1' title='r_1' class='latex' /></td>
<td align='center'>0.</td>
<td align='center' bgcolor='lightgreen'>7</td>
<td align='center'>3</td>
<td align='center'>9</td>
<td align='center'>4</td>
<td align='center'>&#8230;</td>
</tr>
<tr>
<td><img src='http://s.wordpress.com/latex.php?latex=r_2&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_2' title='r_2' class='latex' /></td>
<td align='center'>0.</td>
<td align='center'>9</td>
<td align='center' bgcolor='lightgreen'>3</td>
<td align='center'>2</td>
<td align='center'>8</td>
<td align='center'>&#8230;</td>
</tr>
<tr>
<td><img src='http://s.wordpress.com/latex.php?latex=r_3&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r_3' title='r_3' class='latex' /></td>
<td align='center'>0.</td>
<td align='center'>4</td>
<td align='center'>2</td>
<td align='center' bgcolor='lightgreen'>9</td>
<td align='center'>8</td>
<td align='center'>&#8230;</td>
</tr>
<tr>
<td>&#8230;</td>
</tr>
</table>
<p>In the case of our example bijection, we would have the beginnings of <img src='http://s.wordpress.com/latex.php?latex=C&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='C' title='C' class='latex' /> as 0.840&#8230; &#8211; different from 0.7394&#8230; at digit 1, different from 0.9328&#8230; at digit 2 and different from 0.4298&#8230; at digit 3. Different from every real number in the table!</p>
<p>This concludes the proof. What we have shown is that there is no bijection between the real numbers and the integers. No matter what the bijection, we can come up with some real number that&#8217;s not been included. Therefore we can conclude that there <em>are</em> more real numbers than integers.</p>
<p>The shape that the &#8216;differing digits&#8217; make in the table of real numbers moves along the diagonal. This is why this is called &#8216;Cantor&#8217;s diagonal proof&#8217;. In actual fact, the diagonal pattern is just one of infinitely many bijections we could have come up with. It is the simplest, and the easiest to understand.</p>
<h2>Two Kinds of Infinity</h2>
<p>So this leads us to the strange idea that there are two different kinds of infinity. The first kind of infinite sets, like the integers, are called <em>countably infinite</em> to emphasise the idea that they can be placed in one-to-one correspondence with the integers, an action that we cal &#8216;counting&#8217;. The second kind of infinite sets that we have discovered, like the real numbers, are called <em>uncountably infinite</em> by way of contrast. Uncountably infinite sets are unimaginably more vast than countable ones. Between any two real numbers, there are uncountably infinite intermediate real numbers &#8211; more than there are integers. There are more real numbers between 0 and 1 than there are integers.</p>
<p>This contrast will play a significant role in the development of a theory of computation. We will be able to identify the number of possible computations with a countably infinite set, and therefore describe, using our established language of infinity, computations that our best efforts cannot perform, not even with all the resources in the world.</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=221</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Miss me?</title>
		<link>http://the-paper-trail.org/blog/?p=230</link>
		<comments>http://the-paper-trail.org/blog/?p=230#comments</comments>
		<pubDate>Thu, 28 May 2009 23:02:43 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Note]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[Distributed systems]]></category>
		<category><![CDATA[zookeeper]]></category>

		<guid isPermaLink="false">http://hnr.dnsalias.net/wordpress/?p=230</guid>
		<description><![CDATA[I&#8217;ve been away from this blog for far longer than I hoped. I&#8217;ve got plenty of content queued up though, so in the next few days I should start to get my postings back in order. Rumours of my death are greatly exaggerated; of my gainful employment less so. That, clearly, has been taking up [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been away from this blog for far longer than I hoped. I&#8217;ve got plenty of content queued up though, so in the next few days I should start to get my postings back in order. Rumours of my death are greatly exaggerated; of my <a href="http://www.cloudera.com">gainful employment</a> less so. That, clearly, has been taking up some time, but I&#8217;m getting into my groove now.</p>
<p>A small conciliatory nugget that regular readers of this blog might find interesting: a trademark Robinson monster post on Cloudera&#8217;s blog describing how to build a <a href="http://bit.ly/8JzdC">simple distributed queue with Apache ZooKeeper</a>.</p>
<p>More on ZooKeeper and our usual distributed systems programming to follow.</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=230</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>In Which I Prove Employable</title>
		<link>http://the-paper-trail.org/blog/?p=223</link>
		<comments>http://the-paper-trail.org/blog/?p=223#comments</comments>
		<pubDate>Thu, 09 Apr 2009 07:51:54 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Note]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[job]]></category>

		<guid isPermaLink="false">http://hnr.dnsalias.net/wordpress/?p=223</guid>
		<description><![CDATA[Although I try and keep personal information to a relative minimum on this blog, here&#8217;s some news that&#8217;s relevant. Recently I accepted an offer to start work at Cloudera, a young company in the San Francisco area. Initially I&#8217;ll be working from the UK, with a view to a permanent move out to California when [...]]]></description>
			<content:encoded><![CDATA[<p>Although I try and keep personal information to a relative minimum on this blog, here&#8217;s some news that&#8217;s relevant. Recently I accepted an offer to start work at <a href="http://www.cloudera.com">Cloudera</a>, a young company in the San Francisco area. Initially I&#8217;ll be working from the UK, with a view to a permanent move out to California when timing and visas allow.</p>
<p><a href="http://www.cloudera.com">Hadoop is Cloudera&#8217;s business</a>. Hadoop is an open-source implementation of Google&#8217;s MapReduce  Cloudera provides support for Hadoop, and <a href="http://www.cloudera.com/distribution">their own fully supported distribution</a> of the Hadoop toolset. <a href="http://wiki.apache.org/hadoop/">Hadoop</a> allows very large scale distributed and parallel processing of huge data sets in a fault-tolerant and efficient manner. Data sets are getting bigger all the time, and there&#8217;s a mismatch now between the desire and the ability of companies to handle and process all those data. Cloud computing, at least in the form of dynamic provision of computing resources, and distributed processing technologies such as Hadoop are helping to make this problem tractable. Cloudera provides the expertise and the experience to help businesses make best use of this seriously powerful tech.</p>
<p>I&#8217;m joining, as you might expect, as a distributed systems engineer. There are plenty of interesting problems to attack, both in Hadoop and the ecosystem of technologies that support and extend it, such as <a href="http://hadoop.apache.org/core/docs/current/hdfs_design.html">HDFS</a>, <a href="http://hadoop.apache.org/hbase/">Hbase</a>, <a href="http://wiki.apache.org/pig/">Pig</a>, <a href="http://wiki.apache.org/hadoop/Hive">Hive</a> and <a href="http://wiki.apache.org/hadoop/ZooKeeper">ZooKeeper</a>. Cloudera already has a killer <a href="http://www.cloudera.com/about">team</a> (see some of the press from the <a href="http://bits.blogs.nytimes.com/2009/03/16/bottling-the-magic-behind-google-and-facebook/">New York Times</a>), and I&#8217;m really looking forward to being a part of it.</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=223</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Barbara Liskov&#039;s Turing Award, and Byzantine Fault Tolerance</title>
		<link>http://the-paper-trail.org/blog/?p=211</link>
		<comments>http://the-paper-trail.org/blog/?p=211#comments</comments>
		<pubDate>Mon, 30 Mar 2009 14:53:08 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Distributed systems]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[byzantine fault tolerance]]></category>
		<category><![CDATA[consensus]]></category>

		<guid isPermaLink="false">http://hnr.dnsalias.net/wordpress/?p=211</guid>
		<description><![CDATA[Barbara Liskov has just been announced as the recipient of the 2008 Turing Award, which is one of the most important prizes in computer science, and can be thought of as our field&#8217;s equivalent to the various Nobel Prizes. Professor Liskov is a worthy recipient of the award, even if judged alone by her citation [...]]]></description>
			<content:encoded><![CDATA[<p><a href='www.pmg.csail.mit.edu/~liskov/'>Barbara Liskov</a> has just been announced as the recipient of the <a href="http://www.acm.org/press-room/news-releases/turing-award-08/">2008 Turing Award</a>, which is one of the most important prizes in computer science, and can be thought of as our field&#8217;s equivalent to the various Nobel Prizes. Professor Liskov is a worthy recipient of the award, even if judged alone by her <a href='http://awards.acm.org/citation.cfm?id=1108679&#038;srt=all&#038;aw=140&#038;ao=AMTURING'>citation</a> which lists a number of the important contributions she has made to operating systems, programming languages and distributed systems.</p>
<p>Professor Liskov seems to be particularly well known for the <a href='http://en.wikipedia.org/wiki/Liskov_substitution_principle'>Liskov substitution principle</a> which says that some property of a supertype ought to hold of its subtypes. I&#8217;m not in any position to speak as to the importance of this contribution. However, her more recent work has been regarding the tolerance of Byzantine failures in distributed systems, which is much more close to my heart.</p>
<p>The only work of Liskov&#8217;s that I am really familiar with is the late 90s work on <a href='http://www.pmg.lcs.mit.edu/~castro/osdi99_html/osdi99.html'>Practical Byzantine Fault Tolerance</a> with <a href='http://research.microsoft.com/en-us/um/people/mcastro/'>Miguel Castro</a> and is first published in <a href='http://www.pmg.lcs.mit.edu/papers/osdi99.pdf'>this OSDI &#8217;99 paper</a>. I&#8217;m not going to do a full review, but the topic sits so nicely with my recent focus on consensus protocols that it makes sense to briefly discuss its importance.<br />
<span id="more-211"></span></p>
<h2>Towards Byzantium</h2>
<p>Remember that in the articles on <a href='http://hnr.dnsalias.net/wordpress/?p=173'>Paxos</a>, we were able to deal with two kinds of failure: fail-stop and fail-recover. Although these two failure modes capture a lot of the possible failures in a distributed system there is a yet more general failure mode which captures all possible kinds of errors. <i>Byzantine failures</i> occur when a participant in a protocol deviates arbitrarily from the specified protocol, and thus cannot be relied upon to do anything useful. Worse than that, Byzantine nodes may often look like they are non-faulty up until a crucial point where they might, for example, lie about the result of a transaction, or send two different responses to identical requests, or simply fail to respond. Although Byzantine failures are often equated with malicious attacks, where an attacker has gained control of a node and is controlling its behaviour, in fact many Byzantine failures occur due to programming errors in the implementation of a protocol (Amazon were susceptible to a Byzantine failure recently when a bit-flip caused a gossipped message to be incorrectly propogated, <a href="http://status.aws.amazon.com/s3-20080720.html">taking down S3 for hours</a>).</p>
<p>As we&#8217;ve talked about elsewhere on this blog, consensus is a hugely important primitive for distributed algorithms. When Byzantine failures come into the picture consensus takes even more of the centre stage. Every action that a replica is instructed to perform has to be validated by every fault-free replica because there&#8217;s a strong possibility that the coordinator is itself faulty and has told different nodes to do different things. So nodes must run a consensus protocol to agree on what they were asked to do, let alone the result of doing it.</p>
<p>Of course, standard consensus algorithms won&#8217;t do as they themselves are not Byzantine fault tolerant. So building Byzantine fault tolerant systems is often reduced to building a Byzantine fault tolerant consensus primitive, and using that to build replicated state machines in the usual style.</p>
<p>When we talk about BFT consensus we often distinguish a <em>co-ordinator</em> node responsible for initiating the consensus protocol and proposing the initial value, which is usually received from an external client. The other nodes are replicas, as normal. There&#8217;s nothing particularly special about the co-ordinator, as it still functions like a replica for the rest of the protocol. However, we can show that if the co-ordinator is definitely not faulty then we can execute a more relaxed protocol; more on this later.</p>
<p>BFT consensus has similar requirements to standard consensus: validity, which is that only a proposed value can be decided upon, termination and agreement. Agreement says that all correct nodes must agree upon the same value &#8211; this is the same as for standard consensus, but the definition of faulty here is widened to include Byzantine faults. Validity also requires a small relaxation &#8211; if the co-ordinator is faulty then we allow correct nodes to agree upon some default &#8216;no-op&#8217; value. This is because if the co-ordinator is faulty it&#8217;s impossible to tell what the original request from the client was, and so the safe action is to do nothing. If the co-ordinator is non-faulty, then validity is required as normal. This of course implies that correct replicas must be able to detect a faulty co-ordinator.</p>
<p>How might faults appear in an instance of Byzantine consensus? There are the standard failure modes that we know about &#8211; messages might take a long time to be processed, due to a fail-recover fault, or replicas might crash. More perniciously, replicas might falsify messages to other replicas. The co-ordinator might tell half the replicas that value A has been proposed, and half the replicas that value B was. Faulty replicas in either half might then lie to other replicas about the value that it received from the co-ordinator. The task of a Byzantine fault-tolerant consensus algorithm is to figure out if the co-ordinator is non-faulty, and if so what values it sent to the good replicas. This is a difficult problem &#8211; how, with no a priori notion of trust, do you tell who is lying?</p>
<p><a href="http://research.microsoft.com/en-us/um/people/lamport/">Leslie Lamport</a>, along with Shostak and Pease, wrote <a href="http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#byz">the seminal paper on the subject of BFT</a>. Indeed, they introduced the metaphor of the &#8216;Byzantine Generals&#8217;, deciding whether to attack or retreat from their distributed camp sites in the presence of potential traitors, and therefore misinformation.</p>
<p>Their paper set out two important results. The first is the most well known: that to tolerate completely arbitrary Byzantine failures in a network where Byzantine replicas may lie not only about their own state, but what they have heard from other replicas, at least <img src='http://s.wordpress.com/latex.php?latex=3f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='3f+1' title='3f+1' class='latex' /> nodes total are required to tolerate <img src='http://s.wordpress.com/latex.php?latex=f&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='f' title='f' class='latex' /> faults.</p>
<p>This is an extremely famous result, and one that is quoted time and time again in the literature &#8211; often without a full understanding of why it holds and in what conditions. The proof itself will be the subject of a later post, but the intuition comes from considering the <img src='http://s.wordpress.com/latex.php?latex=2f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='2f+1' title='2f+1' class='latex' /> case when <img src='http://s.wordpress.com/latex.php?latex=f%3D1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='f=1' title='f=1' class='latex' />. In order to reach agreement with three nodes, it can be seen very easily that every replica must send every other replica a message containing the proposal it got from the co-ordinator. If only the co-ordinator could be faulty then the replicas can figure this out very quickly as, when they compare notes, they will see that both received different proposals. However, if one of the replicas could be faulty then the good replica will still see the same message from the faulty replica: both values that they claim to have received from the co-ordinator are different. Since the good replica can&#8217;t distinguish this situation from when the co-ordinator is faulty, it can&#8217;t decide who to believe and therefore can&#8217;t decide consistently. The paper then goes on to show that any purported solution which uses fewer than <img src='http://s.wordpress.com/latex.php?latex=3f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='3f+1' title='3f+1' class='latex' /> replicas could be used to solve the three-replica, one fault case. Therefore, any correct solution must involve <img src='http://s.wordpress.com/latex.php?latex=3f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='3f+1' title='3f+1' class='latex' /> replicas or more. The paper gives one such solution.</p>
<p>The second result is that, if nodes may sign their messages to other nodes in a non-forgeable way so that Byzantine nodes cannot lie about what they have heard, there is an algorithm for Byzantine consensus in a synchronous network that tolerates <img src='http://s.wordpress.com/latex.php?latex=f%20%3D%20n-2&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='f = n-2' title='f = n-2' class='latex' /> failures. This is dramatically better than the <img src='http://s.wordpress.com/latex.php?latex=3f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='3f+1' title='3f+1' class='latex' /> bound, but comes at the cost of an expensive protocol in terms of the number of rounds needed to execute.</p>
<p>It&#8217;s difficult to overestimate how influential this paper has been. All subsequent BFT papers that I&#8217;ve read (quite a few!) have characterised their solutions similarly, in terms of the fraction of faulty replicas that their algorithm will tolerate. The idea of having replicas sign their messages, so that other replicas could not lie about what they had heard is highly practical given the advent of public-key cryptography, and clearly greatly strengthens the system.</p>
<p>But Lamport&#8217;s paper only dealt with synchronous signed-message networks. What about asynchronous ones? Treatment of these had to wait until a few years later for a paper by Bracha and Toueg. They showed that, even if replicas could not forge responses from other replicas, there was still a fundamental <img src='http://s.wordpress.com/latex.php?latex=3f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='3f+1' title='3f+1' class='latex' /> lower bound on the number of replicas. The proof was by construction of an arrangement of replicas such that correct replicas couldn&#8217;t decide between good and bad responses under two different, but identical from the perspective of correct replicas, executions.</p>
<p>The intuition behind this result is that each correct replica must hear from <img src='http://s.wordpress.com/latex.php?latex=2f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='2f+1' title='2f+1' class='latex' /> other replicas in the worst case, so that the <img src='http://s.wordpress.com/latex.php?latex=f&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='f' title='f' class='latex' /> faulty replicas that may be lying about the proposed value that they heard can be outvoted. So the number of replicas must be at least <img src='http://s.wordpress.com/latex.php?latex=r%20%5Cgeq%202f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='r \geq 2f+1' title='r \geq 2f+1' class='latex' />.</p>
<p>At the same time, a replica can only wait to hear from at most <img src='http://s.wordpress.com/latex.php?latex=n-f&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='n-f' title='n-f' class='latex' /> replicas (where <img src='http://s.wordpress.com/latex.php?latex=n&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='n' title='n' class='latex' /> is the total number of replicas). Why? Because in an asynchronous system it&#8217;s possible either that a) <img src='http://s.wordpress.com/latex.php?latex=f&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='f' title='f' class='latex' /> replicas are faulty and have failed to reply or b) <img src='http://s.wordpress.com/latex.php?latex=f&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='f' title='f' class='latex' /> replicas are <em>not faulty</em>, but <em>haven&#8217;t replied yet</em>. A correct replica can&#8217;t distinguish between these cases, and so can&#8217;t wait for an unbounded amount of time to get all <img src='http://s.wordpress.com/latex.php?latex=n&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='n' title='n' class='latex' /> replies, which might never be forthcoming. If the network were synchronous, a good replica could detect that the <img src='http://s.wordpress.com/latex.php?latex=f&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='f' title='f' class='latex' /> replicas that had failed to reply must be faulty, which means that the replies it had already received were all from the other correct replicas. In an asynchronous network, this is not possible.</p>
<p>Therefore we have that <img src='http://s.wordpress.com/latex.php?latex=n%20-f%20%5Cgeq%20r%20%5Cgeq%202f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='n -f \geq r \geq 2f+1' title='n -f \geq r \geq 2f+1' class='latex' />, or that <img src='http://s.wordpress.com/latex.php?latex=n%20%5Cgeq%203f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='n \geq 3f+1' title='n \geq 3f+1' class='latex' />.</p>
<h2>Practical Byzantine Fault Tolerance</h2>
<p>Miguel Castro and Professor Liskov had a 1999 OSDI paper called <a href="">Practical Byzantine Fault Tolerance</a>. Their contribution was to develop a Byzantine fault tolerant consensus protocol that was both efficient and applicable to realistic scenarios.</p>
<p>Liskov and Castro were the first to propose a correct algorithm that worked efficiently in asynchronous networks, and that realised the <img src='http://s.wordpress.com/latex.php?latex=3f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='3f+1' title='3f+1' class='latex' /> lower bound. <a href="http://hnr.dnsalias.net/wordpress/?p=49">FLP impossibility</a> tells us that consensus is in general impossible to solve in asynchronous networks with even just fail-stop failures, rather than the more general Byzantine failures, so Liskov and Castro&#8217;s PBFT took a similar approach to <a href="http://hnr.dnsalias.net/wordpress/?p=173">Paxos</a> and guaranteed liveness only when the network was synchronous. During periods of asynchrony PBFT may not terminate, but will never violate its safety properties (which are, as ever, validity and agreement).</p>
<p>PBFT&#8217;s other advantages were its high performance &#8211; owing in part to the use of the more efficient <a href="http://en.wikipedia.org/wiki/Message_authentication_code">Message Authentication Codes</a> rather than public-key cryptography for fast message signing &#8211; and its ability to tolerate an unbounded number of faults over the lifetime of execution, as long as no more than <img src='http://s.wordpress.com/latex.php?latex=f&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='f' title='f' class='latex' /> were concurrent. PBFT provides the familiar state-machine abstract interface, with a primary through which requests go that proposes a sequence number for every operation. The correct replicas agree (via Byzantine consensus) on the sequence number, and then commit the requests in sequence order.</p>
<p>The protocol itself is reasonably simple, but I&#8217;m not going to describe it in detail here as that would require a lot of groundwork to prove some basic results and show why it&#8217;s correct: this will be the subject of a future series of posts, as there&#8217;s a lot of <a href="http://research.microsoft.com/apps/pubs/default.aspx?id=72404">interesting work in BFT still being done</a>. The basic idea of PBFT is to have the primary broadcast a request to all replicas, which then retransmit what they have heard to every other replica. If all replicas agree on the same operation, then the primary is currently correct, and the replicas broadcast a commit message to each other, much like a standard three-phase commit protocol. In fact, replicas will proceed once they have got <img src='http://s.wordpress.com/latex.php?latex=2f%2B1&#038;bg=000000&#038;fg=ffffff&#038;s=0' alt='2f+1' title='2f+1' class='latex' /> identical replies in the first phase, which is a <a href="">Byzantine quorum</a>, any two of which will be guaranteed to contain at least one correct replica in common. The idea is to ensure that the primary can&#8217;t have two conflicting sequence numbers for a single request accepted (since this would require a correct replica to accept both, which it will not do).</p>
<p>If, however, the primary turns out to be faulty, then a <em>view-change</em> protocol has to be executed, which causes all replicas to stop taking requests from the primary, and elect a new one in its stead. Getting this protocol right involves taking checkpoints of mutually agreed upon histories, since a view-change might occur during a request, then agreeing on the new leader and restarting any pending requests. All the while making sure that faulty replicas can&#8217;t hijack the process.</p>
<p>If you&#8217;re interested, the best paper is not the OSDI one, but the one in <a href="http://research.microsoft.com/en-us/um/people/mcastro/publications/p398-castro-bft-tocs.pdf">Transactions on Computer Systems</a>, which is a longer paper and an easier read.</p>
<p>PBFT has since been update by a variety of different work, but the main structure of the protocol has yet to be fundamentally improved upon. There is still some distance to go before Byzantine fault tolerance techniques habitually make their way into production systems &#8211; there&#8217;s still increased complexity, cost and performance issues to worry about &#8211; but when they do, PBFT will doubtless have been a tremendously important step in the right direction.</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/?feed=rss2&amp;p=211</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
