<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paper Trail</title>
	<atom:link href="http://the-paper-trail.org/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://the-paper-trail.org/blog</link>
	<description>Wading through academic treacle</description>
	<lastBuildDate>Tue, 15 May 2012 08:50:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>EuroSys 2012 blog notes</title>
		<link>http://the-paper-trail.org/blog/eurosys-2012-blog-notes/</link>
		<comments>http://the-paper-trail.org/blog/eurosys-2012-blog-notes/#comments</comments>
		<pubDate>Mon, 16 Apr 2012 01:20:33 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=374</guid>
		<description><![CDATA[EuroSys 2012 was last week &#8211; one of the premier European systems conferences. Over at the Cambridge System Research Group&#8217;s blog, various people from the group have written notes on the papers presented. They&#8217;re very well-written summaries, and worth checking out for an overview of the research presented. Day 1 Day 2 Day 3]]></description>
			<content:encoded><![CDATA[<p>EuroSys 2012 was last week &#8211; one of the premier European systems conferences. Over at the Cambridge System Research Group&#8217;s <a href="http://www.syslog.cl.cam.ac.uk/">blog</a>, various people from the group have written notes on the papers presented. They&#8217;re very well-written summaries, and worth checking out for an overview of the research presented.</p>
<ul>
<li><a href="http://www.syslog.cl.cam.ac.uk/2012/04/11/liveblog-eurosys-2012-day-1/">Day 1</a></li>
<li><a href="http://www.syslog.cl.cam.ac.uk/2012/04/12/liveblog-eurosys-2012-day-2/">Day 2</a></li>
<li><a href="http://www.syslog.cl.cam.ac.uk/2012/04/13/liveblog-eurosys-2012-day-3/">Day 3</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/eurosys-2012-blog-notes/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>FLP and CAP aren&#8217;t the same thing</title>
		<link>http://the-paper-trail.org/blog/flp-and-cap-arent-the-same-thing/</link>
		<comments>http://the-paper-trail.org/blog/flp-and-cap-arent-the-same-thing/#comments</comments>
		<pubDate>Mon, 26 Mar 2012 03:55:34 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Distributed systems]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=347</guid>
		<description><![CDATA[An interesting question came up on Quora this last week. Roughly speaking, the question asked how, if at all, the FLP theorem and the CAP theorem were related. I&#8217;d thought idly about exactly the same question myself before. Both theorems concern the impossibility of solving fairly similar fundamental distributed systems problems in what appear to [...]]]></description>
			<content:encoded><![CDATA[<p>An <a href="http://www.quora.com/Distributed-Systems/Are-the-FLP-impossibility-result-and-Brewers-CAP-theorem-basically-equivalent">interesting question</a> came up on <a href="http://www.quora.com">Quora</a> this last week. Roughly speaking, the question asked how, if at all, the <a href="http://the-paper-trail.org/blog/?p=49">FLP</a> theorem and the <a href="http://the-paper-trail.org/blog/?p=290">CAP theorem</a> were related. I&#8217;d thought idly about exactly the same question myself before. Both theorems concern the impossibility of solving fairly similar fundamental distributed systems problems in what appear to be fairly similar distributed systems settings. The CAP theorem gets all the airtime, but FLP to me is a more beautiful result. Wouldn&#8217;t it be fascinating if both theorems turned out to be equivalent; that is effectively restatements of each other?</p>
<p><span id="more-347"></span></p>
<blockquote>
<h4>What the two theorems mean</h4>
<p>
The <strong>FLP theorem</strong> states that in an asynchronous network where messages may be delayed but not lost, there is no consensus algorithm that is guaranteed to terminate in every execution for all starting conditions, if at least one node may fail-stop.</strong></p>
<p>The <strong>CAP theorem</strong> states that in an asynchronous network where messages may be lost, it is impossible to implement a sequentially consistent atomic read / write register that responds eventually to every request under every pattern of message loss.</p>
</ul>
</blockquote>
<p>Without ever having tackled the problem formally, I had speculated that they might be equivalent, based on a few observations. First, consensus and serialisable atomic objects are very closely related. Both involve causing a set of nodes to come to some agreement about shared state. In the case of consensus, it&#8217;s the value proposed, in the case of atomic objects it&#8217;s the order and identity of any operations performed. Secondly, the failure modes in both theorems appeared similar. CAP deals with &#8216;partitions&#8217;, which is the non-delivery of a subset of messages, and FLP deals with &#8216;faulty nodes&#8217;, which are nodes that only take a finite number of steps in any execution, steps which include message receipt. It seemed likely that both failure models could be used to describe the other. Thirdly, both problems deal with safety and liveness properties in distributed systems in the context of failures. </p>
<p>I wrote an answer up quickly to the question, and formulated a proof sketch typical for these kind of questions: if two problems are equivalent, then a solution to one is a solution to the other, and vice versa. My answer was, like so many informal arguments of this nature, concise, convenient, convincing and wrong.</p>
<h3>The Importance Of Context</h3>
<p><a href="http://www.cs.cornell.edu/people/egs/">Emin Gun Sirer</a> pointed out was what wrong with my argument. One direction of the equivalence still looks good &#8211; a solution to CAP could be used to formulate a solution to the FLP problem. The problem is in the other direction &#8211; with my assertion that an FLP solution could be used to solve CAP. The distinction arises when we consider how each theorem treats nodes that aren&#8217;t receiving the messages that are being sent to them. In FLP, such nodes are failed, and exempt from having to achieve consensus. In CAP, such nodes are only partitioned. Here&#8217;s the difference: a CAP solution requires that any live node be able to correctly serve requests, <em>even if it has not received any messages</em>. So a partitioned node in FLP does <em>not</em> have to achieve consensus, since it is considered failed, but the same node in CAP must &#8211; somehow &#8211; keep up with the activity of the rest of the system.</p>
<p>To reinforce this point, let&#8217;s take a look at how <a href="http://dl.acm.org/citation.cfm?id=564601">Gilbert and Lynch</a> proved their formalisation of the CAP theorem. They show a simple, intuitive result. If there is a permanent partition between two disjoint subsets of nodes in the system, call them <img src='http://s.wordpress.com/latex.php?latex=G_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G_1' title='G_1' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=G_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G_2' title='G_2' class='latex' />, then a write to <img src='http://s.wordpress.com/latex.php?latex=G_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G_2' title='G_2' class='latex' /> can never cause any different execution to occur in <img src='http://s.wordpress.com/latex.php?latex=G_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G_1' title='G_1' class='latex' /> (because the only way that <img src='http://s.wordpress.com/latex.php?latex=G_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G_2' title='G_2' class='latex' /> can influence <img src='http://s.wordpress.com/latex.php?latex=G_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G_1' title='G_1' class='latex' /> is to send a message). But <img src='http://s.wordpress.com/latex.php?latex=G_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G_1' title='G_1' class='latex' /> might have to respond to a read request <strong>after</strong> the initial write to <img src='http://s.wordpress.com/latex.php?latex=G_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G_2' title='G_2' class='latex' />. To be sequentially consistent, <img src='http://s.wordpress.com/latex.php?latex=G_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G_1' title='G_1' class='latex' /> must return the value of the write. It can&#8217;t, because it can never learn of the write. CAP does not identify partition with failure. FLP does, at least in the single node case.</p>
<p>This blows my argument out of the water, because I was relying on equal treatment of nodes that were completely partitioned from all others in both settings. But that&#8217;s not the case. So it is <em>not</em> obvious that a solution to FLP can provide a solution to CAP.</p>
<h3>Proving Yourself Wrong</h3>
<p>It would be unsatisfying only to establish that my original argument was flawed, because doing so doesn&#8217;t actually settle the original question: can CAP and FLP be considered equivalent? Instead of trying to prove the positive result, armed with a bit more clarity about the difference between the two theorems, let&#8217;s try and prove the negative; in particular that a solution to the FLP theorem will <em>not</em> provide a solution to the CAP theorem.</p>
<p>A brief aside about the validity of this proof technique: you might reasonably be uncomfortable with me reasoning about &#8216;positive&#8217; solutions to the CAP or FLP theorem, since respected researchers have already established in peer reviewed articles that no solution to either exists. However, it&#8217;s still reasonable to effectively fantasise about what would be true <em>if</em> positive solutions existed, and indeed it&#8217;s fundamental to posing questions about equivalence. To appeal to a more commonly considered problem &#8211; computer scientists spend a lot of time considering what would happen if a positive solution to <img src='http://s.wordpress.com/latex.php?latex=P%20%3F%3D%20NP&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P ?= NP' title='P ?= NP' class='latex' /> was found, even though it&#8217;s very possible (and usually considered likely) that no positive solution exists.</p>
<p>So let&#8217;s pretend we have this magical black box, which will provide a solution to the FLP problem. That is, if every node in the system runs the algorithm in this box, then non-trivial consensus will always be reached in finite time at all non-faulty nodes, even where there is a single faulty node in the system, under all initial conditions and all network behaviours. Could we use this algorithm to solve CAP?</p>
<p>To prove otherwise, we&#8217;re going to use a very similar argument to the one Gilbert and Lynch made, which is a proof by contradiction &#8211; we&#8217;re going to assume that the algorithm can solve CAP, and then show an example of where it could not.</p>
<p>In the following, I&#8217;m going to use &#8216;achieves consensus&#8217; as the target problem to solve. To do that, I need to quickly argue that any solution to consensus (ignoring the possibility of failures) can be used to implement an atomic sequentially consistent object. This is well established in the literature &#8211; the construction is called a <em>distributed finite state machine</em>. It works by having any node (i.e. one that receives a request) propose a value (i.e. the next write operation to execute, and its result). A round of consensus is performed, where the nodes either agree to the &#8216;next operation&#8217; proposal, or vote it down. If they vote it down, the proposer tries again. There are solutions (such as <a href="http://the-paper-trail.org/blog/?p=173">Paxos</a>) which ensure that every valid proposal is eventually accepted. An object implemented in this way is necessarily sequentially consistent, since all requests are <em>ordered</em> by the process of consensus deciding the next operation to execute. Read requests may be served locally without performing consensus, since all nodes know about the &#8216;most recent&#8217; write operation.</p>
<p>Let&#8217;s set up our network as follows. There will be one distinguished node, <img src='http://s.wordpress.com/latex.php?latex=F&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='F' title='F' class='latex' />, which is the &#8216;failing&#8217; node. It does not receive any messages from the rest of the system, and therefore takes a finite number of steps and then stops. We&#8217;ll call the rest of the network <img src='http://s.wordpress.com/latex.php?latex=G%20%3D%20N%20-%20%5C%7BF%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G = N - \{F\}' title='G = N - \{F\}' class='latex' />. In order to be a solution to CAP, <img src='http://s.wordpress.com/latex.php?latex=F&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='F' title='F' class='latex' /> must itself <em>decide</em> on a value, since it may receive a read request to which it must respond in finite time. We&#8217;ll show that even with a solution to FLP, there must be some execution where <img src='http://s.wordpress.com/latex.php?latex=F&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='F' title='F' class='latex' /> does not know the right answer.</p>
<p>(Note &#8211; this is an extremely simple argument to make informally, but I&#8217;m hoping to show how one might think about this and more difficult problems more formally).</p>
<p>Imagine that a <tt>write(<img src='http://s.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' />)</tt> operation is initiated by a client of some node in <img src='http://s.wordpress.com/latex.php?latex=G&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G' title='G' class='latex' />, followed by read issued at <img src='http://s.wordpress.com/latex.php?latex=F&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='F' title='F' class='latex' />, all the time using the FLP algorithm we were given. Then, in order to be a correct CAP solution, F must respond to the read with the value <img src='http://s.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' />. To do that, it takes some finite series of steps <img src='http://s.wordpress.com/latex.php?latex=S%3DS_0%5Crightarrow%20S_1%5Crightarrow%20S_2%5Cdots%5Crightarrow%20S_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S=S_0\rightarrow S_1\rightarrow S_2\dots\rightarrow S_n' title='S=S_0\rightarrow S_1\rightarrow S_2\dots\rightarrow S_n' class='latex' />, and then takes no further action, because it is receiving no messages to spur it on.</p>
<p>Now imagine that instead, with the same exact initial conditions, a write(<img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' />) operation is initiated, again by a client of some node in <img src='http://s.wordpress.com/latex.php?latex=G&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G' title='G' class='latex' />. Again, to be correct, <img src='http://s.wordpress.com/latex.php?latex=F&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='F' title='F' class='latex' /> must respond to a subequent read with the value <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' />. To do so, it takes another series of steps <img src='http://s.wordpress.com/latex.php?latex=T%3DT_0%5Crightarrow%20T_1%5Crightarrow%20T_2%5Cdots%5Crightarrow%20T_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='T=T_0\rightarrow T_1\rightarrow T_2\dots\rightarrow T_n' title='T=T_0\rightarrow T_1\rightarrow T_2\dots\rightarrow T_n' class='latex' />. However, <img src='http://s.wordpress.com/latex.php?latex=T&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='T' title='T' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> <em>must be exactly the same sequence</em>. Why? Because the behaviour of a node is governed the algorithm it is executing, the initial conditions, and the messages it receives. The first two are exactly the same in both executions, and since no messages are being delivered, so is the third. Therefore, the value that <img src='http://s.wordpress.com/latex.php?latex=F&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='F' title='F' class='latex' /> has decided upon by the final step of both <img src='http://s.wordpress.com/latex.php?latex=T&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='T' title='T' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> must be the same value. But for correctness, <img src='http://s.wordpress.com/latex.php?latex=F&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='F' title='F' class='latex' /> is required to return <em>different</em> values in each execution. This is a contradiction, hence the FLP-solving algorithm cannot be a solution to CAP, which was our initial assumption.</p>
<p>At this point you might be concerned &#8211; doesn&#8217;t FLP guarantee that all nodes agree on the same value? That is, if we had a magical FLP-solver, wouldn&#8217;t that guarantee that <img src='http://s.wordpress.com/latex.php?latex=F&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='F' title='F' class='latex' /> saw the same value as every other node in <img src='http://s.wordpress.com/latex.php?latex=G&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G' title='G' class='latex' />? The key observation here is that, from the perspective of FLP, <em><img src='http://s.wordpress.com/latex.php?latex=F&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='F' title='F' class='latex' /> has failed</em>. From the perspective of CAP, it has not, and must continue to participate correctly in certain activities of the system. So the algorithm is correctly solving consensus as defined by FLP, but the requirements of CAP are too strong.</p>
<p>If consensus cannot be achieved even with an FLP solution in CAP, we cannot use it to construct any kind of solution to CAP.</p>
<h3>What We Have Learnt</h3>
<p>This result (if I&#8217;ve made no mistakes) is arguably more interesting than if the original equivalence assertion was true. To be an interesting impossibility result in distributed systems, it&#8217;s usually true that you want to place the fewest restrictions possible on the environment in which you&#8217;re establishing that result. The idea is that you give your result <em>every chance</em> to be wrong, by allowing the environment few restrictions in the tricks it can pull to overcome the obstacles you&#8217;re constructing. Then, if your result <em>still</em> turns out to be true, you know you have proved something really quite strong. (There way &#8216;strong&#8217; and &#8216;weak&#8217; are often used can be confusing &#8211; <em>weak</em> assumptions lead to <em>strong</em> impossibility results and vice versa).</p>
<p>So FLP, with its strictly weaker restrictions &#8211; all messages are eventually delivered, faulty nodes don&#8217;t have to achieve consensus &#8211; is by this definition a stronger result than CAP, which allows messages to be lost forever and forces partitioned nodes to participate in the system. It&#8217;s therefore much more surprising &#8211; and the authors in the <a href="http://dl.acm.org/citation.cfm?id=214121">original paper</a> remark on this &#8211; because it&#8217;s so much more unexpected.</p>
<p>Conversely, CAP appears much more humdrum. Is it really surprising to anyone that it&#8217;s impossible to maintain sequentially consistent state in a distributed system if nodes cannot talk to each other? It&#8217;s not too far away from discussing what&#8217;s possible in a network that can experience total node failure &#8211; nothing at all, which is why all papers disregard it as a consideration. The utility of CAP comes from telling systems designers that they must be prepared to work in a world where availability or consistency are compromised, but is rather cataclysmic about the situations in which they might be threatened.</p>
<p>It would be more interesting to relax one or two assumptions &#8211; what if partitions are only temporary? What if we allow the client to participate (i.e. a client that has seen a write is allowed to convey that fact to a replica that it issues a read from)? What if partitioned nodes were excluded from participation? If we can show CAP to hold in these circumstances, it greatly restricts our ability to design correct, available systems that experience much weaker failures. <em>That</em> would be interesting.</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/flp-and-cap-arent-the-same-thing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Should I take a systems reading course?</title>
		<link>http://the-paper-trail.org/blog/should-i-take-a-systems-reading-course/</link>
		<comments>http://the-paper-trail.org/blog/should-i-take-a-systems-reading-course/#comments</comments>
		<pubDate>Sat, 10 Mar 2012 02:05:13 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=342</guid>
		<description><![CDATA[A smart student asked me a couple of days ago whether I thought taking a 2xx-level reading course in operating systems was a good idea. The student, understandably, was unsure whether talking about these systems was as valuable as actually building them, and also whether, since his primary interest is in &#8216;distributed&#8217; systems, he stood [...]]]></description>
			<content:encoded><![CDATA[<p>A smart student asked me a couple of days ago whether I thought taking a 2xx-level reading course in operating systems was a good idea. The student, understandably, was unsure whether talking about these systems was as valuable as actually building them, and also whether, since his primary interest is in &#8216;distributed&#8217; systems, he stood to benefit from a deep understanding of things like virtual memory. </p>
<p><span id="more-342"></span></p>
<p>I figured, since I get a bunch of e-mail from students through this site, that it might be worth sharing my answer:</p>
<blockquote><p>Take it, if you&#8217;re serious about distributed systems, and here&#8217;s why:</p>
<p>Systems &#8216;thinking&#8217; is terribly important. One way to learn that well is to read papers and to discuss them. This hones the ability to critically think about computer systems, about the value of certain lines of work and about the meaning (or lack of) of performance results. In the absence of a formalised way of judging systems work (i.e. through a proof) you need to develop your own judgment and taste. This is a very good way to do so.</p>
<p>Not only that, the topics in the course are practically very relevant. Distributed systems aren&#8217;t any different from other systems topics in the sense that they interact with the network, disk and CPU in much the same way. To give you an example, distributed resource management (i.e. Mesos or YARN) which is an incredibly relevant &#8216;distributed&#8217; systems topic right now needs to understand local scheduling policies, the relative characteristics of disk and memory and the interaction between CPU and the memory hierarchy to begin to be effective.</p>
<p>Thinking of distributed systems as an abstraction over a collection of nodes works best only when you are thinking about distributed algorithms, but it doesn&#8217;t work if you&#8217;re an engineer. If you&#8217;re mathematically minded, and you&#8217;re not that keen on actually implementing what you come up with instead of proving something about it, then this might not be the course for you (although I&#8217;d *still* suggest you consider it strongly).</p>
<p>But if you like to build things, and you like to *really* understand how they work, you&#8217;re going to need to know all of this stuff and more.
</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/should-i-take-a-systems-reading-course/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>I&#8217;m talking at Strata Conference 2012</title>
		<link>http://the-paper-trail.org/blog/im-talking-at-strata-conference-2012/</link>
		<comments>http://the-paper-trail.org/blog/im-talking-at-strata-conference-2012/#comments</comments>
		<pubDate>Thu, 19 Jan 2012 05:37:45 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=337</guid>
		<description><![CDATA[I&#8217;ll be giving a talk at this year&#8217;s Strata Conference in Santa Clara, on February 29th. My talk is called Monitoring Apache Hadoop &#8211; A Big Data Problem?. I&#8217;d be lying if I said that every slide was fully realised at this point, but you can read the abstract to see what I&#8217;ve committed myself [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ll be giving a talk at this year&#8217;s <a href="http://strataconf.com/strata2012">Strata Conference</a> in Santa Clara, on February 29th. My talk is called <a href="http://strataconf.com/strata2012/public/schedule/detail/22623">Monitoring Apache Hadoop &#8211; A Big Data Problem?</a>. I&#8217;d be lying if I said that every slide was fully realised at this point, but you can read the abstract to see what I&#8217;ve committed myself to. The general idea is that building large scale shared-nothing distributed systems is at most half the problem in making them a reality. Managing these systems day-to-day requires the understanding and analysis of a serious amount of data; so there&#8217;s a nice cycle here that you might be able to use the data processing systems you&#8217;re trying to understand to understand them. I&#8217;ll try and tie the whole thing together with a discussion of failure; the thesis being that partial failure in distributed systems is both to blame for the incidents we&#8217;re trying to understand, and making understanding them very difficult &#8211; I believe this is true in a very fundamental sense, so I&#8217;ll make that case and also talk about what is to be done. </p>
<p>(And if I&#8217;m not a big enough draw &#8211; perish the thought &#8211; there are many, many other interesting sessions. In particular, <a href="http://strataconf.com/strata2012/public/schedule/detail/22651">Josh will be talking</a> about Crunch, and <a href="http://strataconf.com/strata2012/public/schedule/detail/22360">Sarah</a> will be giving both introductory and advanced Hadoop classes</a> &#8211; both people I work with, and both fantastic speakers!)</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/im-talking-at-strata-conference-2012/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How consistent is eventual consistency?</title>
		<link>http://the-paper-trail.org/blog/how-consistent-is-eventual-consistency/</link>
		<comments>http://the-paper-trail.org/blog/how-consistent-is-eventual-consistency/#comments</comments>
		<pubDate>Wed, 04 Jan 2012 23:22:05 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=334</guid>
		<description><![CDATA[This page, from the &#8216;PBS&#8217; team at Berkeley&#8217;s AMPLab is quite interesting. It allows you to tweak the parameters of a Dynamo-style system, then by running a series of Monte Carlo simulations gives an estimate of the likelihood of staleness of reads after writes. Since the Dynamo paper appeared and really popularised eventual consistency, the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.eecs.berkeley.edu/~pbailis/projects/pbs/" title="PBS">This page</a>, from the &#8216;PBS&#8217; team at Berkeley&#8217;s <a href="http://amplab.cs.berkeley.edu/">AMPLab</a> is quite interesting. It allows you to tweak the parameters of a <a href="http://the-paper-trail.org/blog/?p=51" title="Dynamo">Dynamo</a>-style system, then by running a series of Monte Carlo simulations gives an estimate of the likelihood of staleness of reads after writes. </p>
<p>Since the Dynamo paper appeared and really popularised eventual consistency, the debate has focused on a fairly binary treatment of its merits. Either you can&#8217;t afford to be wrong, ever, or it&#8217;s ok to have your reads be stale for a potentially unbounded amount of time. In fact, the suitability of eventual consistency is dependent partly on the <i>distribution</i> of stale reads; that is the speed of quiescence of a system immediately after a write. If the probability of a ever seeing a stale read due to consistency delays can be reduced to smaller than the probability of every machine in the network simultaneously catching fire, we can probably make use of eventual consistency.</p>
<p>Looking at many designed systems (where there is little more than conventional wisdom on how to choose R and W), it&#8217;s clear that an analytical model relating system parameters to distributions of behaviour is sorely needed. PBS is a good step in that direction. It would be good to see the work extended to handle a treatment of failure distributions (although a good failure model is hard to find!). The reply latencies of write and read replicas are modelled exponentially distributed CDFs, but in reality there&#8217;s a more significant probability of the reply latency becoming infinite. Once that distribution is correctly modelled, PBS should be able to run simulations against it with no change.
</p>
<p>
A great use for this tool would be to enter some operational parameters, such as the required consistency probability, max number of nodes, availability requirements and maximum request latency, and have PBS suggest some points in the system design space that would meet these requirements with high probability. As the size of the R / W quora get larger, the variance on the request latencies gets larger, but the resilience to failures increases as does the likelihood of fresh reads. For full credit, PBS could additionally model a write / read protocol (i.e. 2-phase commit) which has different consistency properties. As <a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html" title="Abadi's PACELC">Daniel Abadi</a> discusses, when things are running well different consistency guarantees trade off between latency and the strength of consistency.
</p>
<p>Nice work PBS team!</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/how-consistent-is-eventual-consistency/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A little bit of recruiting</title>
		<link>http://the-paper-trail.org/blog/a-little-bit-of-recruiting/</link>
		<comments>http://the-paper-trail.org/blog/a-little-bit-of-recruiting/#comments</comments>
		<pubDate>Tue, 14 Jun 2011 05:42:56 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[cloudera]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=321</guid>
		<description><![CDATA[Today was a pretty good day at Cloudera, although it was not much unlike any other. Today I was: checking a leader election protocol for correctness cleaning up some code in an open source project designing a specialised messaging system working with one of our outstanding interns on the very cool project he&#8217;s taking on [...]]]></description>
			<content:encoded><![CDATA[<p>Today was a pretty good day at <a href="http://www.cloudera.com">Cloudera</a>, although it was not much unlike any other. Today I was:</p>
<ul>
<li>checking a leader election protocol for correctness</li>
<li>cleaning up some code in an open source project</li>
<li>designing a specialised messaging system</li>
<li>working with one of our outstanding interns on the very cool project he&#8217;s taking on over the summer</li>
<li>cheerleading as Cloudera pushed <a href="http://wiki.apache.org/incubator/BigtopProposal">another open-source project</a> out of the nest.</li>
</ul>
<p>We have more interesting work than we can possibly do. We need great engineers to come solve some really fascinating problems. I wanted to take advantage of the fact that this blog is currently getting a <strong>lot</strong> of traffic to put this message in front of a lot of eyeballs: <em>you should really consider coming to work at Cloudera</em>.</p>
<p>We&#8217;re hiring pretty aggressively in several engineering areas, but two in particular that I can say a lot about:</p>
<p><strong>Distributed systems engineers</strong>. Engineering jobs where you are doing what we would call &#8216;systems&#8217; work are few and far between. They exist, but they&#8217;re not common. One thing I have found, speaking to engineers at other companies, that a job that encourages specialisation in <em>distributed</em> systems is even less easily found, especially at an exciting company with plenty of really smart people. At Cloudera we need people to work on a variety of large distributed systems &#8211; execution frameworks, distributed filesystems, data ingest pipelines, schedulers, coordination systems, query processing and more. The problems you would work on are fundamental and challenging, and the people you would get to work with know the systems inside and out. The need to scale is real &#8211; we have customers who have thousands of machines in their clusters, and we write software to run across every single core. </p>
<p><strong>Applications engineers</strong>. Think managing one of those clusters is easy? It isn&#8217;t. The monitoring and lifecycle management challenge posed by Apache Hadoop and other systems can be enormously complex. Each cluster produces a vast amount of data, and your job as an application engineer is to extract the right signals from that data and bring it to operations engineers through clear, meaningful visualisations so that they can make sense of all that information. We&#8217;re looking for full-stack developers, as happy writing a server process to aggregate and chomp through log files as they are building a reusable heatmap control to properly visualise a multi-dimensional distribution. </p>
<p>The perks at Cloudera are good &#8211; we have offices in San Francisco and Palo Alto, we have communal lunches brought in every day, there&#8217;s a subsidised gym membership, you can buy books and we&#8217;re pretty light on procedural overhead. But if you&#8217;re a good fit, you&#8217;ll probably be attracted because the problems are real, interesting and hard, and the people you&#8217;ll work with to solve them are smart, knowledgable and a whole load of fun. </p>
<p>If you&#8217;re interested, drop me a line or send me your resume at henry AT cloudera DOT com, and I&#8217;ll happily tell you anything you want to know about this place. </p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/a-little-bit-of-recruiting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>STM: Not (much more than) a research toy?</title>
		<link>http://the-paper-trail.org/blog/stm-not-much-more-than-a-research-toy/</link>
		<comments>http://the-paper-trail.org/blog/stm-not-much-more-than-a-research-toy/#comments</comments>
		<pubDate>Thu, 21 Apr 2011 20:02:38 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Operating systems]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=311</guid>
		<description><![CDATA[It&#8217;s a sign of how down-trodden the Software Transactional Memory (STM) effort must have become that the article (sorry, ACM subscription required) published in a recent CACM might have been just as correctly called &#8220;STM: Not as bad as the worst possible case&#8221;. The authors present a series of experiments that demonstrate that highly concurrent [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s a sign of how down-trodden the Software Transactional Memory (STM) effort must have become that the <a href="http://cacm.acm.org/magazines/2011/4/106585-why-stm-can-be-more-than-a-research-toy/fulltext">article</a> (sorry, ACM subscription required) published in a recent CACM might have been just as correctly called &#8220;STM: Not as bad as the worst possible case&#8221;. The authors present a series of experiments that demonstrate that highly concurrent STM code beats <em>sequential, single threaded code</em>. You&#8217;d hope that this had long ago become a given, but what this demonstrates is only hey, STM allows <em>some</em> parallelism. And this weak lower bound got a whole article.</p>
<p>Another conclusion from the article is that STM performs best when there is little contention for transactions between threads. Again, that should really be a given &#8211; all reasonable concurrency primitives have high throughput when there is little contention but high parallelism. (A lot of work has gone into making this a very fast case (since it is the most common) for locking, see e.g. <a href="http://blogs.sun.com/dave/entry/biased_locking_in_hotspot">biased locking schemes</a> in the Hotspot JVM).</p>
<p>Bryan Cantrill (previously of Fishworks, now of Joyent) <a href="http://blogs.sun.com/bmc/entry/concurrency_s_shysters">rips on transactional memory</a> more eloquently than I ever could. STM is a declarative solution to thread safety, which I like, but no more declarative really than synchronised blocks &#8211; and Cantrill points out the elephant in the room that the CACM article seemed to ignore: doing IO inside transactions is hugely problematic (because how precisely do you roll back a network packet?).</p>
<p>A recent paper at SOSP 2009 called <a href="http://www.sigops.org/sosp/sosp09/papers/porter-sosp09.pdf">Operating System Transactions</a> attacked this problem, although not from the viewpoint of STM, but to provide atomicity and isolation for situations where bugs arise from the separation between reads, and writes that depend on that read (Time Of Check To Time Of Use &#8211; TOCTTOU). Perhaps there&#8217;s an overlap between this paper and STM approaches, but it&#8217;s not clear whether the workloads inside an operating system&#8217;s system call layer are general enough to map onto typical user-space STM work.</p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/stm-not-much-more-than-a-research-toy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Theorem That Will Not Go Away</title>
		<link>http://the-paper-trail.org/blog/the-theorem-that-will-not-go-away/</link>
		<comments>http://the-paper-trail.org/blog/the-theorem-that-will-not-go-away/#comments</comments>
		<pubDate>Fri, 08 Oct 2010 06:28:55 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[computer science]]></category>
		<category><![CDATA[Distributed systems]]></category>
		<category><![CDATA[cap]]></category>
		<category><![CDATA[consensus]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=297</guid>
		<description><![CDATA[The CAP theorem gets another airing. I think the article makes a point worth making again, and makes it fairly well &#8211; that CAP is really about P=> ~(C &#038; A). A couple of things I want to call out though, after a rollicking discussion on Hacker News. &#8220;For a distributed (i.e., multi-node) system to [...]]]></description>
			<content:encoded><![CDATA[<p>The CAP theorem gets <a href="http://codahale.com/you-cant-sacrifice-partition-tolerance/">another airing</a>.</p>
<p>I think the article makes a point worth <a href="http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance"/>making again</a>, and makes it fairly well &#8211; that CAP is really about P=> ~(C &#038; A). A couple of things I want to call out though, after a rollicking <a href="http://news.ycombinator.com/item?id=1768312">discussion</a> on <a href="http://news.ycombinator.com">Hacker News</a>.</p>
<blockquote><p>&#8220;For a distributed (i.e., multi-node) system to not require partition-tolerance it would have to run on a network which is guaranteed to never drop messages (or even deliver them late) and whose nodes are guaranteed to never die. You and I do not work with these types of systems because they don’t exist.&#8221;</p></blockquote>
<p>This is a bit strong, at least theoretically. Actually all you need to not require partition-tolerance is to guarantee that your particular kryptonite failure pattern never occurs. Many protocols are robust to a dropped message here or there. A quorum system requires a fairly dramatic failure (one node completely partitioned) before one side of the partition has to occur.  In practice, of course, these failures happen more often than we would like, which is why we worry about fault-tolerant properties of distributed algorithms. </p>
<p>Therefore the paragraph on failure probabilities is less powerful. It&#8217;s not always a problem if a single failure occurs, and therefore you shouldn&#8217;t immediately worry about sacrificing availability or consistency as soon as one node starts running slowly. CAP only establishes the <i>existence</i> of a failure pattern that torpedoes any distributed implementation of an atomic object, not its high probability. </p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/the-theorem-that-will-not-go-away/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>CAP confusion: Problems with Partition Tolerance</title>
		<link>http://the-paper-trail.org/blog/cap-confusion-problems-with-partition-tolerance/</link>
		<comments>http://the-paper-trail.org/blog/cap-confusion-problems-with-partition-tolerance/#comments</comments>
		<pubDate>Tue, 27 Apr 2010 17:44:29 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Distributed systems]]></category>
		<category><![CDATA[link]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=290</guid>
		<description><![CDATA[Over on the Cloudera blog I&#8217;ve written an article that should be of interest to readers of this blog. I&#8217;m no great fan of the ubiquity of the CAP theorem &#8211; it&#8217;s a solid impossibility result which appeals to the theorist in me, but it doesn&#8217;t capture every fundamental tension in a distributed system. For [...]]]></description>
			<content:encoded><![CDATA[<p>Over on the <a href="http://www.cloudera.com/blog">Cloudera blog</a> I&#8217;ve written an <a href="http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/">article</a> that should be of interest to readers of this blog. </p>
<p>I&#8217;m no great fan of the ubiquity of the CAP theorem &#8211; it&#8217;s a solid impossibility result which appeals to the theorist in me, but it doesn&#8217;t capture every fundamental tension in a distributed system. For example:  we make our systems distributed across more than one machine usually for reasons of performance and to eliminate a single point of failure. Neither of these motivations are captured verbatim by the CAP theorem. There&#8217;s more to designing distributed systems!</p>
<p>In this, I agree with Stonebraker; it&#8217;s the erroneous representation of &#8216;partition tolerance&#8217; that I found very strange. I&#8217;ve been a good deal more forceful in private about this than I have in public <img src='http://the-paper-trail.org/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/cap-confusion-problems-with-partition-tolerance/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Apache ZooKeeper is looking for Google Summer of Code applicants</title>
		<link>http://the-paper-trail.org/blog/apache-zookeeper-is-looking-for-google-summer-of-code-applicants/</link>
		<comments>http://the-paper-trail.org/blog/apache-zookeeper-is-looking-for-google-summer-of-code-applicants/#comments</comments>
		<pubDate>Wed, 24 Mar 2010 17:18:02 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://the-paper-trail.org/blog/?p=286</guid>
		<description><![CDATA[Students! Over at Apache ZooKeeper we&#8217;re looking for great students with a strong interest in distributed systems to work with us over the summer as part of Google&#8217;s Summer of Code, 2010. Summer of Code is a great program &#8211; providing stipends to students and more importantly connecting them with mentors in open source projects. [...]]]></description>
			<content:encoded><![CDATA[<p>Students! Over at <a href="http://hadoop.apache.org/zookeeper/">Apache ZooKeeper</a> we&#8217;re looking for great students with a strong interest in distributed systems to work with us over the summer as part of <a href="http://code.google.com/soc/">Google&#8217;s Summer of Code, 2010</a>. </p>
<p>Summer of Code is a great program &#8211; providing stipends to students and more importantly connecting them with mentors in open source projects. ZooKeeper has a number of <a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&#038;pid=12310801&#038;customfield_12310260=gsoc">interesting projects</a> to get started on.</p>
<p>ZooKeeper is a distributed coordination platform on which you can build the distributed equivalents of many traditional concurrent primitives like locks, queues and barriers. It&#8217;s heavily used in the real world &#8211; Yahoo! use it extensively, and many other major companies rely on it.  The other committers and I are actively looking to increase participation in the project &#8211; there is <em>loads</em> of really interesting work left to do. If consensus protocols, distributed systems, scalability, fault-tolerance and performance are your thing, this is certainly the project for you.</p>
<p>If you have any questions at all, drop me a line at henry at apache dot org. </p>
]]></content:encoded>
			<wfw:commentRss>http://the-paper-trail.org/blog/apache-zookeeper-is-looking-for-google-summer-of-code-applicants/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

