Paper Trail https://the-paper-trail.org/blog Computer Systems, Distributed Algorithms and Databases Sat, 29 Jul 2017 01:11:41 +0000 en-US hourly 1 https://wordpress.org/?v=4.8.2 Exactly-once or not, atomic broadcast is still impossible in Kafka – or anywhere https://the-paper-trail.org/blog/exactly-not-atomic-broadcast-still-impossible-kafka/ https://the-paper-trail.org/blog/exactly-not-atomic-broadcast-still-impossible-kafka/#comments Fri, 28 Jul 2017 23:23:38 +0000 http://the-paper-trail.org/blog/?p=660 Intro Update: Jay responded on Twitter, which you can read here. I read an article recently by Jay Kreps about a feature for delivering messages ‘exactly-once’ within the Kafka framework. Everyone’s excited, and for good reason. But there’s been a bit of a side story about what exactly ‘exactly-once’ means, and what Kafka can actually […]

The post Exactly-once or not, atomic broadcast is still impossible in Kafka – or anywhere appeared first on Paper Trail.

]]>
Intro

Update: Jay responded on Twitter, which you can read here.

I read an article recently by Jay Kreps about a feature for delivering messages ‘exactly-once’ within the Kafka framework. Everyone’s excited, and for good reason. But there’s been a bit of a side story about what exactly ‘exactly-once’ means, and what Kafka can actually do.

In the article, Jay identifies the safety and liveness properties of atomic broadcast as a pretty good definition for the set of properties that Kafka is going after with their new exactly-once feature, and then starts to address claims by naysayers that atomic broadcast is impossible.

For this note, I’m not going to address whether or not exactly-once is an implementation of atomic broadcast. I also believe that exactly-once is a powerful feature that’s been impressively realised by Confluent and the Kafka community; nothing here is a criticism of that effort or the feature itself. But the article makes some claims about impossibility that are, at best, a bit shaky – and, well, impossibility’s kind of my jam. Jay posted his article with a tweet saying he couldn’t ‘resist a good argument’. I’m responding in that spirit.

In particular, the article makes the claim that atomic broadcast is ‘solvable’ (and later that consensus is as well…), which is wrong. What follows is why, and why that matters.

I have since left the pub. So let’s begin.

You can’t solve Atomic Broadcast

From the article:

“So is it possible to solve Atomic Broadcast? The short answer is yes”

The long answer is no. It’s proved by Chandra and Toueg in their utterly fantastic paper about weak failure detectors (of which more later). In fact, the article follows up with a reference that fairly directly contradicts this, and it’s kind of worth quoting:

“From a formal point-of-view, most practical systems are asynchronous because it is not possible to assume that there is an upper bound on communication delays. In spite of this, why do many practitioners still claim that their algorithm can solve agreement problems in real systems? This is because many papers do not formally address the liveness issue of the algorithm, or regard it as sufficient to consider some informal level of synchrony, captured by the assumption that “most messages are likely to reach their destination within a known delay δ” … This does not put a bound on the occurrence of timing failures, but puts a probabilistic restriction on the occurrence of such failures. However, formally this is not enough to establish correctness.” [emphasis mine]

A solution to the atomic broadcast problem satisfies the stated safety and liveness properties in all executions – i.e. no matter what delays are applied to message delivery or non-determinism in their receipt. If we’re talking about theory – and we are, in this article – then ‘solve’ has this very strong meaning.

And because atomic broadcast and consensus are, from a certain perspective, exactly the same problem (Chandra and Toueg, again!) we can apply all the knowledge that’s been accrued regarding consensus and apply it to atomic broadcast. Specifically that it’s impossible to solve atomic broadcast in an asynchronous system where even one process may fail stop.

Yes but what if…

There are some real mitigations to the FLP impossibility result. Some are mentioned in the article. None completely undermine the basic impossibility result.

Randomization (i.e. having access to a sequence of random coin-flips, locally) is a powerful tool. But randomization doesn’t solve consensus, it just makes the probability of a non-terminating execution extremely unlikely; and that doesn’t come with an upper bound on the length of any particular execution. See this introductory wiki page.

Local timers with bounded drift and skew would help since they prevent the system from being asynchronous at all – but in practice timers do drift and skew. See these two ACM queue articles for details.

As for failure detectors, it’s true that augmenting your system model with a failure detector can make consensus solvable. But you have to keep in mind that the FLP result is a *fixed point*. It doesn’t bend or yield to any attempt to disprove it by construction. Instead, if you think that you have solved atomic broadcast you have either done something impossible or you have changed the problem definition. Failure detectors that solve consensus are, therefore, impossible. Failure detectors are interesting *because* consensus is impossible – it’s surprising just how weak the properties of these failure detectors have to be to allow consensus to be solved, and therefore how hard building a failure detector actually is.

I understand the tedium of pedantry regarding abstruse theoretical matters, but it’s not productive to throw out precision in the same set of bath water.

Tinkerbell consensus

The final set of claims from that initial section on impossibility is the toughest to support:

“So if you don’t believe that consensus is possible, then you also don’t believe Kafka is possible, in which case you needn’t worry too much about the possibility of exactly-once support from Kafka!”

Hoo boy.

Here the article proposes “Tinkerbell Consensus”, which is roughly “if you believe in it hard enough, consensus can be solved”. Alas, we’ve been clapping our hands now for a really long time, and the fairy’s still dead.

I don’t believe that consensus can be solved. And yet, I use ZooKeeper daily. That’s because ZooKeeper doesn’t give a damn what I believe, but trundles along implementing a partial solution to consensus (where availability might be compromised) – happily agnostic of my opinions on the matter.

If Kafka purports to implement atomic broadcast, it too will fail to do so correctly in some execution. That’s an important property of any implementation, and one that – ideally – you would want to acknowledge and document, without suggesting that the system will do anything other than work correctly in the vast majority of executions.

Conclusion

If I were to rewrite the article, I’d structure it thus: “exactly-once looks like atomic broadcast. Atomic broadcast is impossible. Here’s how exactly-once might fail, and here’s why we think you shouldn’t be worried about it.”. That’s a harder argument for users to swallow, perhaps, but it would have the benefit of not causing my impossibility spider-sense to tingle.

The post Exactly-once or not, atomic broadcast is still impossible in Kafka – or anywhere appeared first on Paper Trail.

]]>
https://the-paper-trail.org/blog/exactly-not-atomic-broadcast-still-impossible-kafka/feed/ 3
Make any algorithm lock-free with this one crazy trick https://the-paper-trail.org/blog/make-any-algorithm-lock-free-with-this-one-crazy-trick/ https://the-paper-trail.org/blog/make-any-algorithm-lock-free-with-this-one-crazy-trick/#comments Thu, 26 May 2016 05:51:03 +0000 http://the-paper-trail.org/blog/?p=641 Lock-free algorithms often operate by having several versions of a data structure in use at one time. The general pattern is that you can prepare an update to a data structure, and then use a machine primitive to atomically install the update by changing a pointer. This means that all subsequent readers will follow the […]

The post Make any algorithm lock-free with this one crazy trick appeared first on Paper Trail.

]]>
Lock-free algorithms often operate by having several versions of a data structure in use at one time. The general pattern is that you can prepare an update to a data structure, and then use a machine primitive to atomically install the update by changing a pointer. This means that all subsequent readers will follow the pointer to its new location – for example, to a new node in a linked-list – but this pattern can’t do anything about readers that have already followed the old pointer value, and are traversing the previous version of the data structure.

Those readers will see a correct, linearizable version of the data structure, so this pattern doesn’t present a correctness concern. Instead, the problem is garbage collection: who retires the old version of the data structure, to free the memory it’s taking now that it’s unreachable? To put it another way: how do you tell when all possible readers have finished reading an old version?

Of course, there are many techniques for solving this reclamation problem. See this paper for a survey, and this paper for a recent improvement over epoch-based reclamation. RCU, which is an API for enabling single-writer, multi-reader concurrency in the Linux kernel, has an elegant way of solving the problem.

Every reader in RCU marks a critical section by calling rcu_read_lock() / rcu_read_unlock(). Writers typically take responsibility for memory reclamation, which means they have to wait for all in-flight critical sections to complete. The way this is done is, conceptually, really really simple: during a critical section, a thread may not block, or be pre-empted by the scheduler. So as soon as a thread yields its CPU, it’s guaranteed to be out of its critical section.

This gives RCU a really simple scheme to check for a grace period after a write has finished: try to run a thread on every CPU in the system. Once the thread runs, a context switch has happened which means any previous critical sections must have completed, and so any previous version of the data structure can be reclaimed. The readers don’t have to do anything to signal that they are finished with their critical section: it’s implicit in them starting to accept context switches again!

In reality, this is not quite what RCU does (but the idea is the same, see this terrific series). Instead, it takes advantage of kernel context-switch counters and waits for them to increase:

“In practice Linux implements synchronize_rcu by waiting for all CPUs in the system to pass through a context switch, instead of scheduling a thread on each CPU. This design optimizes the Linux RCU implementation for low-cost RCU critical sections, but at the cost of delaying synchronize_rcu callers longer than necessary. In principle, a writer waiting for a particular reader need only wait for that reader to complete an RCU critical section. The reader, however, must communicate to the writer that the RCU critical section is complete. The Linux RCU implementation essentially batches reader-to-writer communication by waiting for context switches. When possible, writers can use an asynchronous version of synchronize_rcu, call_rcu, that will asynchronously invokes a specified callback after all CPUs have passed through at least one context switch.”

From RCU Usage In the Linux Kernel: One Decade Later.

The most elegant thing about vanilla RCU is that the system is lock-free by definition not by design – it has nothing to do with the semantics of RCU’s primitives, and everything to do with the fact that being in the kernel allows you to enforce a sympathetic system model. If threads can’t block or otherwise be prevented from making progress, any (non-pathological) algorithm must, by definition, always make progress! Even a concurrency scheme that nominally used spinlocks to protect critical sections would be lock-free, because every thread would exit their critical section in bounded time – the other threads would all be serialised behind this lock, but there would be progress.

(There are other flavours of RCU that don’t restrict critical sections in this way, as they require critical sections to allow pre-emption).

The post Make any algorithm lock-free with this one crazy trick appeared first on Paper Trail.

]]>
https://the-paper-trail.org/blog/make-any-algorithm-lock-free-with-this-one-crazy-trick/feed/ 1
Distributed systems theory for the distributed systems engineer https://the-paper-trail.org/blog/distributed-systems-theory-for-the-distributed-systems-engineer/ https://the-paper-trail.org/blog/distributed-systems-theory-for-the-distributed-systems-engineer/#comments Sun, 10 Aug 2014 03:45:38 +0000 http://the-paper-trail.org/blog/?p=617 Gwen Shapira, SA superstar and now full-time engineer at Cloudera, asked a question on Twitter that got me thinking. My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”, and I’d have prescribed a laundry list of primary source material which would […]

The post Distributed systems theory for the distributed systems engineer appeared first on Paper Trail.

]]>
Gwen Shapira, SA superstar and now full-time engineer at Cloudera, asked a question on Twitter that got me thinking.

My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”, and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed. But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program). Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context. What good is requiring that level of expertise of engineers?

And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory; particularly material that does so without condescending. Considering that gap lead me to another interesting question:

What distributed systems theory should a distributed systems engineer know?

A little theory is, in this case, not such a dangerous thing. So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer; what I consider ‘table stakes’ for distributed systems engineers competent enough to design a new system. Let me know what you think I missed!

First steps

These four readings do a pretty good job of explaining what about building distributed systems is challenging. Collectively they outline a set of abstract but technical difficulties that the distributed systems engineer has to overcome, and set the stage for the more detailed investigation in later sections

Distributed Systems for Fun and Profit is a short book which tries to cover some of the basic issues in distributed systems including the role of time and different strategies for replication.

Notes on distributed systems for young bloods – not theory, but a good practical counterbalance to keep the rest of your reading grounded.

A Note on Distributed Systems – a classic paper on why you can’t just pretend all remote interactions are like local objects.

The fallacies of distributed computing – 8 fallacies of distributed computing that set the stage for the kinds of things system designers forget.

Failure and Time

Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:

  1. processes may fail
  2. there is no good way to tell that they have done so

There is a very deep relationship between what, if anything, processes share about their knowledge of time, what failure scenarios are possible to detect, and what algorithms and primitives may be correctly implemented. Most of the time, we assume that two different nodes have absolutely no shared knowledge of what time it is, or how quickly time passes.

You should know:

* The (partial) hierarchy of failure modes: crash stop -> omission -> Byzantine. You should understand that what is possible at the top of the hierarchy must be possible at lower levels, and what is impossible at lower levels must be impossible at higher levels.

* How you decide whether an event happened before another event in the absence of any shared clock. This means Lamport clocks and their generalisation to Vector clocks, but also see the Dynamo paper.

* How big an impact the possibility of even a single failure can actually have on our ability to implement correct distributed systems (see my notes on the FLP result below).

* Different models of time: synchronous, partially synchronous and asynchronous (links coming, when I find a good reference).

The basic tension of fault tolerance

A system that tolerates some faults without degrading must be able to act as though those faults had not occurred. This means usually that parts of the system must do work redundantly, but doing more work than is absolutely necessary typically carries a cost both in performance and resource consumption. This is the basic tension of adding fault tolerance to a system.

You should know:

* The quorum technique for ensuring single-copy serialisability. See Skeen’s original paper, but perhaps better is Wikipedia’s entry.

* About 2-phase-commit, 3-phase-commit and Paxos, and why they have different fault-tolerance properties.

* How eventual consistency, and other techniques, seek to avoid this tension at the cost of weaker guarantees about system behaviour. The Dynamo paper is a great place to start, but also Pat Helland’s classic Life Beyond Transactions is a must-read.

Basic primitives

There are few agreed-upon basic building blocks in distributed systems, but more are beginning to emerge. You should know what the following problems are, and where to find a solution for them:

* Leader election (e.g. the Bully algorithm)

* Consistent snapshotting (e.g. this classic paper from Chandy and Lamport)

* Consensus (see the blog posts on 2PC and Paxos above)

* Distributed state machine replication (Wikipedia is ok, Lampson’s paper is canonical but dry).

Fundamental Results

Some facts just need to be internalised. There are more than this, naturally, but here’s a flavour:

  • You can’t implement consistent storage and respond to all requests if you might drop messages between processes. This is the CAP theorem.
  • Consensus is impossible to implement in such a way that it both a) is always correct and b) always terminates if even one machine might fail in an asynchronous system with crash-* stop failures (the FLP result). The first slides – before the proof gets going – of my Papers We Love SF talk do a reasonable job of explaining the result, I hope. Suggestion: there’s no real need to understand the proof.
  • Consensus is impossible to solve in fewer than 2 rounds of messages in general

Real systems

The most important exercise to repeat is to read descriptions of new, real systems, and to critique their design decisions. Do this over and over again. Some suggestions:

Google:

GFS, Spanner, F1, Chubby, BigTable, MillWheel, Omega, Dapper. Paxos Made Live, The Tail At Scale.

Not Google:

Dryad, Cassandra, Ceph, RAMCloud, HyperDex, PNUTS

Postscript

If you tame all the concepts and techniques on this list, I’d like to talk to you about engineering positions working with the menagerie of distributed systems we curate at Cloudera.

The post Distributed systems theory for the distributed systems engineer appeared first on Paper Trail.

]]>
https://the-paper-trail.org/blog/distributed-systems-theory-for-the-distributed-systems-engineer/feed/ 49
The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google https://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/ https://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/#comments Thu, 26 Jun 2014 00:49:39 +0000 http://the-paper-trail.org/blog/?p=604 Note: this is a personal blog post, and doesn’t reflect the views of my employers at Cloudera Map-Reduce is on its way out. But we shouldn’t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise. This morning, at their I/O Conference, Google revealed […]

The post The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google appeared first on Paper Trail.

]]>

Note: this is a personal blog post, and doesn’t reflect the views of my employers at Cloudera

Map-Reduce is on its way out. But we shouldn’t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise.

This morning, at their I/O Conference, Google revealed that they’re not using Map-Reduce to process data internally at all any more.

We shouldn’t be surprised. The writing has been on the wall for Map-Reduce for some time. The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems.

It was known for decades that generalised dataflow engines adequately capture the map-reduce model as a fairly trivial special case. However, there was real doubt over whether such engines could be efficiently implemented on large-scale cluster computers. But ever since Dryad, in 2007 (at least), it was clear to me that Map-Reduce’s days were numbered. Indeed, it’s a bit of a surprise to me that it lasted this long.

Map-Reduce has served a great purpose, though: many, many companies, research labs and individuals are successfully bringing Map-Reduce to bear on problems to which it is suited: brute-force processing with an optional aggregation. But more important in the longer term, to my mind, is the way that Map-Reduce provided the justification for re-evaluating the ways in which large-scale data processing platforms are built (and purchased!).

If we are in a data revolution right now, the computational advance that made it possible was not the ‘discovery’ of Map-Reduce, but instead the realisation that these computing systems can and should be built from relatively cheap, shared-nothing machines (and the real contribution from Google in this area was arguably GFS, not Map-Reduce).

The advantages of this new architecture are enormous and well understood: storage and compute become incrementally scalable, heterogeneous workloads are better supported and the faults that more commonly arise when you have ‘cheaper’ commodity components are at the same time much easier to absorb (paradoxically, it’s easier to build robust, fault-tolerant systems from unreliable components). Of course lower cost is a huge advantage as well, and is the one that has established vendors stuck between the rock of having to cannibalise their own hardware margins, and the hard place of being outmaneuvered by the new technology.

In the public domain, Hadoop would not have had any success without Map-Reduce to sell it. Until the open-source community developed the maturity to build successful replacements, commodity distributed computing needed an app – not a ‘killer’ app, necessarily, but some new approach that made some of the theoretical promise real. Buying into Map-Reduce meant buying into the platform.

Now we are much closer to delivering much more fully on the software promise. MPP database concepts, far from being completely incompatible with large shared-nothing deployments, are becoming more and more applicable as we develop a better understanding of the way to integrate distributed and local execution models. I remember sitting in a reading group at Microsoft Research in Cambridge as we discussed whether joins could ever be efficient in cluster computing. The answer has turned out to be yes, and the techniques were already known at the time. Transactions are similarly thought to be in the ‘can never work’ camp, but Spanner has shown that there’s progress to be made in that area. Perhaps OLTP will never move wholesale to cluster computing, but data in those clusters need not be read-only.

As these more general frameworks improve, they subsume Map-Reduce and make its shortcomings more evident. Map-Reduce has never been an easy paradigm to write new programs for, if only because the mapping between your problem and the rigid two-phase topology is rarely obvious. Languages can only mask that impedance mismatch to a certain extent. Map-Reduce, as implemented, typically has substantial overhead attributable both to its inherent ‘batchness’, and the need to have a barrier between the map and reduce phases. It’s a relief to offer end-users a better alternative

So it’s no surprise to hear that Google have retired Map-Reduce. It will also be no surprise to me when, eventually, Hadoop does the same, and the elephant is finally given its dotage.

The post The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google appeared first on Paper Trail.

]]>
https://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/feed/ 32
Paper notes: MemC3, a better Memcached https://the-paper-trail.org/blog/paper-notes-memc3-a-better-memcached/ https://the-paper-trail.org/blog/paper-notes-memc3-a-better-memcached/#comments Wed, 18 Jun 2014 21:36:51 +0000 http://the-paper-trail.org/blog/?p=595 MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing Fan and Andersen, NSDI 2013 The big idea: This is a paper about choosing your data structures and algorithms carefully. By paying careful attention to the workload and functional requirements, the authors reimplement memcached to achieve a) better concurrency and b) better space efficiency. […]

The post Paper notes: MemC3, a better Memcached appeared first on Paper Trail.

]]>

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing


Fan and Andersen, NSDI 2013

The big idea: This is a paper about choosing your data structures and algorithms carefully. By paying careful attention to the workload and functional requirements, the authors reimplement memcached to achieve a) better concurrency and b) better space efficiency. Specifically, they introduce a variant of cuckoo hashing that is highly amenable to concurrent workloads, and integrate the venerable CLOCK cache eviction algorithm with the hash table for space-efficient approximate LRU.

Optimistic cuckoo hashing: Cuckoo hashing has some problems under high-concurrency. Since each displacement requires accessing two buckets, careful locking is required to make sure that updates don’t conflict, or worse deadlock since you need to lock the entire path through the hash-table on each insert. Without this locking there’s a risk of false negatives due to missing a key move that’s in flight (note that keys can’t be moved atomically since they’re multi-byte strings).

The idea in this paper is to serialise on a single writer (thereby avoiding deadlocks), and to search forward for a valid ‘cuckoo path’ (series of displacements), followed by tracing that path backwards and moving keys one at a time, effectively moving the hole, not the key. Therefore there’s no way the key can not be present in the table; in fact it might be in two places at once. Writer / reader conflicts might happen when a key gets moved out of the space in which the reader looks for it. The answer is to optimistically lock by using version numbers (striped across many keys, to stay cache and space efficient, nice idea), and an odd-even scheme. When a version is odd, the writer is moving the key and readers should retry their read from the start; when it is even, the key is correct as long as the version does not change during the read of the key (if it changes, they need to re-read the key).

For more details, see David Andersen’s blog. See also the slides from NSDI.

CLOCK-based LRU: Memcached spends 18-bytes for each key (!) on LRU maintenance; forward and backward pointers and a 2-byte reference counter. But why the need for strict LRU? Instead use CLOCK: 1-bit of recency that is set to 1 on every Update(). A ‘hand’ pointer moves around a circular buffer of entries. If the entry under the hand pointer has a recency value of 0, it is evicted, otherwise its recency value is set to 0 and the next item in the buffer is interrogated.

The versioning scheme described above is used to coordinate when an item is being evicted (but may be concurrently read). If the eviction process decides to evict an entry, its version is incremented to indicate that it is being modified (i.e. to an odd value). Then the key is removed.

Notes:

  • Single writer obviously limits throughput for write-heavy workloads; not target of paper.
  • General lesson is to pay attention to data structures; memcached wastes a surprising amount of space per key.
  • Evaluation shows great improvements from using more efficient key comparator in the single-node, no concurrency case
  • Also using a 1-byte hash of the key as a ‘tag’ provides early-out for non-matching keys, at the cost of a dependent memory read if the keys match (although you’d have that cost anyhow since the keys are too large to be cache efficient in a cuckoo hash bucket)

The post Paper notes: MemC3, a better Memcached appeared first on Paper Trail.

]]>
https://the-paper-trail.org/blog/paper-notes-memc3-a-better-memcached/feed/ 1
Paper notes: Anti-Caching https://the-paper-trail.org/blog/paper-notes-anti-caching/ Fri, 06 Jun 2014 18:03:39 +0000 http://the-paper-trail.org/blog/?p=590 Anti-Caching: A New Approach to Database Management System Architecture DeBrabant et. al., VLDB 2013 The big idea: Traditional databases typically rely on the OS page cache to bring hot tuples into memory and keep them there. This suffers from a number of problems: No control over granularity of caching or eviction (so keeping a tuple […]

The post Paper notes: Anti-Caching appeared first on Paper Trail.

]]>

Anti-Caching: A New Approach to Database Management System Architecture


DeBrabant et. al., VLDB 2013

The big idea: Traditional databases typically rely on the OS page cache to bring hot tuples into memory and keep them there. This suffers from a number of problems:

  • No control over granularity of caching or eviction (so keeping a tuple in memory might keep all the tuples in its page as well, even though there’s not necessarily a usage correlation between them)
  • No control over when fetches are performed (fetches are typically slow, and transactions may hold onto locks or latches while the access is being made)
  • Duplication of resources – tuples can occupy both disk blocks and memory pages.

Instead, this paper proposes a DB-controlled mechanism for tuple caching and eviction called anti-caching. The idea is that the DB chooses exactly what to evict and when. The ‘anti’ aspect arises when you consider that the disk is now the place to store recently unused tuples, not the source of ground truth for the entire database. The disk, in fact, can’t easily store the tuples that are in memory because, as we shall see, the anti-caching mechanism may choose to write tuples into arbitrary blocks upon eviction, which will not correspond to the pages that are already on disk, giving rise to a complex rewrite / compaction problem on eviction. The benefits are realised partly in IO: only tuples that are cold are evicted (rather than those that are unluckily sitting in the same page), and fetches and evictions may be batched.

The basic mechanism is simple. Tuples are stored in a LRU chain. When memory pressure exceeds a threshold, some proportion of tuples are evicted to disk in a fixed-size block. Only the tuples that are ‘cold’ are evicted, by packing them together in a new block. When a transaction accesses an evicted tuple (the DBMS tracks the location of every tuple in-memory), all tuples that might be accessed by the transaction are scanned at once in what’s called a pre-pass phase. After pre-pass completes, the transaction is aborted (to let other transactions proceed) and retrieves the tuples from disk en-masse. This gives a superior disk access pattern than virtual memory paging, since page faults are typically sequentially served where tuple blocks can be retrieved in parallel. Presumably this process repeats if the set of tuples a transaction requires changes between abort and restart, in the hope that eventually the set of tuples in memory and those accessed by the transaction converge.

Since tuples are read into memory in a block, two obvious strategies are possible for merging them into memory. The first merges all tuples from a block, even those that aren’t required (those that aren’t required could conceivably be immediately evicted, leading to an oscillation problem). The second merges only those tuples that are accessed, but leads to a compaction problem for those tuples left behind; the paper compacts blocks lazily during merge.

Since the overall approach has some efficiency problems to solve, some optimisations are proposed:

  • Only sample a proportion of transactions to update the LRU chain. Hot tuples still have a relatively higher probability of being at the front of the LRU chain.
  • Some tables may be marked as unevictable (small / dimension tables?) and do not participate in the LRU chain
  • LRU chains are per-partition, not global across the whole system

Future work: expand anti-caching to handle larger-than-main-memory queries, use query optimisation to avoid fetching unnecessary tuples (e.g. those covered by an index), or store only a projection of a tuple in memory, better block compaction and reorganisation schemes.

Also see: H-Store’s documentation

The post Paper notes: Anti-Caching appeared first on Paper Trail.

]]>
Paper notes: Stream Processing at Google with Millwheel https://the-paper-trail.org/blog/paper-notes-stream-processing-at-google-with-millwheel/ Wed, 04 Jun 2014 19:07:04 +0000 http://the-paper-trail.org/blog/?p=582 MillWheel: Fault-Tolerant Stream Processing at Internet Scale Akidau et. al., VLDB 2013 The big idea: Streaming computations at scale are nothing new. Millwheel is a standard DAG stream processor, but one that runs at ‘Google’ scale. This paper really answers the following questions: what guarantees should be made about delivery and fault-tolerance to support most […]

The post Paper notes: Stream Processing at Google with Millwheel appeared first on Paper Trail.

]]>

MillWheel: Fault-Tolerant Stream Processing at Internet Scale


Akidau et. al., VLDB 2013

The big idea: Streaming computations at scale are nothing new. Millwheel is a standard DAG stream processor, but one that runs at ‘Google’ scale. This paper really answers the following questions: what guarantees should be made about delivery and fault-tolerance to support most common use cases cheaply? What optimisations become available if you choose these guarantees carefully?

Notable features:

  • A tight bound on the timestamp of all events still in the system, called the low watermark. This substitutes for in-order delivery by allowing processors to know when a time period has been fully processed (nice observation: you don’t necessarily care about receiving events in order, but you do care about knowing whether you’ve seen all the events before a certain time).
  • Persistent storage available at all nodes in the stream graph
  • Exactly once delivery semantics.
  • Triggerable computations called timers that are executed in node context when either a low watermark or wall clock time is observed. This can be modelled as a separate source of null events that allows periodic roll-up to be done.

Each node in the graph receives input events, optionally computes some aggregate, and also optionally emits one or more output events as a result. In order to scale and do load balancing, each input record must have a key (the key extraction function is user defined and can change between nodes); events with identical keys are sent to the same node for processing.

The low watermark at A is defined as min(oldest event still in A, low watermark amongst all streams sending to A). Watermarks are tracked centrally. Note that the monotonic increasing property of watermarks requires that they enter the system in time order, and therefore we can’t track arbitrary functions as watermarks.

State is persisted centrally. To ensure atomicity, only one bulk write is permitted per event. To avoid zombie writers (where work has been moved elsewhere through failure detection or through load balancing), every writer has a lease or sequencer that ensures only they may write. The storage system allows for this to be atomically checked at the same time as a write (i.e. conditional atomic writes).

Emitted records are checkpointed before delivery so that if an acknowledgment is not received the record can be re-sent (duplicates are discarded by MillWheel at the recipient). The checkpoints allow fault-tolerance: if a processor crashes and is restarted somewhere else any intermediate computations can be recovered. When a delivery is ACKed the checkpoints can be garbage collected. The Checkpoint->Delivery->ACK->GC sequence is called a strong production. When a processor restarts, unacked productions are resent. The recipient de-duplicates.

In some cases, Millwheel users can optimise by allowing events to be sent before the checkpoint is committed to persistent storage – this is called a weak production. Weak productions are possible usually if the processing of an event is idempotent with respect to the persistent storage and event production (i.e. you always send the same event once you commit to sending it once (so aggregates over time don’t necessarily work) and / or receipt of those events multiple times doesn’t cause inconsistencies).

The post Paper notes: Stream Processing at Google with Millwheel appeared first on Paper Trail.

]]>
Paper notes: DB2 with BLU Acceleration https://the-paper-trail.org/blog/paper-notes-db2-with-blu-acceleration/ Thu, 15 May 2014 01:02:15 +0000 http://the-paper-trail.org/blog/?p=568 DB2 with BLU Acceleration: So Much More than Just a Column Store Raman et. al., VLDB 2013 The big idea: IBM’s venerable DB2 technology was based on traditional row-based technology. By moving to a columnar execution engine, and crucially then by taking full advantage of the optimisations that columnar formats allow, the ‘BLU Acceleration’ project […]

The post Paper notes: DB2 with BLU Acceleration appeared first on Paper Trail.

]]>

DB2 with BLU Acceleration: So Much More than Just a Column Store

Raman et. al., VLDB 2013

The big idea: IBM’s venerable DB2 technology was based on traditional row-based technology. By moving to a columnar execution engine, and crucially then by taking full advantage of the optimisations that columnar formats allow, the ‘BLU Acceleration’ project was able to improve read-mostly BI workloads by a 10 to 50 times speed-up.

BLU is a single-node system which still uses the DB2 planner (although changed to generate plans for new columnar operators). This paper describes the basic workings of the execution engine itself. Of interest is the fact that the row-based execution engine can continue to co-exist with BLU (so existing tables don’t have to be converted, pretty important for long-time DB2 customers).

Not much is said about the overall execution model; presumably it is a traditional Volcano-style architecture with batches of column values passed between operators. Neither is much said about resource management: BLU is heavily multi-threaded, but the budgeting mechanism for threads assigned to any given query is not included.

On-disk format:

Every column may have multiple encoding schemes associated with it (for example, short-code-word dictionaries for frequent data, larger-width code-words for infrequent values; key is that code-words are constant-size for a scheme). Columns are grouped into column groups; all values from the same column group are stored together. Column groups are stored in pages, and a page may contain one or more regions. A single region has a constant compression scheme for all columns stored within it. Within a region, individual columns are stored in fixed-width data banks. Finally, a page contains a tuple map which is a bitset identifying to which region each tuple in the page belongs. If there are two regions in a page, the tuple map has one bit per tuple, and so on. Each page contains a contiguous set of ‘tuplets’ (projection of a tuple onto a column group), so the tuple map is dense. All column groups are ordered by the same ‘tuple sequence number’. The TSN to page mapping is contained in a B+-tree.

Nullable columns are handled with an extra 1-bit nullable column.
Each column has a synopsis table associated with it which allows for page-skipping, and is stored in the same format as regular table data.

Scans:

There are two variants of on-disk column access. The first, ‘LEAF’, evaluates predicates over columns in a column group. It does so region-by-region, producing a bitmap with a 1 for every tuplet which passed evaluation for every region. The bitmaps are then interleaved using some bit-twiddling magic to produce a bitmap in TSN order. It’s not clear if the only output from LEAF is the validity bitmap. Predicate evaluation can be done over coded data, and using SIMD-efficient algorithms.

The second column access operator, ‘LCOL’, loads columns either in coded or uncompressed format. Coded values can be fed directly into joins and aggregations, so it can often pay not to decompress. LCOL respects a validity bitmap as a parameter, and produces output which contains only valid tuplets.

Joins:

Joins follow a typical build-probe two phase pattern. Both phases are heavily multi-threaded. In the build-phase, join keys are partitioned. Each thread partitions a separate subset of the scanned rows (presumably contiguous by TSN). Then each thread builds a hash-table from one partition each. Partitions are sized according to memory (in order to keep a partition in a single level of the memory hierarchy).

Joins are performed on encoded data, but since the join keys may be encoded differently in the outer and the inner, the inner is re-encoded to the outer’s coding scheme. At the same time, Bloom filters are built on the inner, and later pushed down to the outer scan. Interestingly, the opposite is also possible: if the inner is very large, the outer is scanned once to compute a Bloom filter used to filter the inner. This extra scan is paid for by not having to spill to disk when the inner hash table gets large. Sometimes spilling is unavoidable; in that case either some partitions of the inner and outer are spilled or, depending on a cost model, only the inner is spilled. This has the benefit of not requiring early materialisation of non-join columns from the outer.

Aggregations:

Aggregations are also two-phase. Each thread creates its own local aggregation hash-table based on a partition of the input. In the second phase, those partitions are merged: each thread takes a partition of the output hash table and scans every local hash-table produced in the first phase. In this way, every thread is writing to a local data-structure without contention in both phases. Therefore there must be a barrier between both phases, to avoid a thread updating a local hash table while it’s being read in phase 2.

If a local hash table gets too large, a thread can create ‘overflow buckets’ which contain new groups. Once an overflow bucket gets full, it gets published to a global list of overflow buckets for a given partition. Threads may reorganise their local hash table and corresponding overflow buckets by moving less-frequent groups to the overflow buckets. A recurring theme is the constant gathering of statistics to adapt operator behaviour on the fly.

The post Paper notes: DB2 with BLU Acceleration appeared first on Paper Trail.

]]>
Étale cohomology https://the-paper-trail.org/blog/etale-cohomology/ Wed, 05 Mar 2014 06:22:25 +0000 http://the-paper-trail.org/blog/?p=554 The second in an extremely irregular series of posts made on behalf of my father, who has spent much of his retirement so far doing very hard mathematics. What is attached here is the essay he wrote for the Part III of the Cambridge Mathematical Tripos, a one year taught course. The subject is the […]

The post Étale cohomology appeared first on Paper Trail.

]]>
The second in an extremely irregular series of posts made on behalf of my father, who has spent much of his retirement so far doing very hard mathematics. What is attached here is the essay he wrote for the Part III of the Cambridge Mathematical Tripos, a one year taught course. The subject is the Étale cohomology.

Says my Dad: “I am afraid that I have been lured away from the translation of SGA 4.5 for some time by the attraction of working on Wolfgang Krull’s report on “Idealtheorie” from 1935 (again I am not aware of an English version anywhere) which is yet another important classic. However during a year at Cambridge I did write an essay as a very basic introduction to Étale Cohomology which was based on the first part of SGA 4.5. So with the usual imprecation of caveat lector, here it is as a temporising partial substitute should any other beginner be interested.”

Here’s part of the introduction:

This essay has been written as part of the one year Certificate of Advanced Study in Mathematics (CASM) course at Cambridge University which coincides with Part III of the Mathematical Tripos. The starting point is, of necessity, roughly that reached in the lectures which in this particular year did not include much in the way of schemes and sheaves, nor, in the case of the author, much in the way of algebraic number theory.
Thus the frontiers of the subject can safely rest undisturbed by the contents of this essay. Rather it has been written with a reader in mind corresponding roughly to the author at the start of the enterprise. That is someone who is interested to find out what all the fuss was with the French algebraic geometers in the 1960s but is in need of some fairly elementary background to map out the abstractions involved and with any luck to avoid drowning in the “rising sea”.

And here’s the essay itself!

The post Étale cohomology appeared first on Paper Trail.

]]>
ByteArrayOutputStream is really, really slow sometimes in JDK6 https://the-paper-trail.org/blog/535/ https://the-paper-trail.org/blog/535/#comments Fri, 10 Jan 2014 22:57:41 +0000 http://the-paper-trail.org/blog/?p=535 TLDR: Yesterday I mentioned on Twitter that I’d found a bad performance problem when writing to a large ByteArrayOutputStream in Java. After some digging, it appears to be the case that there’s a bad bug in JDK6 that doesn’t affect correctness, but does cause performance to nosedive when a ByteArrayOutputStream gets large. This post explains […]

The post ByteArrayOutputStream is really, really slow sometimes in JDK6 appeared first on Paper Trail.

]]>
TLDR: Yesterday I mentioned on Twitter that I’d found a bad performance problem when writing to a large ByteArrayOutputStream in Java. After some digging, it appears to be the case that there’s a bad bug in JDK6 that doesn’t affect correctness, but does cause performance to nosedive when a ByteArrayOutputStream gets large. This post explains why.

Two of Impala’s server processes have both C++ and Java components (for reasons both historic and pragmatic). We often need to pass data structures from C++ to Java and vice versa, and mapping the C++ representation onto a Java one via JNI is too painful to contemplate. So instead we take advantage of the fact that Thrift is very good at generating equivalent data structures in different languages, and make every parameter to methods on the JNI boundary a serialised Thrift structure. That is, it’s a byte array that Thrift on both sides knows how to convert into a Thrift structure. So we pass byte arrays back and forth, and use Thrift to convert them to language-readable data structures. This works pretty well. (To see exactly how, start by reading frontend.cc and JniFrontend.java). We pay an extra copy or two, plus the CPU overhead of the serialisation, but the benefits in terms of usability and maintainability of the interface vastly outweigh some pretty nominal performance hits.

If the performance hit isn’t nominal, however, we have a problem. And this is what we observed earlier this week: one of the JNI methods was trying to pass a huge data structure back from Java to C++. Doing so was taking a long time – on the order of minutes. What was particularly of interest was that the performance dropped off a cliff: a data structure half the size was happily serialising in about 500ms. So we have a non-linear relationship between the size of the input and the cost of serialising it. We can’t really absorb that cost, so we had to understand the problem.

So how did we get there? Thrift’s Java serialisation implementation works by having a TSerializer object, which contains a ByteArrayOutputStream, call write() on a Thrift structure with its ByteArrayOutputStream as an argument. The Thrift structure then walks its members and writes object headers and then serialised data for each field in turn. The result is lots of small write() calls to the ByteArrayOutputStream.

The first thing was to connect a profiler (YourKit, but honestly repeated SIGHUP to get the stack trace would have worked). During the long serialisation period, almost all the time was spent inside java.util.Arrays.copyOf, inside a method to write a byte[] to a ByteArrayOutputStream. Progress was being made – the item being written to the ByteArrayOutputStream was changing – but it was taking an unreasonably long time to write each field.

A ByteArrayOutputStream is not necessarily initialised with any estimate of the ultimate size of the byte array it wraps. So it needs a mechanism to resize when more space is required. The source for ByteArrayOutputStream.write(byte[], int, int) in JDK6 shows the (very standard) strategy it uses.

public synchronized void write(byte b[], int off, int len) {
    if ((off < 0) || (off > b.length) || (len < 0) ||
        ((off + len) > b.length) || ((off + len) < 0)) {
        throw new IndexOutOfBoundsException();
    } else if (len == 0) {
        return;
    }
    int newcount = count + len;
    if (newcount > buf.length) {
        buf = Arrays.copyOf(buf, Math.max(buf.length << 1, newcount));
    }
    System.arraycopy(b, off, buf, count, len);
    count = newcount;
}

The first six lines just deal with parameter validation; they can be ignored from here on. Lines 8-9 are interesting: we compute the new size of the array after the write completes, and then, if that size is larger than the current size, we need to do something to compensate.

Line 10 is where that compensation happens. Arrays.copyOf() creates a new array containing all the bytes from the original array, but with a larger size. The size of the new array is the maximum of twice the current length (buf.length << 1) and the requested size of the array after the write completes (this is so that a large write that more than doubles the current size of the array can be accommodated). Performing this copy is expensive, but since the size of the array should grow exponentially, frequent copies are hopefully unlikely. C++'s vector does the same thing.

After that (lines 12-13) we copy in the argument, and update the tracked number of bytes in the array.

My working hypothesis was that copyOf() was being called on every write() (since that matched up with what the profiler was telling us). The source code tells us the only way that can happen is if newcount is always larger than buf.length. This leads to two possibilities: newcount is getting large quickly, or buf.length is getting large slowly. The former seems unlikely - Thrift serialisation works by writing many small byte arrays - so to support my hypothesis, buf.length had to be growing slowly so that the copyOf() branch was being taken much more frequently than we expected.

A session with JDB (a terrible, terrible debugger) confirmed this. During the slow serialisation period, the size of the array increased on every write only by the amount required to contain the write in progress. On every write of say 2 bytes, the array size would increase by exactly those 2 bytes and a copy would be taken. The array itself was about 1GB in size, so the copy was really expensive.

This leads us to the bug. The size of the array is determined by Math.max(buf.length << 1, newcount). Ordinarily, buf.length << 1 returns double buf.length, which would always be much larger than newcount for a 2 byte write. Why was it not?

The problem is that for all integers larger than Integer.MAX_INTEGER / 2, shifting left by one place causes overflow, setting the sign bit. The result is a _negative_ integer, which is always less than newcount. So for all byte arrays larger than 1073741824 bytes (i.e. one GB), any write will cause the array to resize, and only to exactly the size required.

You could argue that this is by design for the following reason: the maximum size of any array in Java is Integer.MAX_INTEGER (minus a few bytes for preamble). Any array larger than Integer.MAX_INTEGER / 2 bytes long would become larger than that limit when doubling in size. However, the source for ByteArrayOutputStream.write() could handle this case by setting the new length to Integer.MAX_INTEGER if buf.length > Integer.MAX_INTEGER / 2 to give the array the maximum chance to grow with few copies.

The true fix is for us to cut down the size of the object we want to marshal, or to come up with some less expensive way of doing so (we could use a different TSerializer implementation, for example). Still, it's an unfortunate degradation an a fairly commonly used class, even if there are other, better ways of achieving the same thing.

Postscript

In fact, JDK7 'fixed' the issue by correctly dealing with overflow, but if the resulting doubled array-length was larger than Integer.MAX_INTEGER, an exception is thrown. You can check by running this code on both JDK6 and JDK7:

import java.io.ByteArrayOutputStream;

public class TestByteArray {

  static byte[] chunk = new byte[1024 * 1024];
  public static void main(String[] args) {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    int numChunks = 2 * 1024 * 1024;
    for (int  i = 0; i < numChunks; ++i) {
      long start = System.currentTimeMillis();
      baos.write(chunk, 0, chunk.length);
      long end = System.currentTimeMillis();
      System.out.println("Chunk " + i + " of " + numChunks + " took: "
          + ((end - start) / 1000.0) + "s, total written: " + (i * chunk.length) + " bytes");
    }
  }
}

On JDK6:

...
Chunk 1015 of 2097152 took: 0.0010s, total written: 1064304640 bytes
Chunk 1016 of 2097152 took: 0.0s, total written: 1065353216 bytes
Chunk 1017 of 2097152 took: 0.0010s, total written: 1066401792 bytes
Chunk 1018 of 2097152 took: 0.0s, total written: 1067450368 bytes
Chunk 1019 of 2097152 took: 0.0s, total written: 1068498944 bytes
Chunk 1020 of 2097152 took: 0.0010s, total written: 1069547520 bytes
Chunk 1021 of 2097152 took: 0.0s, total written: 1070596096 bytes
Chunk 1022 of 2097152 took: 0.0s, total written: 1071644672 bytes
Chunk 1023 of 2097152 took: 0.0010s, total written: 1072693248 bytes
Chunk 1024 of 2097152 took: 1.163s, total written: 1073741824 bytes <-- >1s per write!
Chunk 1025 of 2097152 took: 0.979s, total written: 1074790400 bytes
Chunk 1026 of 2097152 took: 0.948s, total written: 1075838976 bytes
Chunk 1027 of 2097152 took: 1.053s, total written: 1076887552 bytes
Chunk 1028 of 2097152 took: 1.033s, total written: 1077936128 bytes
Chunk 1029 of 2097152 took: 1.123s, total written: 1078984704 bytes
Chunk 1030 of 2097152 took: 0.723s, total written: 1080033280 bytes
Chunk 1031 of 2097152 took: 0.603s, total written: 1081081856 bytes
...

On JDK7:

...
Chunk 1015 of 2097152 took: 0.0s, total written: 1064304640 bytes
Chunk 1016 of 2097152 took: 0.0s, total written: 1065353216 bytes
Chunk 1017 of 2097152 took: 0.0s, total written: 1066401792 bytes
Chunk 1018 of 2097152 took: 0.0s, total written: 1067450368 bytes
Chunk 1019 of 2097152 took: 0.0s, total written: 1068498944 bytes
Chunk 1020 of 2097152 took: 0.0s, total written: 1069547520 bytes
Chunk 1021 of 2097152 took: 0.001s, total written: 1070596096 bytes
Chunk 1022 of 2097152 took: 0.0s, total written: 1071644672 bytes
Chunk 1023 of 2097152 took: 0.001s, total written: 1072693248 bytes
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:2271)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        at TestByteArray.main(TestByteArray.java:11)

The post ByteArrayOutputStream is really, really slow sometimes in JDK6 appeared first on Paper Trail.

]]>
https://the-paper-trail.org/blog/535/feed/ 1