Reservoir Sampling

Right, time to get this blog back on track. I want to talk about a useful technique that’s both highly practical and crops up in interview scenarios regularly.

Consider this problem: How can we efficiently randomly select k items from a set S of size n > k, where n is unknown? Each member of S should have an equal probability of being selected.

At first glance, this problem looks a strange mix of trivial and impossible. If we don’t know n, how can we know how to weight our selection probabilities? And, assuming S is finite, how can we not know n if we’re able – as we must be – to visit every element in S?

To address the second point first: we can find n by iterating over all elements in S. However, this adds a pass over the entire set that might be expensive. Consider selecting rows at random from a large database table. We don’t want to bring the whole table into memory just to count the number of rows. Linked lists are another good motivating example – say we would like to select k elements from a linked list of length n. Even if we do loop over the list to count its elements, employing the simple approach of choosing k elements at random will still be slow because random access in linked lists is not a constant time operation. In general we will take on average O(kn) time. By using the following technique, called ‘reservoir sampling’, we can bring this down to \Theta(n).

We can keep our selection of k elements updated on-line as we scan through our data structure, and we can do it very simply.

Continue reading