# Reservoir Sampling

Right, time to get this blog back on track. I want to talk about a useful technique that’s both highly practical and crops up in interview scenarios regularly.

Consider this problem: How can we efficiently randomly select $k$ items from a set $S$ of size $n > k$, where $n$ is unknown? Each member of $S$ should have an equal probability of being selected.

At first glance, this problem looks a strange mix of trivial and impossible. If we don’t know $n$, how can we know how to weight our selection probabilities? And, assuming $S$ is finite, how can we not know $n$ if we’re able – as we must be – to visit every element in $S$?

To address the second point first: we can find $n$ by iterating over all elements in $S$. However, this adds a pass over the entire set that might be expensive. Consider selecting rows at random from a large database table. We don’t want to bring the whole table into memory just to count the number of rows. Linked lists are another good motivating example – say we would like to select $k$ elements from a linked list of length $n$. Even if we do loop over the list to count its elements, employing the simple approach of choosing $k$ elements at random will still be slow because random access in linked lists is not a constant time operation. In general we will take on average $O(kn)$ time. By using the following technique, called ‘reservoir sampling’, we can bring this down to $\Theta(n)$.

We can keep our selection of $k$ elements updated on-line as we scan through our data structure, and we can do it very simply.

| 786 Words