Reservoir Sampling

Right, time to get this blog back on track. I want to talk about a useful technique that’s both highly practical and crops up in interview scenarios regularly.

Consider this problem: How can we efficiently randomly select kk items from a set SS of size n>kn > k , where nn is unknown? Each member of SS should have an equal probability of being selected.

At first glance, this problem looks a strange mix of trivial and impossible. If we don’t know nn , how can we know how to weight our selection probabilities? And, assuming SS is finite, how can we not know nn if we’re able - as we must be - to visit every element in SS ?

To address the second point first: we can find nn by iterating over all elements in SS . However, this adds a pass over the entire set that might be expensive. Consider selecting rows at random from a large database table. We don’t want to bring the whole table into memory just to count the number of rows. Linked lists are another good motivating example - say we would like to select kk elements from a linked list of length nn . Even if we do loop over the list to count its elements, employing the simple approach of choosing kk elements at random will still be slow because random access in linked lists is not a constant time operation. In general we will take on average O(kn)O(kn) time. By using the following technique, called ‘reservoir sampling’, we can bring this down to Θ(n)\Theta(n) .

We can keep our selection of kk elements updated on-line as we scan through our data structure, and we can do it very simply.

Assume we make one pass over SS . As we encounter a new element, we need to make a decision about whether to include it in the sample or not. Consider the ithi^{th} element. We assume that we have already selected kk elements uniformly at random from the i1i-1 elements we have already encountered. We’ll call that set of kk the reservoir. If we can update the reservoir so that it correctly contains kk elements, all selected with a probability of k/ik/i , then we can show that when i=ni=n we’ll have a correct sample.

With what probability should we put the ithi^{th} element in the reservoir? This is simple: we select it with probability k/ik/i . But this ‘overflows’ the reservoir, so we have to choose an element to remove to make space. All i1i-1 elements are in the reservoir with probability k/(i1)k/(i-1) by our assumption, so intuition tells us that we should just pick one of the kk at random and evict that.

Does that work? Let’s look at the probabilities. The probability that the ithi^{th} element is in the reservoir is k/ik/i which is correct. The probability that any of the kk elements currently in the reservoir remain there is: 1ki+ki(k1)k=i1i1-\frac{k}{i}+\frac{k}{i}\frac{(k-1)}{k}=\frac{i-1}{i} . For all i1i-1 elements already considered there is, by assumption, a ki1\frac{k}{i-1} probability that a given element is in the reservoir already. Multiplying the two probabilities to give the total probability that any of the i1i-1 elements will be in the reservoir after considering the ithi^{th} element gives:

P(Xji=1)=ki1i1i=kiP( X_{ji} = 1 ) = \frac{k}{i-1}\frac{i-1}{i} = \frac{k}{i}

where Xji,1j<iX_{ji}, 1\leq j < i is a binary random variable which is 1 if element jj is selected to be in the reservoir after considering the first ii elements.

While iki \le k the probability of selecting element ii is 1 and since the reservoir will not yet be full there’s no point removing elements. This gives us the ‘base case’ of our inductive proof: it’s easy to show that this method gives us a full reservoir with element selection probability k/ik/i when i=ki=k . For all i>ki >k , our inductive step above allows us to update the reservoir, keeping all the selection probabilities equal to k/ik/i .

This is incredibly simple to implement, and very fast - because the reservoir is fixed size we can implement it as an array and not worry about high costs for insertion or deletion. Two random numbers are needed per element, making this a very efficient solution and much quicker than random access in a linearly accessible data structure.

© - 2023 · Paper Trail · Theme Simpleness Powered by Hugo ·