This is an essay that I wrote for a competition in College. The competition was for the ‘best explanation of a principle of scientific interest to the general public’. I twisted the definition somewhat and wrote a very light – and rather flawed – introduction to Turing’s Entschiedungsproblem work in the context of explaining what a computer scientist really spends her days thinking about. These are both subjects that are close to my heart: on one hand the tragic life of the brilliant Turing as arguable father of computer science, and on the other the correction of now pervasive assumptions about what computer science is and can be. I don’t think I did either justice, but, happily, the judges seemed to like what I had to say and awarded me first prize.
Despite the prize being a substantial cash sum, it wouldn’t surprise me if there had been only a few entries – but you can only beat what’s in front of you 🙂
Right, time to get this blog back on track. I want to talk about a useful technique that’s both highly practical and crops up in interview scenarios regularly.
Consider this problem: How can we efficiently randomly select items from a set of size , where is unknown? Each member of should have an equal probability of being selected.
At first glance, this problem looks a strange mix of trivial and impossible. If we don’t know , how can we know how to weight our selection probabilities? And, assuming is finite, how can we not know if we’re able – as we must be – to visit every element in ?
To address the second point first: we can find by iterating over all elements in . However, this adds a pass over the entire set that might be expensive. Consider selecting rows at random from a large database table. We don’t want to bring the whole table into memory just to count the number of rows. Linked lists are another good motivating example – say we would like to select elements from a linked list of length . Even if we do loop over the list to count its elements, employing the simple approach of choosing elements at random will still be slow because random access in linked lists is not a constant time operation. In general we will take on average time. By using the following technique, called ‘reservoir sampling’, we can bring this down to .
We can keep our selection of elements updated on-line as we scan through our data structure, and we can do it very simply.