Reservoir Sampling is an algorithm for sampling elements from a stream of data. Imagine you are given a really large stream of data elements, for example:

  • Queries on DuckDuckGo searches in June
  • Products bought at Sainsbury’s during the Christmas season
  • Names in the white pages guide.

Let me put in these easy words imagine the following “dating” game show. The contestant, a bachelorette, is seated at a table with an empty chair. The host introduces the first suitor; the bachelorette has to invite him to sit with her and be her current “date”. Next, the host introduces the second suitor. Now the girl gets the choice of whether she will keep her current “date” or replace him with the new suitor. She can use a variety of means to make her decision, such as asking questions or making the two suitors compete in some way. After that, the host introduces the third suitor, and again the girl can choose to either keep or replace her current “date.” After showing n suitors this way, the game show ends, and the girl goes on a real date with the suitor she kept at the end, the “winner” of the show.

Imagine one contestant who simply flips a coin to decide whether or not to swap her current “date”. Is this “fair” to the suitors, i.e., is the probability distribution of the winner uniform over all the suitors? The answer is no because it is much more likely for the last few suitors to win than the first few suitors. The very first suitor is most unfortunate of all since if he wants to go on a date with the girl, he has to survive n-1 coin flips. The last suitor has the best chances — he only needs to win a single coin flip

If you like video animations and your good in making videos you can sell your videos using this link if you don’t like money forget it, my friend.

import random

def reservoir_sampling(iterator, k):
result = []
n = 0
for item in iterator:
n = n + 1
if len(result) < k:
print(result)
result.append(item)
else:
j = int(random.random() * n)
if j < k:
result[j] = item
print('else:', result)

return result


if __name__ == "__main__":
stream = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
k = 5
print(reservoir_sampling(stream, k))

Cheers,
Zaid Alissa Almaliki