Problem

Given a stream of elements too large to store in memory, pick a random element from the stream with uniform probability.

Solution

This is a classic problem known as Reservoir Sampling. The goal is to randomly select an element from a large stream of data with uniform probability, even when the entire stream cannot be stored in memory.

Introduction to Problem

Reservoir sampling, often referred to as Algorithm R as described by Jeffrey Vitter in Random Sampling with a Reservoir, is a widely used technique in data processing. It allows for the random selection of k samples from a set S containing n items, where n is very large or unknown. Each of the chosen k items forms a “reservoir,” ensuring that every item is selected with an equal probability of 1/n.

Reservoir sampling encompasses a variety of randomized algorithms.

Formal Description

Given a data stream of unknown size n, randomly select k items (“reservoir”) with equal probability. Start by saving the first k items in an array of size k. For each subsequent item j (where j > k), generate a random integer M between 1 and j. If M is smaller than k, replace the item at index M in the reservoir with the item at index j.

In simpler terms:

  • Choose k entries from n numbers. Make sure each number is selected with the probability of k/n

Example

Note: For a better understanding, there is an interesting post here that uses an example of how girls fairly choose a date from a group of bachelors.

Bad Solution

A straightforward method is to make an array called reservoir[] of maximum size k. Randomly pick items from the stream[0..n-1] one at a time. If an item has not been selected before, add it to reservoir[]. Checking if an item has been selected involves searching in reservoir[]. The time complexity is O(k^2), which can be inefficient if k is large. This method is also not suitable for streaming input.

Algorithm

Choose k entries from n numbers. Make sure each number is selected with the probability of k/n. Here are the steps we can follow:

  1. Initialize:
    • Create an array reservoir of size k and fill it with the first k items from the input stream.
  2. Process Remaining Items:
    • For each subsequent item (from k+1 to n):
      • Pick a random index from 0 to the current index.
      • If the picked index is within the range of the reservoir (i.e., 0 to k-1), replace the item at that index in the reservoir with the current item.

Code

Java

import java.util.Random;
import java.util.ArrayList;


public class ReservoirSampling {

	public static int[] reservoirSampling(int[] stream, int k) {
		Random random = new Random();
		int[] reservoir = new int[k];

		// Initialize the reservoir with the first k elements
		for (int i = 0; i < k; i++) {
			reservoir[i] = stream[i];
		}

		// Process the rest of the stream
		for (int i = k; i < stream.length; i++) {
			// Pick a random index from 0 to i
			int j = random.nextInt(i + 1);

			// Replace element in reservoir with the current element
			if (j < k) {
				reservoir[j] = stream[i];
			}
		}

		return reservoir;
	}

	public static void main(String[] args) {
		int[] stream = new int[100];

		for (int i = 0; i < 100; i++) {
			stream[i] = i + 1;
		}

		int k = 10; // Number of entries to select
		int[] sampledEntries = reservoirSampling(stream, k);

		System.out.print("Sampled Entries: ");

		for (int entry: sampledEntries) {
			System.out.print(entry + " ");
		}

		System.out.println();
	}
}

Python

import random


def reservoir_sampling(stream, k):
    # Initialize reservoir with the first k elements
    reservoir = []
    for i in range(k):
        reservoir.append(stream[i])

    # Process the rest of the stream
    for i in range(k, len(stream)):
        # Pick a random index from 0 to i
        j = random.randint(0, i)

        # Replace element in reservoir with the current element
        # with probability k/i
        if j < k:
            reservoir[j] = stream[i]

    return reservoir


# Example usage
stream = list(range(1, 101))  # A stream of numbers from 1 to 100
k = 10  # Number of entries to select

sampled_entries = reservoir_sampling(stream, k)
print("Sampled Entries:", sampled_entries)

Implement Solution

So we are given a big array (or stream) of numbers (to simplify), and we need to write an efficient function to randomly select k numbers where 1 <= k <= n. Let the input array be stream[].

Method 1- Bad Solution

A simple solution is to create an array reservoir[] of maximum size k. One by one randomly select an item from stream[0..n-1]. If the selected item is not previously selected, then put it in reservoir[]. To check if an item is previously selected or not, we need to search the item in reservoir[]. The time complexity of this algorithm will be O(k^2). This can be costly if k is big. Also, this is not efficient if the input is in the form of a stream.

Method 2 - Using Reservoir Sampling

It can be solved in O(n) time. The solution also suits well for input in the form of stream. The idea is similar to this post. Following are the steps.

1) Create an array reservoir[0..k-1] and copy first k items of stream[] to it. 2) Now one by one consider all items from _(k+1)_th item to _n_th item. …a) Generate a random number from 0 to i where i is index of current item in stream[]. Let the generated random number is j. …b) If j is in range 0 to k-1, replace reservoir[j] with arr[i]

Code

static int[] selectKItems(int stream[], int k) {
	int n = stream.length;
	int i; // index for elements in stream[]

	// reservoir[] is the output array. Initialize it
	// with first k elements from stream[]
	int reservoir[] = new int[k];
	for (i = 0; i<k; i++)
		reservoir[i] = stream[i];

	Random r = new Random();

	// Iterate from the (k+1)th element to nth element
	for (; i<n; i++) {
		// Pick a random index from 0 to i.
		int j = r.nextInt(i + 1);

		// If the randomly  picked index is smaller than
		// k, then replace the element present at the
		// index with new element from stream
		if (j<k) {
			reservoir[j] = stream[i];
		}
	}

	return reservoir;
}

Time Complexity: O(n)

Problems

Fisher-Yates Shuffle vs Reservoir Sampling Linked List Random Node Problem Random Pick Index Problem Random Flip Matrix Problem Random Point in Non-overlapping Rectangles Problem