Problem
Given a stream of elements too large to store in memory, pick a random element from the stream with uniform probability.
Solution
This is a classic problem known as Reservoir Sampling. The goal is to randomly select an element from a large stream of data with uniform probability, even when the entire stream cannot be stored in memory.
Introduction to Problem
Reservoir sampling, often referred to as Algorithm R as described by Jeffrey Vitter in Random Sampling with a Reservoir, is a widely used technique in data processing. It allows for the random selection of k samples from a set S containing n items, where n is very large or unknown. Each of the chosen k items forms a “reservoir,” ensuring that every item is selected with an equal probability of 1/n.
Reservoir sampling encompasses a variety of randomized algorithms.
Formal Description
Given a data stream of unknown size n
, randomly select k
items (“reservoir”) with equal probability. Start by saving the first k
items in an array of size k
. For each subsequent item j
(where j > k
), generate a random integer M between 1 and j
. If M
is smaller than k
, replace the item at index M
in the reservoir with the item at index j
.
In simpler terms:
- Choose
k
entries fromn
numbers. Make sure each number is selected with the probability ofk/n
Example
Note: For a better understanding, there is an interesting post here that uses an example of how girls fairly choose a date from a group of bachelors.
Bad Solution
A straightforward method is to make an array called reservoir[]
of maximum size k
. Randomly pick items from the stream[0..n-1]
one at a time. If an item has not been selected before, add it to reservoir[]
. Checking if an item has been selected involves searching in reservoir[]
. The time complexity is O(k^2)
, which can be inefficient if k
is large. This method is also not suitable for streaming input.
Algorithm
Choose k
entries from n
numbers. Make sure each number is selected with the probability of k/n
. Here are the steps we can follow:
- Initialize:
- Create an array
reservoir
of sizek
and fill it with the firstk
items from the input stream.
- Create an array
- Process Remaining Items:
- For each subsequent item (from
k+1
ton
):- Pick a random index from 0 to the current index.
- If the picked index is within the range of the reservoir (i.e., 0 to
k-1
), replace the item at that index in the reservoir with the current item.
- For each subsequent item (from
Code
Java
import java.util.Random;
import java.util.ArrayList;
public class ReservoirSampling {
public static int[] reservoirSampling(int[] stream, int k) {
Random random = new Random();
int[] reservoir = new int[k];
// Initialize the reservoir with the first k elements
for (int i = 0; i < k; i++) {
reservoir[i] = stream[i];
}
// Process the rest of the stream
for (int i = k; i < stream.length; i++) {
// Pick a random index from 0 to i
int j = random.nextInt(i + 1);
// Replace element in reservoir with the current element
if (j < k) {
reservoir[j] = stream[i];
}
}
return reservoir;
}
public static void main(String[] args) {
int[] stream = new int[100];
for (int i = 0; i < 100; i++) {
stream[i] = i + 1;
}
int k = 10; // Number of entries to select
int[] sampledEntries = reservoirSampling(stream, k);
System.out.print("Sampled Entries: ");
for (int entry: sampledEntries) {
System.out.print(entry + " ");
}
System.out.println();
}
}
Python
import random
def reservoir_sampling(stream, k):
# Initialize reservoir with the first k elements
reservoir = []
for i in range(k):
reservoir.append(stream[i])
# Process the rest of the stream
for i in range(k, len(stream)):
# Pick a random index from 0 to i
j = random.randint(0, i)
# Replace element in reservoir with the current element
# with probability k/i
if j < k:
reservoir[j] = stream[i]
return reservoir
# Example usage
stream = list(range(1, 101)) # A stream of numbers from 1 to 100
k = 10 # Number of entries to select
sampled_entries = reservoir_sampling(stream, k)
print("Sampled Entries:", sampled_entries)
Implement Solution
So we are given a big array (or stream) of numbers (to simplify), and we need to write an efficient function to randomly select k numbers where 1 <= k <= n. Let the input array be stream[].
Method 1- Bad Solution
A simple solution is to create an array reservoir[] of maximum size k. One by one randomly select an item from stream[0..n-1]. If the selected item is not previously selected, then put it in reservoir[]. To check if an item is previously selected or not, we need to search the item in reservoir[]. The time complexity of this algorithm will be O(k^2). This can be costly if k is big. Also, this is not efficient if the input is in the form of a stream.
Method 2 - Using Reservoir Sampling
It can be solved in O(n) time. The solution also suits well for input in the form of stream. The idea is similar to this post. Following are the steps.
1) Create an array reservoir[0..k-1] and copy first k items of stream[] to it. 2) Now one by one consider all items from _(k+1)_th item to _n_th item. …a) Generate a random number from 0 to i where i is index of current item in stream[]. Let the generated random number is j. …b) If j is in range 0 to k-1, replace reservoir[j] with arr[i]
Code
static int[] selectKItems(int stream[], int k) {
int n = stream.length;
int i; // index for elements in stream[]
// reservoir[] is the output array. Initialize it
// with first k elements from stream[]
int reservoir[] = new int[k];
for (i = 0; i<k; i++)
reservoir[i] = stream[i];
Random r = new Random();
// Iterate from the (k+1)th element to nth element
for (; i<n; i++) {
// Pick a random index from 0 to i.
int j = r.nextInt(i + 1);
// If the randomly picked index is smaller than
// k, then replace the element present at the
// index with new element from stream
if (j<k) {
reservoir[j] = stream[i];
}
}
return reservoir;
}
Time Complexity: O(n)
Problems
Fisher-Yates Shuffle vs Reservoir Sampling Linked List Random Node Random Pick Index Random Flip Matrix Random Point in Non-overlapping Rectangles