Problem
You are given a list of (website, user)
pairs that represent users visiting websites. Come up with a program that identifies the top k
pairs of websites with the greatest similarity.
For example, suppose k = 1
, and the list of tuples is:
|
|
Then a reasonable similarity metric would most likely conclude that a
and e
are the most similar, so your program should return [('a', 'e')]
.
Examples
Example 1
|
|
Example 2
|
|
Example 3
|
|
Example 4
|
|
Solution
Method 1 - Jaccard Similarity with Hash Sets
Intuition
Website similarity can be measured using Jaccard similarity, which calculates the ratio of shared users to total unique users between two websites. We need to group users by website, then compute pairwise similarities and find the top k pairs.
Approach
- Group users by website using a hash map where key is website and value is set of users
- Generate all possible pairs of websites
- For each pair, calculate Jaccard similarity: |intersection| / |union|
- Sort all pairs by similarity score in descending order
- Return the top k pairs
- Handle edge cases: empty input, k larger than available pairs
Code
|
|
|
|
|
|
|
|
Complexity
- ⏰ Time complexity:
O(V + W^2 * U)
, where V is the number of visits, W is the number of websites, and U is the average number of users per website. We process all visits once, then compare all pairs of websites - 🧺 Space complexity:
O(V + W^2)
, for storing the user sets and similarity results
Method 2 - Optimized with Early Termination
Intuition
We can optimize by using a heap to maintain only the top k similarities instead of computing and storing all pairs. This is especially useful when k is much smaller than the total number of website pairs.
Approach
- Group users by website as before
- Use a min-heap of size k to track the top k similarities
- For each pair of websites, calculate similarity
- If heap size < k, add the pair to heap
- If similarity > minimum in heap, replace the minimum
- Return all pairs from the heap
Code
|
|
|
|
|
|
|
|
Complexity
- ⏰ Time complexity:
O(V + W^2 * U + W^2 * log k)
, where the additional log k factor comes from heap operations, but this is better when k « W^2 - 🧺 Space complexity:
O(V + k)
, significantly better space usage when k is small compared to total number of pairs