Problem

Given a document (or stream) of words. Find the top k most frequent words in the document (or stream).

OR Given an array of strings words and an integer k, return the k most frequent strings.

Return the answer sorted by the frequency from highest to lowest. Sort the words with the same frequency by their lexicographical order.

Examples

Example 1:

Input:
words = ["i","love","leetcode","i","love","coding"], k = 2
Output:
 ["i","love"]
Explanation: "i" and "love" are the two most frequent words.
Note that "i" comes before "love" due to a lower alphabetical order.

Example 2:

Input:
words = ["the","day","is","sunny","the","the","the","sunny","is","is"], k = 4
Output:
 ["the","is","sunny","day"]
Explanation: "the", "is", "sunny" and "day" are the four most frequent words, with the number of occurrence being 4, 3, 2 and 1 respectively.

Solution

Method 1 - Basic Solution Using Hashtable

Here is the basic solution:

  1. One quick solution would be to create a pair object with the word and its frequency. This takes O(n) time.
  2. then sort the pair array with respect to the frequency of the pair. This takes O(nlgn).
  3. Now, take the first k pairs from the sorted array of pairs. This takes O(k) time.

Complexity

the total time is O(n+nlg(n)+K) = O(n lgn).

Method 2 - Using MaxHeap (storing All the elements)

Here is the solution:

  1. build a heap of (word, word-frequency) pair with “word-frequency” as key. It takes O(n) time to build a heap;
  2. extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

Complexity

  • Time Complexity - O(n+k*lg(n)
  • Space Complexity - O(n)

Method 3 - Using MinHeap with K Elements- O(nlgk) Solution with O(n) Space

But we can improve this solution. Note that we are only concern about the top k elements. Sorting the array means we are sorting all the n elements which is unnecessary as we are only concerned for first k. Any idea popping in? Yeah, I am sure you have the same feeling that we could use a Min Heap of size k to keep top k most frequent words. That’s right. We also need to use a hashmap to keep frequency of each word.

  1. Calculated frequency of all the words in a hashmap from the word to its frequency.
  2. Start adding pair object of word and its frequency into a min heap where we use the frequency as the key for the min heap.
  3. If the heap is full then remove the minimum element (top) form the heap and add add the new word-frequency pair only if the frequency of this word has frequency greater than the top word in the heap.
  4. Once we scanned all the words in the map and the heap is properly updated then the elements contained in the min heap are the top k most frequents.

This code is similar to Kth Largest Element in an Array.

Code

Java

Below is a simple implementation of the above idea using Java Priority Queue or creating our own heap.

Minheap with WordFreq implementing Comparable
public String[] topKWords(String[] words, int k) {
	final class WordFreq implements Comparable<WordFreq> {
		String word;
		int freq;

		public WordFreq(final String w, final int c) {
			word = w;
			freq = c;
		}

		@Override
		public int compareTo(final WordFreq other) {
			return Integer.compare(this.freq, other.freq);
		}
	}

	final Map<String, Integer> frequencyMap = new HashMap<String, Integer>();
	for (final String word: words) {
		frequencyMap.put(word, frequencyMap.getOrDefault(word, 0) + 1);
	}

	// Build the topK heap
	final PriorityQueue<WordFreq> minHeap = new PriorityQueue<WordFreq> (k);	
	for (final java.util.Map.Entry<String, Integer> entry: frequencyMap.entrySet()) {
		if (minHeap.size()<k) {
			minHeap.add(new WordFreq(entry.getKey(), entry.getValue()));
		} else if (entry.getValue() > topKHeap.peek().freq) {
			minHeap.remove();
			minHeap.add(new WordFreq(entry.getKey(), entry.getValue()));
		}
	}

	LinkedList<String> ans = new LinkedList<>(); // nOt using arraylist as it doesnt provide addFirst, otherwise we have to do collection reverse

	while (!minHeap.isEmpty()) {
		ans.addFirst(minHeap.poll().word);
	}

	return ans;
}
Minheap with WordFreq adding Comparator in PriorityQueue
public List<String> topKFrequent(String[] words, int k) {
	Map<String, Integer> map = new HashMap<>();
	for (String w : words) {
		map.put(w, map.getOrDefault(w, 0) + 1);
	}
	Queue<Pair> minHeap = new PriorityQueue<>((a, b) -> (a.freq == b.freq) ? b.word.compareTo(a.word) : Integer.compare(a.freq, b.freq));

	for (String w : map.keySet()) {
		int f = map.get(w);
		minHeap.offer(new Pair(w, f));
		if (minHeap.size() > k) {
			minHeap.poll();
		}
	}
	LinkedList<String> result = new LinkedList<>(); // nOt using arraylist as it doesnt provide addFirst, otherwise we have to do collection reverse

	while (!minHeap.isEmpty()) {
		result.addFirst(minHeap.poll().word);
	}
	return result;
}

static class WordFreq {
	String word;
	int freq;

	Pair(String word, int freq) {
		this.word = word;
		this.freq = freq;
	}
}

Complexity

  • Time Complexity - O(nlgk) time
  • Space Complexity - O(n) space.

Method 4 - Use Bucketsort

We can apply bucketsort on the frequency.