Word Frequency From File

Problem

Write a bash script to calculate the frequency of each word in a text file words.txt.

For simplicity sake, you may assume:

  • words.txt contains only lowercase characters and space ' ' characters.
  • Each word must consist of lowercase characters only.
  • Words are separated by one or more whitespace characters.

Examples

Example:

Assume that words.txt has the following content:

the day is sunny the the
the sunny is is

Your script should output the following, sorted by descending frequency:

the 4
is 3
sunny 2
day 1

Note:

  • Don’t worry about handling ties, it is guaranteed that each word’s frequency count is unique.
  • Could you write it in one-line using Unix pipes?

Solution

Method 1 - Using Unix Pipes

Code

cat words.txt | tr -s ' ' '\n' | sort | uniq --count | sort -r | awk '{print $2 " " $1}'

Dry Run

Lets see the command 1 by 1.

cat words.txt

Outputs the content in the file in the standard output

  ~ cat words.txt
the day is sunny the the
the sunny is is
tr -s ' ' '\n'

tr -s uses for truncating the input as per given command followed by it. In our case, we are interested in truncating each whitespace( ’ ‘) and replace it with newline(’\n’) as shown below:

  ~ cat words.txt | tr -s ' ' '\n'
the
day
is
sunny
the
the
the
sunny
is
is
sort

This sort the input in ascending order so that uniq can find duplicate words adjacently (order does not matter for uniq) as shown below:

  ~ cat words.txt | tr -s ' ' '\n' | sort
day
is
is
is
sunny
sunny
the
the
the
the
uniq --count

This command provides word frequency as “count word” format. Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output). Note: ‘uniq’ does not detect repeated lines unless they are adjacent.

  ~ cat words.txt | tr -s ' ' '\n' | sort | uniq --count
      1 day
      3 is
      2 sunny
      4 the
sort -r

sort -r sorts the input in descending order.

  ~ cat words.txt | tr -s ' ' '\n' | sort | uniq --count | sort -r
      4 the
      3 is
      2 sunny
      1 day
awk '{print 2""2 " "2""1}

awk formats the input given for each line. In our example, we want the second column (2) appears first and the first column appears first and the first column appears second separated by whitespace(" “)

  ~ cat words.txt | tr -s ' ' '\n' | sort | uniq --count | sort -r | awk '{print $2 " " $1}'
the 4
is 3
sunny 2
day 1