Problem

Table: Files

+-------------+---------+
| Column Name | Type    |
+-- ----------+---------+
| file_name   | varchar |
| content     | text    |
+-------------+---------+
file_name is the column with unique values of this table.
Each row contains file_name and the content of that file.

Write a solution to find the number of files that have at least one occurrence of the words ’ bull’ and ’ bear’ as a standalone word , respectively, disregarding any instances where it appears without space on either side (e.g. ‘bullet’, ‘bears’, ‘bull.’, or ‘bear’ at the beginning or end of a sentence will not be considered)

Return the word ‘bull’ and ‘bear’ along with the corresponding number of occurrences in any order.

The result format is in the following example.

Examples

Example 1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Input:  
Files table:
+------------+----------------------------------------------------------------------------------+
| file_name  | content                                                                         | 
+------------+----------------------------------------------------------------------------------+
| draft1.txt | The stock exchange predicts a bull market which would make many investors happy. | 
| draft2.txt | The stock exchange predicts a bull market which would make many investors happy, |
|            | but analysts warn of possibility of too much optimism and that in fact we are    |
|            | awaiting a bear market.                                                          | 
| draft3.txt | The stock exchange predicts a bull market which would make many investors happy, |
|            | but analysts warn of possibility of too much optimism and that in fact we are    |
|            | awaiting a bear market. As always predicting the future market is an uncertain   |
|            | game and all investors should follow their instincts and best practices.         | 
+------------+----------------------------------------------------------------------------------+
Output:  
+------+-------+
| word | count |  
+------+-------+
| bull | 3     | 
| bear | 2     | 
+------+-------+
Explanation:  
- The word "bull" appears 1 time in "draft1.txt", 1 time in "draft2.txt", and 1 time in "draft3.txt". Therefore, the total number of occurrences for the word "bull" is 3.
- The word "bear" appears 1 time in "draft2.txt", and 1 time in "draft3.txt". Therefore, the total number of occurrences for the word "bear" is 2.

Solution

Method 1 – SQL Regex and String Matching

Intuition

We need to count files where the word ‘bull’ or ‘bear’ appears as a standalone word (i.e., surrounded by spaces or at the start/end of the content). We use SQL regular expressions or string functions to match these patterns.

Approach

  1. For ‘bull’, check if the content matches the pattern for a standalone ‘bull’ (using word boundaries or manual checks for start/end/space).
  2. For ‘bear’, do the same.
  3. Count the number of files for each word.
  4. Return the result as required.

Code

1
2
3
4
5
6
7
SELECT 'bull' AS word, COUNT(*) AS count
FROM Files
WHERE content REGEXP '(^|[^a-zA-Z])bull([^a-zA-Z]|$)'
UNION ALL
SELECT 'bear' AS word, COUNT(*) AS count
FROM Files
WHERE content REGEXP '(^|[^a-zA-Z])bear([^a-zA-Z]|$)';
1
2
3
4
5
6
7
SELECT 'bull' AS word, COUNT(*) AS count
FROM Files
WHERE content ~* '\m(bull)\M'
UNION ALL
SELECT 'bear' AS word, COUNT(*) AS count
FROM Files
WHERE content ~* '\m(bear)\M';
1
2
3
4
5
6
7
8
9
import pandas as pd
import re

def count_occurrences(files: pd.DataFrame) -> pd.DataFrame:
    def has_word(text, word):
        return bool(re.search(rf'(?<![a-zA-Z]){word}(?![a-zA-Z])', text))
    bull_count = files['content'].apply(lambda x: has_word(x, 'bull')).sum()
    bear_count = files['content'].apply(lambda x: has_word(x, 'bear')).sum()
    return pd.DataFrame({'word': ['bull', 'bear'], 'count': [bull_count, bear_count]})

Complexity

  • ⏰ Time complexity: O(n*m), where n is the number of files and m is the average content length, since we scan each file’s content for the pattern.
  • 🧺 Space complexity: O(1) (SQL) or O(n) (pandas), for storing intermediate results.