Problem

Table: Actions

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| user_id       | int     |
| post_id       | int     |
| action_date   | date    |
| action        | enum    |
| extra         | varchar |
+---------------+---------+
This table may have duplicate rows.
The action column is an ENUM (category) type of ('view', 'like', 'reaction', 'comment', 'report', 'share').
The extra column has optional information about the action, such as a reason for the report or a type of reaction.

Table: Removals

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| post_id       | int     |
| remove_date   | date    |
+---------------+---------+
post_id is the primary key (column with unique values) of this table.
Each row in this table indicates that some post was removed due to being reported or as a result of an admin review.

Write a solution to find the average daily percentage of posts that got removed after being reported as spam, rounded to 2 decimal places.

The result format is in the following example.

Examples

Example 1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Input: 
Actions table:
+---------+---------+-------------+--------+--------+
| user_id | post_id | action_date | action | extra  |
+---------+---------+-------------+--------+--------+
| 1       | 1       | 2019-07-01  | view   | null   |
| 1       | 1       | 2019-07-01  | like   | null   |
| 1       | 1       | 2019-07-01  | share  | null   |
| 2       | 2       | 2019-07-04  | view   | null   |
| 2       | 2       | 2019-07-04  | report | spam   |
| 3       | 4       | 2019-07-04  | view   | null   |
| 3       | 4       | 2019-07-04  | report | spam   |
| 4       | 3       | 2019-07-02  | view   | null   |
| 4       | 3       | 2019-07-02  | report | spam   |
| 5       | 2       | 2019-07-03  | view   | null   |
| 5       | 2       | 2019-07-03  | report | racism |
| 5       | 5       | 2019-07-03  | view   | null   |
| 5       | 5       | 2019-07-03  | report | racism |
+---------+---------+-------------+--------+--------+
Removals table:
+---------+-------------+
| post_id | remove_date |
+---------+-------------+
| 2       | 2019-07-20  |
| 3       | 2019-07-18  |
+---------+-------------+
Output: 
+-----------------------+
| average_daily_percent |
+-----------------------+
| 75.00                 |
+-----------------------+
Explanation: 
The percentage for 2019-07-04 is 50% because only one post of two spam reported posts were removed.
The percentage for 2019-07-02 is 100% because one post was reported as spam and it was removed.
The other days had no spam reports so the average is (50 + 100) / 2 = 75%
Note that the output is only one number and that we do not care about the remove dates.

Solution

Method 1 - Average Daily Percent of Spam-Reported Posts Removed (SQL & Pandas)

Intuition

We need to find, for each day, the percentage of posts reported as spam that were eventually removed, and then average these daily percentages. Only days with at least one spam report are considered. Removal date is irrelevant; we only care if the post was ever removed.

Approach

  1. For each day, find the set of unique posts reported as spam (action = ‘report’, extra = ‘spam’).
  2. For each day, count how many of those posts appear in the Removals table.
  3. For each day, compute the percentage: (removed / reported) * 100.
  4. Average these daily percentages and round to 2 decimal places.

Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
WITH spam_reports AS (
  SELECT action_date, post_id
  FROM Actions
  WHERE action = 'report' AND extra = 'spam'
  GROUP BY action_date, post_id
),
daily_stats AS (
  SELECT
    action_date,
    COUNT(*) AS reported,
    SUM(CASE WHEN r.post_id IS NOT NULL THEN 1 ELSE 0 END) AS removed
  FROM spam_reports s
  LEFT JOIN Removals r ON s.post_id = r.post_id
  GROUP BY action_date
)
SELECT ROUND(AVG(removed * 100.0 / reported), 2) AS average_daily_percent
FROM daily_stats;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
WITH spam_reports AS (
  SELECT action_date, post_id
  FROM Actions
  WHERE action = 'report' AND extra = 'spam'
  GROUP BY action_date, post_id
),
daily_stats AS (
  SELECT
    action_date,
    COUNT(*) AS reported,
    SUM(CASE WHEN r.post_id IS NOT NULL THEN 1 ELSE 0 END) AS removed
  FROM spam_reports s
  LEFT JOIN Removals r ON s.post_id = r.post_id
  GROUP BY action_date
)
SELECT ROUND(AVG(removed * 100.0 / reported)::numeric, 2) AS average_daily_percent
FROM daily_stats;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# actions and removals are pandas DataFrames
spam = actions[(actions['action'] == 'report') & (actions['extra'] == 'spam')]
spam = spam.drop_duplicates(['action_date', 'post_id'])
merged = spam.merge(removals[['post_id']], on='post_id', how='left', indicator='removed')
merged['removed'] = (merged['removed'] == 'both').astype(int)
daily = merged.groupby('action_date').agg(reported=('post_id', 'count'), removed=('removed', 'sum')).reset_index()
daily['percent'] = daily['removed'] * 100 / daily['reported']
average_daily_percent = round(daily['percent'].mean(), 2)
# To output as a DataFrame:
result = pd.DataFrame({'average_daily_percent': [average_daily_percent]})

Complexity

  • ⏰ Time complexity: O(N + M), where N is the number of rows in Actions and M is the number of rows in Removals.
  • 🧺 Space complexity: O(D + P), where D is the number of days with spam reports and P is the number of unique posts reported as spam.