Report Contiguous Dates
Problem
Table: Failed
+--------------+---------+
| Column Name | Type |
+--------------+---------+
| fail_date | date |
+--------------+---------+
fail_date is the primary key (column with unique values) for this table.
This table contains the days of failed tasks.
Table: Succeeded
+--------------+---------+
| Column Name | Type |
+--------------+---------+
| success_date | date |
+--------------+---------+
success_date is the primary key (column with unique values) for this table.
This table contains the days of succeeded tasks.
A system is running one task every day. Every task is independent of the previous tasks. The tasks can fail or succeed.
Write a solution to report the period_state for each continuous interval of days in the period from 2019-01-01 to 2019-12-31.
period_state is 'failed'__ if tasks in this interval failed or
'succeeded' if tasks in this interval succeeded. Interval of days are retrieved as start_date and end_date.
Return the result table ordered by start_date.
The result format is in the following example.
Examples
Example 1:
Input:
Failed table:
+-------------------+
| fail_date |
+-------------------+
| 2018-12-28 |
| 2018-12-29 |
| 2019-01-04 |
| 2019-01-05 |
+-------------------+
Succeeded table:
+-------------------+
| success_date |
+-------------------+
| 2018-12-30 |
| 2018-12-31 |
| 2019-01-01 |
| 2019-01-02 |
| 2019-01-03 |
| 2019-01-06 |
+-------------------+
Output:
+--------------+--------------+--------------+
| period_state | start_date | end_date |
+--------------+--------------+--------------+
| succeeded | 2019-01-01 | 2019-01-03 |
| failed | 2019-01-04 | 2019-01-05 |
| succeeded | 2019-01-06 | 2019-01-06 |
+--------------+--------------+--------------+
Explanation:
The report ignored the system state in 2018 as we care about the system in the period 2019-01-01 to 2019-12-31.
From 2019-01-01 to 2019-01-03 all tasks succeeded and the system state was "succeeded".
From 2019-01-04 to 2019-01-05 all tasks failed and the system state was "failed".
From 2019-01-06 to 2019-01-06 all tasks succeeded and the system state was "succeeded".
Solution
Method 1 - Window Functions and Grouping (SQL & Pandas)
Intuition
We need to report contiguous intervals of days where the system was in the same state (failed or succeeded) for each day in 2019. This is a classic "islands and gaps" problem, which can be solved by assigning a group number to each contiguous block of the same state.
Approach
- Generate all dates from 2019-01-01 to 2019-12-31.
- For each date, determine if it is in the Failed or Succeeded table (it must be in one or the other).
- Assign a state ('failed' or 'succeeded') to each date.
- Use a window function or difference of row_number and date to group contiguous blocks of the same state.
- For each group, report the state, start_date, and end_date.
Code
MySQL
WITH RECURSIVE dates AS (
SELECT DATE('2019-01-01') AS d
UNION ALL
SELECT d + INTERVAL 1 DAY FROM dates WHERE d < '2019-12-31'
),
status AS (
SELECT d AS day,
CASE WHEN f.fail_date IS NOT NULL THEN 'failed' ELSE 'succeeded' END AS period_state
FROM dates d
LEFT JOIN Failed f ON d.d = f.fail_date
)
SELECT period_state, MIN(day) AS start_date, MAX(day) AS end_date
FROM (
SELECT *,
ROW_NUMBER() OVER (ORDER BY day) - ROW_NUMBER() OVER (PARTITION BY period_state ORDER BY day) AS grp
FROM status
) t
GROUP BY period_state, grp
ORDER BY start_date;
PostgreSQL
WITH RECURSIVE dates AS (
SELECT DATE '2019-01-01' AS d
UNION ALL
SELECT d + INTERVAL '1 day' FROM dates WHERE d < DATE '2019-12-31'
),
status AS (
SELECT d AS day,
CASE WHEN f.fail_date IS NOT NULL THEN 'failed' ELSE 'succeeded' END AS period_state
FROM dates d
LEFT JOIN Failed f ON d.d = f.fail_date
)
SELECT period_state, MIN(day) AS start_date, MAX(day) AS end_date
FROM (
SELECT *,
ROW_NUMBER() OVER (ORDER BY day) - ROW_NUMBER() OVER (PARTITION BY period_state ORDER BY day) AS grp
FROM status
) t
GROUP BY period_state, grp
ORDER BY start_date;
Python (pandas)
import pandas as pd
from datetime import datetime
# failed and succeeded are DataFrames with 'fail_date' and 'success_date' columns
all_dates = pd.date_range('2019-01-01', '2019-12-31')
df = pd.DataFrame({'date': all_dates})
df['period_state'] = 'succeeded'
df.loc[df['date'].isin(failed['fail_date']), 'period_state'] = 'failed'
# Find contiguous blocks
df['grp'] = (df['period_state'] != df['period_state'].shift()).cumsum()
result = df.groupby(['period_state', 'grp']).agg(start_date=('date', 'first'), end_date=('date', 'last')).reset_index(drop=True)
result = result[['period_state', 'start_date', 'end_date']]
Complexity
- ⏰ Time complexity: O(D), where D is the number of days in 2019 (365).
- 🧺 Space complexity: O(D), for storing the date and state for each day.