Problem

Table: Activity

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| player_id    | int     |
| device_id    | int     |
| event_date   | date    |
| games_played | int     |
+--------------+---------+

(player_id, event_date) is the primary key (combination of columns with unique values) of this table. This table shows the activity of players of some games. Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.

Write a solution to report the fraction of players that logged in again on the day after the day they first logged in, rounded to 2 decimal places. In other words, you need to count the number of players that logged in for at least two consecutive days starting from their first login date, then divide that number by the total number of players.

The result format is in the following example.

Examples

Example 1:

Input: Activity table:

+-----------+-----------+------------+--------------+
| player_id | device_id | event_date | games_played |
+-----------+-----------+------------+--------------+
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-03-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-02 | 0            |
| 3         | 4         | 2018-07-03 | 5            |
+-----------+-----------+------------+--------------+

Output:

+-----------+
| fraction  |
+-----------+
| 0.33      |
+-----------+

Explanation: Only the player with id 1 logged back in after the first day he had logged in so the answer is 1/3 = 0.33

Solution

Method 1 - Using common table expressions

To solve this problem in SQL, we need to piece together a few steps: find the first login date for each player, determine if they logged in the next day, and calculate the fraction of such players over the total number of players. Here’s how you can do it: Here is the approach:

  1. FirstLogins CTE:
    • This Common Table Expression (CTE) retrieves the first login date for each player using MIN(event_date) and groups the results by player_id.
  2. LoggedInNextDay CTE:
    • This CTE selects the player_id of players who have a login record exactly one day after their first login date. It joins the FirstLogins CTE with the Activity table and checks if event_date is one day after the first_login_date.
  3. Main Query:
    • The main query counts the unique player_ids in the LoggedInNextDay CTE to determine the number of players who logged in on consecutive days.
    • It also counts the unique player_ids in the FirstLogins CTE to get the total number of players.
    • The final selection rounds the fraction to two decimal places using ROUND.

Code

SQL
 WITH FirstLogins AS (
     SELECT 
         player_id, 
         MIN(event_date) AS first_login_date
     FROM 
         Activity
     GROUP BY 
         player_id
 ),
 LoggedInNextDay AS (
     SELECT DISTINCT 
         fl.player_id
     FROM 
         FirstLogins fl
     JOIN 
         Activity a ON fl.player_id = a.player_id 
                    AND a.event_date = DATE_ADD(fl.first_login_date, INTERVAL 1 DAY)
 )
 SELECT 
     ROUND(COUNT(DISTINCT ln.player_id) / COUNT(DISTINCT fl.player_id), 2) AS fraction
 FROM 
     FirstLogins fl
 LEFT JOIN 
     LoggedInNextDay ln ON fl.player_id = ln.player_id;

Complexity

  • ⏰ Time complexity: O(n log n)
    • FirstLogins CTE: Calculating the minimum date for each player involves a full scan and aggregation on the event_date, leading to O(n log n), where n is the number of records in the Activity table.
    • LoggedInNextDay CTE: This involves another scan and join operation, resulting in O(n log n) due to the date addition and comparison for each record.
    • Main Query: Aggregating and joining large datasets typically has a combined complexity of O(n log n) with efficient indexing.
  • 🧺 Space complexity: O(n)
    • Intermediate Tables and Results: The CTEs and final result set might need additional space proportional to the number of unique player_ids, so O(n) when considering intermediate results.