Leetcode 196 - Delete Duplicate Emails

Problem

Table: Person

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| email       | varchar |
+-------------+---------+

id is the primary key column for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.

Write an SQL query to delete all the duplicate emails, keeping only one unique email with the smallest id. Note that you are supposed to write a DELETE statement and not a SELECT one.

After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.

Examples

Example 1:

Input: Person table:

+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |
+----+------------------+

Output:

+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
+----+------------------+

Explanation: [email protected] is repeated two times. We keep the row with the smallest Id = 1.

Solution

Method 1 - Where Not in Min IDs

Code

SQL
DELETE FROM Person
WHERE id NOT IN (SELECT MIN(id) as id FROM Person GROUP BY email)

Above code will not work in MySQL, and we will get following error:

You can't specify target table 'Person' for update in FROM clause

As, we are deleting/updating the table, we cannot select from it. So, we just need to assign alias to our sub query:

DELETE FROM Person 
WHERE id NOT IN (
    SELECT * FROM (
        SELECT MIN(id)
        FROM Person
        GROUP BY email) as minIds);

Surprisingly, this performed well. The logic is that we are grouping by email and selecting the smallest ID for those groups. We then delete any records where ID is not present in that output. The reason we have to do SELECT * FROM (SELECT...) is because in MYSQL we can’t delete the table we are querying. So we have to query the table within the query.

Method 2 - Using Self Join

Code

SQL
DELETE p FROM Person p
JOIN Person q ON p.Email = q.Email AND p.Id > q.Id;

Another way of writing:

DELETE p FROM Person p,
    Person q
WHERE
    p.Email = q.Email AND p.Id > q.Id;
Pandas
def delete_duplicate_emails(person: pd.DataFrame):
    person.sort_values(by='id', inplace=True)
    person.drop_duplicates(subset=['email'], inplace=True)