Leetcode 196 - Delete Duplicate Emails
Problem
Table: Person
+-------------+---------+
| Column Name | Type |
+-------------+---------+
| id | int |
| email | varchar |
+-------------+---------+
id is the primary key column for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.
Write an SQL query to delete all the duplicate emails, keeping only one unique email with the smallest id. Note that you are supposed to write a DELETE statement and not a SELECT one.
After running your script, the answer shown is the Person
table. The driver will first compile and run your piece of code and then show the Person
table. The final order of the Person
table does not matter.
Examples
Example 1:
Input: Person table:
+----+------------------+
| id | email |
+----+------------------+
| 1 | john@example.com |
| 2 | bob@example.com |
| 3 | john@example.com |
+----+------------------+
Output:
+----+------------------+
| id | email |
+----+------------------+
| 1 | john@example.com |
| 2 | bob@example.com |
+----+------------------+
Explanation: [email protected] is repeated two times. We keep the row with the smallest Id = 1.
Solution
Method 1 - Where Not in Min IDs
Code
SQL
DELETE FROM Person
WHERE id NOT IN (SELECT MIN(id) as id FROM Person GROUP BY email)
Above code will not work in MySQL, and we will get following error:
You can't specify target table 'Person' for update in FROM clause
As, we are deleting/updating the table, we cannot select from it. So, we just need to assign alias to our sub query:
DELETE FROM Person
WHERE id NOT IN (
SELECT * FROM (
SELECT MIN(id)
FROM Person
GROUP BY email) as minIds);
Surprisingly, this performed well. The logic is that we are grouping by email and selecting the smallest ID for those groups. We then delete any records where ID is not present in that output. The reason we have to do SELECT * FROM (SELECT...)
is because in MYSQL we can’t delete the table we are querying. So we have to query the table within the query.
Method 2 - Using Self Join
Code
SQL
DELETE p FROM Person p
JOIN Person q ON p.Email = q.Email AND p.Id > q.Id;
Another way of writing:
DELETE p FROM Person p,
Person q
WHERE
p.Email = q.Email AND p.Id > q.Id;
Pandas
def delete_duplicate_emails(person: pd.DataFrame):
person.sort_values(by='id', inplace=True)
person.drop_duplicates(subset=['email'], inplace=True)