Find Candidates for Data Scientist Position II
Problem
Table: Candidates
# +--------------+---------+
# | Column Name | Type |
# +--------------+---------+
# | candidate_id | int |
# | skill | varchar |
# | proficiency | int |
# +--------------+---------+
# (candidate_id, skill) is the unique key for this table.
# Each row includes candidate_id, skill, and proficiency level (1-5).
Table: Projects
+--------------+---------+
| Column Name | Type |
+--------------+---------+
| project_id | int |
| skill | varchar |
| importance | int |
+--------------+---------+
(project_id, skill) is the primary key for this table.
Each row includes project_id, required skill, and its importance (1-5) for the project.
Leetcode is staffing for multiple data science projects. Write a solution to find the best candidate foreach project based on the following criteria:
- Candidates must have all the skills required for a project.
- Calculate a score for each candidate-project pair as follows:
- Start with
100points - Add
10points for each skill where proficiency > importance - Subtract
5points for each skill where proficiency < importance - If the candidate's skill proficiency equal to the project's skill importance, the score remains unchanged
- Start with
Include only the top candidate (highest score) for each project. If there's a
tie , choose the candidate with the lower candidate_id. If there is
no suitable candidate for a project, do not return that project.
Return a result table ordered by project_id in ascending order.
The result format is in the following example.
Examples
Example 1:
Input:
`Candidates` table:
+--------------+-----------+-------------+
| candidate_id | skill | proficiency |
+--------------+-----------+-------------+
| 101 | Python | 5 |
| 101 | Tableau | 3 |
| 101 | PostgreSQL| 4 |
| 101 | TensorFlow| 2 |
| 102 | Python | 4 |
| 102 | Tableau | 5 |
| 102 | PostgreSQL| 4 |
| 102 | R | 4 |
| 103 | Python | 3 |
| 103 | Tableau | 5 |
| 103 | PostgreSQL| 5 |
| 103 | Spark | 4 |
+--------------+-----------+-------------+
`Projects` table:
+-------------+-----------+------------+
| project_id | skill | importance |
+-------------+-----------+------------+
| 501 | Python | 4 |
| 501 | Tableau | 3 |
| 501 | PostgreSQL| 5 |
| 502 | Python | 3 |
| 502 | Tableau | 4 |
| 502 | R | 2 |
+-------------+-----------+------------+
Output:
+-------------+--------------+-------+
| project_id | candidate_id | score |
+-------------+--------------+-------+
| 501 | 101 | 105 |
| 502 | 102 | 130 |
+-------------+--------------+-------+
Explanation:
* For Project 501, Candidate 101 has the highest score of 105. All other candidates have the same score but Candidate 101 has the lowest candidate_id among them.
* For Project 502, Candidate 102 has the highest score of 130.
The output table is ordered by project_id in ascending order.
Solution
Method 1 – Skill Matching and Scoring
Intuition
The key idea is to match each candidate to projects only if they possess all required skills. For each candidate-project pair, we calculate a score based on the difference between the candidate's proficiency and the project's skill importance. We then select the candidate with the highest score for each project, breaking ties by candidate_id.
Approach
- For each project, list all required skills.
- For each candidate, check if they have all the required skills for a project.
- For each matching skill, calculate the score:
- Start with 100 points.
- For each skill:
- Add 10 if proficiency > importance.
- Subtract 5 if proficiency < importance.
- No change if equal.
- For each project, select the candidate with the highest score (lowest candidate_id in case of tie).
- Return the results ordered by project_id.
Code
MySQL
WITH ProjectSkills AS (
SELECT project_id, COUNT(*) AS skill_count
FROM Projects
GROUP BY project_id
),
CandidateSkills AS (
SELECT p.project_id, c.candidate_id,
SUM(
CASE
WHEN c.proficiency > p.importance THEN 10
WHEN c.proficiency < p.importance THEN -5
ELSE 0
END
) + 100 AS score,
COUNT(*) AS matched_skills
FROM Projects p
JOIN Candidates c
ON p.skill = c.skill
GROUP BY p.project_id, c.candidate_id
)
SELECT cs.project_id, cs.candidate_id, cs.score
FROM CandidateSkills cs
JOIN ProjectSkills ps
ON cs.project_id = ps.project_id
WHERE cs.matched_skills = ps.skill_count
-- Only candidates with all required skills
AND cs.score = (
SELECT MAX(score)
FROM CandidateSkills cs2
WHERE cs2.project_id = cs.project_id
AND cs2.matched_skills = ps.skill_count
)
ORDER BY cs.project_id, cs.candidate_id
;
PostgreSQL
WITH ProjectSkills AS (
SELECT project_id, COUNT(*) AS skill_count
FROM Projects
GROUP BY project_id
),
CandidateSkills AS (
SELECT p.project_id, c.candidate_id,
SUM(
CASE
WHEN c.proficiency > p.importance THEN 10
WHEN c.proficiency < p.importance THEN -5
ELSE 0
END
) + 100 AS score,
COUNT(*) AS matched_skills
FROM Projects p
JOIN Candidates c
ON p.skill = c.skill
GROUP BY p.project_id, c.candidate_id
)
SELECT cs.project_id, cs.candidate_id, cs.score
FROM CandidateSkills cs
JOIN ProjectSkills ps
ON cs.project_id = ps.project_id
WHERE cs.matched_skills = ps.skill_count
AND cs.score = (
SELECT MAX(score)
FROM CandidateSkills cs2
WHERE cs2.project_id = cs.project_id
AND cs2.matched_skills = ps.skill_count
)
ORDER BY cs.project_id, cs.candidate_id
;
Python (pandas)
class Solution:
def find_best_candidates(
self,
candidates: 'pd.DataFrame',
projects: 'pd.DataFrame'
) -> 'pd.DataFrame':
ans = []
for pid, pgroup in projects.groupby('project_id'):
required = pgroup[['skill', 'importance']].set_index('skill')
best_score = float('-inf')
best_cid = None
for cid, cgroup in candidates.groupby('candidate_id'):
cskills = cgroup.set_index('skill')['proficiency']
if not required.index.isin(cskills.index).all():
continue
score = 100
for skill, imp in required['importance'].items():
prof = cskills[skill]
if prof > imp:
score += 10
elif prof < imp:
score -= 5
if (score > best_score) or (score == best_score and (best_cid is None or cid < best_cid)):
best_score = score
best_cid = cid
if best_cid is not None:
ans.append({'project_id': pid, 'candidate_id': best_cid, 'score': best_score})
return pd.DataFrame(ans).sort_values(['project_id', 'candidate_id'])
Complexity
- ⏰ Time complexity:
O(P * C * S)Where P = number of projects, C = number of candidates, S = average number of skills per project. For each project, we check all candidates and their skills. - 🧺 Space complexity:
O(P + C)For storing intermediate groupings and the result.