DNA Pattern Recognition
MediumUpdated: Sep 29, 2025
Practice on:
Problem
Table: Samples
+----------------+---------+
| Column Name | Type |
+----------------+---------+
| sample_id | int |
| dna_sequence | varchar |
| species | varchar |
+----------------+---------+
sample_id is the unique key for this table.
Each row contains a DNA sequence represented as a string of characters (A, T, G, C) and the species it was collected from.
Biologists are studying basic patterns in DNA sequences. Write a solution to identify sample_id with the following patterns:
- Sequences that start with ATG (a common start codon)
- Sequences that end with either TAA , TAG , or TGA (stop codons)
- Sequences containing the motif ATAT (a simple repeated pattern)
- Sequences that have at least
3consecutive G (like GGG or GGGG)
Return _the result table ordered by _sample_id inascending order.
The result format is in the following example.
Examples
Example 1:
Input:
Samples table:
+-----------+------------------+-----------+
| sample_id | dna_sequence | species |
+-----------+------------------+-----------+
| 1 | ATGCTAGCTAGCTAA | Human |
| 2 | GGGTCAATCATC | Human |
| 3 | ATATATCGTAGCTA | Human |
| 4 | ATGGGGTCATCATAA | Mouse |
| 5 | TCAGTCAGTCAG | Mouse |
| 6 | ATATCGCGCTAG | Zebrafish |
| 7 | CGTATGCGTCGTA | Zebrafish |
+-----------+------------------+-----------+
Output:
+-----------+------------------+-------------+-------------+------------+------------+------------+
| sample_id | dna_sequence | species | has_start | has_stop | has_atat | has_ggg |
+-----------+------------------+-------------+-------------+------------+------------+------------+
| 1 | ATGCTAGCTAGCTAA | Human | 1 | 1 | 0 | 0 |
| 2 | GGGTCAATCATC | Human | 0 | 0 | 0 | 1 |
| 3 | ATATATCGTAGCTA | Human | 0 | 0 | 1 | 0 |
| 4 | ATGGGGTCATCATAA | Mouse | 1 | 1 | 0 | 1 |
| 5 | TCAGTCAGTCAG | Mouse | 0 | 0 | 0 | 0 |
| 6 | ATATCGCGCTAG | Zebrafish | 0 | 1 | 1 | 0 |
| 7 | CGTATGCGTCGTA | Zebrafish | 0 | 0 | 0 | 0 |
+-----------+------------------+-------------+-------------+------------+------------+------------+
Explanation:
* Sample 1 (ATGCTAGCTAGCTAA):
* Starts with ATG (has_start = 1)
* Ends with TAA (has_stop = 1)
* Does not contain ATAT (has_atat = 0)
* Does not contain at least 3 consecutive 'G's (has_ggg = 0)
* Sample 2 (GGGTCAATCATC):
* Does not start with ATG (has_start = 0)
* Does not end with TAA, TAG, or TGA (has_stop = 0)
* Does not contain ATAT (has_atat = 0)
* Contains GGG (has_ggg = 1)
* Sample 3 (ATATATCGTAGCTA):
* Does not start with ATG (has_start = 0)
* Does not end with TAA, TAG, or TGA (has_stop = 0)
* Contains ATAT (has_atat = 1)
* Does not contain at least 3 consecutive 'G's (has_ggg = 0)
* Sample 4 (ATGGGGTCATCATAA):
* Starts with ATG (has_start = 1)
* Ends with TAA (has_stop = 1)
* Does not contain ATAT (has_atat = 0)
* Contains GGGG (has_ggg = 1)
* Sample 5 (TCAGTCAGTCAG):
* Does not match any patterns (all fields = 0)
* Sample 6 (ATATCGCGCTAG):
* Does not start with ATG (has_start = 0)
* Ends with TAG (has_stop = 1)
* Starts with ATAT (has_atat = 1)
* Does not contain at least 3 consecutive 'G's (has_ggg = 0)
* Sample 7 (CGTATGCGTCGTA):
* Does not start with ATG (has_start = 0)
* Does not end with TAA, "TAG", or "TGA" (has_stop = 0)
* Does not contain ATAT (has_atat = 0)
* Does not contain at least 3 consecutive 'G's (has_ggg = 0)
**Note:**
* The result is ordered by sample_id in ascending order
* For each pattern, 1 indicates the pattern is present and 0 indicates it is not present
Example 2:
Solution
Method 1 – SQL Pattern Matching
Intuition
We can use SQL string functions and pattern matching to check for the required DNA patterns:
- Use
LIKE 'ATG%'to check if the sequence starts with 'ATG'. - Use
LIKE '%TAA' OR LIKE '%TAG' OR LIKE '%TGA'to check for stop codons at the end. - Use
LIKE '%ATAT%'to check for the motif 'ATAT'. - Use a regular expression to check for at least 3 consecutive 'G's (e.g.,
REGEXP 'GGG').
Approach
- For each row in the
Samplestable, check:- If
dna_sequencestarts with 'ATG'. - If
dna_sequenceends with 'TAA', 'TAG', or 'TGA'. - If
dna_sequencecontains 'ATAT'. - If
dna_sequencecontains at least three consecutive 'G's.
- If
- Output the required columns and computed flags, ordered by
sample_idascending.
Code
MySQL
SELECT
sample_id,
dna_sequence,
species,
(dna_sequence LIKE 'ATG%') AS has_start,
(dna_sequence LIKE '%TAA' OR dna_sequence LIKE '%TAG' OR dna_sequence LIKE '%TGA') AS has_stop,
(dna_sequence LIKE '%ATAT%') AS has_atat,
(dna_sequence REGEXP 'GGG') AS has_ggg
FROM Samples
ORDER BY sample_id ASC;
PostgreSQL
SELECT
sample_id,
dna_sequence,
species,
(dna_sequence LIKE 'ATG%')::int AS has_start,
(dna_sequence LIKE '%TAA' OR dna_sequence LIKE '%TAG' OR dna_sequence LIKE '%TGA')::int AS has_stop,
(dna_sequence LIKE '%ATAT%')::int AS has_atat,
(dna_sequence ~ 'GGG')::int AS has_ggg
FROM Samples
ORDER BY sample_id ASC;
Python (pandas)
import pandas as pd
import re
def dna_pattern_recognition(samples: pd.DataFrame) -> pd.DataFrame:
def has_start(s: str) -> int:
return int(s.startswith('ATG'))
def has_stop(s: str) -> int:
return int(s.endswith('TAA') or s.endswith('TAG') or s.endswith('TGA'))
def has_atat(s: str) -> int:
return int('ATAT' in s)
def has_ggg(s: str) -> int:
return int(bool(re.search(r'G{3,}', s)))
df = samples.copy()
df['has_start'] = df['dna_sequence'].apply(has_start)
df['has_stop'] = df['dna_sequence'].apply(has_stop)
df['has_atat'] = df['dna_sequence'].apply(has_atat)
df['has_ggg'] = df['dna_sequence'].apply(has_ggg)
return df.sort_values('sample_id')[['sample_id', 'dna_sequence', 'species', 'has_start', 'has_stop', 'has_atat', 'has_ggg']]
Complexity
- ⏰ Time complexity:
O(n), where n is the number of samples. Each pattern check is O(1) per row. - 🧺 Space complexity:
O(n), for the output table.