Problem

Table: Samples

+----------------+---------+
| Column Name    | Type    |
+----------------+---------+
| sample_id      | int     |
| dna_sequence   | varchar |
| species        | varchar |
+----------------+---------+
sample_id is the unique key for this table.
Each row contains a DNA sequence represented as a string of characters (A, T, G, C) and the species it was collected from.

Biologists are studying basic patterns in DNA sequences. Write a solution to identify sample_id with the following patterns:

  • Sequences that start with ATG (a common start codon)
  • Sequences that end with either TAA , TAG , or TGA (stop codons)
  • Sequences containing the motif ATAT (a simple repeated pattern)
  • Sequences that have at least 3 consecutive G (like GGG or GGGG)

Return _the result table ordered by _sample_id inascending order.

The result format is in the following example.

Example 1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
Input:
Samples table:
+-----------+------------------+-----------+
| sample_id | dna_sequence     | species   |
+-----------+------------------+-----------+
| 1         | ATGCTAGCTAGCTAA  | Human     |
| 2         | GGGTCAATCATC     | Human     |
| 3         | ATATATCGTAGCTA   | Human     |
| 4         | ATGGGGTCATCATAA  | Mouse     |
| 5         | TCAGTCAGTCAG     | Mouse     |
| 6         | ATATCGCGCTAG     | Zebrafish |
| 7         | CGTATGCGTCGTA    | Zebrafish |
+-----------+------------------+-----------+
Output:
+-----------+------------------+-------------+-------------+------------+------------+------------+
| sample_id | dna_sequence     | species     | has_start   | has_stop   | has_atat   | has_ggg    |
+-----------+------------------+-------------+-------------+------------+------------+------------+
| 1         | ATGCTAGCTAGCTAA  | Human       | 1           | 1          | 0          | 0          |
| 2         | GGGTCAATCATC     | Human       | 0           | 0          | 0          | 1          |
| 3         | ATATATCGTAGCTA   | Human       | 0           | 0          | 1          | 0          |
| 4         | ATGGGGTCATCATAA  | Mouse       | 1           | 1          | 0          | 1          |
| 5         | TCAGTCAGTCAG     | Mouse       | 0           | 0          | 0          | 0          |
| 6         | ATATCGCGCTAG     | Zebrafish   | 0           | 1          | 1          | 0          |
| 7         | CGTATGCGTCGTA    | Zebrafish   | 0           | 0          | 0          | 0          |
+-----------+------------------+-------------+-------------+------------+------------+------------+
Explanation:
* Sample 1 (ATGCTAGCTAGCTAA): 
* Starts with ATG (has_start = 1)
* Ends with TAA (has_stop = 1)
* Does not contain ATAT (has_atat = 0)
* Does not contain at least 3 consecutive 'G's (has_ggg = 0)
* Sample 2 (GGGTCAATCATC): 
* Does not start with ATG (has_start = 0)
* Does not end with TAA, TAG, or TGA (has_stop = 0)
* Does not contain ATAT (has_atat = 0)
* Contains GGG (has_ggg = 1)
* Sample 3 (ATATATCGTAGCTA): 
* Does not start with ATG (has_start = 0)
* Does not end with TAA, TAG, or TGA (has_stop = 0)
* Contains ATAT (has_atat = 1)
* Does not contain at least 3 consecutive 'G's (has_ggg = 0)
* Sample 4 (ATGGGGTCATCATAA): 
* Starts with ATG (has_start = 1)
* Ends with TAA (has_stop = 1)
* Does not contain ATAT (has_atat = 0)
* Contains GGGG (has_ggg = 1)
* Sample 5 (TCAGTCAGTCAG): 
* Does not match any patterns (all fields = 0)
* Sample 6 (ATATCGCGCTAG): 
* Does not start with ATG (has_start = 0)
* Ends with TAG (has_stop = 1)
* Starts with ATAT (has_atat = 1)
* Does not contain at least 3 consecutive 'G's (has_ggg = 0)
* Sample 7 (CGTATGCGTCGTA): 
* Does not start with ATG (has_start = 0)
* Does not end with TAA, "TAG", or "TGA" (has_stop = 0)
* Does not contain ATAT (has_atat = 0)
* Does not contain at least 3 consecutive 'G's (has_ggg = 0)
**Note:**
* The result is ordered by sample_id in ascending order
* For each pattern, 1 indicates the pattern is present and 0 indicates it is not present

Example 2:

Solution

Method 1 – SQL Pattern Matching

Intuition

We can use SQL string functions and pattern matching to check for the required DNA patterns:

  • Use LIKE 'ATG%' to check if the sequence starts with ‘ATG’.
  • Use LIKE '%TAA' OR LIKE '%TAG' OR LIKE '%TGA' to check for stop codons at the end.
  • Use LIKE '%ATAT%' to check for the motif ‘ATAT’.
  • Use a regular expression to check for at least 3 consecutive ‘G’s (e.g., REGEXP 'GGG').

Approach

  1. For each row in the Samples table, check:
    • If dna_sequence starts with ‘ATG’.
    • If dna_sequence ends with ‘TAA’, ‘TAG’, or ‘TGA’.
    • If dna_sequence contains ‘ATAT’.
    • If dna_sequence contains at least three consecutive ‘G’s.
  2. Output the required columns and computed flags, ordered by sample_id ascending.

Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT
  sample_id,
  dna_sequence,
  species,
  (dna_sequence LIKE 'ATG%') AS has_start,
  (dna_sequence LIKE '%TAA' OR dna_sequence LIKE '%TAG' OR dna_sequence LIKE '%TGA') AS has_stop,
  (dna_sequence LIKE '%ATAT%') AS has_atat,
  (dna_sequence REGEXP 'GGG') AS has_ggg
FROM Samples
ORDER BY sample_id ASC;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT
  sample_id,
  dna_sequence,
  species,
  (dna_sequence LIKE 'ATG%')::int AS has_start,
  (dna_sequence LIKE '%TAA' OR dna_sequence LIKE '%TAG' OR dna_sequence LIKE '%TGA')::int AS has_stop,
  (dna_sequence LIKE '%ATAT%')::int AS has_atat,
  (dna_sequence ~ 'GGG')::int AS has_ggg
FROM Samples
ORDER BY sample_id ASC;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pandas as pd
import re

def dna_pattern_recognition(samples: pd.DataFrame) -> pd.DataFrame:
    def has_start(s: str) -> int:
        return int(s.startswith('ATG'))
    def has_stop(s: str) -> int:
        return int(s.endswith('TAA') or s.endswith('TAG') or s.endswith('TGA'))
    def has_atat(s: str) -> int:
        return int('ATAT' in s)
    def has_ggg(s: str) -> int:
        return int(bool(re.search(r'G{3,}', s)))
    df = samples.copy()
    df['has_start'] = df['dna_sequence'].apply(has_start)
    df['has_stop'] = df['dna_sequence'].apply(has_stop)
    df['has_atat'] = df['dna_sequence'].apply(has_atat)
    df['has_ggg'] = df['dna_sequence'].apply(has_ggg)
    return df.sort_values('sample_id')[['sample_id', 'dna_sequence', 'species', 'has_start', 'has_stop', 'has_atat', 'has_ggg']]

Complexity

  • ⏰ Time complexity: O(n), where n is the number of samples. Each pattern check is O(1) per row.
  • 🧺 Space complexity: O(n), for the output table.