Repeated DNA Sequences
Problem
The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.
- For example,
"ACGAATTCCG"is a DNA sequence.
When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Examples
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Solution
Method 1 – Brute Force Using HashSet
Intuition
We want to find all 10-letter-long substrings that appear more than once in the DNA string. The simplest way is to check every substring of length 10, and use a set to track which ones we've seen before. If we see a substring again, we add it to the result.
Approach
- Iterate over all substrings of length 10 in the string.
- Use a set to record substrings we've seen.
- If a substring is seen again, add it to the result set.
- Return all substrings that appeared more than once.
Code
C++
#include <string>
#include <vector>
#include <unordered_set>
using namespace std;
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
unordered_set<string> seen, repeated;
for (int i = 0; i + 9 < s.size(); ++i) {
string sub = s.substr(i, 10);
if (!seen.insert(sub).second) repeated.insert(sub);
}
return vector<string>(repeated.begin(), repeated.end());
}
};
Go
func findRepeatedDnaSequences(s string) []string {
seen := make(map[string]bool)
repeated := make(map[string]bool)
for i := 0; i+9 < len(s); i++ {
sub := s[i : i+10]
if seen[sub] {
repeated[sub] = true
} else {
seen[sub] = true
}
}
res := []string{}
for k := range repeated {
res = append(res, k)
}
return res
}
Java
import java.util.*;
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<String> seen = new HashSet<>();
Set<String> repeated = new HashSet<>();
for (int i = 0; i + 9 < s.length(); i++) {
String sub = s.substring(i, i + 10);
if (!seen.add(sub)) repeated.add(sub);
}
return new ArrayList<>(repeated);
}
}
Kotlin
class Solution {
fun findRepeatedDnaSequences(s: String): List<String> {
val seen = mutableSetOf<String>()
val repeated = mutableSetOf<String>()
for (i in 0..s.length - 10) {
val sub = s.substring(i, i + 10)
if (!seen.add(sub)) repeated.add(sub)
}
return repeated.toList()
}
}
Python
class Solution:
def findRepeatedDnaSequences(self, s: str) -> list[str]:
seen = set()
repeated = set()
for i in range(len(s) - 9):
sub = s[i:i+10]
if sub in seen:
repeated.add(sub)
else:
seen.add(sub)
return list(repeated)
Rust
use std::collections::HashSet;
impl Solution {
pub fn find_repeated_dna_sequences(s: String) -> Vec<String> {
let mut seen = HashSet::new();
let mut repeated = HashSet::new();
let bytes = s.as_bytes();
for i in 0..=s.len().saturating_sub(10) {
let sub = &bytes[i..i+10];
if !seen.insert(sub) {
repeated.insert(sub);
}
}
repeated.into_iter().map(|b| String::from_utf8(b.to_vec()).unwrap()).collect()
}
}
TypeScript
function findRepeatedDnaSequences(s: string): string[] {
const seen = new Set<string>();
const repeated = new Set<string>();
for (let i = 0; i + 9 < s.length; i++) {
const sub = s.substring(i, i + 10);
if (seen.has(sub)) {
repeated.add(sub);
} else {
seen.add(sub);
}
}
return Array.from(repeated);
}
Complexity
- ⏰ Time complexity: O(n * L), where n is the length of s and L = 10 (substring extraction is O(L)).
- 🧺 Space complexity: O(n * L) for the sets.
substring is a not so good method to use in interview, because s.substring() is implemented in O(s.length()). This code then becomes O(n ^ 2)
Method 2 - Using Hashset but with StringBuilder
Method 2 – HashSet with StringBuilder (or Mutable String)
Intuition
Using substring() repeatedly creates new immutable strings, which is inefficient. Instead, we can use a mutable string (like StringBuilder in Java, or similar in other languages) to extract substrings more efficiently, reducing memory allocations.
Approach
- Use a mutable string or efficient substring extraction to avoid repeated allocations.
- Use a set to track seen substrings and another for repeated ones.
- Iterate over all substrings of length 10, adding to sets as in Method 1.
Code
C++
#include <string>
#include <vector>
#include <unordered_set>
using namespace std;
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
unordered_set<string> seen, repeated;
string curr;
for (int i = 0; i + 9 < s.size(); ++i) {
curr = s.substr(i, 10); // substr is efficient in C++
if (!seen.insert(curr).second) repeated.insert(curr);
}
return vector<string>(repeated.begin(), repeated.end());
}
};
Go
func findRepeatedDnaSequences(s string) []string {
seen := make(map[string]bool)
repeated := make(map[string]bool)
for i := 0; i+9 < len(s); i++ {
curr := s[i : i+10]
if seen[curr] {
repeated[curr] = true
} else {
seen[curr] = true
}
}
res := []string{}
for k := range repeated {
res = append(res, k)
}
return res
}
Java
import java.util.*;
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
if (s == null || s.length() < 10) return Collections.emptyList();
Set<String> seen = new HashSet<>();
Set<String> result = new HashSet<>();
StringBuilder builder = new StringBuilder(s);
for (int i = 0; i <= s.length() - 10; i++) {
String curr = builder.substring(i, i + 10);
if (!seen.add(curr)) result.add(curr);
}
return new ArrayList<>(result);
}
}
Kotlin
class Solution {
fun findRepeatedDnaSequences(s: String): List<String> {
if (s.length < 10) return emptyList()
val seen = mutableSetOf<String>()
val result = mutableSetOf<String>()
val builder = StringBuilder(s)
for (i in 0..s.length - 10) {
val curr = builder.substring(i, i + 10)
if (!seen.add(curr)) result.add(curr)
}
return result.toList()
}
}
Python
class Solution:
def findRepeatedDnaSequences(self, s: str) -> list[str]:
if len(s) < 10:
return []
seen = set()
result = set()
for i in range(len(s) - 9):
curr = s[i:i+10]
if curr in seen:
result.add(curr)
else:
seen.add(curr)
return list(result)
Rust
use std::collections::HashSet;
impl Solution {
pub fn find_repeated_dna_sequences(s: String) -> Vec<String> {
if s.len() < 10 { return vec![]; }
let mut seen = HashSet::new();
let mut result = HashSet::new();
let bytes = s.as_bytes();
for i in 0..=s.len() - 10 {
let curr = &bytes[i..i+10];
if !seen.insert(curr) {
result.insert(curr);
}
}
result.into_iter().map(|b| String::from_utf8(b.to_vec()).unwrap()).collect()
}
}
TypeScript
function findRepeatedDnaSequences(s: string): string[] {
if (s.length < 10) return [];
const seen = new Set<string>();
const result = new Set<string>();
for (let i = 0; i + 9 < s.length; i++) {
const curr = s.substring(i, i + 10);
if (seen.has(curr)) {
result.add(curr);
} else {
seen.add(curr);
}
}
return Array.from(result);
}
Complexity
- ⏰ Time complexity: O(n * L), where n is the length of s and L = 10 (substring extraction is O(L)).
- 🧺 Space complexity: O(n * L) for the sets.
Method 3 - Using Hashset and Bit Representation
Dry Run (Step-by-step Bit Manipulation)
Let's walk through how the rolling hash encodes a DNA substring using bit manipulation:
Suppose our substring is "CG" (for simplicity, but the same logic applies for 10 letters):
-
Start with
v = 0(our hash value). -
For each character, shift
vleft by 2 bits to make space for the new character, then OR with the 2-bit encoding of the character:-
For 'C':
v <<= 2→v = 00v |= 01(since 'C' = 01) →v = 01
-
For 'G':
v <<= 2→v = 0100v |= 10(since 'G' = 10) →v = 0110
-
-
Continue this for all 10 characters. After 10 steps,
vis a 20-bit integer representing the substring.
When sliding the window to the next substring:
-
To remove the leftmost (oldest) character and add a new one:
- Shift
vleft by 2 bits:v <<= 2 - OR with the new character's bits:
v |= map[s.charAt(i+9)] - To keep only the last 20 bits (drop the oldest character), AND with a mask:
v &= (1 << 20) - 1
- Shift
-
If you want to explicitly clear the leftmost 2 bits (oldest character):
v &= ~(3 << 18)- Here,
3 << 18is0b110000...(18 zeros), so~(3 << 18)is0b001111...(first 2 bits zeroed)
This way, you efficiently encode and compare each 10-letter substring using only integer operations.
Method 3 – HashSet and Bit Manipulation (Rolling Hash)
Intuition
Each DNA character can be encoded in 2 bits: A=00, C=01, G=10, T=11. Thus, a 10-letter sequence can be represented as a 20-bit integer. We can use a rolling hash to efficiently encode and compare substrings, using sets to track which hashes we've seen and which are repeated.
Approach
- Map each character to 2 bits.
- For the first 10 characters, build the initial hash value.
- For each subsequent character, update the hash by shifting and adding the new character, keeping only the last 20 bits.
- Use sets to track seen and repeated hashes.
- If a hash is seen again, add the substring to the result.
Code
C++
#include <string>
#include <vector>
#include <unordered_set>
#include <unordered_map>
using namespace std;
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
if (s.size() < 10) return {};
unordered_map<char, int> enc = {{'A',0},{'C',1},{'G',2},{'T',3}};
unordered_set<int> seen, repeated;
vector<string> res;
int hash = 0, mask = (1 << 20) - 1;
for (int i = 0; i < s.size(); ++i) {
hash = ((hash << 2) | enc[s[i]]) & mask;
if (i >= 9) {
if (!seen.insert(hash).second && repeated.insert(hash).second)
res.push_back(s.substr(i-9, 10));
}
}
return res;
}
};
Go
func findRepeatedDnaSequences(s string) []string {
if len(s) < 10 {
return nil
}
enc := map[byte]int{'A':0,'C':1,'G':2,'T':3}
seen := make(map[int]bool)
repeated := make(map[int]bool)
res := []string{}
hash, mask := 0, (1<<20)-1
for i := 0; i < len(s); i++ {
hash = ((hash << 2) | enc[s[i]]) & mask
if i >= 9 {
if seen[hash] {
if !repeated[hash] {
res = append(res, s[i-9:i+1])
repeated[hash] = true
}
} else {
seen[hash] = true
}
}
}
return res
}
Java
import java.util.*;
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
if (s.length() < 10) return Collections.emptyList();
Map<Character, Integer> enc = Map.of('A',0,'C',1,'G',2,'T',3);
Set<Integer> seen = new HashSet<>();
Set<Integer> repeated = new HashSet<>();
List<String> res = new ArrayList<>();
int hash = 0, mask = (1 << 20) - 1;
for (int i = 0; i < s.length(); i++) {
hash = ((hash << 2) | enc.get(s.charAt(i))) & mask;
if (i >= 9) {
if (!seen.add(hash) && repeated.add(hash))
res.add(s.substring(i-9, i+1));
}
}
return res;
}
}
Kotlin
class Solution {
fun findRepeatedDnaSequences(s: String): List<String> {
if (s.length < 10) return emptyList()
val enc = mapOf('A' to 0, 'C' to 1, 'G' to 2, 'T' to 3)
val seen = mutableSetOf<Int>()
val repeated = mutableSetOf<Int>()
val res = mutableListOf<String>()
var hash = 0
val mask = (1 shl 20) - 1
for (i in s.indices) {
hash = ((hash shl 2) or enc[s[i]]!!) and mask
if (i >= 9) {
if (!seen.add(hash) && repeated.add(hash))
res.add(s.substring(i-9, i+1))
}
}
return res
}
}
Python
class Solution:
def findRepeatedDnaSequences(self, s: str) -> list[str]:
if len(s) < 10:
return []
enc = {'A':0,'C':1,'G':2,'T':3}
seen = set()
repeated = set()
res = []
hash = 0
mask = (1 << 20) - 1
for i, c in enumerate(s):
hash = ((hash << 2) | enc[c]) & mask
if i >= 9:
if hash in seen:
if hash not in repeated:
res.append(s[i-9:i+1])
repeated.add(hash)
else:
seen.add(hash)
return res
Rust
use std::collections::HashSet;
impl Solution {
pub fn find_repeated_dna_sequences(s: String) -> Vec<String> {
if s.len() < 10 { return vec![]; }
let enc = |c: u8| match c {
b'A' => 0,
b'C' => 1,
b'G' => 2,
b'T' => 3,
_ => 0,
};
let mut seen = HashSet::new();
let mut repeated = HashSet::new();
let bytes = s.as_bytes();
let mut hash = 0u32;
let mask = (1 << 20) - 1;
let mut res = Vec::new();
for i in 0..bytes.len() {
hash = ((hash << 2) | enc(bytes[i])) & mask;
if i >= 9 {
if !seen.insert(hash) && repeated.insert(hash) {
res.push(s[i-9..=i].to_string());
}
}
}
res
}
}
TypeScript
function findRepeatedDnaSequences(s: string): string[] {
if (s.length < 10) return [];
const enc: Record<string, number> = {A:0, C:1, G:2, T:3};
const seen = new Set<number>();
const repeated = new Set<number>();
const res: string[] = [];
let hash = 0, mask = (1 << 20) - 1;
for (let i = 0; i < s.length; i++) {
hash = ((hash << 2) | enc[s[i]]) & mask;
if (i >= 9) {
if (seen.has(hash)) {
if (!repeated.has(hash)) {
res.push(s.substring(i-9, i+1));
repeated.add(hash);
}
} else {
seen.add(hash);
}
}
}
return res;
}
Dry Run Example
Suppose s = "ACGTACGTAC". Let's encode the first 10 characters:
| Index | Char | Bits | Hash (binary) |
|---|---|---|---|
| 0 | A | 00 | 00 |
| 1 | C | 01 | 0001 |
| 2 | G | 10 | 000110 |
| 3 | T | 11 | 00011011 |
| ... | ... | ... | ... |
At each step, shift the hash left by 2 bits and OR with the new character's bits. After 10 characters, the hash represents the 10-letter substring. As you slide the window, drop the leftmost 2 bits and add the new character's bits, always keeping only the last 20 bits.
Bitwise Example (Java/Python-style pseudocode)
Suppose we want to encode the substring "ACGTACGTAC":
int hash = 0;
for (int i = 0; i < 10; i++) {
hash <<= 2; // shift left by 2 bits
hash |= enc[s.charAt(i)]; // add new character's bits
// For example, after 'A' (00): hash = 0
// After 'C' (01): hash = 0b0001
// After 'G' (10): hash = 0b000110
// After 'T' (11): hash = 0b00011011
// ... and so on
}
When sliding the window, to remove the leftmost 2 bits and add a new character:
// Remove the leftmost 2 bits (keep only last 20 bits)
int mask = (1 << 20) - 1; // 20 bits set to 1: 0b11111111111111111111
hash = ((hash << 2) | enc[s.charAt(i)]) & mask;
Why (3 << 18)?
To clear the leftmost 2 bits of a 20-bit integer:
int val = ...; // some 20-bit value
val &= ~(3 << 18); // 3 = 0b11, so (3 << 18) = 0b110000... (18 zeros)
// ~(3 << 18) = 0b00111111111111111111 (first 2 bits zeroed)
// This operation keeps only the last 18 bits
This is useful if you want to manually remove the oldest character from the rolling hash.
Summary
- Shift left by 2 bits to make space for the new character
- OR with the new character's 2-bit encoding
- AND with mask (0b111...111, 20 bits) to keep only the last 20 bits
- Use sets to track which hashes have been seen and which are repeated
If a hash is seen again, the substring is repeated.
Complexity
- ⏰ Time complexity: O(n), where n is the length of s. Each character is processed once, and set operations are O(1).
- 🧺 Space complexity: O(n) for the sets.