The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.
For example, "ACGAATTCCG" is a DNA sequence.
When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
We want to find all 10-letter-long substrings that appear more than once in the DNA string. The simplest way is to check every substring of length 10, and use a set to track which ones we’ve seen before. If we see a substring again, we add it to the result.
import java.util.*;
classSolution {
public List<String>findRepeatedDnaSequences(String s) {
Set<String> seen =new HashSet<>();
Set<String> repeated =new HashSet<>();
for (int i = 0; i + 9 < s.length(); i++) {
String sub = s.substring(i, i + 10);
if (!seen.add(sub)) repeated.add(sub);
}
returnnew ArrayList<>(repeated);
}
}
1
2
3
4
5
6
7
8
9
10
11
classSolution {
funfindRepeatedDnaSequences(s: String): List<String> {
val seen = mutableSetOf<String>()
val repeated = mutableSetOf<String>()
for (i in0..s.length - 10) {
val sub = s.substring(i, i + 10)
if (!seen.add(sub)) repeated.add(sub)
}
return repeated.toList()
}
}
1
2
3
4
5
6
7
8
9
10
11
classSolution:
deffindRepeatedDnaSequences(self, s: str) -> list[str]:
seen = set()
repeated = set()
for i in range(len(s) -9):
sub = s[i:i+10]
if sub in seen:
repeated.add(sub)
else:
seen.add(sub)
return list(repeated)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
use std::collections::HashSet;
impl Solution {
pubfnfind_repeated_dna_sequences(s: String) -> Vec<String> {
letmut seen = HashSet::new();
letmut repeated = HashSet::new();
let bytes = s.as_bytes();
for i in0..=s.len().saturating_sub(10) {
let sub =&bytes[i..i+10];
if!seen.insert(sub) {
repeated.insert(sub);
}
}
repeated.into_iter().map(|b| String::from_utf8(b.to_vec()).unwrap()).collect()
}
}
Using substring() repeatedly creates new immutable strings, which is inefficient. Instead, we can use a mutable string (like StringBuilder in Java, or similar in other languages) to extract substrings more efficiently, reducing memory allocations.
import java.util.*;
classSolution {
public List<String>findRepeatedDnaSequences(String s) {
if (s ==null|| s.length() < 10) return Collections.emptyList();
Set<String> seen =new HashSet<>();
Set<String> result =new HashSet<>();
StringBuilder builder =new StringBuilder(s);
for (int i = 0; i <= s.length() - 10; i++) {
String curr = builder.substring(i, i + 10);
if (!seen.add(curr)) result.add(curr);
}
returnnew ArrayList<>(result);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
classSolution {
funfindRepeatedDnaSequences(s: String): List<String> {
if (s.length < 10) return emptyList()
val seen = mutableSetOf<String>()
val result = mutableSetOf<String>()
val builder = StringBuilder(s)
for (i in0..s.length - 10) {
val curr = builder.substring(i, i + 10)
if (!seen.add(curr)) result.add(curr)
}
return result.toList()
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
classSolution:
deffindRepeatedDnaSequences(self, s: str) -> list[str]:
if len(s) <10:
return []
seen = set()
result = set()
for i in range(len(s) -9):
curr = s[i:i+10]
if curr in seen:
result.add(curr)
else:
seen.add(curr)
return list(result)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
use std::collections::HashSet;
impl Solution {
pubfnfind_repeated_dna_sequences(s: String) -> Vec<String> {
if s.len() <10 { returnvec![]; }
letmut seen = HashSet::new();
letmut result = HashSet::new();
let bytes = s.as_bytes();
for i in0..=s.len() -10 {
let curr =&bytes[i..i+10];
if!seen.insert(curr) {
result.insert(curr);
}
}
result.into_iter().map(|b| String::from_utf8(b.to_vec()).unwrap()).collect()
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
functionfindRepeatedDnaSequences(s: string):string[] {
if (s.length<10) return [];
constseen=newSet<string>();
constresult=newSet<string>();
for (leti=0; i+9<s.length; i++) {
constcurr=s.substring(i, i+10);
if (seen.has(curr)) {
result.add(curr);
} else {
seen.add(curr);
}
}
return Array.from(result);
}
Each DNA character can be encoded in 2 bits: A=00, C=01, G=10, T=11. Thus, a 10-letter sequence can be represented as a 20-bit integer. We can use a rolling hash to efficiently encode and compare substrings, using sets to track which hashes we’ve seen and which are repeated.
import java.util.*;
classSolution {
public List<String>findRepeatedDnaSequences(String s) {
if (s.length() < 10) return Collections.emptyList();
Map<Character, Integer> enc = Map.of('A',0,'C',1,'G',2,'T',3);
Set<Integer> seen =new HashSet<>();
Set<Integer> repeated =new HashSet<>();
List<String> res =new ArrayList<>();
int hash = 0, mask = (1 << 20) - 1;
for (int i = 0; i < s.length(); i++) {
hash = ((hash << 2) | enc.get(s.charAt(i))) & mask;
if (i >= 9) {
if (!seen.add(hash) && repeated.add(hash))
res.add(s.substring(i-9, i+1));
}
}
return res;
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
classSolution {
funfindRepeatedDnaSequences(s: String): List<String> {
if (s.length < 10) return emptyList()
val enc = mapOf('A' to 0, 'C' to 1, 'G' to 2, 'T' to 3)
val seen = mutableSetOf<Int>()
val repeated = mutableSetOf<Int>()
val res = mutableListOf<String>()
var hash = 0val mask = (1 shl 20) - 1for (i in s.indices) {
hash = ((hash shl 2) or enc[s[i]]!!) and mask
if (i >=9) {
if (!seen.add(hash) && repeated.add(hash))
res.add(s.substring(i-9, i+1))
}
}
return res
}
}
classSolution:
deffindRepeatedDnaSequences(self, s: str) -> list[str]:
if len(s) <10:
return []
enc = {'A':0,'C':1,'G':2,'T':3}
seen = set()
repeated = set()
res = []
hash =0 mask = (1<<20) -1for i, c in enumerate(s):
hash = ((hash <<2) | enc[c]) & mask
if i >=9:
if hash in seen:
if hash notin repeated:
res.append(s[i-9:i+1])
repeated.add(hash)
else:
seen.add(hash)
return res
Suppose s = “ACGTACGTAC”. Let’s encode the first 10 characters:
Index
Char
Bits
Hash (binary)
0
A
00
00
1
C
01
0001
2
G
10
000110
3
T
11
00011011
…
…
…
…
At each step, shift the hash left by 2 bits and OR with the new character’s bits. After 10 characters, the hash represents the 10-letter substring. As you slide the window, drop the leftmost 2 bits and add the new character’s bits, always keeping only the last 20 bits.
Suppose we want to encode the substring “ACGTACGTAC”:
1
2
3
4
5
6
7
8
9
10
int hash = 0;
for (int i = 0; i < 10; i++) {
hash <<= 2; // shift left by 2 bits hash |= enc[s.charAt(i)]; // add new character's bits// For example, after 'A' (00): hash = 0// After 'C' (01): hash = 0b0001// After 'G' (10): hash = 0b000110// After 'T' (11): hash = 0b00011011// ... and so on}
When sliding the window, to remove the leftmost 2 bits and add a new character:
1
2
3
// Remove the leftmost 2 bits (keep only last 20 bits)int mask = (1 << 20) - 1; // 20 bits set to 1: 0b11111111111111111111hash = ((hash << 2) | enc[s.charAt(i)]) & mask;
int val = ...; // some 20-bit valueval &=~(3 << 18); // 3 = 0b11, so (3 << 18) = 0b110000... (18 zeros)// ~(3 << 18) = 0b00111111111111111111 (first 2 bits zeroed)// This operation keeps only the last 18 bits
This is useful if you want to manually remove the oldest character from the rolling hash.