Repeated DNA Sequences

Problem

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Examples

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example 2:

Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]

Solution

Method 1 – Brute Force Using HashSet

Intuition

We want to find all 10-letter-long substrings that appear more than once in the DNA string. The simplest way is to check every substring of length 10, and use a set to track which ones we've seen before. If we see a substring again, we add it to the result.

Approach

Iterate over all substrings of length 10 in the string.
Use a set to record substrings we've seen.
If a substring is seen again, add it to the result set.
Return all substrings that appeared more than once.

Code

C++

#include <string>
#include <vector>
#include <unordered_set>
using namespace std;

class Solution {
public:
	vector<string> findRepeatedDnaSequences(string s) {
		unordered_set<string> seen, repeated;
		for (int i = 0; i + 9 < s.size(); ++i) {
			string sub = s.substr(i, 10);
			if (!seen.insert(sub).second) repeated.insert(sub);
		}
		return vector<string>(repeated.begin(), repeated.end());
	}
};

Go

func findRepeatedDnaSequences(s string) []string {
	seen := make(map[string]bool)
	repeated := make(map[string]bool)
	for i := 0; i+9 < len(s); i++ {
		sub := s[i : i+10]
		if seen[sub] {
			repeated[sub] = true
		} else {
			seen[sub] = true
		}
	}
	res := []string{}
	for k := range repeated {
		res = append(res, k)
	}
	return res
}

Java

import java.util.*;

class Solution {
	public List<String> findRepeatedDnaSequences(String s) {
		Set<String> seen = new HashSet<>();
		Set<String> repeated = new HashSet<>();
		for (int i = 0; i + 9 < s.length(); i++) {
			String sub = s.substring(i, i + 10);
			if (!seen.add(sub)) repeated.add(sub);
		}
		return new ArrayList<>(repeated);
	}
}

Kotlin

class Solution {
	fun findRepeatedDnaSequences(s: String): List<String> {
		val seen = mutableSetOf<String>()
		val repeated = mutableSetOf<String>()
		for (i in 0..s.length - 10) {
			val sub = s.substring(i, i + 10)
			if (!seen.add(sub)) repeated.add(sub)
		}
		return repeated.toList()
	}
}

Python

class Solution:
	def findRepeatedDnaSequences(self, s: str) -> list[str]:
		seen = set()
		repeated = set()
		for i in range(len(s) - 9):
			sub = s[i:i+10]
			if sub in seen:
				repeated.add(sub)
			else:
				seen.add(sub)
		return list(repeated)

Rust

use std::collections::HashSet;

impl Solution {
	pub fn find_repeated_dna_sequences(s: String) -> Vec<String> {
		let mut seen = HashSet::new();
		let mut repeated = HashSet::new();
		let bytes = s.as_bytes();
		for i in 0..=s.len().saturating_sub(10) {
			let sub = &bytes[i..i+10];
			if !seen.insert(sub) {
				repeated.insert(sub);
			}
		}
		repeated.into_iter().map(|b| String::from_utf8(b.to_vec()).unwrap()).collect()
	}
}

TypeScript

function findRepeatedDnaSequences(s: string): string[] {
	const seen = new Set<string>();
	const repeated = new Set<string>();
	for (let i = 0; i + 9 < s.length; i++) {
		const sub = s.substring(i, i + 10);
		if (seen.has(sub)) {
			repeated.add(sub);
		} else {
			seen.add(sub);
		}
	}
	return Array.from(repeated);
}

Complexity

⏰ Time complexity: O(n * L), where n is the length of s and L = 10 (substring extraction is O(L)).
🧺 Space complexity: O(n * L) for the sets.

substring is a not so good method to use in interview, because s.substring() is implemented in O(s.length()). This code then becomes O(n ^ 2)

Method 2 - Using Hashset but with StringBuilder

Method 2 – HashSet with StringBuilder (or Mutable String)

Intuition

Using substring() repeatedly creates new immutable strings, which is inefficient. Instead, we can use a mutable string (like StringBuilder in Java, or similar in other languages) to extract substrings more efficiently, reducing memory allocations.

Approach

Use a mutable string or efficient substring extraction to avoid repeated allocations.
Use a set to track seen substrings and another for repeated ones.
Iterate over all substrings of length 10, adding to sets as in Method 1.

Code

C++

#include <string>
#include <vector>
#include <unordered_set>
using namespace std;

class Solution {
public:
	vector<string> findRepeatedDnaSequences(string s) {
		unordered_set<string> seen, repeated;
		string curr;
		for (int i = 0; i + 9 < s.size(); ++i) {
			curr = s.substr(i, 10); // substr is efficient in C++
			if (!seen.insert(curr).second) repeated.insert(curr);
		}
		return vector<string>(repeated.begin(), repeated.end());
	}
};

Go

func findRepeatedDnaSequences(s string) []string {
	seen := make(map[string]bool)
	repeated := make(map[string]bool)
	for i := 0; i+9 < len(s); i++ {
		curr := s[i : i+10]
		if seen[curr] {
			repeated[curr] = true
		} else {
			seen[curr] = true
		}
	}
	res := []string{}
	for k := range repeated {
		res = append(res, k)
	}
	return res
}

Java

import java.util.*;

class Solution {
	public List<String> findRepeatedDnaSequences(String s) {
		if (s == null || s.length() < 10) return Collections.emptyList();
		Set<String> seen = new HashSet<>();
		Set<String> result = new HashSet<>();
		StringBuilder builder = new StringBuilder(s);
		for (int i = 0; i <= s.length() - 10; i++) {
			String curr = builder.substring(i, i + 10);
			if (!seen.add(curr)) result.add(curr);
		}
		return new ArrayList<>(result);
	}
}

Kotlin

class Solution {
	fun findRepeatedDnaSequences(s: String): List<String> {
		if (s.length < 10) return emptyList()
		val seen = mutableSetOf<String>()
		val result = mutableSetOf<String>()
		val builder = StringBuilder(s)
		for (i in 0..s.length - 10) {
			val curr = builder.substring(i, i + 10)
			if (!seen.add(curr)) result.add(curr)
		}
		return result.toList()
	}
}

Python

class Solution:
	def findRepeatedDnaSequences(self, s: str) -> list[str]:
		if len(s) < 10:
			return []
		seen = set()
		result = set()
		for i in range(len(s) - 9):
			curr = s[i:i+10]
			if curr in seen:
				result.add(curr)
			else:
				seen.add(curr)
		return list(result)

Rust

use std::collections::HashSet;

impl Solution {
	pub fn find_repeated_dna_sequences(s: String) -> Vec<String> {
		if s.len() < 10 { return vec![]; }
		let mut seen = HashSet::new();
		let mut result = HashSet::new();
		let bytes = s.as_bytes();
		for i in 0..=s.len() - 10 {
			let curr = &bytes[i..i+10];
			if !seen.insert(curr) {
				result.insert(curr);
			}
		}
		result.into_iter().map(|b| String::from_utf8(b.to_vec()).unwrap()).collect()
	}
}

TypeScript

function findRepeatedDnaSequences(s: string): string[] {
	if (s.length < 10) return [];
	const seen = new Set<string>();
	const result = new Set<string>();
	for (let i = 0; i + 9 < s.length; i++) {
		const curr = s.substring(i, i + 10);
		if (seen.has(curr)) {
			result.add(curr);
		} else {
			seen.add(curr);
		}
	}
	return Array.from(result);
}

Complexity

⏰ Time complexity: O(n * L), where n is the length of s and L = 10 (substring extraction is O(L)).
🧺 Space complexity: O(n * L) for the sets.

Method 3 - Using Hashset and Bit Representation

Dry Run (Step-by-step Bit Manipulation)

Let's walk through how the rolling hash encodes a DNA substring using bit manipulation:

Suppose our substring is "CG" (for simplicity, but the same logic applies for 10 letters):

Start with v = 0 (our hash value).
For each character, shift v left by 2 bits to make space for the new character, then OR with the 2-bit encoding of the character:
- For 'C':
  - v <<= 2 → v = 00
  - v |= 01 (since 'C' = 01) → v = 01
- For 'G':
  - v <<= 2 → v = 0100
  - v |= 10 (since 'G' = 10) → v = 0110
Continue this for all 10 characters. After 10 steps, v is a 20-bit integer representing the substring.

When sliding the window to the next substring:

To remove the leftmost (oldest) character and add a new one:
- Shift v left by 2 bits: v <<= 2
- OR with the new character's bits: v |= map[s.charAt(i+9)]
- To keep only the last 20 bits (drop the oldest character), AND with a mask: v &= (1 << 20) - 1
If you want to explicitly clear the leftmost 2 bits (oldest character):
- v &= ~(3 << 18)
- Here, 3 << 18 is 0b110000... (18 zeros), so ~(3 << 18) is 0b001111... (first 2 bits zeroed)

This way, you efficiently encode and compare each 10-letter substring using only integer operations.

Method 3 – HashSet and Bit Manipulation (Rolling Hash)

Intuition

Each DNA character can be encoded in 2 bits: A=00, C=01, G=10, T=11. Thus, a 10-letter sequence can be represented as a 20-bit integer. We can use a rolling hash to efficiently encode and compare substrings, using sets to track which hashes we've seen and which are repeated.

Approach

Map each character to 2 bits.
For the first 10 characters, build the initial hash value.
For each subsequent character, update the hash by shifting and adding the new character, keeping only the last 20 bits.
Use sets to track seen and repeated hashes.
If a hash is seen again, add the substring to the result.

Code

C++

#include <string>
#include <vector>
#include <unordered_set>
#include <unordered_map>
using namespace std;

class Solution {
public:
	vector<string> findRepeatedDnaSequences(string s) {
		if (s.size() < 10) return {};
		unordered_map<char, int> enc = {{'A',0},{'C',1},{'G',2},{'T',3}};
		unordered_set<int> seen, repeated;
		vector<string> res;
		int hash = 0, mask = (1 << 20) - 1;
		for (int i = 0; i < s.size(); ++i) {
			hash = ((hash << 2) | enc[s[i]]) & mask;
			if (i >= 9) {
				if (!seen.insert(hash).second && repeated.insert(hash).second)
					res.push_back(s.substr(i-9, 10));
			}
		}
		return res;
	}
};

Go

func findRepeatedDnaSequences(s string) []string {
	if len(s) < 10 {
		return nil
	}
	enc := map[byte]int{'A':0,'C':1,'G':2,'T':3}
	seen := make(map[int]bool)
	repeated := make(map[int]bool)
	res := []string{}
	hash, mask := 0, (1<<20)-1
	for i := 0; i < len(s); i++ {
		hash = ((hash << 2) | enc[s[i]]) & mask
		if i >= 9 {
			if seen[hash] {
				if !repeated[hash] {
					res = append(res, s[i-9:i+1])
					repeated[hash] = true
				}
			} else {
				seen[hash] = true
			}
		}
	}
	return res
}

Java

import java.util.*;

class Solution {
	public List<String> findRepeatedDnaSequences(String s) {
		if (s.length() < 10) return Collections.emptyList();
		Map<Character, Integer> enc = Map.of('A',0,'C',1,'G',2,'T',3);
		Set<Integer> seen = new HashSet<>();
		Set<Integer> repeated = new HashSet<>();
		List<String> res = new ArrayList<>();
		int hash = 0, mask = (1 << 20) - 1;
		for (int i = 0; i < s.length(); i++) {
			hash = ((hash << 2) | enc.get(s.charAt(i))) & mask;
			if (i >= 9) {
				if (!seen.add(hash) && repeated.add(hash))
					res.add(s.substring(i-9, i+1));
			}
		}
		return res;
	}
}

Kotlin

class Solution {
	fun findRepeatedDnaSequences(s: String): List<String> {
		if (s.length < 10) return emptyList()
		val enc = mapOf('A' to 0, 'C' to 1, 'G' to 2, 'T' to 3)
		val seen = mutableSetOf<Int>()
		val repeated = mutableSetOf<Int>()
		val res = mutableListOf<String>()
		var hash = 0
		val mask = (1 shl 20) - 1
		for (i in s.indices) {
			hash = ((hash shl 2) or enc[s[i]]!!) and mask
			if (i >= 9) {
				if (!seen.add(hash) && repeated.add(hash))
					res.add(s.substring(i-9, i+1))
			}
		}
		return res
	}
}

Python

class Solution:
	def findRepeatedDnaSequences(self, s: str) -> list[str]:
		if len(s) < 10:
			return []
		enc = {'A':0,'C':1,'G':2,'T':3}
		seen = set()
		repeated = set()
		res = []
		hash = 0
		mask = (1 << 20) - 1
		for i, c in enumerate(s):
			hash = ((hash << 2) | enc[c]) & mask
			if i >= 9:
				if hash in seen:
					if hash not in repeated:
						res.append(s[i-9:i+1])
						repeated.add(hash)
				else:
					seen.add(hash)
		return res

Rust

use std::collections::HashSet;

impl Solution {
	pub fn find_repeated_dna_sequences(s: String) -> Vec<String> {
		if s.len() < 10 { return vec![]; }
		let enc = |c: u8| match c {
			b'A' => 0,
			b'C' => 1,
			b'G' => 2,
			b'T' => 3,
			_ => 0,
		};
		let mut seen = HashSet::new();
		let mut repeated = HashSet::new();
		let bytes = s.as_bytes();
		let mut hash = 0u32;
		let mask = (1 << 20) - 1;
		let mut res = Vec::new();
		for i in 0..bytes.len() {
			hash = ((hash << 2) | enc(bytes[i])) & mask;
			if i >= 9 {
				if !seen.insert(hash) && repeated.insert(hash) {
					res.push(s[i-9..=i].to_string());
				}
			}
		}
		res
	}
}

TypeScript

function findRepeatedDnaSequences(s: string): string[] {
	if (s.length < 10) return [];
	const enc: Record<string, number> = {A:0, C:1, G:2, T:3};
	const seen = new Set<number>();
	const repeated = new Set<number>();
	const res: string[] = [];
	let hash = 0, mask = (1 << 20) - 1;
	for (let i = 0; i < s.length; i++) {
		hash = ((hash << 2) | enc[s[i]]) & mask;
		if (i >= 9) {
			if (seen.has(hash)) {
				if (!repeated.has(hash)) {
					res.push(s.substring(i-9, i+1));
					repeated.add(hash);
				}
			} else {
				seen.add(hash);
			}
		}
	}
	return res;
}

Dry Run Example

Suppose s = "ACGTACGTAC". Let's encode the first 10 characters:

Index	Char	Bits	Hash (binary)
0	A	00	00
1	C	01	0001
2	G	10	000110
3	T	11	00011011
...	...	...	...

At each step, shift the hash left by 2 bits and OR with the new character's bits. After 10 characters, the hash represents the 10-letter substring. As you slide the window, drop the leftmost 2 bits and add the new character's bits, always keeping only the last 20 bits.

Bitwise Example (Java/Python-style pseudocode)

Suppose we want to encode the substring "ACGTACGTAC":

int hash = 0;
for (int i = 0; i < 10; i++) {
	hash <<= 2; // shift left by 2 bits
	hash |= enc[s.charAt(i)]; // add new character's bits
	// For example, after 'A' (00): hash = 0
	// After 'C' (01): hash = 0b0001
	// After 'G' (10): hash = 0b000110
	// After 'T' (11): hash = 0b00011011
	// ... and so on
}

When sliding the window, to remove the leftmost 2 bits and add a new character:

// Remove the leftmost 2 bits (keep only last 20 bits)
int mask = (1 << 20) - 1; // 20 bits set to 1: 0b11111111111111111111
hash = ((hash << 2) | enc[s.charAt(i)]) & mask;

Why (3 << 18)?

To clear the leftmost 2 bits of a 20-bit integer:

int val = ...; // some 20-bit value
val &= ~(3 << 18); // 3 = 0b11, so (3 << 18) = 0b110000... (18 zeros)
// ~(3 << 18) = 0b00111111111111111111 (first 2 bits zeroed)
// This operation keeps only the last 18 bits

This is useful if you want to manually remove the oldest character from the rolling hash.

Summary

Shift left by 2 bits to make space for the new character
OR with the new character's 2-bit encoding
AND with mask (0b111...111, 20 bits) to keep only the last 20 bits
Use sets to track which hashes have been seen and which are repeated

If a hash is seen again, the substring is repeated.

Complexity

⏰ Time complexity: O(n), where n is the length of s. Each character is processed once, and set operations are O(1).
🧺 Space complexity: O(n) for the sets.

Problem

Examples

Solution

Method 1 – Brute Force Using HashSet

Intuition

Approach

Code

C++

Go

Java

Kotlin

Python

Rust

TypeScript

Complexity

Method 2 - Using Hashset but with StringBuilder

Method 2 – HashSet with StringBuilder (or Mutable String)

Intuition

Approach

Code

C++

Go

Java

Kotlin

Python

Rust

TypeScript

Complexity

Method 3 - Using Hashset and Bit Representation

Dry Run (Step-by-step Bit Manipulation)

Method 3 – HashSet and Bit Manipulation (Rolling Hash)

Intuition

Approach

Code

C++

Go

Java

Kotlin

Python

Rust

TypeScript

Dry Run Example

Bitwise Example (Java/Python-style pseudocode)

Why (3 << 18)?

Summary

Complexity

Comments