Introduction
Have you ever wondered which algorithm helps find a word when you use Ctrl+F and type something? The title might have given it away, but this article delves into the process in detail.
The Morris-Pratt string searching algorithm demonstrates that it’s not necessary to compare each character of the text with the pattern, improving searching efficiency by skipping comparisons. In contrast, the Brute Force OR Naive String Search Algorithm and Rabin-Karp algorithm are slower since they match characters one by one.
The KMP algorithm preprocesses the pattern to allow for shifts by more than one character. Similarly, the Boyer-Moore algorithm preprocesses the pattern and creates arrays for two heuristics, sliding the pattern as much as possible based on these heuristics.
Even though the Morris-Pratt algorithm offers significant improvements over brute force searching, one might wonder if there’s a faster algorithm that skips more comparisons and moves the pattern more efficiently.
The Boyer-Moore algorithm, defined in 1977 by Robert S. Boyer and J Strother Moore, enhances pattern searching performance in a text by leveraging specific observations.
The Boyer-Moore string searching algorithm is more efficient than Morris-Pratt for finding a pattern of length “m” in a text of length “n”.
Algorithm
Basic
- When searching through text t for pattern s:
- If a character mismatch occurs (bad character rule):
- Check if that character is present elsewhere in s:
- If so, align that character in s with the mismatch in t.
- If not, align the start of s with the next character in t and begin the next check.
- Check if that character is present elsewhere in s:
- If characters match but later a mismatch occurs (good suffix rule):
- Search through s for the occurrence of the matched substring:
- If found, align s with the substring.
- If not, align the start of s with the next character in t and begin the next check.
- Search through s for the occurrence of the matched substring:
- If a character mismatch occurs (bad character rule):
The Boyer-Moore algorithm effectively reduces unnecessary comparisons and can move the pattern faster, making it an efficient choice for string searching tasks.
Shift Direction
This algorithm starts by comparing the pattern from the leftmost part of the text and shifts it to the right, as illustrated:
+---------+---+---+---+---+---+---+---+---+---+
| text |
+---------+---+---+---+---+---+---+---+---+---+
+---------+---+---+---+
| pattern |
+---------+---+---+---+
-> shift pattern
Unlike other string searching algorithms though, Boyer-Moore compares the pattern against a possible match from right to left as shown below.
+---------+---+---+---+---+---+---+---+---+---+
| text |
+---------+---+---+---+---+---+---+---+---+---+
+---------+---+---+---+
| pattern |
+---------+---+---+---+
<-shift pattern as per rule
Unlike other algorithms the letters of the pattern are compared from right to left!
The main idea of Boyer-Moore in order to improve the performance are some observations of the pattern. In the terminology of this algorithm they are called good-suffix and bad-character shifts. Let’s see by the following examples what they are standing for.
The Boyer-Moore algorithm is an efficient string-searching algorithm that uses two primary heuristics: the bad character heuristic and the good suffix heuristic. Here, we’ll focus on the good suffix heuristic.
Good Suffix Heuristic
The good suffix heuristic allows the pattern to be shifted more efficiently when a mismatch occurs after part of the pattern has matched a suffix of the text.
How It Works:
- Suffix Identification: When a mismatch occurs after some characters of the pattern have matched a portion of the text, this matched portion is referred to as a “suffix”.
- Shift Determination: The pattern is then shifted such that the next occurrence of this suffix in the pattern aligns with the matched portion of the text. If the suffix does not appear elsewhere in the pattern, the entire pattern is shifted past the mismatched character.
The good suffix heuristic helps avoid unnecessary comparisons by making use of the information from the previous matches.
Example Illustration
Consider the text T
and the pattern P
:
Text (T): x x a b c a b a b a
Pattern (P): a b c a b c y
^
mismatch
Pseudocode for Good Suffix Heuristic
Here is the general approach to applying the good suffix heuristic:
- Identify the Good Suffix:
The part of the pattern that matches the text before the mismatch is
abcab
. When the first mismatch occurs aty
(pattern) vsa
(text), we identifyab
as the good suffix. - Shift the Pattern:
- Find
ab
in the pattern from left to right before the mismatch. - If found, align this occurrence of
ab
in the pattern with the matched part in the text. - If not found, shift the pattern past the mismatch.
- Find
Example of Good Suffix Table Creation
void preprocessGoodSuffix(char[] pattern, int m, int[] goodSuffixTable) {
int lastPrefixPosition = m;
for (int i = m - 1; i >= 0; i--) {
if (isPrefix(pattern, i + 1)) {
lastPrefixPosition = i + 1;
}
goodSuffixTable[m - 1 - i] = lastPrefixPosition + (m - 1 - i);
}
for (int i = 0; i < m - 1; i++) {
int slen = suffixLength(pattern, i);
goodSuffixTable[slen] = m - 1 - i + slen;
}
}
boolean isPrefix(char[] pattern, int p) {
int suffixLength = pattern.length - p;
for (int i = 0; i < suffixLength; i++) {
if (pattern[i] != pattern[p + i]) {
return false;
}
}
return true;
}
int suffixLength(char[] pattern, int p) {
int len = 0;
for (int i = p, j = pattern.length - 1; i >= 0 && pattern[i] == pattern[j]; i--, j--) {
len++;
}
return len;
}
Bad Character Heuristic
The bad character heuristic is used to determine how far the pattern should be shifted when a mismatch occurs based on the presence of mismatched characters within the pattern.
How it works
- Character Comparison: Compare the characters of the pattern from right to left against the text.
- Mismatch Handling: Upon a mismatch, identify the “bad character” in the text that caused the mismatch.
- Shift Determination: Look at the position of the bad character in the pattern. If the bad character exists in the pattern:
- Shift the pattern such that the bad character in the text aligns with the last occurrence of that character in the pattern.
- If the bad character does not exist in the pattern:
- Shift the pattern past the mismatched character.
This technique reduces unnecessary comparisons by skipping portions of the text that cannot match the pattern.
Example Illustration
Consider the text T
and the pattern P
:
Text (T): x x a b c a b a b a
Pattern (P): a b c b a d
^
mismatch at 'b' in text with 'd' in pattern
- Identify the Bad Character:
- The mismatched character in the text is
b
.
- The mismatched character in the text is
- Determine the Shift:
- Check the pattern to see the last occurrence of
b
. - Align this occurrence with the
b
in the text.
- Check the pattern to see the last occurrence of
Bad Character Table Creation
The bad character table indicates the last occurrence of each character in the pattern. It is an array where the index represents the character and the value represents the last position of that character in the pattern.
Example
Let’s create a bad character table for the pattern P = "abcbad"
:
Pattern (P): a b c b a d
Last occurrence:
a -> 4
b -> 3
c -> 2
d -> 5
Constructing the Bad Character Table in Code
int[] buildBadCharacterTable(String pattern) {
final int ALPHABET_SIZE = 256; // Assuming ASCII character set
int[] badCharTable = new int[ALPHABET_SIZE];
int m = pattern.length();
for (int i = 0; i < ALPHABET_SIZE; i++) {
badCharTable[i] = -1; // Initialize all occurrences as -1
}
for (int i = 0; i < m; i++) {
badCharTable[pattern.charAt(i)] = i; // Fill the actual value of last occurrence
}
return badCharTable;
}
Example of Bad Character Heuristic
Consider the text ababcab
and the pattern abc
:
Text (T): a b a b c a b
Pattern (P): a b c
Start comparing from right:
Mismatch at 'a'
Bad character table for pattern "abc":
a -> 0
b -> 1
c -> 2
Mismatch at 'a' in text:
Shift pattern based on last occurrence of 'a'.
Shift pattern to align 'a' in pattern with 'a' in text.
Key Points
- Efficiency: The bad character heuristic improves efficiency by reducing unnecessary comparisons.
- Optimal Shifts: It suggests optimal shifts by aligning the mismatched character in the text with its last occurrence in the pattern.
Maximum of Good-suffix and Bad-Character shifts
Boyer-Moore needs both good-suffix and bad-character shifts in order to speed up searching performance. After a mismatch the maximum of both is considered in order to move the pattern to the right.
Complexity
Boyer-Moore is evidently faster than Morris-Pratt, although its worst-case complexity is O(n + m). It performs exceptionally well in natural language searches.
Code
Java
public class Solution {
public int strStr(String haystack, String needle) {
if (needle.length() == 0) return 0;
if (haystack.length() == 0 || needle.length() > haystack.length()) return -1;
int[] badChar = buildBadCharacterTable(needle);
int[] goodSuffix = buildGoodSuffixTable(needle);
int n = haystack.length();
int m = needle.length();
int s = 0;
while (s <= n - m) {
int j = m - 1;
while (j >= 0 && needle.charAt(j) == haystack.charAt(s + j)) {
j--;
}
if (j < 0) {
return s;
} else {
s += Math.max(goodSuffix[j], j - badChar[haystack.charAt(s + j)]);
}
}
return -1;
}
private int[] buildBadCharacterTable(String pattern) {
final int ALPHABET_SIZE = 256;
int[] badChar = new int[ALPHABET_SIZE];
for (int i = 0; i < ALPHABET_SIZE; i++) {
badChar[i] = -1;
}
for (int i = 0; i < pattern.length(); i++) {
badChar[pattern.charAt(i)] = i;
}
return badChar;
}
private int[] buildGoodSuffixTable(String pattern) {
int m = pattern.length();
int[] suffixes = new int[m];
int[] goodSuffix = new int[m];
preprocessSuffixes(pattern, suffixes);
for (int i = 0; i < m; i++) {
goodSuffix[i] = m;
}
int j = 0;
for (int i = m - 1; i >= 0; i--) {
if (suffixes[i] == i + 1) {
for (; j < m - 1 - i; j++) {
if (goodSuffix[j] == m) {
goodSuffix[j] = m - 1 - i;
}
}
}
}
for (int i = 0; i <= m - 2; i++) {
goodSuffix[m - 1 - suffixes[i]] = m - 1 - i;
}
return goodSuffix;
}
private void preprocessSuffixes(String pattern, int[] suffixes) {
int m = pattern.length();
suffixes[m - 1] = m;
int g = m - 1;
int f = 0;
for (int i = m - 2; i >= 0; i--) {
if (i > g && suffixes[i + m - 1 - f] < i - g) {
suffixes[i] = suffixes[i + m - 1 - f];
} else {
if (i < g) {
g = i;
}
f = i;
while (g >= 0 && pattern.charAt(g) == pattern.charAt(g + m - 1 - f)) {
g--;
}
suffixes[i] = f - g;
}
}
}
}
Python
class Solution:
def strStr(self, haystack: str, needle: str) -> int:
if len(needle) == 0:
return 0
if len(haystack) == 0 or len(needle) > len(haystack):
return -1
badChar = self.buildBadCharacterTable(needle)
goodSuffix = self.buildGoodSuffixTable(needle)
n = len(haystack)
m = len(needle)
s = 0
while s <= n - m:
j = m - 1
while j >= 0 and needle[j] == haystack[s + j]:
j -= 1
if j < 0:
return s
else:
s += max(goodSuffix[j], j - badChar.get(haystack[s + j], -1))
return -1
def buildBadCharacterTable(self, pattern):
badChar = {}
for i in range(len(pattern)):
badChar[pattern[i]] = i
return badChar
def buildGoodSuffixTable(self, pattern):
m = len(pattern)
suffixes = [0] * m
goodSuffix = [0] * m
self.preprocessSuffixes(pattern, suffixes)
for i in range(m):
goodSuffix[i] = m
j = 0
for i in range(m - 1, -1, -1):
if suffixes[i] == i + 1:
for j in range(m - 1 - i):
if goodSuffix[j] == m:
goodSuffix[j] = m - 1 - i
for i in range(m - 1):
goodSuffix[m - 1 - suffixes[i]] = m - 1 - i
return goodSuffix
def preprocessSuffixes(self, pattern, suffixes):
m = len(pattern)
suffixes[m - 1] = m
g = m - 1
f = 0
for i in range(m - 2, -1, -1):
if i > g and suffixes[i + m - 1 - f] < i - g:
suffixes[i] = suffixes[i + m - 1 - f]
else:
if i < g:
g = i
f = i
while g >= 0 and pattern[g] == pattern[g + m - 1 - f]:
g -= 1
suffixes[i] = f - g
Pros and Cons
Pros
The Boyer-Moore algorithm is almost always faster than brute force and KMP. It typically skips many characters without checking them and scans the pattern right-to-left.
Application
Boyer-Moore is one of the most commonly used string searching algorithms in practice. It is widely recognized for its effectiveness, particularly in search and replace operations within text editors.