Introduction

Have you ever wondered which algorithm helps find a word when you use Ctrl+F and type something? The title might have given it away, but this article delves into the process in detail.

The Morris-Pratt string searching algorithm demonstrates that it’s not necessary to compare each character of the text with the pattern, improving searching efficiency by skipping comparisons. In contrast, the Brute Force OR Naive String Search Algorithm and Rabin-Karp algorithm are slower since they match characters one by one.

The KMP algorithm preprocesses the pattern to allow for shifts by more than one character. Similarly, the Boyer-Moore algorithm preprocesses the pattern and creates arrays for two heuristics, sliding the pattern as much as possible based on these heuristics.

Even though the Morris-Pratt algorithm offers significant improvements over brute force searching, one might wonder if there’s a faster algorithm that skips more comparisons and moves the pattern more efficiently.

The Boyer-Moore algorithm, defined in 1977 by Robert S. Boyer and J Strother Moore, enhances pattern searching performance in a text by leveraging specific observations.

The Boyer-Moore string searching algorithm is more efficient than Morris-Pratt for finding a pattern of length “m” in a text of length “n”.

Algorithm

Basic

  • When searching through text t for pattern s:
    • If a character mismatch occurs (bad character rule):
      • Check if that character is present elsewhere in s:
        • If so, align that character in s with the mismatch in t.
        • If not, align the start of s with the next character in t and begin the next check.
    • If characters match but later a mismatch occurs (good suffix rule):
      • Search through s for the occurrence of the matched substring:
        • If found, align s with the substring.
        • If not, align the start of s with the next character in t and begin the next check.

The Boyer-Moore algorithm effectively reduces unnecessary comparisons and can move the pattern faster, making it an efficient choice for string searching tasks.

Shift Direction

This algorithm starts by comparing the pattern from the leftmost part of the text and shifts it to the right, as illustrated:

 +---------+---+---+---+---+---+---+---+---+---+
 | text                                        |
 +---------+---+---+---+---+---+---+---+---+---+
  
 +---------+---+---+---+
 | pattern             |
 +---------+---+---+---+
   -> shift pattern

Unlike other string searching algorithms though, Boyer-Moore compares the pattern against a possible match from right to left as shown below.

 +---------+---+---+---+---+---+---+---+---+---+
 | text                                        |
 +---------+---+---+---+---+---+---+---+---+---+
  
                 +---------+---+---+---+
                  | pattern             |
                  +---------+---+---+---+
                  <-shift pattern as per rule

Unlike other algorithms the letters of the pattern are compared from right to left!

The main idea of Boyer-Moore in order to improve the performance are some observations of the pattern. In the terminology of this algorithm they are called good-suffix and bad-character shifts. Let’s see by the following examples what they are standing for.

The Boyer-Moore algorithm is an efficient string-searching algorithm that uses two primary heuristics: the bad character heuristic and the good suffix heuristic. Here, we’ll focus on the good suffix heuristic.

Good Suffix Heuristic

The good suffix heuristic allows the pattern to be shifted more efficiently when a mismatch occurs after part of the pattern has matched a suffix of the text.

How It Works:

  1. Suffix Identification: When a mismatch occurs after some characters of the pattern have matched a portion of the text, this matched portion is referred to as a “suffix”.
  2. Shift Determination: The pattern is then shifted such that the next occurrence of this suffix in the pattern aligns with the matched portion of the text. If the suffix does not appear elsewhere in the pattern, the entire pattern is shifted past the mismatched character.

The good suffix heuristic helps avoid unnecessary comparisons by making use of the information from the previous matches.

Example Illustration

Consider the text T and the pattern P:

Text (T):        x  x  a  b  c  a  b  a  b  a
Pattern (P):     a  b  c  a  b  c  y
                    ^
                mismatch

Pseudocode for Good Suffix Heuristic

Here is the general approach to applying the good suffix heuristic:

  1. Identify the Good Suffix: The part of the pattern that matches the text before the mismatch is abcab. When the first mismatch occurs at y (pattern) vs a (text), we identify ab as the good suffix.
  2. Shift the Pattern:
    • Find ab in the pattern from left to right before the mismatch.
    • If found, align this occurrence of ab in the pattern with the matched part in the text.
    • If not found, shift the pattern past the mismatch.
Example of Good Suffix Table Creation
void preprocessGoodSuffix(char[] pattern, int m, int[] goodSuffixTable) {
    int lastPrefixPosition = m;
    
    for (int i = m - 1; i >= 0; i--) {
        if (isPrefix(pattern, i + 1)) {
            lastPrefixPosition = i + 1;
        }
        goodSuffixTable[m - 1 - i] = lastPrefixPosition + (m - 1 - i);
    }
    
    for (int i = 0; i < m - 1; i++) {
        int slen = suffixLength(pattern, i);
        goodSuffixTable[slen] = m - 1 - i + slen;
    }
}

boolean isPrefix(char[] pattern, int p) {
    int suffixLength = pattern.length - p;
    for (int i = 0; i < suffixLength; i++) {
        if (pattern[i] != pattern[p + i]) {
            return false;
        }
    }
    return true;
}

int suffixLength(char[] pattern, int p) {
    int len = 0;
    for (int i = p, j = pattern.length - 1; i >= 0 && pattern[i] == pattern[j]; i--, j--) {
        len++;
    }
    return len;
}

Bad Character Heuristic

The bad character heuristic is used to determine how far the pattern should be shifted when a mismatch occurs based on the presence of mismatched characters within the pattern.

How it works

  1. Character Comparison: Compare the characters of the pattern from right to left against the text.
  2. Mismatch Handling: Upon a mismatch, identify the “bad character” in the text that caused the mismatch.
  3. Shift Determination: Look at the position of the bad character in the pattern. If the bad character exists in the pattern:
    • Shift the pattern such that the bad character in the text aligns with the last occurrence of that character in the pattern.
    • If the bad character does not exist in the pattern:
      • Shift the pattern past the mismatched character.

This technique reduces unnecessary comparisons by skipping portions of the text that cannot match the pattern.

Example Illustration

Consider the text T and the pattern P:

Text (T):        x  x  a  b  c  a  b  a  b  a
Pattern (P):     a  b  c  b  a  d
                        ^
                mismatch at 'b' in text with 'd' in pattern
  1. Identify the Bad Character:
    • The mismatched character in the text is b.
  2. Determine the Shift:
    • Check the pattern to see the last occurrence of b.
    • Align this occurrence with the b in the text.

Bad Character Table Creation

The bad character table indicates the last occurrence of each character in the pattern. It is an array where the index represents the character and the value represents the last position of that character in the pattern.

Example

Let’s create a bad character table for the pattern P = "abcbad":

Pattern (P):  a  b  c  b  a  d

Last occurrence:
a -> 4
b -> 3
c -> 2
d -> 5

Constructing the Bad Character Table in Code

int[] buildBadCharacterTable(String pattern) {
    final int ALPHABET_SIZE = 256; // Assuming ASCII character set
    int[] badCharTable = new int[ALPHABET_SIZE];
    int m = pattern.length();

    for (int i = 0; i < ALPHABET_SIZE; i++) {
        badCharTable[i] = -1; // Initialize all occurrences as -1
    }
    for (int i = 0; i < m; i++) {
        badCharTable[pattern.charAt(i)] = i; // Fill the actual value of last occurrence
    }
    return badCharTable;
}

Example of Bad Character Heuristic

Consider the text ababcab and the pattern abc:

Text (T):    a  b  a  b  c  a  b
Pattern (P): a  b  c

Start comparing from right:
Mismatch at 'a'

Bad character table for pattern "abc":
a -> 0
b -> 1
c -> 2

Mismatch at 'a' in text:
Shift pattern based on last occurrence of 'a'.
Shift pattern to align 'a' in pattern with 'a' in text.

Key Points

  • Efficiency: The bad character heuristic improves efficiency by reducing unnecessary comparisons.
  • Optimal Shifts: It suggests optimal shifts by aligning the mismatched character in the text with its last occurrence in the pattern.

Maximum of Good-suffix and Bad-Character shifts

Boyer-Moore needs both good-suffix and bad-character shifts in order to speed up searching performance. After a mismatch the maximum of both is considered in order to move the pattern to the right.

Complexity

Boyer-Moore is evidently faster than Morris-Pratt, although its worst-case complexity is O(n + m). It performs exceptionally well in natural language searches.

Code

Java

public class Solution {
    public int strStr(String haystack, String needle) {
        if (needle.length() == 0) return 0;
        if (haystack.length() == 0 || needle.length() > haystack.length()) return -1;

        int[] badChar = buildBadCharacterTable(needle);
        int[] goodSuffix = buildGoodSuffixTable(needle);

        int n = haystack.length();
        int m = needle.length();
        int s = 0;

        while (s <= n - m) {
            int j = m - 1;

            while (j >= 0 && needle.charAt(j) == haystack.charAt(s + j)) {
                j--;
            }

            if (j < 0) {
                return s;
            } else {
                s += Math.max(goodSuffix[j], j - badChar[haystack.charAt(s + j)]);
            }
        }
        return -1;
    }

    private int[] buildBadCharacterTable(String pattern) {
        final int ALPHABET_SIZE = 256;
        int[] badChar = new int[ALPHABET_SIZE];
        for (int i = 0; i < ALPHABET_SIZE; i++) {
            badChar[i] = -1;
        }
        for (int i = 0; i < pattern.length(); i++) {
            badChar[pattern.charAt(i)] = i;
        }
        return badChar;
    }

    private int[] buildGoodSuffixTable(String pattern) {
        int m = pattern.length();
        int[] suffixes = new int[m];
        int[] goodSuffix = new int[m];

        preprocessSuffixes(pattern, suffixes);

        for (int i = 0; i < m; i++) {
            goodSuffix[i] = m;
        }

        int j = 0;
        for (int i = m - 1; i >= 0; i--) {
            if (suffixes[i] == i + 1) {
                for (; j < m - 1 - i; j++) {
                    if (goodSuffix[j] == m) {
                        goodSuffix[j] = m - 1 - i;
                    }
                }
            }
        }

        for (int i = 0; i <= m - 2; i++) {
            goodSuffix[m - 1 - suffixes[i]] = m - 1 - i;
        }

        return goodSuffix;
    }

    private void preprocessSuffixes(String pattern, int[] suffixes) {
        int m = pattern.length();
        suffixes[m - 1] = m;
        int g = m - 1;
        int f = 0;

        for (int i = m - 2; i >= 0; i--) {
            if (i > g && suffixes[i + m - 1 - f] < i - g) {
                suffixes[i] = suffixes[i + m - 1 - f];
            } else {
                if (i < g) {
                    g = i;
                }
                f = i;
                while (g >= 0 && pattern.charAt(g) == pattern.charAt(g + m - 1 - f)) {
                    g--;
                }
                suffixes[i] = f - g;
            }
        }
    }
}

Python

class Solution:
    def strStr(self, haystack: str, needle: str) -> int:
        if len(needle) == 0:
            return 0
        if len(haystack) == 0 or len(needle) > len(haystack):
            return -1

        badChar = self.buildBadCharacterTable(needle)
        goodSuffix = self.buildGoodSuffixTable(needle)

        n = len(haystack)
        m = len(needle)
        s = 0

        while s <= n - m:
            j = m - 1

            while j >= 0 and needle[j] == haystack[s + j]:
                j -= 1

            if j < 0:
                return s
            else:
                s += max(goodSuffix[j], j - badChar.get(haystack[s + j], -1))

        return -1

    def buildBadCharacterTable(self, pattern):
        badChar = {}
        for i in range(len(pattern)):
            badChar[pattern[i]] = i
        return badChar

    def buildGoodSuffixTable(self, pattern):
        m = len(pattern)
        suffixes = [0] * m
        goodSuffix = [0] * m
        self.preprocessSuffixes(pattern, suffixes)

        for i in range(m):
            goodSuffix[i] = m

        j = 0
        for i in range(m - 1, -1, -1):
            if suffixes[i] == i + 1:
                for j in range(m - 1 - i):
                    if goodSuffix[j] == m:
                        goodSuffix[j] = m - 1 - i

        for i in range(m - 1):
            goodSuffix[m - 1 - suffixes[i]] = m - 1 - i

        return goodSuffix

    def preprocessSuffixes(self, pattern, suffixes):
        m = len(pattern)
        suffixes[m - 1] = m
        g = m - 1
        f = 0

        for i in range(m - 2, -1, -1):
            if i > g and suffixes[i + m - 1 - f] < i - g:
                suffixes[i] = suffixes[i + m - 1 - f]
            else:
                if i < g:
                    g = i
                f = i
                while g >= 0 and pattern[g] == pattern[g + m - 1 - f]:
                    g -= 1
                suffixes[i] = f - g

Pros and Cons

Pros

The Boyer-Moore algorithm is almost always faster than brute force and KMP. It typically skips many characters without checking them and scans the pattern right-to-left.

Application

Boyer-Moore is one of the most commonly used string searching algorithms in practice. It is widely recognized for its effectiveness, particularly in search and replace operations within text editors.