Morris-Pratt String Searching

Introduction

This is part of Find the Index of the First Occurrence in a String.

Limitations of Brute Force

It is evident that both brute force string searching and Rabin-Karp string searching are not efficient. To devise a better algorithm, we first need to understand the limitations of these methods.

Brute force string matching is inherently slow, and while the Rabin-Karp algorithm tries to improve this with hash functions, it still shares the same time complexity of O(m*n).

We need a different approach. To identify what is wrong with brute force string searching, let’s take a closer l]ook at its principles. In brute force matching, each character of the text is compared to the first character of the pattern. If they match, the comparison moves to the next character. A mismatch means multiple characters are re-checked, which is inefficient.

In brute force string matching, a mismatch results in reverting to characters previously compared, causing redundant comparisons. For instance, if there is a mismatch after checking the first four letters, the comparison starts again from the second letter of the text, which is redundant since we know the pattern’s starting character does not appear within the checked range.

To eliminate this redundancy, we need a more efficient method.

Overview

The solution to the problem was presented by James H. Morris and Vaughan Pratt in 1970 with their algorithm, which avoids many unnecessary comparisons, making it more efficient than brute force string matching.

Key features include:

Comparisons are done from left to right.
The preprocessing phase has O(m) space and time complexity.
The searching phase has O(n+m) time complexity, independent of the alphabet size.
The algorithm performs at most 2n-1 comparisons during text scanning.
The delay is bounded by m.

Let’s examine the details. The main idea is to utilize the information gathered during pattern comparisons to identify potential matches, as shown in the diagram below.

The Morris-Pratt algorithm skips certain comparisons by advancing to the next possible match location. To achieve this, the pattern must be preprocessed to determine these positions. This way, when a mismatch occurs, the algorithm knows precisely where to resume, avoiding redundant comparisons.

Generating the Table of Next Positions

This is the crucial aspect of the Morris-Pratt algorithm, which helps it overcome the limitations of brute force string searching. The following images illustrate this concept:

If the pattern consists of unique letters, upon a mismatch, we compare the next text character with the first pattern character. However, if the pattern contains repeating characters, and a mismatch occurs after a repeating character, the next potential match begins from that repeating character, as shown below:

The “next” table differs when the pattern has repeating characters: Finally, if the text contains multiple repeating characters, the “next” table will indicate their positions.

Once we have this table of potential “next” positions, we can start searching the text for our pattern.

Code

Implementing the Morris-Pratt algorithm is straightforward. First, preprocess the pattern, and then execute the search.

Java

public class Solution {
    // Preprocess the pattern and return the "next" table
    private void preprocessMorrisPratt(String pattern, int[] nextTable) {
        int i = 0;
        int j = nextTable[0] = -1;
        int len = pattern.length();

        while (i < len) {
            while (j > -1 && pattern.charAt(i) != pattern.charAt(j)) {
                j = nextTable[j];
            }
            nextTable[++i] = ++j;
        }
    }

    // Perform a string search with the Morris-Pratt algorithm
    public int strStr(String haystack, String needle) {
        if (needle.isEmpty()) {
            return 0;
        }

        int n = haystack.length();
        int m = needle.length();
        int[] nextTable = new int[m + 1];

        // Preprocess the pattern
        preprocessMorrisPratt(needle, nextTable);

        int i = 0;  // index for pattern
        int j = 0;  // index for text
        while (j < n) {
            while (i > -1 && needle.charAt(i) != haystack.charAt(j)) {
                i = nextTable[i];
            }
            i++;
            j++;
            if (i >= m) {
                return j - i;
            }
        }
        return -1;
    }
}

Python

class Solution:
    # Preprocess the pattern and return the "next" table
    def preprocess_morris_pratt(self, pattern):
        m = len(pattern)
        next_table = [-1] * (m + 1)
        i, j = 0, -1

        while i < m:
            while j > -1 and pattern[i] != pattern[j]:
                j = next_table[j]
            i += 1
            j += 1
            next_table[i] = j
        
        return next_table

    # Perform a string search with the Morris-Pratt algorithm
    def strStr(self, haystack, needle):
        if not needle:
            return 0

        n, m = len(haystack), len(needle)
        next_table = self.preprocess_morris_pratt(needle)

        i, j = 0, 0
        while j < n:
            while i > -1 and needle[i] != haystack[j]:
                i = next_table[i]
            i += 1
            j += 1
            if i >= m:
                return j - i

        return -1

Complexity

The Morris-Pratt algorithm requires some preprocessing time and space. Preprocessing the pattern takes O(m) time, where m is the length of the pattern, while the search takes O(m + n) time. The advantage is that preprocessing is done only once, allowing multiple searches to be done efficiently.

The following chart illustrates the O(n + m) complexity compared to O(nm) for 5-letter patterns.

Application

Why it’s Cool

The search complexity is O(m + n), which is faster than brute force and Rabin-Karp.
It is relatively easy to implement.

Why it isn’t Cool

It requires additional space and preprocessing time of O(m).
It can be further optimized (e.g., with Knuth-Morris-Pratt).

Summary

The Morris-Pratt algorithm is valuable as it elegantly improves upon brute force matching. Although faster algorithms like Boyer-Moore exist, Morris-Pratt is quite useful in many cases, making understanding its principles beneficial.

Introduction#

Limitations of Brute Force#

Overview#

Generating the Table of Next Positions#

Code#

Java#

Python#

Complexity#

Application#

Why it’s Cool#

Why it isn’t Cool#

Summary#