Introduction
This is part of Find the Index of the First Occurrence in a String.
Limitations of Brute Force
It is evident that both brute force string searching and Rabin-Karp string searching are not efficient. To devise a better algorithm, we first need to understand the limitations of these methods.
Brute force string matching is inherently slow, and while the Rabin-Karp algorithm tries to improve this with hash functions, it still shares the same time complexity of O(m*n)
.
We need a different approach. To identify what is wrong with brute force string searching, let’s take a closer l]ook at its principles. In brute force matching, each character of the text is compared to the first character of the pattern. If they match, the comparison moves to the next character. A mismatch means multiple characters are re-checked, which is inefficient.
In brute force string matching, a mismatch results in reverting to characters previously compared, causing redundant comparisons. For instance, if there is a mismatch after checking the first four letters, the comparison starts again from the second letter of the text, which is redundant since we know the pattern’s starting character does not appear within the checked range.
To eliminate this redundancy, we need a more efficient method.
Overview
The solution to the problem was presented by James H. Morris and Vaughan Pratt in 1970 with their algorithm, which avoids many unnecessary comparisons, making it more efficient than brute force string matching.
Key features include:
- Comparisons are done from left to right.
- The preprocessing phase has
O(m)
space and time complexity. - The searching phase has
O(n+m)
time complexity, independent of the alphabet size. - The algorithm performs at most
2n-1
comparisons during text scanning. - The delay is bounded by
m
.
Let’s examine the details. The main idea is to utilize the information gathered during pattern comparisons to identify potential matches, as shown in the diagram below.
The Morris-Pratt algorithm skips certain comparisons by advancing to the next possible match location. To achieve this, the pattern must be preprocessed to determine these positions. This way, when a mismatch occurs, the algorithm knows precisely where to resume, avoiding redundant comparisons.
Generating the Table of Next Positions
This is the crucial aspect of the Morris-Pratt algorithm, which helps it overcome the limitations of brute force string searching. The following images illustrate this concept:
If the pattern consists of unique letters, upon a mismatch, we compare the next text character with the first pattern character. However, if the pattern contains repeating characters, and a mismatch occurs after a repeating character, the next potential match begins from that repeating character, as shown below:
The “next” table differs when the pattern has repeating characters: Finally, if the text contains multiple repeating characters, the “next” table will indicate their positions.
Once we have this table of potential “next” positions, we can start searching the text for our pattern.
Code
Implementing the Morris-Pratt algorithm is straightforward. First, preprocess the pattern, and then execute the search.
Java
public class Solution {
// Preprocess the pattern and return the "next" table
private void preprocessMorrisPratt(String pattern, int[] nextTable) {
int i = 0;
int j = nextTable[0] = -1;
int len = pattern.length();
while (i < len) {
while (j > -1 && pattern.charAt(i) != pattern.charAt(j)) {
j = nextTable[j];
}
nextTable[++i] = ++j;
}
}
// Perform a string search with the Morris-Pratt algorithm
public int strStr(String haystack, String needle) {
if (needle.isEmpty()) {
return 0;
}
int n = haystack.length();
int m = needle.length();
int[] nextTable = new int[m + 1];
// Preprocess the pattern
preprocessMorrisPratt(needle, nextTable);
int i = 0; // index for pattern
int j = 0; // index for text
while (j < n) {
while (i > -1 && needle.charAt(i) != haystack.charAt(j)) {
i = nextTable[i];
}
i++;
j++;
if (i >= m) {
return j - i;
}
}
return -1;
}
}
Python
class Solution:
# Preprocess the pattern and return the "next" table
def preprocess_morris_pratt(self, pattern):
m = len(pattern)
next_table = [-1] * (m + 1)
i, j = 0, -1
while i < m:
while j > -1 and pattern[i] != pattern[j]:
j = next_table[j]
i += 1
j += 1
next_table[i] = j
return next_table
# Perform a string search with the Morris-Pratt algorithm
def strStr(self, haystack, needle):
if not needle:
return 0
n, m = len(haystack), len(needle)
next_table = self.preprocess_morris_pratt(needle)
i, j = 0, 0
while j < n:
while i > -1 and needle[i] != haystack[j]:
i = next_table[i]
i += 1
j += 1
if i >= m:
return j - i
return -1
Complexity
The Morris-Pratt algorithm requires some preprocessing time and space. Preprocessing the pattern takes O(m) time, where m is the length of the pattern, while the search takes O(m + n) time. The advantage is that preprocessing is done only once, allowing multiple searches to be done efficiently.
The following chart illustrates the O(n + m) complexity compared to O(nm) for 5-letter patterns.
Application
Why it’s Cool
- The search complexity is O(m + n), which is faster than brute force and Rabin-Karp.
- It is relatively easy to implement.
Why it isn’t Cool
- It requires additional space and preprocessing time of O(m).
- It can be further optimized (e.g., with Knuth-Morris-Pratt).
Summary
The Morris-Pratt algorithm is valuable as it elegantly improves upon brute force matching. Although faster algorithms like Boyer-Moore exist, Morris-Pratt is quite useful in many cases, making understanding its principles beneficial.