This is part of Find the Index of the First Occurrence in a String.
Introduction
The Knuth-Morris-Pratt (KMP) algorithm is an extension and improvement of the Morris-Pratt algorithm, developed jointly by Donald Knuth, James H. Morris, and Vaughan R. Pratt and published in 1977. It operates in linear time by leveraging the information within the pattern itself to determine where to resume the search after a mismatch.
The key improvement in the KMP algorithm over the Morris-Pratt algorithm is the modification of the prefix function to handle mismatches more efficiently during the search phase.
In addition to the prefix function, the KMP algorithm constructs a shift table that helps to determine how much the pattern should be shifted when a mismatch occurs, minimizing unnecessary comparisons.
The worst case complexity of the Naive algorithm is O(m(n-m+1))
. The time complexity of KMP algorithm is O(n)
in the worst case.
The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the pattern) of the pattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text of next window. We take advantage of this information to avoid matching the characters that we know will anyway match.
This is where KMP does optimization over Naive. In this second window, we only compare fourth A of pattern with fourth character of current window of text to decide whether current window matches or not. Since we know first three characters will anyway match, we skipped matching first three characters.
Need of Preprocessing?
An important question arises from above explanation, how to know how many characters to be skipped. To know this, we pre-process pattern and prepare an integer array lps[]
that tells us count of characters to be skipped.
Preprocessing Overview
- KMP algorithm does pre-proceses
pat[]
and constructs an auxiliarylps[]
orprefix[]
of size m (same as size of pattern) which is used to skip characters while matching. - name
lps
indicates longest proper prefix which is also suffix.. A proper prefix is prefix with whole string not allowed.
Example 1
String: ABC
proper prefixes: "",A, AB, ABC
Suffix: "", C, BC, ABC
Example 2
String: ABCDEFABCR
Proper Prefixes: "", A, AB, ABC, ABCD, ABCDE, ABCDEF, ABCDEFA, ABCDEFAB, ABCDEFABC
Suffix: "", R, CR, BCR, ABCR, FABCR, EFABCR, DEFABCR, CDEFABCR, BCDEFABCR
The lps[]
array helps in determining the new positions to start comparisons from, thereby optimizing the search process.
Generating the lps[]
Array
For each sub-pattern pat[0..i]
where i = 0
to m-1
, lps[i]
stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i]
.
lps[i] = the longest proper prefix of `pat[0..i]` which is also a suffix of `pat[0..i]`.
We create the lps[]
array by comparing characters within the pattern:
Procedure GenerateLPSArray(Pattern):
i := 1
j := 0
lps[0] := 0
while i < Pattern.length
if Pattern[i] == Pattern[j]
j := j + 1
lps[i] := j
i := i + 1
else
if j != 0
j := lps[j - 1]
else
lps[i] := 0
i := i + 1
Example from pattern “abcdabca”
+---------+---+---+---+---+---+---+---+---+
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+
| lps[] | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 1 |
+---------+---+---+---+---+---+---+---+---+
Filling in
+---------+---+---+---+---+---+---+---+---+
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+
Considering the positions of j
and i
:
j i
+---------+---+---+---+---+---+---+---+---+
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+
When Pattern[j]
and Pattern[i]
don’t match, we increment i
. Since j
is 0, we set Pattern[i]
to 0 without checking the previous value. Continuing this process, when i
becomes 4, a match occurs, so we set S[i]
to S[4]
, which is j + 1 = 1
, and increment both j
and i
. The array at this stage is:
j i
+---------+---+---+---+---+---+---+---+---+
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+
| S | 0 | 0 | 0 | 0 | 1 | | | |
+---------+---+---+---+---+---+---+---+---+
As Pattern[1]
matches Pattern[5]
, we set S[i]
to S[5]
, which is j + 1 = 2
. Moving forward, j = 3
and i = 7
leads to a mismatch. Since j
is not 0, we set j
to S[j-1]
and compare Pattern[i]
with Pattern[j]
. If they match, S[i]
becomes j + 1
. The final array thus becomes:
+---------+---+---+---+---+---+---+---+---+
| S | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 1 |
+---------+---+---+---+---+---+---+---+---+
This array indicates the length of the longest suffix which is also a prefix in the substring (from 0 to i
). A nonzero S[i]
value signifies this, and the next comparison will start from S[i]
+ 1 in the pattern.
Finding matches OR Search Algorithm
We are now ready to find occurrences of the pattern in the string.
How to use lps[]
to determine the next positions (or the number of characters to skip):
- Start by comparing
pat[j]
(withj = 0
) with characters of the current text window. - Continue matching characters
txt[i]
andpat[j]
; increment bothi
andj
while they keep matching. - Upon a mismatch:
- Recognize that characters
pat[0..j-1]
matchtxt[i-j+1…i-1]
(note thatj
starts at 0 and increments only on matches). - The value
lps[j-1]
indicates the count of characters inpat[0…j-1]
that are both proper prefixes and suffixes. - Therefore, do not re-match these
lps[j-1]
characters withtxt[i-j…i-1]
since we know they will match. Let’s see an example to understand this concept.
- Recognize that characters
Example
Now let’s do a substring search using the following example:
+---------+---+---+---+---+---+---+---+---+---+---+---+---+
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |
+---------+---+---+---+---+---+---+---+---+---+---+---+---+
| Text | a | b | x | a | b | c | a | b | c | a | b | y |
+---------+---+---+---+---+---+---+---+---+---+---+---+---+
+---------+---+---+---+---+---+---+
| Index | 0 | 1 | 2 | 3 | 4 | 5 |
+---------+---+---+---+---+---+---+
| Pattern | a | b | c | a | b | y |
+---------+---+---+---+---+---+---+
| LPS | 0 | 0 | 0 | 1 | 2 | 0 |
+---------+---+---+---+---+---+---+
In this example, we have a text, a pattern, and a pre-calculated lps[]
array. First, match Text[0]
with Pattern[0]
, which match. Similarly, Text[1]
matches Pattern[1]
. However, Text[2]
does not match Pattern[2]
.
We check lps[1]
, which is 0, indicating no matching suffix-prefix in the substring, so we start at Pattern[0]
again for Text[2]
. Upon reaching Text[3]
, it matches Pattern[0]
, and this continues until Text[8]
matches Pattern[5]
. Upon a mismatch, we refer to lps[4]
, finding the value 2
, meaning a matching prefix that is a suffix (abcab ends in ab). Resume comparison from Pattern[2]
for Text[8]
.
Procedure KMP(Text, Pattern)
GenerateLPSArray(Pattern)
m := Text.length
n := Pattern.length
i := 0
j := 0
while i < m
if Pattern[j] == Text[i]
j := j + 1
i := i + 1
if j == n
Return (i - j)
j := lps[j - 1]
else if i < m and Pattern[j] != Text[i]
if j != 0
j := lps[j - 1]
else
i := i + 1
Return -1
Complexity
Time
Apart from preprocessing (which is O(n)
), the search phase has a time complexity of O(m)
, leading to a total complexity of O(m + n)
.
Space
The algorithm requires O(m)
space for the pattern’s lps[]
array.
Code
Java
public class KMPSearch {
// Compute the LPS (Longest Prefix Suffix) array
public int[] computeLPSArray(String pat) {
int M = pat.length();
int[] lps = new int[M];
int len = 0;
int i = 1;
lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
while (i < M) {
if (pat.charAt(i) == pat.charAt(len)) {
len++;
lps[i] = len;
i++;
} else { // (pat[i] != pat[len])
if (len != 0) {
len = lps[len - 1];
} else { // if (len == 0)
lps[i] = len;
i++;
}
}
}
return lps;
}
public int findPattern(String haystack, String needle) {
if (haystack == null || needle == null) {
return 0;
}
int h = haystack.length();
int n = needle.length();
if (n > h) {
return -1;
}
if (n == 0) {
return 0;
}
int[] lps = computeLPSArray(needle);
int i = 0; // index for haystack
int j = 0; // index for needle
while (i < h) {
if (needle.charAt(j) == haystack.charAt(i)) {
i++;
j++;
}
if (j == n) {
return i - j; // found match
} else if (i < h && needle.charAt(j) != haystack.charAt(i)) {
if (j != 0) {
j = lps[j - 1];
} else {
i = i + 1;
}
}
}
return -1; // no match found
}
}
Python
Prefix:
def prefix(p):
m = len(p)
pi = [0] * m
j = 0
for i in range(1, m):
while j >= 0 and p[j] != p[i]:
if j - 1 >= 0:
j = pi[j - 1]
else:
j = -1
j += 1
pi[i] = j
return pi
Match:
def find_occurrences(S, p):
matches = []
f = prefix(p)
n, m = len(S), len(p)
j = 0
for i in range(n):
while j >= 0 and S[i] != p[j]:
if j - 1 >= 0:
j = f[j - 1]
else:
j = -1
j += 1
if j == m:
j = f[m - 1]
matches.append(i - m + 1)
return matches