Knuth Morris Pratt KMP Algorithm

This is part of Find the Index of the First Occurrence in a String.

Introduction

The Knuth-Morris-Pratt (KMP) algorithm is an extension and improvement of the Morris-Pratt algorithm, developed jointly by Donald Knuth, James H. Morris, and Vaughan R. Pratt and published in 1977. It operates in linear time by leveraging the information within the pattern itself to determine where to resume the search after a mismatch.

The key improvement in the KMP algorithm over the Morris-Pratt algorithm is the modification of the prefix function to handle mismatches more efficiently during the search phase.

In addition to the prefix function, the KMP algorithm constructs a shift table that helps to determine how much the pattern should be shifted when a mismatch occurs, minimizing unnecessary comparisons.

The worst case complexity of the Naive algorithm is O(m(n-m+1)). The time complexity of KMP algorithm is O(n) in the worst case.

The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the pattern) of the pattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text of next window. We take advantage of this information to avoid matching the characters that we know will anyway match.

This is where KMP does optimization over Naive. In this second window, we only compare fourth A of pattern with fourth character of current window of text to decide whether current window matches or not. Since we know first three characters will anyway match, we skipped matching first three characters.

Need of Preprocessing?

An important question arises from above explanation, how to know how many characters to be skipped. To know this, we pre-process pattern and prepare an integer array lps[] that tells us count of characters to be skipped.

Preprocessing Overview

KMP algorithm does pre-proceses pat[] and constructs an auxiliary lps[] or prefix[] of size m (same as size of pattern) which is used to skip characters while matching.
name lps indicates longest proper prefix which is also suffix.. A proper prefix is prefix with whole string not allowed.

Example 1

String: ABC
proper prefixes: "",A, AB, ABC
Suffix: "", C, BC, ABC

Example 2

String: ABCDEFABCR
Proper Prefixes: "", A, AB, ABC, ABCD, ABCDE, ABCDEF, ABCDEFA, ABCDEFAB, ABCDEFABC
Suffix: "", R, CR, BCR, ABCR, FABCR, EFABCR, DEFABCR, CDEFABCR, BCDEFABCR

The lps[] array helps in determining the new positions to start comparisons from, thereby optimizing the search process.

Generating the `lps[]` Array

For each sub-pattern pat[0..i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i].

lps[i] = the longest proper prefix of `pat[0..i]` which is also a suffix of `pat[0..i]`.

We create the lps[] array by comparing characters within the pattern:

Procedure GenerateLPSArray(Pattern):
    i := 1
    j := 0
    lps[0] := 0
    while i < Pattern.length
        if Pattern[i] == Pattern[j]
            j := j + 1
            lps[i] := j
            i := i + 1
        else
            if j != 0
                j := lps[j - 1]
            else
                lps[i] := 0
                i := i + 1

Example from pattern “abcdabca”

+---------+---+---+---+---+---+---+---+---+
| Index   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+
| lps[]   | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 1 |
+---------+---+---+---+---+---+---+---+---+

Filling in

+---------+---+---+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+

Considering the positions of j and i:

 j   i
+---------+---+---+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+

When Pattern[j] and Pattern[i] don’t match, we increment i. Since j is 0, we set Pattern[i] to 0 without checking the previous value. Continuing this process, when i becomes 4, a match occurs, so we set S[i] to S[4], which is j + 1 = 1, and increment both j and i. The array at this stage is:

 j               i
+---------+---+---+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+
|    S    | 0 | 0 | 0 | 0 | 1 |   |   |   |
+---------+---+---+---+---+---+---+---+---+

As Pattern[1] matches Pattern[5], we set S[i] to S[5], which is j + 1 = 2. Moving forward, j = 3 and i = 7 leads to a mismatch. Since j is not 0, we set j to S[j-1] and compare Pattern[i] with Pattern[j]. If they match, S[i] becomes j + 1. The final array thus becomes:

+---------+---+---+---+---+---+---+---+---+
|    S    | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 1 |
+---------+---+---+---+---+---+---+---+---+

This array indicates the length of the longest suffix which is also a prefix in the substring (from 0 to i). A nonzero S[i] value signifies this, and the next comparison will start from S[i] + 1 in the pattern.

Finding matches OR Search Algorithm

We are now ready to find occurrences of the pattern in the string.

How to use lps[] to determine the next positions (or the number of characters to skip):

Start by comparing pat[j] (with j = 0) with characters of the current text window.
Continue matching characters txt[i] and pat[j]; increment both i and j while they keep matching.
Upon a mismatch:
- Recognize that characters pat[0..j-1] match txt[i-j+1…i-1] (note that j starts at 0 and increments only on matches).
- The value lps[j-1] indicates the count of characters in pat[0…j-1] that are both proper prefixes and suffixes.
- Therefore, do not re-match these lps[j-1] characters with txt[i-j…i-1] since we know they will match. Let’s see an example to understand this concept.

Example

Now let’s do a substring search using the following example:

+---------+---+---+---+---+---+---+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |
+---------+---+---+---+---+---+---+---+---+---+---+---+---+
|   Text  | a | b | x | a | b | c | a | b | c | a | b | y |
+---------+---+---+---+---+---+---+---+---+---+---+---+---+

+---------+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 |
+---------+---+---+---+---+---+---+
| Pattern | a | b | c | a | b | y |
+---------+---+---+---+---+---+---+
|    LPS  | 0 | 0 | 0 | 1 | 2 | 0 |
+---------+---+---+---+---+---+---+

In this example, we have a text, a pattern, and a pre-calculated lps[] array. First, match Text[0] with Pattern[0], which match. Similarly, Text[1] matches Pattern[1]. However, Text[2] does not match Pattern[2].

We check lps[1], which is 0, indicating no matching suffix-prefix in the substring, so we start at Pattern[0] again for Text[2]. Upon reaching Text[3], it matches Pattern[0], and this continues until Text[8] matches Pattern[5]. Upon a mismatch, we refer to lps[4], finding the value 2, meaning a matching prefix that is a suffix (abcab ends in ab). Resume comparison from Pattern[2] for Text[8].

Procedure KMP(Text, Pattern)
    GenerateLPSArray(Pattern)
    m := Text.length
    n := Pattern.length
    i := 0
    j := 0
    while i < m
        if Pattern[j] == Text[i]
            j := j + 1
            i := i + 1
        if j == n
            Return (i - j)
            j := lps[j - 1]
        else if i < m and Pattern[j] != Text[i]
            if j != 0
                j := lps[j - 1]
            else
                i := i + 1
    Return -1

Complexity

Time

Apart from preprocessing (which is O(n)), the search phase has a time complexity of O(m), leading to a total complexity of O(m + n).

Space

The algorithm requires O(m) space for the pattern’s lps[] array.

Code

Java

public class KMPSearch {

    // Compute the LPS (Longest Prefix Suffix) array
    public int[] computeLPSArray(String pat) {
        int M = pat.length();
        int[] lps = new int[M];
        int len = 0;
        int i = 1;

        lps[0] = 0; // lps[0] is always 0

        // the loop calculates lps[i] for i = 1 to M-1
        while (i < M) {
            if (pat.charAt(i) == pat.charAt(len)) {
                len++;
                lps[i] = len;
                i++;
            } else { // (pat[i] != pat[len])
                if (len != 0) {
                    len = lps[len - 1];
                } else { // if (len == 0)
                    lps[i] = len;
                    i++;
                }
            }
        }
        return lps;
    }

    public int findPattern(String haystack, String needle) {
        if (haystack == null || needle == null) {
            return 0;
        }

        int h = haystack.length();
        int n = needle.length();

        if (n > h) {
            return -1;
        }
        if (n == 0) {
            return 0;
        }

        int[] lps = computeLPSArray(needle);
        int i = 0; // index for haystack
        int j = 0; // index for needle

        while (i < h) {
            if (needle.charAt(j) == haystack.charAt(i)) {
                i++;
                j++;
            }

            if (j == n) {
                return i - j; // found match
            } else if (i < h && needle.charAt(j) != haystack.charAt(i)) {
                if (j != 0) {
                    j = lps[j - 1];
                } else {
                    i = i + 1;
                }
            }
        }

        return -1; // no match found
    }
}

Python

Prefix:

def prefix(p):
    m = len(p)
    pi = [0] * m
    j = 0
    for i in range(1, m):
        while j >= 0 and p[j] != p[i]:
            if j - 1 >= 0:
                j = pi[j - 1]
            else:
                j = -1
        j += 1
        pi[i] = j
    return pi

Match:

 def find_occurrences(S, p):
     matches = []
     f = prefix(p)
     n, m = len(S), len(p)
     j = 0
     for i in range(n):
         while j >= 0 and S[i] != p[j]:
             if j - 1 >= 0:
                 j = f[j - 1]
             else:
                 j = -1
         j += 1
         if j == m:
             j = f[m - 1]
             matches.append(i - m + 1)
     return matches

Introduction#

Need of Preprocessing?#

Preprocessing Overview#

Generating the lps[] Array#

Example from pattern “abcdabca”#

Filling in#

Finding matches OR Search Algorithm#

Example#

Complexity#

Time#

Space#

Code#

Java#

Python#