Z Algorithm String Searching

The Z algorithm finds all occurrences of a pattern in a text in linear time. If the length of the text is n and the length of the pattern is m, the total time taken is O(m + n) with linear space complexity. This algorithm is simpler to understand than the KMP algorithm while maintaining the same time and space complexities.

What is Z Array?

For a string str[0..n-1], the Z array is the same length as the string. An element Z[i] of the Z array stores the length of the longest substring starting from str[i] which is also a prefix of str[0..n-1]. The first entry of the Z array is meaningless as the complete string is always a prefix of itself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


Example:
Index            0   1   2   3   4   5   6   7   8   9  10  11 
Text             a   a   b   c   a   a   b   x   a   a   a   z
Z values         X   1   0   0   3   1   0   0   2   2   1   0 
More Examples:
str  = "aaaaaa"
Z[]  = {X, 5, 4, 3, 2, 1}

str = "aabaacd"
Z[] = {X, 1, 0, 2, 1, 0, 0}

str = "abababab"
Z[] = {X, 0, 6, 0, 4, 0, 2, 0}

How is the Z Array Helpful in Pattern Searching in Linear Time?

The idea is to concatenate the pattern and text, and create a string “P$T”

where P is pattern,
$ is a special character should not be present in pattern and text,
and T is text.

Build the Z array for concatenated string. In Z array, if Z value at any point is equal to pattern length, then pattern is present at that point.

Example:

1
2
3
4


Pattern P = "aab",  Text T = "baabaa"
The concatenated string is = "aab$baabaa"
Z array for the above concatenated string is {X, 1, 0, 0, 0, 3, 1, 0, 2, 1}.
Since the length of the pattern is 3, the value 3 in the Z array indicates the presence of the pattern.

How to construct Z array?

A simple solution involves two nested loops: the outer loop goes to every index, and the inner loop finds the length of the longest prefix matching the substring starting at the current index. This solution has a time complexity of O(n^2).

We can construct the Z array in linear time:

The idea is to maintain an interval [L, R] which is the interval with the maximum R such that [L,R] is a prefix substring (a substring which is also a prefix).

Steps for maintaining this interval are as follows:

If i > R, there is no prefix substring that starts before i and ends after i, so we reset L and R and compute a new [L, R] by comparing str[0..] to str[i..] and get Z[i] = R - L + 1.
If i <= R, let K = i - L. Now, Z[i] >= min(Z[K], R - i + 1) because str[i..] matches with str[K..] for at least R - i + 1 characters (they are in the [L, R] interval which we know is a prefix substring). Two sub-cases arise: a. If Z[K] < R - i + 1, there is no prefix substring starting at str[i] (otherwise Z[K] would be larger), so Z[i] = Z[K] and the interval [L, R] remains the same. b. If Z[K] >= R - i + 1, it is possible to extend the [L, R] interval. Thus, we set L as i and start matching from str[R] onwards to get a new R, then we update the interval [L, R] and calculate Z[i] = R - L + 1.

The algorithm runs in linear time because we never compare characters less than R. With matching, we increase R by one, so there are at most T comparisons. In the mismatch case, mismatches happen only once for each i (because of which R stops), leading to another at most T comparisons, making the overall complexity linear.

Code

Java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


public class Solution {
    public int strStr(String haystack, String needle) {
        String concat = needle + "$" + haystack;
        int l = concat.length();
        int[] Z = new int[l];
        getZarr(concat, Z);

        for (int i = 0; i < l; ++i) {
            if (Z[i] == needle.length()) {
                return i - needle.length() - 1;
            }
        }
        return -1;
    }

    private void getZarr(String str, int[] Z) {
        int n = str.length();
        int L = 0, R = 0;

        for (int i = 1; i < n; ++i) {
            if (i > R) {
                L = R = i;
                while (R < n && str.charAt(R) == str.charAt(R - L)) {
                    R++;
                }
                Z[i] = R - L;
                R--;
            } else {
                int k = i - L;
                if (Z[k] < R - i + 1) {
                    Z[i] = Z[k];
                } else {
                    L = i;
                    while (R < n && str.charAt(R) == str.charAt(R - L)) {
                        R++;
                    }
                    Z[i] = R - L;
                    R--;
                }
            }
        }
    }
}

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


class Solution:
    def strStr(self, haystack: str, needle: str) -> int:
        concat = needle + "$" + haystack
        l = len(concat)
        Z = [0] * l
        self.getZarr(concat, Z)

        for i in range(l):
            if Z[i] == len(needle):
                return i - len(needle) - 1
        return -1

    def getZarr(self, s, Z):
        L, R, K = 0, 0, 0
        for i in range(1, len(s)):
            if i > R:
                L, R = i, i
                while R < len(s) and s[R] == s[R - L]:
                    R += 1
                Z[i] = R - L
                R -= 1
            else:
                K = i - L
                if Z[K] < R - i + 1:
                    Z[i] = Z[K]
                else:
                    L = i
                    while R < len(s) and s[R] == s[R - L]:
                        R += 1
                    Z[i] = R - L
                    R -= 1

What is Z Array?#

How is the Z Array Helpful in Pattern Searching in Linear Time?#

How to construct Z array?#

Code#

Java#

Python#