The Z algorithm finds all occurrences of a pattern in a text in linear time. If the length of the text is n
and the length of the pattern is m
, the total time taken is O(m + n) with linear space complexity. This algorithm is simpler to understand than the KMP algorithm while maintaining the same time and space complexities.
What is Z Array?
For a string str[0..n-1]
, the Z array is the same length as the string. An element Z[i]
of the Z array stores the length of the longest substring starting from str[i]
which is also a prefix of str[0..n-1]
. The first entry of the Z array is meaningless as the complete string is always a prefix of itself.
Example:
Index 0 1 2 3 4 5 6 7 8 9 10 11
Text a a b c a a b x a a a z
Z values X 1 0 0 3 1 0 0 2 2 1 0
More Examples:
str = "aaaaaa"
Z[] = {X, 5, 4, 3, 2, 1}
str = "aabaacd"
Z[] = {X, 1, 0, 2, 1, 0, 0}
str = "abababab"
Z[] = {X, 0, 6, 0, 4, 0, 2, 0}
How is the Z Array Helpful in Pattern Searching in Linear Time?
The idea is to concatenate the pattern and text, and create a string “P$T”
- where P is pattern,
- $ is a special character should not be present in pattern and text,
- and T is text.
Build the Z array for concatenated string. In Z array, if Z value at any point is equal to pattern length, then pattern is present at that point.
Example:
Pattern P = "aab", Text T = "baabaa"
The concatenated string is = "aab$baabaa"
Z array for the above concatenated string is {X, 1, 0, 0, 0, 3, 1, 0, 2, 1}.
Since the length of the pattern is 3, the value 3 in the Z array indicates the presence of the pattern.
How to construct Z array?
A simple solution involves two nested loops: the outer loop goes to every index, and the inner loop finds the length of the longest prefix matching the substring starting at the current index. This solution has a time complexity of O(n^2).
We can construct the Z array in linear time:
The idea is to maintain an interval [L, R] which is the interval with the maximum R such that [L,R] is a prefix substring (a substring which is also a prefix).
Steps for maintaining this interval are as follows:
- If i > R, there is no prefix substring that starts before i and ends after i, so we reset L and R and compute a new
[L, R]
by comparingstr[0..] to str[i..]
and getZ[i] = R - L + 1
. - If
i <= R
, letK = i - L
. Now,Z[i] >= min(Z[K], R - i + 1)
because str[i..] matches withstr[K..]
for at leastR - i + 1
characters (they are in the[L, R]
interval which we know is a prefix substring). Two sub-cases arise: a. IfZ[K] < R - i +
1, there is no prefix substring starting atstr[i]
(otherwiseZ[K]
would be larger), soZ[i] = Z[K]
and the interval[L, R]
remains the same. b. IfZ[K] >= R - i + 1
, it is possible to extend the[L, R]
interval. Thus, we set L as i and start matching fromstr[R]
onwards to get a new R, then we update the interval[L, R]
and calculateZ[i] = R - L + 1
.
The algorithm runs in linear time because we never compare characters less than R. With matching, we increase R by one, so there are at most T
comparisons. In the mismatch case, mismatches happen only once for each i (because of which R stops), leading to another at most T
comparisons, making the overall complexity linear.
Code
Java
public class Solution {
public int strStr(String haystack, String needle) {
String concat = needle + "$" + haystack;
int l = concat.length();
int[] Z = new int[l];
getZarr(concat, Z);
for (int i = 0; i < l; ++i) {
if (Z[i] == needle.length()) {
return i - needle.length() - 1;
}
}
return -1;
}
private void getZarr(String str, int[] Z) {
int n = str.length();
int L = 0, R = 0;
for (int i = 1; i < n; ++i) {
if (i > R) {
L = R = i;
while (R < n && str.charAt(R) == str.charAt(R - L)) {
R++;
}
Z[i] = R - L;
R--;
} else {
int k = i - L;
if (Z[k] < R - i + 1) {
Z[i] = Z[k];
} else {
L = i;
while (R < n && str.charAt(R) == str.charAt(R - L)) {
R++;
}
Z[i] = R - L;
R--;
}
}
}
}
}
Python
class Solution:
def strStr(self, haystack: str, needle: str) -> int:
concat = needle + "$" + haystack
l = len(concat)
Z = [0] * l
self.getZarr(concat, Z)
for i in range(l):
if Z[i] == len(needle):
return i - len(needle) - 1
return -1
def getZarr(self, s, Z):
L, R, K = 0, 0, 0
for i in range(1, len(s)):
if i > R:
L, R = i, i
while R < len(s) and s[R] == s[R - L]:
R += 1
Z[i] = R - L
R -= 1
else:
K = i - L
if Z[K] < R - i + 1:
Z[i] = Z[K]
else:
L = i
while R < len(s) and s[R] == s[R - L]:
R += 1
Z[i] = R - L
R -= 1