Problem
The Z algorithm finds all occurrences of a pattern in a text in linear time. If the length of the text is n and the length of the pattern is m, the total time taken is O(m + n) with linear space complexity. This algorithm is simpler to understand than the KMP algorithm while maintaining the same time and space complexities.
What is Z Array?
For a string str[0..n-1], the Z array is the same length as the string. An element Z[i] of the Z array stores the length of the longest substring starting from str[i] which is also a prefix of str[0..n-1]. The first entry of the Z array is meaningless as the complete string is always a prefix of itself.
| |
How is the Z Array Helpful in Pattern Searching in Linear Time?
The idea is to concatenate the pattern and text, and create a string “P$T”
- where P is pattern,
- $ is a special character should not be present in pattern and text,
- and T is text.
Build the Z array for concatenated string. In Z array, if Z value at any point is equal to pattern length, then pattern is present at that point.
Example:
| |
How to construct Z array?
A simple solution involves two nested loops: the outer loop goes to every index, and the inner loop finds the length of the longest prefix matching the substring starting at the current index. This solution has a time complexity of O(n^2).
We can construct the Z array in linear time:
The idea is to maintain an interval [L, R] which is the interval with the maximum R such that [L,R] is a prefix substring (a substring which is also a prefix).
Steps for maintaining this interval are as follows:
- If i > R, there is no prefix substring that starts before i and ends after i, so we reset L and R and compute a new
[L, R]by comparingstr[0..] to str[i..]and getZ[i] = R - L + 1. - If
i <= R, letK = i - L. Now,Z[i] >= min(Z[K], R - i + 1)because str[i..] matches withstr[K..]for at leastR - i + 1characters (they are in the[L, R]interval which we know is a prefix substring). Two sub-cases arise: a. IfZ[K] < R - i +1, there is no prefix substring starting atstr[i](otherwiseZ[K]would be larger), soZ[i] = Z[K]and the interval[L, R]remains the same. b. IfZ[K] >= R - i + 1, it is possible to extend the[L, R]interval. Thus, we set L as i and start matching fromstr[R]onwards to get a new R, then we update the interval[L, R]and calculateZ[i] = R - L + 1.
The algorithm runs in linear time because we never compare characters less than R. With matching, we increase R by one, so there are at most T comparisons. In the mismatch case, mismatches happen only once for each i (because of which R stops), leading to another at most T comparisons, making the overall complexity linear.
Code
Java
| |
Python
| |