The Z algorithm finds all occurrences of a pattern in a text in linear time. If the length of the text is n
and the length of the pattern is m
, the total time taken is O(m + n) with linear space complexity. This algorithm is simpler to understand than the KMP algorithm while maintaining the same time and space complexities.
What is Z Array?
For a string str[0..n-1]
, the Z array is the same length as the string. An element Z[i]
of the Z array stores the length of the longest substring starting from str[i]
which is also a prefix of str[0..n-1]
. The first entry of the Z array is meaningless as the complete string is always a prefix of itself.
|
|
How is the Z Array Helpful in Pattern Searching in Linear Time?
The idea is to concatenate the pattern and text, and create a string “P$T”
- where P is pattern,
- $ is a special character should not be present in pattern and text,
- and T is text.
Build the Z array for concatenated string. In Z array, if Z value at any point is equal to pattern length, then pattern is present at that point.
Example:
|
|
How to construct Z array?
A simple solution involves two nested loops: the outer loop goes to every index, and the inner loop finds the length of the longest prefix matching the substring starting at the current index. This solution has a time complexity of O(n^2).
We can construct the Z array in linear time:
The idea is to maintain an interval [L, R] which is the interval with the maximum R such that [L,R] is a prefix substring (a substring which is also a prefix).
Steps for maintaining this interval are as follows:
- If i > R, there is no prefix substring that starts before i and ends after i, so we reset L and R and compute a new
[L, R]
by comparingstr[0..] to str[i..]
and getZ[i] = R - L + 1
. - If
i <= R
, letK = i - L
. Now,Z[i] >= min(Z[K], R - i + 1)
because str[i..] matches withstr[K..]
for at leastR - i + 1
characters (they are in the[L, R]
interval which we know is a prefix substring). Two sub-cases arise: a. IfZ[K] < R - i +
1, there is no prefix substring starting atstr[i]
(otherwiseZ[K]
would be larger), soZ[i] = Z[K]
and the interval[L, R]
remains the same. b. IfZ[K] >= R - i + 1
, it is possible to extend the[L, R]
interval. Thus, we set L as i and start matching fromstr[R]
onwards to get a new R, then we update the interval[L, R]
and calculateZ[i] = R - L + 1
.
The algorithm runs in linear time because we never compare characters less than R. With matching, we increase R by one, so there are at most T
comparisons. In the mismatch case, mismatches happen only once for each i (because of which R stops), leading to another at most T
comparisons, making the overall complexity linear.
Code
Java
|
|
Python
|
|