Jaro-Winkler Algorithm for String Similarity
MediumUpdated: Aug 28, 2025
Problem
Given two strings, compute their similarity using the Jaro-Winkler distance metric. The Jaro-Winkler distance is a string similarity measure that rewards common prefixes and penalizes transpositions. It is widely used for record linkage, spelling correction, and fuzzy string matching.
Examples
Example 1
Input:
s1 = "martha"
s2 = "marhta"
Output:
Jaro distance = 0.944
Jaro-Winkler distance = 0.961
Explanation:
The two strings have all characters matching, but two transpositions (t/h and h/t). The Jaro distance is computed as shown below, and the Jaro-Winkler distance further boosts the score due to the common prefix "mar".
Solution
Method 1 – Jaro-Winkler Distance Calculation
Intuition
The Jaro distance measures similarity by considering matching characters and transpositions. The Jaro-Winkler distance extends this by giving extra weight to common prefixes, making it especially useful when similar strings share the same start.
Approach
- Find matching characters between the two strings within a matching window.
- Count the number of transpositions (characters that match but are out of order).
- Compute the Jaro distance using the formula:
- Let m = number of matching characters
- Let t = number of transpositions / 2
- Jaro distance:
dj = (1/3) * (m/|s1| + m/|s2| + (m-t)/m)
- Compute the length l of the common prefix (up to 4 characters).
- Compute the Jaro-Winkler distance:
dw = dj + l * p * (1 - dj)where p is the scaling factor (usually 0.1)
Code
C++
class Solution {
public:
double jaroWinkler(string s1, string s2) {
int n1 = s1.size(), n2 = s2.size();
if (n1 == 0 && n2 == 0) return 1.0;
int match_dist = max(n1, n2) / 2 - 1;
vector<bool> s1_match(n1, false), s2_match(n2, false);
int m = 0;
for (int i = 0; i < n1; ++i) {
int start = max(0, i - match_dist), end = min(n2 - 1, i + match_dist);
for (int j = start; j <= end; ++j) {
if (!s2_match[j] && s1[i] == s2[j]) {
s1_match[i] = s2_match[j] = true;
++m;
break;
}
}
}
if (m == 0) return 0.0;
int t = 0, k = 0;
for (int i = 0; i < n1; ++i) {
if (s1_match[i]) {
while (!s2_match[k]) ++k;
if (s1[i] != s2[k]) ++t;
++k;
}
}
double dj = (m / (double)n1 + m / (double)n2 + (m - t / 2.0) / m) / 3.0;
int l = 0;
while (l < min(4, min(n1, n2)) && s1[l] == s2[l]) ++l;
double p = 0.1;
return dj + l * p * (1 - dj);
}
};
Java
class Solution {
public double jaroWinkler(String s1, String s2) {
int n1 = s1.length(), n2 = s2.length();
if (n1 == 0 && n2 == 0) return 1.0;
int matchDist = Math.max(n1, n2) / 2 - 1;
boolean[] s1Match = new boolean[n1], s2Match = new boolean[n2];
int m = 0;
for (int i = 0; i < n1; ++i) {
int start = Math.max(0, i - matchDist), end = Math.min(n2 - 1, i + matchDist);
for (int j = start; j <= end; ++j) {
if (!s2Match[j] && s1.charAt(i) == s2.charAt(j)) {
s1Match[i] = s2Match[j] = true;
++m;
break;
}
}
}
if (m == 0) return 0.0;
int t = 0, k = 0;
for (int i = 0; i < n1; ++i) {
if (s1Match[i]) {
while (!s2Match[k]) ++k;
if (s1.charAt(i) != s2.charAt(k)) ++t;
++k;
}
}
double dj = (m / (double)n1 + m / (double)n2 + (m - t / 2.0) / m) / 3.0;
int l = 0;
while (l < Math.min(4, Math.min(n1, n2)) && s1.charAt(l) == s2.charAt(l)) ++l;
double p = 0.1;
return dj + l * p * (1 - dj);
}
}
Python
class Solution:
def jaro_winkler(self, s1: str, s2: str) -> float:
n1, n2 = len(s1), len(s2)
if n1 == 0 and n2 == 0:
return 1.0
match_dist = max(n1, n2) // 2 - 1
s1_match = [False] * n1
s2_match = [False] * n2
m = 0
for i in range(n1):
start, end = max(0, i - match_dist), min(n2 - 1, i + match_dist)
for j in range(start, end + 1):
if not s2_match[j] and s1[i] == s2[j]:
s1_match[i] = s2_match[j] = True
m += 1
break
if m == 0:
return 0.0
t = 0
k = 0
for i in range(n1):
if s1_match[i]:
while not s2_match[k]:
k += 1
if s1[i] != s2[k]:
t += 1
k += 1
dj = (m / n1 + m / n2 + (m - t / 2) / m) / 3
l = 0
while l < min(4, n1, n2) and s1[l] == s2[l]:
l += 1
p = 0.1
return dj + l * p * (1 - dj)
Complexity
- ⏰ Time complexity:
O(n + m), where n and m are the lengths of the two strings. Each character is compared within a window, and prefix and transpositions are checked in linear time. - 🧺 Space complexity:
O(n + m), for the match marker arrays.