Jaro-Winkler Algorithm for String Similarity

Problem

Given two strings, compute their similarity using the Jaro-Winkler distance metric. The Jaro-Winkler distance is a string similarity measure that rewards common prefixes and penalizes transpositions. It is widely used for record linkage, spelling correction, and fuzzy string matching.

Examples

Example 1

Input:
s1 = "martha"
s2 = "marhta"
Output:
Jaro distance = 0.944
Jaro-Winkler distance = 0.961
Explanation:
The two strings have all characters matching, but two transpositions (t/h and h/t). The Jaro distance is computed as shown below, and the Jaro-Winkler distance further boosts the score due to the common prefix "mar".

Solution

Method 1 – Jaro-Winkler Distance Calculation

Intuition

The Jaro distance measures similarity by considering matching characters and transpositions. The Jaro-Winkler distance extends this by giving extra weight to common prefixes, making it especially useful when similar strings share the same start.

Approach

Find matching characters between the two strings within a matching window.
Count the number of transpositions (characters that match but are out of order).
Compute the Jaro distance using the formula:
- Let m = number of matching characters
- Let t = number of transpositions / 2
- Jaro distance: dj = (1/3) * (m/|s1| + m/|s2| + (m-t)/m)

d_j = \frac{1}{3}(\frac{m}{s_1} + \frac{m}{s_2} + \frac{m - t}{m})

Compute the length l of the common prefix (up to 4 characters).
Compute the Jaro-Winkler distance:
- dw = dj + l * p * (1 - dj) where p is the scaling factor (usually 0.1)

d_w = d_j + (l*p*(1 - d_j))

Code

C++

class Solution {
public:
  double jaroWinkler(string s1, string s2) {
    int n1 = s1.size(), n2 = s2.size();
    if (n1 == 0 && n2 == 0) return 1.0;
    int match_dist = max(n1, n2) / 2 - 1;
    vector<bool> s1_match(n1, false), s2_match(n2, false);
    int m = 0;
    for (int i = 0; i < n1; ++i) {
      int start = max(0, i - match_dist), end = min(n2 - 1, i + match_dist);
      for (int j = start; j <= end; ++j) {
        if (!s2_match[j] && s1[i] == s2[j]) {
          s1_match[i] = s2_match[j] = true;
          ++m;
          break;
        }
      }
    }
    if (m == 0) return 0.0;
    int t = 0, k = 0;
    for (int i = 0; i < n1; ++i) {
      if (s1_match[i]) {
        while (!s2_match[k]) ++k;
        if (s1[i] != s2[k]) ++t;
        ++k;
      }
    }
    double dj = (m / (double)n1 + m / (double)n2 + (m - t / 2.0) / m) / 3.0;
    int l = 0;
    while (l < min(4, min(n1, n2)) && s1[l] == s2[l]) ++l;
    double p = 0.1;
    return dj + l * p * (1 - dj);
  }
};

Java

class Solution {
  public double jaroWinkler(String s1, String s2) {
    int n1 = s1.length(), n2 = s2.length();
    if (n1 == 0 && n2 == 0) return 1.0;
    int matchDist = Math.max(n1, n2) / 2 - 1;
    boolean[] s1Match = new boolean[n1], s2Match = new boolean[n2];
    int m = 0;
    for (int i = 0; i < n1; ++i) {
      int start = Math.max(0, i - matchDist), end = Math.min(n2 - 1, i + matchDist);
      for (int j = start; j <= end; ++j) {
        if (!s2Match[j] && s1.charAt(i) == s2.charAt(j)) {
          s1Match[i] = s2Match[j] = true;
          ++m;
          break;
        }
      }
    }
    if (m == 0) return 0.0;
    int t = 0, k = 0;
    for (int i = 0; i < n1; ++i) {
      if (s1Match[i]) {
        while (!s2Match[k]) ++k;
        if (s1.charAt(i) != s2.charAt(k)) ++t;
        ++k;
      }
    }
    double dj = (m / (double)n1 + m / (double)n2 + (m - t / 2.0) / m) / 3.0;
    int l = 0;
    while (l < Math.min(4, Math.min(n1, n2)) && s1.charAt(l) == s2.charAt(l)) ++l;
    double p = 0.1;
    return dj + l * p * (1 - dj);
  }
}

Python

class Solution:
    def jaro_winkler(self, s1: str, s2: str) -> float:
        n1, n2 = len(s1), len(s2)
        if n1 == 0 and n2 == 0:
            return 1.0
        match_dist = max(n1, n2) // 2 - 1
        s1_match = [False] * n1
        s2_match = [False] * n2
        m = 0
        for i in range(n1):
            start, end = max(0, i - match_dist), min(n2 - 1, i + match_dist)
            for j in range(start, end + 1):
                if not s2_match[j] and s1[i] == s2[j]:
                    s1_match[i] = s2_match[j] = True
                    m += 1
                    break
        if m == 0:
            return 0.0
        t = 0
        k = 0
        for i in range(n1):
            if s1_match[i]:
                while not s2_match[k]:
                    k += 1
                if s1[i] != s2[k]:
                    t += 1
                k += 1
        dj = (m / n1 + m / n2 + (m - t / 2) / m) / 3
        l = 0
        while l < min(4, n1, n2) and s1[l] == s2[l]:
            l += 1
        p = 0.1
        return dj + l * p * (1 - dj)

Complexity

⏰ Time complexity: O(n + m), where n and m are the lengths of the two strings. Each character is compared within a window, and prefix and transpositions are checked in linear time.
🧺 Space complexity: O(n + m), for the match marker arrays.