Problem

We have already seen the same problem with insert, delete and replace of characters here - Edit Distance Problem.

The edit distance of two strings is the minimum number of changes you need to make to turn the first string into the second. You are only allowed to remove or insert letters. Another common change that isn’t considered here is changing one letter into another, which can be achieved by first deleting the original letter and then inserting the new one.

Examples

Example 1:

Input: s1 = "horse", s2 = "ros"
Output: 3
Explanation: 
- Remove 'h' and 'e' from "horse" to get "ros".
- Total operations: 3 (remove 'h', remove 'e', remove 's')

Example 2:

Input: s1 = "intention", s2 = "execution"
Output: 9
Explanation: 
- Remove 'i', 'n', 't', 'e', 'n', 't', 'i', 'o', and 'n' from "intention" to get "execution".
- Total operations: 9

Solution

Method 1 - DP

The edit distance between two strings is the minimum number of operations required to convert the first string into the second. For this discussion, we only allow insertion and deletion of characters. Note that changing one letter into another can be achieved by a deletion followed by an insertion.

For example, to turn “encourage” into “entourage,” you can remove “c” and insert “t,” resulting in an edit distance of 2.

Consider another example with the words “assent” and “descent.” Here’s one way to convert “assent” into “descent”:

  1. Remove “a” to get “ssent.”
  2. Remove “s” to get “sent.”
  3. Remove “s” to get “ent.”
  4. Add “d” to get “dent.”
  5. Add “e” to get “deent.”
  6. Add “s” to get “desent.”
  7. Add “c” to get “descent.”

This method takes seven steps, making the edit distance at most 7. Determining if this is the minimum number of steps requires a structured approach. For longer strings, it can be challenging to find the optimal solution.

To calculate the edit distance, you can build an edit graph representing all possible changes from the first string to the second. The nodes at the top of the graph represent the characters of the first word, while nodes down the side represent the characters of the second word. Create links between nodes moving to the right and downward. Add diagonal links where characters in both strings match. Each link represents a transformation, such as removing or adding a letter, or keeping a letter unchanged.

Traversing the graph from the top-left to the bottom-right corner represents the series of operations required to convert the first word into the second. To find the least costly path, assign a cost of 1 to horizontal and vertical moves, and a cost of 0 to diagonal moves. This transforms the problem into finding the shortest path through the network.

The technique involves:

  1. Setting the distances for the nodes in the top row to their column numbers and for nodes in the leftmost column to their row numbers.
  2. Iterating over the graph to find the shortest path to each node by considering the nodes above, to the left, and diagonally up-left.

When applying this method to files, comparing character by character may require large and inefficient graphs. For example, files with about 40,000 characters would result in a graph containing 1.6 billion nodes. A more practical approach is to compare lines instead of characters, reducing the graph size significantly.

Code

Java
class Solution {
    public int minDistance(String s1, String s2) {
        int n = s1.length();
        int m = s2.length();
        
        int[][] dp = new int[n + 1][m + 1];
        
        for (int i = 0; i <= n; i++) dp[i][0] = i;
        for (int j = 0; j <= m; j++) dp[0][j] = j;
        
        for (int i = 1; i <= n; i++) {
            for (int j = 1; j <= m; j++) {
                if (s1.charAt(i - 1) == s2.charAt(j - 1)) {
                    dp[i][j] = dp[i - 1][j - 1];
                } else {
                    dp[i][j] = 1 + Math.min(dp[i - 1][j], dp[i][j - 1]);
                }
            }
        }
        
        return dp[n][m];
    }
}
Python
class Solution:
    def min_distance(self, s1: str, s2: str) -> int:
        n = len(s1)
        m = len(s2)
        
        dp = [[0] * (m + 1) for _ in range(n + 1)]
        
        for i in range(n + 1):
            dp[i][0] = i
        for j in range(m + 1):
            dp[0][j] = j
        
        for i in range(1, n + 1):
            for j in range(1, m + 1):
                if s1[i - 1] == s2[j - 1]:
                    dp[i][j] = dp[i - 1][j - 1]
                else:
                    dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1])
        
        return dp[n][m]

Complexity

  • ⏰ Time complexity: O(n * m), where n and m are lengths of s1 and s2.
  • 🧺 Space complexity: O(n * m) due to the DP table.