Using Dynamic Programming to Calculate Levenshtein Distance in Java

Levenshtein distance is a metric for the distance between two strings. It is defined by three different types of edits: substitution, insertion, and deletion. The Levenshtein distance between two strings is the minimum number of edits to get from one string to the other.

For example, the edit distance between “abc” and “abcd” is exactly 1 (insertion). Levenshtein distance is the same either way, which means that the distance between “abc” and “abcd” is the same as the distance between “abcd” and “abc”.

The dynamic programming (DP) algorithm for calculating Levenshtein distance is a classical one (it is known as the Wagner-Fischer algorithm). Each DP state is defined by the index (i) in the first string, which will be called s1, and the index (j) in the second string, which will be called s2. So, the DP array, which will be called dp, would have two dimensions. The value at dp[i][j] for any i or j that is less than or equal to the lengths of the string is the edit distance between s1.substring(0, i) and s2.substring(0, j). Note that substring(0, 3) of “abcd” is “abc” (the ending index is excluded). The recursive formulation for the algorithm is that if the i – 1 th and j – 1 th characters of s1 and s2 match, then the dp[i][j] would be dp[i – 1][j – 1]. No increase in value. If the i – 1 th and j – 1 th characters of s1 and s2 do not match, then dp[i][j] would be min(dp[i – 1][j], dp[i][j – 1], and dp[i – 1][j – 1]) + 1. The idea is that if a character in s1 is deleted, then the next character will slide into the position. Then, the min of the three recursive cases would be dp[i][j – 1], because s1 is missing one element so it shouldn’t be subtracted by one when looking for the last state. For insertion, it is as if a character is deleted from s2, so the recursive formulation for insertion is the mirror of the recursive formulation for deletion. The base states (dp[0][0 to s2.length()] and dp[0 to s1.length()][0]) is just increasing values, from 0 to s1.length() or s2.length(). The result is dp[s1.length()][s2.length()].

DP Array for the Levenshtein Distance Between “COMPUTER” and “USER”:

----------------------------------------
      C   O   M   P   U   T   E   R
----------------------------------------
  U   0   1   2   3   4   5   6   7   8
  S   1   1   2   3   4   4   5   6   7
  E   2   2   2   3   4   5   5   6   7
  R   3   3   3   3   4   5   6   5   6
      4   4   4   4   4   5   6   6   5
----------------------------------------

Here is the code for Levenshtein distance:

public static void editDistance(String s1, String s2){
	int[][] dp = new int[s1.length() + 1][s2.length() + 1];

	for(int i = 0; i <= s1.length(); i++){
		dp[i][0] = i;
	}
	for(int i = 0; i <= s2.length(); i++){
		dp[0][i] = i;
	}

	for(int i = 1; i <= s1.length(); i++){
		for(int j = 1; j <= s2.length(); j++){
			if(s1.charAt(i - 1) == s2.charAt(j - 1))
				dp[i][j] = dp[i - 1][j - 1];
			else
				dp[i][j] = Math.min(Math.min(dp[i][j - 1], dp[i - 1][j]), dp[i - 1][j - 1]) + 1;
		}
	}
	System.out.println(dp[s1.length()][s2.length()]);
}

To fuzzy search (search, but allow edits such as substitution, insertion, and deletion) for s2 inside s1, just make the first row all 0. This basically lets the algorithm ignore the edit distance cost for starting the match at a later index. This is known as Seller’s variant of the Wagner-Fischer algorithm.

DP Array When Fuzzy Searching for “USER” in “COMPUTER”:

----------------------------------------
      C   O   M   P   U   T   E   R
----------------------------------------
  U   0   0   0   0   0   0   0   0   0
  S   1   1   1   1   1   0   1   1   1
  E   2   2   2   2   2   1   1   2   2
  R   3   3   3   3   3   2   2   1   2
      4   4   4   4   4   3   3   2   1
----------------------------------------

Here is the code for searching:

public static void search(String s1, String s2){
	int[][] dp = new int[s1.length() + 1][s2.length() + 1];

	for(int i = 0; i <= s1.length(); i++){
		dp[i][0] = 0;
	}
	for(int i = 0; i <= s2.length(); i++){
		dp[0][i] = i;
	}

	for(int i = 1; i <= s1.length(); i++){
		for(int j = 1; j <= s2.length(); j++){
			if(s1.charAt(i - 1) == s2.charAt(j - 1))
				dp[i][j] = dp[i - 1][j - 1];
			else
				dp[i][j] = Math.min(Math.min(dp[i][j - 1], dp[i - 1][j]), dp[i - 1][j - 1]) + 1;
		}
	}

	int min = Integer.MAX_VALUE;
	for(int i = 0; i <= s1.length(); i++){
		min = Math.min(min, dp[i][s2.length()]);
	}

	System.out.println(min);
}

Notice how the algorithm goes through each index in the last row and finds the minimum. The minimum is the minimum edit distance for a match during the search. The index, i, in the for loop that goes through the last row to find the min is the ending positions for each search location. For example, the ending position is 3 when searching for “abc” inside “abcd”. Of course, the algorithm can be optimized to only save two rows instead of a full DP array. Some other optimizations can also be done to increase speed and decrease space usage, but they are beyond the scope of this post. Here is an example of a more optimized Levenshtein distance implementation:

public static void editDistance2(String s1, String s2){
	if(s1.length() > s2.length()){
		String temp = s1;
		s1 = s2;
		s2 = temp;
	}

	int[] curr = new int[s1.length() + 1];
	int[] prev = new int[s1.length() + 1];

	for(int i = 1; i <= s2.length(); i++){
		curr[0] = i;
		for(int j = 1; j <= s1.length(); j++){
			if(s2.charAt(i - 1) == s1.charAt(j - 1))
				curr[j] = i == 1 ? j - 1 : prev[j - 1];
			else
				curr[j] = Math.min(Math.min(i == 1 ? j : prev[j], curr[j - 1]), i == 1 ? j - 1 : prev[j - 1]) + 1;
			if(i < s2.length())
				prev[j - 1] = curr[j - 1];
		}
		if(i < s2.length())
			prev[s1.length()] = curr[s1.length()];
	}

	System.out.println(curr[s1.length()]);
}

GitHub link for full source code: https://github.com/Daniel-Liu-c0deb0t/General-Algorithms/blob/master/EditDistance.java

Thanks for reading!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s