The Levenshtein distance describes the difference between two strings (think diff). It is quite useful to be able to determine this metric, also called the “minimum edit distance,” quickly.

What is the "minimum edit distance?"

First, let's review the problem.

Let's imagine we are comparing 2 strings character by character. As we move along, we transform the "source" string into the "target" string. Each time we change a character in the "source" string, we'll increment the "edit distance."

As we compare each character, there are four possible outcomes as we try to change characters in the source to match characters in the target string:

  1. The characters differ, so we delete the source character (hoping the next source matches the current target)
  2. The characters differ, so we insert a new one in front of the source character (hoping the source character we pushed down is a match to the next target character)
  3. The characters differ, so we substitute the source character with a different character (disregarding next target characters)
  4. The characters are the same, we do nothing

Each operation carries the same weight, except of course the "do nothing" operation which has no weight at all.

You might wonder why we don't always just substitute characters, but imagine we are changing "ab" to "b" — we wouldn't want to substitute 'a' for 'b', then delete 'b' (2 steps) when we could just delete 'a' (1 step).

As an example, let's examine how many steps are required to change the word "cast" to "bat." The "score" here is the Levenshtien distance. We're "scoring" this series of changes because this might not be the only way to change one string into the other.

example: change "cast" to "bat"
c a s t
b a t

c != b

substitute, b for c

a = a

do nothing!

s != t

delete s

t = t

do nothing!

score:0

Let's see it

As you can see, the distance here is 2. But how do we know this is the best score? Could we do better?

We could, for example, completely erase the starting word and insert all of the target word characters. This, however, would require 4 steps to clear out the 4 characters of the source word, "cast," and then 3 more steps to re-insert each character in the target word, "bat." This is obviously not ideal, since we want the minimum edit distance.

As you might imagine the combinatorics involved with solving this problem for larger strings are rather unruly. How can you be sure once you've found a step-by-step process of changing source to target that you've actually got the minimum edit distance?

Are you going to compare thousands of possible step-by-step solutions to eachother? How can we do this efficiently?

dynamic programming

Dynamic programming is the recursive analysis of simpler subproblems. We save time by breaking down the problem into smaller parts, and then we use the solutions of our smaller sub-problems to build a map that helps us efficiently find our way to the answer.

what are the subproblems?

In this case, each subproblem is a comparison of 2 characters at specific indexes in the the two strings. Each operation is a decision that will affect subsequent operations, and each operation has a score based on previous operations.

If you remove a character from source and move onto the next comparison, then that decision will carry into your score for the rest your future explorations.

Luckily, this problem translates nicely to a chart:

c a s t
b a t
bat
c
a
s
t

The intersection of the "b" column and the "c" row represents our comparison of "b" and "c"

how do we show actions in the chart?

Remember our 4 actions:

  1. The characters differ, so we delete the source character
    bat
    +1
    c
    a
    s
    t

    Yellow cell (b,c) = dark blue score + 1
    Examine (b,a) next

  2. The characters differ, so we insert a new character in front of the source character
    bat
    c+1
    a
    s
    t

    Yellow cell (b,c) = dark blue score + 1
    Examine (a,c) next

  3. The characters differ, so we substitute the source character
    bat
    +1
    c
    a
    s
    t

    Yellow cell (b,c) = dark blue score + 1
    Examine (a,a) next

  4. The characters are the same, we do nothing
    bat
    +0
    c
    a
    s
    t

    Yellow cell (b,c) = dark blue score + 0
    Examine (a,a) next

You'll notice there are numerous ways to get a score for a single cell in the table. Obviously, we'll choose the one with the minimum score.

We then end up examining the cell with the light blue background, which is the next character comparison required after our previous action. This cell initially has no score. There are many ways to arrive at this cell, and not every one will have the same cummulative score. So, how do we know we've taked then path with the minimum score?

the previous subproblems

Take the "s" and "t" comparison...

b a s t
b a t
bat
c
a+1+1
s+1?
t

In order to solve for (t,s) we take the minimum of the 3 possible ways to get to (s,t)

  • (a,a)+1
  • (t,a)+1
  • (a,s)+1

We have +1 for every outcome because "s" and "t" do not match. If we had a match in this cell the 3 choices would be the same, but the (a,a) would be a +0 score since we're matching and not substituting.

So, as you can see, our next move is fully dependent on the values of preceding cells. In order to fill out the table, we need to have all preceding cells filled out. We can't just move from one best case to another, we have to systematically fill out the entire chart. So where do we start?

the base cases

bat
0123
c1
a2
s3
t4

For our base cases, we fill out our first row and first column values. Think of these values as starts to the worst case: "delete all, then re-insert all characters" as we move across each axis.

Moving across the chart from the top left to the bottom left would take us through a complete deletion of the source, then moving from bottom left to bottom right would take us to complete re-insertion. That is always the worst solution, but it helps us establish and understand our base case values.

building the chart

After we establish our base cases, there are two ways to construct the rest of the chart: iteratively or recursively.

The iterative solution is the "bottom up." We'll start filling out the chart by filling out each cell, row-by-row. Once we have the bottom-right value, we have our Levenshtein distance.

The recursive solution is the "top down" approach. We start by asking for the bottom-right value, which is our solution. We then recurse into each sub-problem for comparison.

understanding the results

In both approaches, we need to keep track of which action yields the best score. We can then trace backward through the chart to determine the complete sequence of actions to take to transform source to target with a minimum number of steps.

As we trace backward from the bottom-right through the best cells, we move in 1 of 4 ways:

  • up and left, with score change
    -- substitute
  • up and left, without score change
    -- match
  • up
    -- delete
  • left
    -- insert

The distance calculator below shows both the recursive and iterative chart-building approaches, step-by-step. Check the "track recursion" box to watch how we recurse from bottom-right down to our known solutions.

As the charts are being constructed, the best sub-problem score used to calculate each cell's own score flashes yellow. This cell is then marked and used at the end of the building process to help us backtrace the solution.

At the end of the construction process, all of the cells that led to the best solution, and at some point flashed yellow, are highlighted green. These cells represent the sequence of actions that take source to target

Recursive

Iterative

There you have it..

So what does it look like when applied to the strings?

Recursive
Iterative

Let's see it!


The code for these solutions is available on GitHub: recursive, iterative. It's a little ugly, since it has a lot of callbacks for animation.

Credit to Cliff Min for authoring the iterative code.