I’ve learned about the Levenshtein distance in two different courses on computer algorithms. I spent a lot of time writing up a detailed explanation of a common method used to solve the problem with dynamic programming.
So, when I saw a practice problem on TopCoder called ShortPalindromes, I got excited about the similarity to the “minimum edit distance.”
Because the problem statement is owned by TopCoder, I can’t post the whole thing here. However, I do own my solution (available on GitHub), which is a nice variant of the Levenshtein distance solution I posted on my projects page.
I think I’m OK legally to summarize the problem in my own words, in order to frame my solution.
ShortPalindromes problem summary
The idea is to take a string as input, and turn the string into a palindrome if it is not already a palindrome. The output string doesn’t have to be an actual word, but it must be a palindrome.
For example, if I pass in
ab then I should get
aba. The tricky part of the problem spec, however, is that you can insert characters in front of the string, at the end, or at any point in between.
That means that
ab could also be
bab, or for a more complex example,
raecar should become
racecar (the second
c gets inserted not at the front nor the back).
the min-ed-dis similarity
I initially thought I could use a tree to represent the possibilities, and use an A* search to get the best result for the lowest number of insertions. However, this approach would involve a lot of branches at each node, and could get huge for larger strings.
Then I realized that this is basically a “minimum edit distance” problem, where we are trying to make a forward version of the string just like a reversed version of the string.
massaman (my favorite curry dish) for example. Below is the chart we’d generate to turn
massaman (forward input) into
namassam (backward input):
I built this with the calculator at the bottom of my Levenshtein distance project page.
There are a few problems with this chart.
For one, we can’t “delete” or “substitute” any characters from the input string. This means we’ll have to redefine what it means to move through the chart.
Second, it doesn’t make sense to move all the way down to the bottom right cell of the chart. We don’t want to reverse our input, we just want the make the front half match the second half. This means that our solutions actually lie on the minor diagonal of the chart.
First, we need to address the differences in what it means to move from one cell to another.
Here’s what we’ll do… In my previous post I used a demo chart that was changing
bat – here let’s say my input is
tab and I’m trying to get a ShortPalindrome.
our new chart labels
Let’s start by redefining what the column and row labels mean.
Imagine that we have 2 pointers. One points to the front of the input, and the other points to the end.
Initial string solution:
tab (same as input)
As we move from the first row in our chart to the last, think of the front pointer advancing down the string. As we move from the first column in our chart to the last, imagine the back pointer moving toward the front of the string.
our new chart actions
- Insert the character that the back pointer points to in front of the front pointer, and advance the back pointer toward the front.
t a b ↑ ↑
- Insert the character that the front pointer points to after the back pointer, and advance the forward pointer toward the back.
t a b ↑ ↑
It’s a little tricky to see in this example, since we’re only using 3 characters, but this means we allow ourselves to insert characters in between the start and end of the string, because we are moving the pointers inward from the ends.
our new chart motions
Now let’s put these actions into the context of movement through the chart.
- The characters differ, so we insert the character the front pointer points to to the back of the back pointer, and advance the front pointer forward. Since we've advanced the front pointer, we move down in the chart. But the back pointer remains pointing to the same character, even though the solution string is now longer. This means we stay in the same column until we address the back pointer.
b a t +1 t ↓ a b
Yellow cell (b,t) = dark blue score + 1
Examine (b,a) next
- The characters differ, so we insert the character the back pointer points to in front of the front pointer, and advance the back pointer toward the front. Similar to our other option, we stay in the same row since the front pointer is not yet addressed.
b a t t +1 → a b
Yellow cell (b,t) = dark blue score + 1
Examine (a,t) next
- The characters are the same, we do nothing to the string we are modifying, march the front pointer toward the back, and the back pointer toward the front.
b a t +0 t ↘ a b
Yellow cell (b,r) = dark blue score + 0
Examine (a,a) next
Note that there are only 3 options in this chart. In min-ed-dis we have 2 options on the diagonal move. The second option not here is the “substitute” operation. We cannot substitute in this chart, because that wouldn’t help us get any closer to a palindrome.
This also means that the diagonal move is only allowed if we have matching characters.
our new solution
The last difference is in how we read the chart’s solution. In the “minimum edit distance” problem, the solution lies in the bottom right of the chart.
In the case of ShortPaindrome, our solution is found when our front pointer and our back pointer are pointing to the same location. This means that we’ve matched the front half of the string with the second half.
This happens to occur along the minor diagonal of our matrix.
So, once we have filled up a triangular matrix with the scores associated with inserting characters near our pointers, we have a range of solutions along the minor diagonal.
All we have to do is grab the minimum score on the diagonal.
Check out my GitHub repo for an example of the output this program generates, and a working C++ implementation.