Sunday 13 March 2011

How fast can we diff two strings?

On this question I have yet to read of any advance on Esko Ukkonen's 'Algorithms for Approximate String Matching' which appeared in Information and Control, 64, 100-118 in 1985. As per his predecessors, he uses the edit-graph to model the matching process, laying out the B-string across the top and the A-string down the left hand side (see diagram below). Finding the shortest edit script is akin to moving through this graph one square at a time. The x coordinates are just indices into the B-string, and the y-coordinates are expressed as x-i, where i is the index of the relevant diagonal. (The diagonals are numbered 0 at the origin, with increasing negative values on the 'down' side and increasing positive values on the 'right' side.) From each square there are three possible moves: right, corresponding to a deletion from the B-string; down, corresponding to an insertion from the A string; diagonally down and right, corresponding to a match (if the characters at that point are equal) or to an exchange if they are not.

In this diagram we transform 'aback' into 'beak'. First we delete 'a'. Then we match 'b', insert 'e', match 'a', delete 'c' and match 'k'. Note that each move takes us one beyond the square it was performed in.

Progressive refinement of the p-band

The basic idea of Ukkonen's (and Myers) algorithm is that the only positions of interest as we traverse the graph are the most advanced positions on each diagonal. So instead of storing NxM positions (as in the dynamic programming method of Needleman and Wunsch for example) we only need to store N+M diagonal positions – quite a saving. Ukkonen further reduces the search space to a narrow diagonal strip down the centre of the graph – the p-band (p.105f) – which can gradually narrow as we proceed toward the end (p.106). Both ideas have been republished as alleged improvements to his algorithm (e.g. Wu 1989 – cited 88 times). Ukkonen defines p as ⌊1/2(dmn/Δ–|n-m|)⌋. In plain English the worst possible cost is incurred by exchanging all the characters on the 0 diagonal and then deleting all the characters horizontally until you get to M,N. Any path on the graph that is more than half that cost away from the goal has no chance of reaching it more efficiently than that. What is more, we can update p as we get closer to the goal, and we get a better idea of the cost of reaching Dmn.

Based on this idea my program assesses dynamically, for each move in the graph, whether it could result in a shorter path to the goal than the worst result of the currently best diagonal. If not, then it is dropped. My algorithm is also very space-efficient. Rather than allocating an array of size M+N (the lengths of the two strings) to hold the diagonals, I use a linked list to store only the active ones.

Moving through the graph

We construct a linked list of diagonals and number them, starting with the one between the two strings at 0,0 (the origin), numbered 0. The other diagonals along the y-axis are numbered -1,-2,-3 etc and the ones along the x-axis are numbered 1,2,3 etc. Then we assess the even and odd numbered diagonals in the list in alternate passes. The reason is that the computation of the even-diagonals interferes with that of the odd-diagonals. For each diagonal at index i and horizontal position x we compute the maximum x-value that would result from:

an insertion from the i+1 diagonal
a deletion from the i-1 diagonal
an exchange on the i diagonal

Whichever takes us further east along the diagonal (and by implication also further down) will be the best move. All things being equal we favour the exchange move because it promises to be the cheapest overall. After processing the even and odd diagonals we increment the d-value: our record of how much it has cost to transform B into A so far. If we are at one or the other edge of the linked list, and it is necessary, we also create a new diagonal to the left or right. This won't be processed in the current pass because it is odd when we are even and vice versa.

Finding the snakes

A match or 'snake' as Myers called them, is just a run of matching characters in the edit graph, and they give the diagonal method all its speed. But we have to be clear as to when we look for them. The very first square of the edit graph at 0,0 may or may not contain a matching character. If it does, we have to match the snake before checking for an exchange, but we will still have to do a delete and an insertion from 0,0. The reason is that otherwise we might miss an optimal snake starting on the -1 or 1 diagonal. Another position of care arises at the very last square at M-1,N-1. Until we have checked that square for a match as well as for insertion, deletion and exchange we won't have reached the real 'end', which is one character beyond M-1,N-1. For all other squares we just test for the existence of snakes after each edit operation, so we can be certain at the start of the next pass that the characters at the head of each diagonal do not match.

Exchanges vs insertions and deletions

Traditionally diff algorithms don't compute the exchanges. But an exchange move from x,y to x+1,y+1 costs only 1, whereas a deletion+insertion move to the same location costs 2. So exchanges are often much cheaper. The insertion/deletion pair can be recovered from any pair of exchanges. Replacing xxx with yyy is the same as deleting xxx and inserting yyy. Here's the result of my implementation of Ukkonen's algorithm. When you consider how many comparisons the program could have made (the entire graph) as opposed to the ones it did make (the squares marked with 'x'), you can see how efficient it is:

Recording the path

Finding the optimal path through the graph is no use if it doesn't allow us to construct a shortest edit script. One simple way to do this is to associate a path with each diagonal. The path will be implemented as a linked list. Each component of the path has an x,i origin, a length and a type (nothing [for the initial move], inserted, deleted, exchanged or matched). If the type changes, from, say, matched to exchanged, we create a new path-component and link it backwards to the last one. (If we went forwards we would need multi-way branches.) But if the type is unchanged we simply increase the length of the existing one. When we finally get to the end, the backwards-pointing path of the successful diagonal will contain the shortest edit script. To print it we just reverse the order and print it out. Here's the edit script for the above graph:

matched 11 chars at x=0 y=0
exchanged 4 chars at x=11 y=11
matched 25 chars at x=15 y=15
exchanged 3 chars at x=40 y=40
matched 1 chars at x=43 y=43

Downloadables

pdiff prints out an edit script but also generates a trace of moves taken in the edit graph if both strings are smaller than 100 bytes. It's free under the GPLv3 and is written in C++. It can also diff entire files.