| /* |
| * Copyright 2010 Google Inc. |
| * |
| * Licensed under the Apache License, Version 2.0 (the "License"); you may not |
| * use this file except in compliance with the License. You may obtain a copy of |
| * the License at |
| * |
| * http://www.apache.org/licenses/LICENSE-2.0 |
| * |
| * Unless required by applicable law or agreed to in writing, software |
| * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT |
| * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the |
| * License for the specific language governing permissions and limitations under |
| * the License. |
| */ |
| package com.google.gwt.dev.util.editdistance; |
| |
| /** |
| * A modified version of a string edit distance described by Berghel and |
| * Roach that uses only O(d) space and O(n*d) worst-case time, where n is |
| * the pattern string length and d is the edit distance computed. |
| * We achieve the space reduction by keeping only those sub-computations |
| * required to compute edit distance, giving up the ability to |
| * reconstruct the edit path. |
| */ |
| public class ModifiedBerghelRoachEditDistance implements GeneralEditDistance { |
| /* |
| * This is a modification of the original Berghel-Roach edit |
| * distance (based on prior work by Ukkonen) described in |
| * ACM Transactions on Information Systems, Vol. 14, No. 1, |
| * January 1996, pages 94-106. |
| * |
| * I observed that only O(d) prior computations are required |
| * to compute edit distance. Rather than keeping all prior |
| * f(k,p) results in a matrix, we keep only the two "outer edges" |
| * in the triangular computation pattern that will be used in |
| * subsequent rounds. We cannot reconstruct the edit path, |
| * but many applications do not require that; for them, this |
| * modification uses less space (and empirically, slightly |
| * less time). |
| * |
| * First, some history behind the algorithm necessary to understand |
| * Berghel-Roach and our modification... |
| * |
| * The traditional algorithm for edit distance uses dynamic programming, |
| * building a matrix of distances for substrings: |
| * D[i,j] holds the distance for string1[0..i]=>string2[0..j]. |
| * The matrix is initially populated with the trivial values |
| * D[0,j]=j and D[i,0]=i; and then expanded with the rule: |
| * <pre> |
| * D[i,j] = min( D[i-1,j]+1, // insertion |
| * D[i,j-1]+1, // deletion |
| * (D[i-1,j-1] |
| * + (string1[i]==string2[j]) |
| * ? 0 // match |
| * : 1 // substitution ) ) |
| * </pre> |
| * |
| * Ukknonen observed that each diagonal of the matrix must increase |
| * by either 0 or 1 from row to row. If D[i,j] = p, then the |
| * then the matching rule requires that D[i+x,j+x] = p for all |
| * x where string1[i..i+x) matches string2[j..j+j+x). Ukkonen |
| * defined a function f(k,p) as the highest row number in which p |
| * appears on the k-th diagonal (those D[i,j] where k=(i-j), noting |
| * that k may be negative). The final result of the edit |
| * distance is the D[n,m] cell, on the (n-m) diagonal; it is |
| * the value of p for which f(n-m, p) = m. The function f can |
| * also be computed dynamically, according to a simple recursion: |
| * <pre> |
| * f(k,p) { |
| * contains_p = max(f(k-1,p-1), f(k,p-1)+1, f(k+1,p-1)+1) |
| * while (string1[contains_p] == string2[contains_p + k]) |
| * contains_p++; |
| * return contains_p; |
| * } |
| * </pre> |
| * The max() expression finds a row where the k-th diagonal must |
| * contain p by virtue of an edit from the prior, same, or following |
| * diagonal (corresponding to an insert, substitute, or delete); |
| * we need not consider more distant diagonals because row-to-row |
| * and column-to-column changes are at most +/- 1. |
| * |
| * The original Ukkonen algorithm computed f(k,p) roughly as |
| * follows: |
| * <pre> |
| * for (p = 0; ; p++) { |
| * compute f(k,p) for all valid k |
| * if (f(n-m, p) == m) return p; |
| * } |
| * </pre> |
| * |
| * Berghel and Roach observed that many values of f(k,p) are |
| * computed unnecessarily, and reorganized the computation into |
| * a just-in-time sequence. In each iteration, we are primarily |
| * interested in the terminating value f(main,p), where main=(n-m) |
| * is the main diagonal. To compute that we need f(x,p-1) for |
| * three values of x: main-1, main, and main+1. Those depend on |
| * values for p-2, and so forth. We will already have computed |
| * f(main,p-1) in the prior round, and thus f(main-1,p-2) and |
| * f(main+1,p-2), and so forth. The only new values we need to compute |
| * are on the edges: f(main-i,p-i) and f(main+i,p-i). Noting that |
| * f(k,p) is only meaningful when abs(k) is no greater than p, |
| * one of the Berghel-Roach reviewers noted that we can compute |
| * the bounds for i: |
| * <pre> |
| * (main+i &le p-i) implies (i ≤ (p-main)/2) |
| * </pre> |
| * (where main+i is limited on the positive side) and similarly |
| * <pre> |
| * (-(main-i) &le p-i) implies (i ≤ (p+main)/2). |
| * </pre> |
| * (where main-i is limited on the negative side). |
| * |
| * This reduces the computation sequence to |
| * <pre> |
| * for (i = (p-main)/2; i > 0; i--) compute f(main+i,p-i); |
| * for (i = (p+main)/2; i > 0; i--) compute f(main-i,p-i); |
| * if (f(main, p) == m) return p; |
| * </pre> |
| * |
| * The original Berghel-Roach algorithm recorded prior values |
| * of f(k,p) in a matrix, using O(distance^2) space, enabling |
| * reconstruction of the edit path, but if all we want is the |
| * edit *distance*, we only need to keep O(distance) prior computations. |
| * |
| * The requisite prior k-1, k, and k+1 values are conveniently |
| * computed in the current round and the two preceding it. |
| * For example, on the higher-diagonal side, we compute: |
| * <pre> |
| * current[i] = f(main+i, p-i) |
| * </pre> |
| * We keep the two prior rounds of results, where p was one and two |
| * smaller. So, from the preceidng round |
| * <pre> |
| * last[i] = f(main+i, (p-1)-i) |
| * </pre> |
| * and from the prior round, but one position back: |
| * <pre> |
| * prior[i-1] = f(main+(i-1), (p-2)-(i-1)) |
| * </pre> |
| * In the current round, one iteration earlier: |
| * <pre> |
| * current[i+1] = f(main+(i+1), p-(i+1)) |
| * </pre> |
| * Note that the distance in all of these evaluates to p-i-1, |
| * and the diagonals are (main+i) and its neighbors... just |
| * what we need. The lower-diagonal side behaves similarly. |
| * |
| * We need to materialize values that are not computed in prior |
| * rounds, for either of two reasons: <ul> |
| * <li> Initially, we have no prior rounds, so we need to fill |
| * all of the "last" and "prior" values for use in the |
| * first round. The first round uses only on one side |
| * of the main diagonal or the other. |
| * <li> In every other round, we compute one more diagonal than before. |
| * </ul> |
| * In all of these cases, the missing f(k,p) values are for abs(k) > p, |
| * where a real value of f(k,p) is undefined. [The original Berghel-Roach |
| * algorithm prefills its F matrix with these values, but we fill |
| * them as we go, as needed.] We define |
| * <pre> |
| * f(-p-1,p) = p, so that we start diagonal -p with row p, |
| * f(p+1,p) = -1, so that we start diagonal p with row 0. |
| * </pre> |
| * (We also allow f(p+2,p)=f(-p-2,p)=-1, causing those values to |
| * have no effect in the starting row computation.] |
| * |
| * We only expand the set of diagonals visited every other round, |
| * when (p-main) or (p+main) is even. We keep track of even/oddness |
| * to save some arithmetic. The first round is always even, as p=abs(main). |
| * Note that we rename the "f" function to "computeRow" to be Googley. |
| */ |
| |
| private static final int [] EMPTY_INT_ARRAY = new int[0]; |
| |
| /** |
| * Creates an instance for computing edit distance from {@code pattern}. |
| * @param pattern string from which distances are measured |
| * @return an instance for computing distances from the pattern |
| */ |
| public static ModifiedBerghelRoachEditDistance |
| getInstance(CharSequence pattern) { |
| return getInstance(pattern.toString()); |
| } |
| |
| /** |
| * Creates an instance for computing edit distance from {@code pattern}. |
| * @param pattern string from which distances are measured |
| * @return an instance for computing distances from the pattern |
| */ |
| public static ModifiedBerghelRoachEditDistance |
| getInstance(String pattern) { |
| return new ModifiedBerghelRoachEditDistance(pattern.toCharArray()); |
| } |
| |
| /* |
| * The current and two preceding sets of Ukkonen f(k,p) values for diagonals |
| * around the main, computed by the main loop of {@code getDistance}. These |
| * arrays are retained between calls to save allocation costs. (They are all |
| * initialized to a real array so that we can indiscriminately use length |
| * when ensuring/resizing.) |
| */ |
| private int[] currentLeft = EMPTY_INT_ARRAY; |
| |
| private int[] currentRight = EMPTY_INT_ARRAY; |
| |
| private int[] lastLeft = EMPTY_INT_ARRAY; |
| |
| private int[] lastRight = EMPTY_INT_ARRAY; |
| |
| /** |
| * The "pattern" string against which others are compared. |
| */ |
| private final char[] pattern; |
| |
| private int[] priorLeft = EMPTY_INT_ARRAY; |
| |
| private int[] priorRight = EMPTY_INT_ARRAY; |
| |
| private ModifiedBerghelRoachEditDistance(char[] pattern) { |
| this.pattern = pattern; |
| } |
| |
| public ModifiedBerghelRoachEditDistance duplicate() { |
| return new ModifiedBerghelRoachEditDistance(pattern); |
| } |
| |
| public int getDistance(CharSequence targetSequence, int limit) { |
| final int targetLength = targetSequence.length(); |
| |
| /* |
| * Compute the main diagonal number. |
| * The final result lies on this diagonal. |
| */ |
| final int main = pattern.length - targetLength; |
| |
| /* |
| * Compute our initial distance candidate. |
| * The result cannot be less than the difference in |
| * string lengths, so we start there. |
| */ |
| int distance = Math.abs(main); |
| if (distance > limit) { |
| /* More than we wanted. Give up right away */ |
| return Integer.MAX_VALUE; |
| } |
| |
| /* Snag a copy of the second string for efficiency */ |
| final char[] target = new char[targetLength]; |
| for (int i = 0; i < targetLength; i++) { |
| target[i] = targetSequence.charAt(i); |
| } |
| |
| /* |
| * In the main loop below, the current{Right,Left} arrays record results |
| * from the current outer loop pass. The last{Right,Left} and |
| * prior{Right,Left} arrays hold the results from the preceding two passes. |
| * At the end of the outer loop, we shift them around (reusing the prior |
| * array as the current for the next round, to avoid reallocating). |
| * The Right reflects higher-numbered diagonals, Left lower-numbered. |
| */ |
| |
| /* |
| * Fill in "prior" values for the first two passes through |
| * the distance loop. Note that we will execute only one side of |
| * the main diagonal in these passes, so we only need |
| * initialize one side of prior values. |
| */ |
| |
| if (main <= 0) { |
| ensureCapacityRight(distance, false); |
| for (int j = 0; j <= distance; j++) { |
| lastRight[j] = distance - j - 1; /* Make diagonal -k start in row k */ |
| priorRight[j] = -1; |
| } |
| } else { |
| ensureCapacityLeft(distance, false); |
| for (int j = 0; j <= distance; j++) { |
| lastLeft[j] = -1; /* Make diagonal +k start in row 0 */ |
| priorLeft[j] = -1; |
| } |
| } |
| |
| /* |
| * Keep track of even rounds. Only those rounds consider new diagonals, |
| * and thus only they require artificial "last" values below. |
| */ |
| boolean even = true; |
| |
| /* |
| * MAIN LOOP: try each successive possible distance until one succeeds. |
| */ |
| while (true) { |
| /* |
| * Before calling computeRow(main, distance), we need to fill in |
| * missing cache elements. See the high-level description above. |
| */ |
| |
| /* |
| * Higher-numbered diagonals |
| */ |
| |
| int offDiagonal = (distance - main) / 2; |
| ensureCapacityRight(offDiagonal, true); |
| |
| if (even) { |
| /* Higher diagonals start at row 0 */ |
| lastRight[offDiagonal] = -1; |
| } |
| |
| int immediateRight = -1; |
| for (; offDiagonal > 0; offDiagonal--) { |
| currentRight[offDiagonal] = immediateRight = computeRow( |
| (main + offDiagonal), |
| (distance - offDiagonal), |
| pattern, |
| target, |
| priorRight[offDiagonal - 1], |
| lastRight[offDiagonal], |
| immediateRight); |
| } |
| |
| /* |
| * Lower-numbered diagonals |
| */ |
| |
| offDiagonal = (distance + main) / 2; |
| ensureCapacityLeft(offDiagonal, true); |
| |
| if (even) { |
| /* Lower diagonals, fictitious values for f(-x-1,x) = x */ |
| lastLeft[offDiagonal] = (distance - main) / 2 - 1; |
| } |
| |
| int immediateLeft = even ? -1 : (distance - main) / 2; |
| |
| for (; offDiagonal > 0; offDiagonal--) { |
| currentLeft[offDiagonal] = immediateLeft = computeRow( |
| (main - offDiagonal), |
| (distance - offDiagonal), |
| pattern, target, |
| immediateLeft, |
| lastLeft[offDiagonal], |
| priorLeft[offDiagonal - 1]); |
| } |
| |
| /* |
| * We are done if the main diagonal has distance in the last row. |
| */ |
| int mainRow = computeRow(main, distance, pattern, target, |
| immediateLeft, lastLeft[0], immediateRight); |
| |
| if ((mainRow == targetLength) || (++distance > limit) || (distance < 0)) { |
| break; |
| } |
| |
| /* The [0] element goes to both sides. */ |
| currentLeft[0] = currentRight[0] = mainRow; |
| |
| /* Rotate rows around for next round: current=>last=>prior (=>current) */ |
| int[] tmp = priorLeft; |
| priorLeft = lastLeft; |
| lastLeft = currentLeft; |
| currentLeft = priorLeft; |
| |
| tmp = priorRight; |
| priorRight = lastRight; |
| lastRight = currentRight; |
| currentRight = tmp; |
| |
| /* Update evenness, too */ |
| even = !even; |
| } |
| |
| return distance; |
| } |
| |
| /** |
| * Computes the highest row in which the distance {@code p} appears |
| * in diagonal {@code k} of the edit distance computation for |
| * strings {@code a} and {@code b}. The diagonal number is |
| * represented by the difference in the indices for the two strings; |
| * it can range from {@code -b.length()} through {@code a.length()}. |
| * |
| * More precisely, this computes the highest value x such that |
| * <pre> |
| * p = edit-distance(a[0:(x+k)), b[0:x)). |
| * </pre> |
| * |
| * This is the "f" function described by Ukkonen. |
| * |
| * The caller must assure that abs(k) ≤ p, the only values for |
| * which this is well-defined. |
| * |
| * The implementation depends on the cached results of prior |
| * computeRow calls for diagonals k-1, k, and k+1 for distance p-1. |
| * These must be supplied in {@code knownLeft}, {@code knownAbove}, |
| * and {@code knownRight}, respectively. |
| * @param k diagonal number |
| * @param p edit distance |
| * @param a one string to be compared |
| * @param b other string to be compared |
| * @param knownLeft value of {@code computeRow(k-1, p-1, ...)} |
| * @param knownAbove value of {@code computeRow(k, p-1, ...)} |
| * @param knownRight value of {@code computeRow(k+1, p-1, ...)} |
| */ |
| private int computeRow(int k, int p, char[] a, char[] b, |
| int knownLeft, int knownAbove, int knownRight) { |
| assert (Math.abs(k) <= p); |
| assert (p >= 0); |
| |
| /* |
| * Compute our starting point using the recurrance. |
| * That is, find the first row where the desired edit distance |
| * appears in our diagonal. This is at least one past |
| * the highest row for |
| */ |
| int t; |
| if (p == 0) { |
| t = 0; |
| } else { |
| /* |
| * We look at the adjacent diagonals for the next lower edit distance. |
| * We can start in the next row after the prior result from |
| * our own diagonal (the "substitute" case), or the next diagonal |
| * ("delete"), but only the same row as the prior result from |
| * the prior diagonal ("insert"). |
| */ |
| t = Math.max(Math.max(knownAbove, knownRight) + 1, knownLeft); |
| } |
| |
| /* |
| * Look down our diagonal for matches to find the maximum |
| * row with edit-distance p. |
| */ |
| int tmax = Math.min(b.length, (a.length - k)); |
| |
| while ((t < tmax) && b[t] == a[t + k]) { |
| t++; |
| } |
| |
| return t; |
| } |
| |
| /* |
| * Ensures that the Left arrays can be indexed through {@code index}, |
| * inclusively, resizing (and copying) as necessary. |
| */ |
| private void ensureCapacityLeft(int index, boolean copy) { |
| if (currentLeft.length <= index) { |
| index++; |
| priorLeft = resize(priorLeft, index, copy); |
| lastLeft = resize(lastLeft, index, copy); |
| currentLeft = resize(currentLeft, index, false); |
| } |
| } |
| |
| /* |
| * Ensures that the Right arrays can be indexed through {@code index}, |
| * inclusively, resizing (and copying) as necessary. |
| */ |
| private void ensureCapacityRight(int index, boolean copy) { |
| if (currentRight.length <= index) { |
| index++; |
| priorRight = resize(priorRight, index, copy); |
| lastRight = resize(lastRight, index, copy); |
| currentRight = resize(currentRight, index, false); |
| } |
| } |
| |
| /* Resize an array, copying old contents if requested */ |
| private int[] resize(int[] array, int size, boolean copy) { |
| int[] result = new int[size]; |
| if (copy) { |
| System.arraycopy(array, 0, result, 0, array.length); |
| } |
| return result; |
| } |
| } |