home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Frozen Fish 1: Amiga
/
FrozenFish-Apr94.iso
/
bbs
/
alib
/
d1xx
/
d142
/
diff.lha
/
Diff
/
tech.doc
< prev
next >
Wrap
Text File
|
1988-05-22
|
8KB
|
290 lines
Jan 13 13:00 1988 DIFF.MEM Page 1
Diff maintains all information needed to compare the two files in
main memory. This means that very large files (or fairly large
files with many differences) will cause the program to abort with
an "out of space" error. Main memory requirements (in words) are
approximately:
2 * (length of file1 + length of file2) + (3 * number of changes)
The diff algorithm reads each file twice (once to build hash
tables and a second time to check for fortuitous matches), then
reads the differences by seeking randomly within the files. CPU
time requirements include sorting the two hash vectors and
randomly searching memory tables for equivalence classes. For
example, running in Vax compatibility mode, two 1000 line files
with a fair number of differences took about 25 seconds (elapsed
wall clock time) for processing. Most of this time was spent in
the file read routines. This test required slightly more than
6000 words of memory for internal tables.
The diff algorithm was developed by J. W. Hunt and M. D. McIlroy,
using a central algorithm defined by H. S. Stone. The algorithm
was described in:
Hunt, J. W., and McIlroy, M. D.,
An Algorithm for Differential File Comparison,
Computing Science Technical Report #41,
Bell Laboratories, Murray Hill, NJ 07974
The following description is summarized from that document. While
it has been slightly modified to correspond to the program
source, the algorithm is essentially identical.
1. Read the input files, building two vectors containing the line
number (serial) and hash value (hash) of each line. Data for
fileA will be in a vector pointed to by fileA[], while data for
fileB will be pointed to by fileB[]. The lengths (number of
lines) of the files will be represented by lenA and lenB
respectively. [This is slightly different from the published
algorithm.]
2. Note initial and final sequences that have identical hash
values to shorten subsequent processing. Note that the "jackpot"
phase (step 9.) will examine all lines in the file. Next, sort
each file using hash as the primary key and serial as the
secondary key.
3. Construct an array of equivalence classes (member[1..lenB])
where each element contains the line number in fileB and a flag
which is True if the indicated line is the first member of an
equivalence class. (An equivalence class is a set of lines which
all hash to the same value. The lines themselves are not
necessarily identical.)
4. Construct an array, class[1..lenA], where each element, I, is
set to the index of a line, J, in fileB if member[J] is the first
Jan 13 13:00 1988 DIFF.MEM Page 2
element in an equivalence class and the hash code of line[I] in
fileA is the same as the hash code of line[J] in fileB. Class[I]
is set to zero if no such line exists.
If non-zero, class[I] now points in member[] to the beginning of
the class of lines in fileB equivalent to line[I] in fileA.
The next two steps implement the longest common subsequence
algorithm.
5. Structure CANDIDATE { a, b, previous }, where a and b are line
numbers and previous a reference to a candidate, will store
candidate lists as they are constructed.
Vector clist[] stores references to candidates. It is dimensioned
(0..min(lenA, lenB) + 1)
Initialize
clist[0] = CANDIDATE { 0, 0, -1 };
clist[1] = CANDIDATE { A+1, B+1, -1 };
ktop = 1;
clist[1] is a fence beyond the last usefully filled element and
-1 is an out-of-range clist index. Ktop is the index of the
fence. Note, because of the memory allocation used, clist[] is
actually composed of two vectors, clist[] containing the
candidate reference, and klist[] containing pointers to clist.
6. For (A = 1 to lenA) {
I = class[A]; -- the index in member[]: beginning of
-- the class of lines in fileB equivalent
-- to this line in fileA.
if (I is non-zero) {
Merge each member into the candidate list
as discussed below.
}
Unravel the chain of candidates, getting a vector of common
subsequences:
7. Set all elements of match[0..lenA] to zero.
8. clist[ktop-1] points to the subsequence chain head. For each
element in the chain, let A and B be the line number entries.
Then, set
match[A] = B;
The non-zero elements of match[] now pick out a longest common
subsequence chain, possibly including spurious matches due to
hash coincidences. The pairings between the two files are:
if (match[A] is non-zero) {
line A in fileA matches line match[A] in fileB;
}
Jan 13 13:00 1988 DIFF.MEM Page 3
Now, read each line of fileA and fileB to check for jackpots:
9. for (A = 1 to lenA) {
if (match[A] is nonzero) {
if (fileA[A] is not identical to fileB[match[A]])
match[A] = 0; -- Hash congruence
}
}
Ignoring "squish" complications, the merge step may be defined as
follows:
Entry:
clist[] Candidate pointer array
ktop Fence beyond last filled index
A Current index in fileA
member[] Equivalence vector
I The index in member[] of the first element
of the class of lines in fileB that are
equivalent to line[A] in fileA.
1. Let clist[R] be "an r-candidate" and C a reference to the last
candidate found, which will always be an r-candidate. clist[R]
will be updated with this reference once the previous value of
clist[R] is no longer needed. Initialize:
R = 0; C = clist[0];
2. Do steps 3 through 6 repeatedly:
3. set B to the line number in member[I]; search clist[R..ktop]
for an element, clist[S], such that
clist[S-1].b < B and clist[S].b >= B
Note that clist[] is ordered on clist[].b so that binary search
will work. The search algorithm used requires the two "fence"
entries described above.
If such an element is found, perform steps 4. and 5.
4. Extend the longest common subsequence chain:
If (clist[S].b > B) {
clist[R] = C;
R = S;
C = candidate(A, B, clist[S - 1]);
}
5. Extend the number of subsequences, moving the fence:
If (S == ktop) {
clist[ktop + 1] = clist[ktop]
ktop = ktop + 1;
break out of step 2's loop;
}
Jan 13 13:00 1988 DIFF.MEM Page 4
6. I = I + 1;
if (member[I] is the first element in another class) {
break out of step 2's loop;
}
else {
continue at step 2.
}
7. clist[R] = C; exit merge subroutine.
To illustrate vector contents, here is a sample run:
File A:
line 1
line 2
line 3
line 4
line 5 gets deleted
line 6
File B:
line 1
line 1.5 inserted
line 2
line 3 changed
line 4
line 6
(For clarity, the "squish" step is omitted from the following)
On entry to equiv() (after readin and sorting), the file[] vector
is as follows (the first entry in each pair is the line number,
the second is the hash value. Entries are sorted on hash value):
FileA[] (1..lines in fileA):
line hash
3 042400 6 043300 5 050026 1 102201 2 102701 4 103501
FileB[] (1..lines in fileB):
6 043300 2 045600 1 102201 3 102701 5 103501 4 147166
After Equiv has processed file[]:
FileA[] (1..lines in fileA):
line value
3 0 6 1 5 0 1 3 2 4 4 5
Member[] (0..lines in fileB)
0 -6 -2 -1 -3 -5 -4
After unsort() has unwound fileB:
Class[] (1 .. lines in fileA):
3 4 0 5 0 1
Within unravel(), match is built in the following order:
Jan 13 13:00 1988 DIFF.MEM Page 5
match[6] := 6
match[4] := 5
match[2] := 3
match[1] := 1
Match[] (0 .. lines in fileA):
0 1 3 0 5 0 6
Output is as follows:
1a2
> line 1.5 inserted
3c4
< line 3
---
> line 3 changed
5d5
< line 5 gets deleted