home *** CD-ROM | disk | FTP | other *** search
-
-
- Jan 13 13:00 1988 CDIFF.MEM Page 1
-
-
-
- Diff maintains all information needed to compare the two files in
- main memory. This means that very large files (or fairly large
- files with many differences) will cause the program to abort with
- an "out of space" error. Main memory requirements (in words) are
- approximately:
-
- 2 * (length of file1 + length of file2) + (3 * number of changes)
-
- The diff algorithm reads each file twice (once to build hash
- tables and a second time to check for fortuitous matches), then
- reads the differences by seeking randomly within the files. CPU
- time requirements include sorting the two hash vectors and
- randomly searching memory tables for equivalence classes. For
- example, running in Vax compatibility mode, two 1000 line files
- with a fair number of differences took about 25 seconds (elapsed
- wall clock time) for processing. Most of this time was spent in
- the file read routines. This test required slightly more than
- 6000 words of memory for internal tables.
-
- The diff algorithm was developed by J. W. Hunt and M. D. McIlroy,
- using a central algorithm defined by H. S. Stone. The algorithm
- was described in:
-
- Hunt, J. W., and McIlroy, M. D.,
- An Algorithm for Differential File Comparison,
- Computing Science Technical Report #41,
- Bell Laboratories, Murray Hill, NJ 07974
-
- The following description is summarized from that document. While
- it has been slightly modified to correspond to the program
- source, the algorithm is essentially identical.
-
- 1. Read the input files, building two vectors containing the line
- number (serial) and hash value (hash) of each line. Data for
- fileA will be in a vector pointed to by fileA[], while data for
- fileB will be pointed to by fileB[]. The lengths (number of
- lines) of the files will be represented by lenA and lenB
- respectively. [This is slightly different from the published
- algorithm.]
-
- 2. Note initial and final sequences that have identical hash
- values to shorten subsequent processing. Note that the "jackpot"
- phase (step 9.) will examine all lines in the file. Next, sort
- each file using hash as the primary key and serial as the
- secondary key.
-
- 3. Construct an array of equivalence classes (member[1..lenB])
- where each element contains the line number in fileB and a flag
- which is True if the indicated line is the first member of an
- equivalence class. (An equivalence class is a set of lines which
- all hash to the same value. The lines themselves are not
- necessarily identical.)
-
- 4. Construct an array, class[1..lenA], where each element, I, is
- set to the index of a line, J, in fileB if member[J] is the first
-
-
-
-
-
-
-
- Jan 13 13:00 1988 CDIFF.MEM Page 2
-
-
- element in an equivalence class and the hash code of line[I] in
- fileA is the same as the hash code of line[J] in fileB. Class[I]
- is set to zero if no such line exists.
-
- If non-zero, class[I] now points in member[] to the beginning of
- the class of lines in fileB equivalent to line[I] in fileA.
-
- The next two steps implement the longest common subsequence
- algorithm.
-
- 5. Structure CANDIDATE { a, b, previous }, where a and b are line
- numbers and previous a reference to a candidate, will store
- candidate lists as they are constructed.
-
- Vector clist[] stores references to candidates. It is dimensioned
- (0..min(lenA, lenB) + 1)
-
- Initialize
- clist[0] = CANDIDATE { 0, 0, -1 };
- clist[1] = CANDIDATE { A+1, B+1, -1 };
- ktop = 1;
-
- clist[1] is a fence beyond the last usefully filled element and
- -1 is an out-of-range clist index. Ktop is the index of the
- fence. Note, because of the memory allocation used, clist[] is
- actually composed of two vectors, clist[] containing the
- candidate reference, and klist[] containing pointers to clist.
-
- 6. For (A = 1 to lenA) {
- I = class[A]; -- the index in member[]: beginning of
- -- the class of lines in fileB equivalent
- -- to this line in fileA.
- if (I is non-zero) {
- Merge each member into the candidate list
- as discussed below.
- }
-
- Unravel the chain of candidates, getting a vector of common
- subsequences:
-
- 7. Set all elements of match[0..lenA] to zero.
-
- 8. clist[ktop-1] points to the subsequence chain head. For each
- element in the chain, let A and B be the line number entries.
- Then, set
-
- match[A] = B;
-
- The non-zero elements of match[] now pick out a longest common
- subsequence chain, possibly including spurious matches due to
- hash coincidences. The pairings between the two files are:
-
- if (match[A] is non-zero) {
- line A in fileA matches line match[A] in fileB;
- }
-
-
-
-
-
-
-
-
- Jan 13 13:00 1988 CDIFF.MEM Page 3
-
-
- Now, read each line of fileA and fileB to check for jackpots:
-
- 9. for (A = 1 to lenA) {
- if (match[A] is nonzero) {
- if (fileA[A] is not identical to fileB[match[A]])
- match[A] = 0; -- Hash congruence
- }
- }
-
- Ignoring "squish" complications, the merge step may be defined as
- follows:
-
- Entry:
- clist[] Candidate pointer array
- ktop Fence beyond last filled index
- A Current index in fileA
- member[] Equivalence vector
- I The index in member[] of the first element
- of the class of lines in fileB that are
- equivalent to line[A] in fileA.
-
- 1. Let clist[R] be "an r-candidate" and C a reference to the last
- candidate found, which will always be an r-candidate. clist[R]
- will be updated with this reference once the previous value of
- clist[R] is no longer needed. Initialize:
-
- R = 0; C = clist[0];
-
- 2. Do steps 3 through 6 repeatedly:
-
- 3. set B to the line number in member[I]; search clist[R..ktop]
- for an element, clist[S], such that
-
- clist[S-1].b < B and clist[S].b >= B
-
- Note that clist[] is ordered on clist[].b so that binary search
- will work. The search algorithm used requires the two "fence"
- entries described above.
-
- If such an element is found, perform steps 4. and 5.
-
- 4. Extend the longest common subsequence chain:
-
- If (clist[S].b > B) {
- clist[R] = C;
- R = S;
- C = candidate(A, B, clist[S - 1]);
- }
-
- 5. Extend the number of subsequences, moving the fence:
-
- If (S == ktop) {
- clist[ktop + 1] = clist[ktop]
- ktop = ktop + 1;
- break out of step 2's loop;
- }
-
-
-
-
-
-
-
- Jan 13 13:00 1988 CDIFF.MEM Page 4
-
-
-
- 6. I = I + 1;
- if (member[I] is the first element in another class) {
- break out of step 2's loop;
- }
- else {
- continue at step 2.
- }
-
- 7. clist[R] = C; exit merge subroutine.
-
- To illustrate vector contents, here is a sample run:
-
- File A:
- line 1
- line 2
- line 3
- line 4
- line 5 gets deleted
- line 6
-
- File B:
- line 1
- line 1.5 inserted
- line 2
- line 3 changed
- line 4
- line 6
-
- (For clarity, the "squish" step is omitted from the following)
-
- On entry to equiv() (after readin and sorting), the file[] vector
- is as follows (the first entry in each pair is the line number,
- the second is the hash value. Entries are sorted on hash value):
-
- FileA[] (1..lines in fileA):
- line hash
- 3 042400 6 043300 5 050026 1 102201 2 102701 4 103501
- FileB[] (1..lines in fileB):
- 6 043300 2 045600 1 102201 3 102701 5 103501 4 147166
-
-
- After Equiv has processed file[]:
-
- FileA[] (1..lines in fileA):
- line value
- 3 0 6 1 5 0 1 3 2 4 4 5
- Member[] (0..lines in fileB)
- 0 -6 -2 -1 -3 -5 -4
-
- After unsort() has unwound fileB:
-
- Class[] (1 .. lines in fileA):
- 3 4 0 5 0 1
-
- Within unravel(), match is built in the following order:
-
-
-
-
-
-
-
- Jan 13 13:00 1988 CDIFF.MEM Page 5
-
-
-
- match[6] := 6
- match[4] := 5
- match[2] := 3
- match[1] := 1
-
- Match[] (0 .. lines in fileA):
-
- 0 1 3 0 5 0 6
-
- Output is as follows:
-
- 1a2
- > line 1.5 inserted
- 3c4
- < line 3
- ---
- > line 3 changed
- 5d5
- < line 5 gets deleted
-
- ********************************************************************
-
- /*
- * s t r e q . c
- */
-
-
- String Equality Test
- String equality test
-
- Synopsis:
- streq(a, b);
- char *a;
- char *b;
-
- Description:
-
- Return TRUE if the strings are equal.
-
- Bugs
-
- ***************************************************************
-
- /*
- * e r r o r . c
- */
-
-
- Fatal Error Exit
-
- Synopsis:
-
- _error()
-
- error(format, args)
-
-
-
-
-
-
-
- Jan 13 13:00 1988 CDIFF.MEM Page 6
-
-
- char *format;
-
- Documentation:
-
- Fatal error exits. _error() halts, error() prints something
- on stderr and then halts.
-
- Bugs:
-
- THIS DOES NOT WORK ON MANY SYSTEMS DUE TO EXTREMLY NON-PORTABLE CODE.
- Why oh why can't people learn to use varargs properly? This code will
- blow up on OSK. Fortunatly, it isn't used often...
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-