Usenet 1994 October

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 October / usenetsourcesnewsgroupsinfomagicoctober1994disk2.iso / misc / volume2 / pd-diff-v2 / diff.mem < prev next >

Wrap

Text File | 1991-08-07 | 9.1 KB | 397 lines

Jan 13 13:00 1988 CDIFF.MEM Page 1 Diff maintains all information needed to compare the two files in main memory. This means that very large files (or fairly large files with many differences) will cause the program to abort with an "out of space" error. Main memory requirements (in words) are approximately: 2 * (length of file1 + length of file2) + (3 * number of changes) The diff algorithm reads each file twice (once to build hash tables and a second time to check for fortuitous matches), then reads the differences by seeking randomly within the files. CPU time requirements include sorting the two hash vectors and randomly searching memory tables for equivalence classes. For example, running in Vax compatibility mode, two 1000 line files with a fair number of differences took about 25 seconds (elapsed wall clock time) for processing. Most of this time was spent in the file read routines. This test required slightly more than 6000 words of memory for internal tables. The diff algorithm was developed by J. W. Hunt and M. D. McIlroy, using a central algorithm defined by H. S. Stone. The algorithm was described in: Hunt, J. W., and McIlroy, M. D., An Algorithm for Differential File Comparison, Computing Science Technical Report #41, Bell Laboratories, Murray Hill, NJ 07974 The following description is summarized from that document. While it has been slightly modified to correspond to the program source, the algorithm is essentially identical. 1. Read the input files, building two vectors containing the line number (serial) and hash value (hash) of each line. Data for fileA will be in a vector pointed to by fileA[], while data for fileB will be pointed to by fileB[]. The lengths (number of lines) of the files will be represented by lenA and lenB respectively. [This is slightly different from the published algorithm.] 2. Note initial and final sequences that have identical hash values to shorten subsequent processing. Note that the "jackpot" phase (step 9.) will examine all lines in the file. Next, sort each file using hash as the primary key and serial as the secondary key. 3. Construct an array of equivalence classes (member[1..lenB]) where each element contains the line number in fileB and a flag which is True if the indicated line is the first member of an equivalence class. (An equivalence class is a set of lines which all hash to the same value. The lines themselves are not necessarily identical.) 4. Construct an array, class[1..lenA], where each element, I, is set to the index of a line, J, in fileB if member[J] is the first Jan 13 13:00 1988 CDIFF.MEM Page 2 element in an equivalence class and the hash code of line[I] in fileA is the same as the hash code of line[J] in fileB. Class[I] is set to zero if no such line exists. If non-zero, class[I] now points in member[] to the beginning of the class of lines in fileB equivalent to line[I] in fileA. The next two steps implement the longest common subsequence algorithm. 5. Structure CANDIDATE { a, b, previous }, where a and b are line numbers and previous a reference to a candidate, will store candidate lists as they are constructed. Vector clist[] stores references to candidates. It is dimensioned (0..min(lenA, lenB) + 1) Initialize clist[0] = CANDIDATE { 0, 0, -1 }; clist[1] = CANDIDATE { A+1, B+1, -1 }; ktop = 1; clist[1] is a fence beyond the last usefully filled element and -1 is an out-of-range clist index. Ktop is the index of the fence. Note, because of the memory allocation used, clist[] is actually composed of two vectors, clist[] containing the candidate reference, and klist[] containing pointers to clist. 6. For (A = 1 to lenA) { I = class[A]; -- the index in member[]: beginning of -- the class of lines in fileB equivalent -- to this line in fileA. if (I is non-zero) { Merge each member into the candidate list as discussed below. } Unravel the chain of candidates, getting a vector of common subsequences: 7. Set all elements of match[0..lenA] to zero. 8. clist[ktop-1] points to the subsequence chain head. For each element in the chain, let A and B be the line number entries. Then, set match[A] = B; The non-zero elements of match[] now pick out a longest common subsequence chain, possibly including spurious matches due to hash coincidences. The pairings between the two files are: if (match[A] is non-zero) { line A in fileA matches line match[A] in fileB; } Jan 13 13:00 1988 CDIFF.MEM Page 3 Now, read each line of fileA and fileB to check for jackpots: 9. for (A = 1 to lenA) { if (match[A] is nonzero) { if (fileA[A] is not identical to fileB[match[A]]) match[A] = 0; -- Hash congruence } } Ignoring "squish" complications, the merge step may be defined as follows: Entry: clist[] Candidate pointer array ktop Fence beyond last filled index A Current index in fileA member[] Equivalence vector I The index in member[] of the first element of the class of lines in fileB that are equivalent to line[A] in fileA. 1. Let clist[R] be "an r-candidate" and C a reference to the last candidate found, which will always be an r-candidate. clist[R] will be updated with this reference once the previous value of clist[R] is no longer needed. Initialize: R = 0; C = clist[0]; 2. Do steps 3 through 6 repeatedly: 3. set B to the line number in member[I]; search clist[R..ktop] for an element, clist[S], such that clist[S-1].b < B and clist[S].b >= B Note that clist[] is ordered on clist[].b so that binary search will work. The search algorithm used requires the two "fence" entries described above. If such an element is found, perform steps 4. and 5. 4. Extend the longest common subsequence chain: If (clist[S].b > B) { clist[R] = C; R = S; C = candidate(A, B, clist[S - 1]); } 5. Extend the number of subsequences, moving the fence: If (S == ktop) { clist[ktop + 1] = clist[ktop] ktop = ktop + 1; break out of step 2's loop; } Jan 13 13:00 1988 CDIFF.MEM Page 4 6. I = I + 1; if (member[I] is the first element in another class) { break out of step 2's loop; } else { continue at step 2. } 7. clist[R] = C; exit merge subroutine. To illustrate vector contents, here is a sample run: File A: line 1 line 2 line 3 line 4 line 5 gets deleted line 6 File B: line 1 line 1.5 inserted line 2 line 3 changed line 4 line 6 (For clarity, the "squish" step is omitted from the following) On entry to equiv() (after readin and sorting), the file[] vector is as follows (the first entry in each pair is the line number, the second is the hash value. Entries are sorted on hash value): FileA[] (1..lines in fileA): line hash 3 042400 6 043300 5 050026 1 102201 2 102701 4 103501 FileB[] (1..lines in fileB): 6 043300 2 045600 1 102201 3 102701 5 103501 4 147166 After Equiv has processed file[]: FileA[] (1..lines in fileA): line value 3 0 6 1 5 0 1 3 2 4 4 5 Member[] (0..lines in fileB) 0 -6 -2 -1 -3 -5 -4 After unsort() has unwound fileB: Class[] (1 .. lines in fileA): 3 4 0 5 0 1 Within unravel(), match is built in the following order: Jan 13 13:00 1988 CDIFF.MEM Page 5 match[6] := 6 match[4] := 5 match[2] := 3 match[1] := 1 Match[] (0 .. lines in fileA): 0 1 3 0 5 0 6 Output is as follows: 1a2 > line 1.5 inserted 3c4 < line 3 --- > line 3 changed 5d5 < line 5 gets deleted ******************************************************************** /* * s t r e q . c */ String Equality Test String equality test Synopsis: streq(a, b); char *a; char *b; Description: Return TRUE if the strings are equal. Bugs *************************************************************** /* * e r r o r . c */ Fatal Error Exit Synopsis: _error() error(format, args) Jan 13 13:00 1988 CDIFF.MEM Page 6 char *format; Documentation: Fatal error exits. _error() halts, error() prints something on stderr and then halts. Bugs: THIS DOES NOT WORK ON MANY SYSTEMS DUE TO EXTREMLY NON-PORTABLE CODE. Why oh why can't people learn to use varargs properly? This code will blow up on OSK. Fortunatly, it isn't used often...