NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / programm / 3426 < prev next >

Wrap

Text File | 1993-01-06 | 3.3 KB | 67 lines

Newsgroups: comp.programming Path: sparky!uunet!zaphod.mps.ohio-state.edu!rpi!batcomputer!cornell!karr From: karr@cs.cornell.edu (David Karr) Subject: Re: Locating duplicates Message-ID: <1993Jan7.021146.2290@cs.cornell.edu> Organization: Cornell Univ. CS Dept, Ithaca NY 14853 References: <C07z6L.1Aq@world.std.com> <1993Jan5.043456.29130@micor.ocunix.on.ca> <1993Jan5.224141.21639@organpipe.uug.arizona.edu> Date: Thu, 7 Jan 1993 02:11:46 GMT Lines: 56 In article <1993Jan5.224141.21639@organpipe.uug.arizona.edu> dave@cs.arizona.edu (Dave Schaumann) writes: >Hmmm. An even more robust algorithm would scan through the array to find >the difference between the largest and smallest elements, and then malloc() >an appropriate sized array. Of course, if malloc() returns NULL, you'll >have to fall back to some kind of in-place sorting scheme. How about this: Assume in the following that all integers are 32 bits. Given array v of integers in which we have to detect duplicates, say v contains n entries. Allocate an array p of n integers and an array h of 512 integers, both initialized to all zeros. Consider the 32-bit integers to be partitioned into 512=2^9 "segments" of 2^23 contiguous numbers sharing the same leading 9 bits. Each entry in h is the index of the head of a list of entries in v (offset by 1 if this is in C, so that index 0 can indicate the empty list). The array p gives the "next" index for each item in a list (again offset by 1 if necessary). The first pass over the data will put all the values from v that belong to the i-th segment into the list h[i], for each i. This is done by scanning linearly once through the array; for each entry in v, examine the upper 9 bits and insert that value at the head of the corresponding list in h. This takes a small constant time for each of n entries. Now run through the 512 entries of h. For each entry h[i], allocate a one-megabyte bit vector f (2^20 bytes=2^23 bits) initialized to all zeros. Then scan the list pointed to by h[i], reading the lower 23 bits of each value and setting the corresponding bit in f, and marking this entry in v as a duplicate if the flag was already set (we can put a distinguished value in the corresponding entry of p, for example, because we won't need to scan the list again). Deallocate f when the end of the list is reached. This phase takes a constant time to process each of the n entries in v, plus the time required to examine the 512 pointers in h. And that's the end of the algorithm. It takes K*n + K' time, where K is the time to perform the small number of operations required for each entry of v in the two phases, and K' represents a small overhead for each of the 512 head pointers. That is, the algorithm is O(n), with a constant that I think compares reasonably well with algorithms that pre-scan v for min and max only and assume you can allocate an array large enough to index the entire range. A shortcoming of course is that the running-time analysis is valid only if you can repeatedly allocate large amounts of zeroed-out memory in a small (essentially constant) time. But pessimistic assumptions about memory initialization are bad for other index-array algorithms as well. Of course any of the constants I used (e.g. 2^20, 512) can be adjusted to suit practical considerations, perhaps even at runtime. -- David Karr (karr@cs.cornell.edu)