home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.programming
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!rpi!batcomputer!cornell!karr
- From: karr@cs.cornell.edu (David Karr)
- Subject: Re: Locating duplicates
- Message-ID: <1993Jan7.021146.2290@cs.cornell.edu>
- Organization: Cornell Univ. CS Dept, Ithaca NY 14853
- References: <C07z6L.1Aq@world.std.com> <1993Jan5.043456.29130@micor.ocunix.on.ca> <1993Jan5.224141.21639@organpipe.uug.arizona.edu>
- Date: Thu, 7 Jan 1993 02:11:46 GMT
- Lines: 56
-
- In article <1993Jan5.224141.21639@organpipe.uug.arizona.edu> dave@cs.arizona.edu (Dave Schaumann) writes:
- >Hmmm. An even more robust algorithm would scan through the array to find
- >the difference between the largest and smallest elements, and then malloc()
- >an appropriate sized array. Of course, if malloc() returns NULL, you'll
- >have to fall back to some kind of in-place sorting scheme.
-
- How about this:
-
- Assume in the following that all integers are 32 bits.
- Given array v of integers in which we have to detect duplicates,
- say v contains n entries.
- Allocate an array p of n integers and an array h of 512 integers, both
- initialized to all zeros.
-
- Consider the 32-bit integers to be partitioned into 512=2^9 "segments"
- of 2^23 contiguous numbers sharing the same leading 9 bits. Each
- entry in h is the index of the head of a list of entries in v (offset
- by 1 if this is in C, so that index 0 can indicate the empty list).
- The array p gives the "next" index for each item in a list (again
- offset by 1 if necessary).
-
- The first pass over the data will put all the values from v that belong
- to the i-th segment into the list h[i], for each i. This is done by
- scanning linearly once through the array; for each entry in v, examine
- the upper 9 bits and insert that value at the head of the corresponding
- list in h. This takes a small constant time for each of n entries.
-
- Now run through the 512 entries of h. For each entry h[i], allocate a
- one-megabyte bit vector f (2^20 bytes=2^23 bits) initialized to all
- zeros. Then scan the list pointed to by h[i], reading the lower 23
- bits of each value and setting the corresponding bit in f, and marking
- this entry in v as a duplicate if the flag was already set (we can put
- a distinguished value in the corresponding entry of p, for example,
- because we won't need to scan the list again). Deallocate f when the
- end of the list is reached. This phase takes a constant time to
- process each of the n entries in v, plus the time required to
- examine the 512 pointers in h.
-
- And that's the end of the algorithm. It takes K*n + K' time, where K
- is the time to perform the small number of operations required for
- each entry of v in the two phases, and K' represents a small overhead
- for each of the 512 head pointers. That is, the algorithm is O(n),
- with a constant that I think compares reasonably well with algorithms
- that pre-scan v for min and max only and assume you can allocate an
- array large enough to index the entire range.
-
- A shortcoming of course is that the running-time analysis is valid
- only if you can repeatedly allocate large amounts of zeroed-out memory
- in a small (essentially constant) time. But pessimistic assumptions
- about memory initialization are bad for other index-array algorithms
- as well.
-
- Of course any of the constants I used (e.g. 2^20, 512) can be
- adjusted to suit practical considerations, perhaps even at runtime.
-
- -- David Karr (karr@cs.cornell.edu)
-