home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.programming
- Path: sparky!uunet!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!rpi!batcomputer!cornell!karr
- From: karr@cs.cornell.edu (David Karr)
- Subject: Re: Locating duplicates
- Message-ID: <1993Jan4.175728.22640@cs.cornell.edu>
- Organization: Cornell Univ. CS Dept, Ithaca NY 14853
- References: <C07FJo.4vq@ux1.cso.uiuc.edu> <1993Jan2.041658.21596@organpipe.uug.arizona.edu> <DOUGM.93Jan2014723@titan.cs.rice.edu>
- Date: Mon, 4 Jan 1993 17:57:28 GMT
- Lines: 51
-
- I posted a somewhat stupid reply to the following last night; hopefully I
- canceled it before it did too much damage. I'll try to be wiser this time:
-
- In article <DOUGM.93Jan2014723@titan.cs.rice.edu> dougm@titan.cs.rice.edu (Doug Moore) writes:
- >But if these are fixed size integers (like C ints), you can radix sort
- >in linear time anyway, which is as fast as you can reasonably expect.
-
- The algorithm that follows is indeed linear time, but look at the
- constant factor:
-
- > while (mask <= max && mask) /* <= 32 times on my machine */
- >[...]
- > for (j = 0; j < n; ++j)
- ...etc.
-
- The program is doing up to 32 passes over a set of loops (of which I
- just showed the first) that always execute 5 lines of code for every
- entry in the array. That's up to 160*n lines of code to be executed,
- depending on the range of values in your array. Is that really
- better than a well-written O(n log n) sorting algorithm?
-
- In fact what you're doing is changing K1*n*log(n) time to K2*n*log(m)
- where m is the largest integer in the array. For K1 ~= K2, this is
- good if m << n (lots and lots of duplicates) but bad if m >> n (e.g. a
- moderate sized array with widely varying data, few duplicates and lots
- of holes in the "range" it covers).
-
- Personally, for the latter case I like the suggestions to use either a
- flag array indexing the range of your data or a hashing scheme along
- with an array of flags. You can reduce the amount of extra memory
- considerably in either case by using bits of a bit vector for flags
- rather than using whole bytes, though this entails a little extra run
- time. Of course you need to consider how long it takes to obtain an
- array intialized to all zeros. (Depends on the machine
- implementation, I'd think.) If you have to loop through the array to
- initialize it, I think it might more than wipe out any speed advantage.
-
- When the range of your data is large, e.g. values up to 2^32, the bit
- array ends up being half a gigabyte, a bit large for comfort, so you
- might make a few passes of the radix sort to partition the array into
- "segments" in each of which the first p digits of each number are
- constant. Then walk the array once using the unsorted digits of each
- segment to index a flag array. You only need to keep the array for
- the current segment. I suggest radix sorting from the left rather
- than the right; this makes the resulting order bizarre but who cares,
- all you need is the segmentation and I suspect the indexing arithmetic
- is fastest on the low-order bits.
-
- -- David Karr (karr@cs.cornell.edu)
-
-
-