NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / programm / 3405 < prev next >

Wrap

Text File | 1993-01-04 | 2.9 KB | 62 lines

Newsgroups: comp.programming Path: sparky!uunet!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!rpi!batcomputer!cornell!karr From: karr@cs.cornell.edu (David Karr) Subject: Re: Locating duplicates Message-ID: <1993Jan4.175728.22640@cs.cornell.edu> Organization: Cornell Univ. CS Dept, Ithaca NY 14853 References: <C07FJo.4vq@ux1.cso.uiuc.edu> <1993Jan2.041658.21596@organpipe.uug.arizona.edu> <DOUGM.93Jan2014723@titan.cs.rice.edu> Date: Mon, 4 Jan 1993 17:57:28 GMT Lines: 51 I posted a somewhat stupid reply to the following last night; hopefully I canceled it before it did too much damage. I'll try to be wiser this time: In article <DOUGM.93Jan2014723@titan.cs.rice.edu> dougm@titan.cs.rice.edu (Doug Moore) writes: >But if these are fixed size integers (like C ints), you can radix sort >in linear time anyway, which is as fast as you can reasonably expect. The algorithm that follows is indeed linear time, but look at the constant factor: > while (mask <= max && mask) /* <= 32 times on my machine */ >[...] > for (j = 0; j < n; ++j) ...etc. The program is doing up to 32 passes over a set of loops (of which I just showed the first) that always execute 5 lines of code for every entry in the array. That's up to 160*n lines of code to be executed, depending on the range of values in your array. Is that really better than a well-written O(n log n) sorting algorithm? In fact what you're doing is changing K1*n*log(n) time to K2*n*log(m) where m is the largest integer in the array. For K1 ~= K2, this is good if m << n (lots and lots of duplicates) but bad if m >> n (e.g. a moderate sized array with widely varying data, few duplicates and lots of holes in the "range" it covers). Personally, for the latter case I like the suggestions to use either a flag array indexing the range of your data or a hashing scheme along with an array of flags. You can reduce the amount of extra memory considerably in either case by using bits of a bit vector for flags rather than using whole bytes, though this entails a little extra run time. Of course you need to consider how long it takes to obtain an array intialized to all zeros. (Depends on the machine implementation, I'd think.) If you have to loop through the array to initialize it, I think it might more than wipe out any speed advantage. When the range of your data is large, e.g. values up to 2^32, the bit array ends up being half a gigabyte, a bit large for comfort, so you might make a few passes of the radix sort to partition the array into "segments" in each of which the first p digits of each number are constant. Then walk the array once using the unsorted digits of each segment to index a flag array. You only need to keep the array for the current segment. I suggest radix sorting from the left rather than the right; this makes the resulting order bizarre but who cares, all you need is the segmentation and I suspect the indexing arithmetic is fastest on the low-order bits. -- David Karr (karr@cs.cornell.edu)