home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.programming
- Path: sparky!uunet!pipex!harlqn.co.uk!harlqn!markt
- From: markt@harlqn.co.uk (Mark Tillotson)
- Subject: Re: Locating duplicates
- In-Reply-To: dave@cs.arizona.edu's message of 8 Jan 93 05:10:51 GMT
- Message-ID: <MARKT.93Jan12174830@wundt.harlqn.co.uk>
- Lines: 24
- Sender: news@harlqn.co.uk (Usenet News Account)
- Organization: Harlequin Limited, Cambridge, England
- References: <C07FJo.4vq@ux1.cso.uiuc.edu> <1993Jan2.041658.21596@organpipe.uug.arizona.edu>
- <1993Jan5.150010.27104@ulysses.att.com>
- <MARKT.93Jan7122729@wundt.harlqn.co.uk>
- <1993Jan8.051051.18788@organpipe.uug.arizona.edu>
- Date: Tue, 12 Jan 1993 17:48:30 GMT
-
- dave@cs.arizona.edu (Dave Schaumann) mentions (when talking of my
- hashing method for duplicate-detection):
- > If you choose the number of buckets at run-time based on the number
- > of elements, you might as well just use the "radix sort" solution
- > already posted.
-
- But a radix sort needs buckets proportional to the _range_ of integers
- involved, or else will require multiple passes over the data, for
- instance 100 buckets and input with a range of 1 million means 3
- passes are required, and 3 sets of buckets. In practice with 32 bit
- (or even 64 bit) integers, there isn't going to be a great deal of
- difference between hashing and radix sorting. Generalising the
- problem (say to spotting duplicated strings) will tend to favour
- hashing.
-
- However if you want guaranteed performance, avoid the hash table
- method because it's worst case becomes O(N^2)...
-
- ------------------------------------------------------
- |\ /| | , M. Tillotson Harlequin Ltd. \
- | \/ | /\| |/\ |< markt@uk.co.harlqn Barrington Hall,\
- | | \_| | | \ +44 223 872522 Barrington, \
- I came, I saw, I core-dumped... Cambridge CB2 5RG \
-
-