NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / programm / 3447 < prev next >

Wrap

Text File | 1993-01-12 | 1.8 KB | 40 lines

Newsgroups: comp.programming Path: sparky!uunet!pipex!harlqn.co.uk!harlqn!markt From: markt@harlqn.co.uk (Mark Tillotson) Subject: Re: Locating duplicates In-Reply-To: dave@cs.arizona.edu's message of 8 Jan 93 05:10:51 GMT Message-ID: <MARKT.93Jan12174830@wundt.harlqn.co.uk> Lines: 24 Sender: news@harlqn.co.uk (Usenet News Account) Organization: Harlequin Limited, Cambridge, England References: <C07FJo.4vq@ux1.cso.uiuc.edu> <1993Jan2.041658.21596@organpipe.uug.arizona.edu> <1993Jan5.150010.27104@ulysses.att.com> <MARKT.93Jan7122729@wundt.harlqn.co.uk> <1993Jan8.051051.18788@organpipe.uug.arizona.edu> Date: Tue, 12 Jan 1993 17:48:30 GMT dave@cs.arizona.edu (Dave Schaumann) mentions (when talking of my hashing method for duplicate-detection): > If you choose the number of buckets at run-time based on the number > of elements, you might as well just use the "radix sort" solution > already posted. But a radix sort needs buckets proportional to the _range_ of integers involved, or else will require multiple passes over the data, for instance 100 buckets and input with a range of 1 million means 3 passes are required, and 3 sets of buckets. In practice with 32 bit (or even 64 bit) integers, there isn't going to be a great deal of difference between hashing and radix sorting. Generalising the problem (say to spotting duplicated strings) will tend to favour hashing. However if you want guaranteed performance, avoid the hash table method because it's worst case becomes O(N^2)... ------------------------------------------------------ |\ /| | , M. Tillotson Harlequin Ltd. \ | \/ | /\| |/\ |< markt@uk.co.harlqn Barrington Hall,\ | | \_| | | \ +44 223 872522 Barrington, \ I came, I saw, I core-dumped... Cambridge CB2 5RG \