NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / programm / 2574 < prev next >

Wrap

Internet Message Format | 1992-09-07 | 2.5 KB

Path: sparky!uunet!ogicse!uwm.edu!linac!pacific.mps.ohio-state.edu!cis.ohio-state.edu!daisy.learning.cs.cmu.edu!Marc.Ringuette From: Marc.Ringuette@daisy.learning.cs.cmu.edu Newsgroups: comp.programming Subject: Re: How to find vectors with minimum Hamming-distance? Message-ID: <9209042121.AA00144@news.cis.ohio-state.edu> Date: 4 Sep 92 20:56:00 GMT Article-I.D.: news.9209042121.AA00144 Sender: daemon@cis.ohio-state.edu Organization: The Ohio State University Department of Computer and Information Science Lines: 43 rainer@ruble.fml.tuwien.ac.at (Rainer Staringer) writes, > I stumbled over the following problem recently, and I wonder if there > is a clever algorithm for it: I have a set of n m-dimensional vectors > (vector elements are from a finite set, but not necessarily binary), > and I want to find the 2 vectors with the minimum Hamming-distance > (i.e. those which have the most elements in common). Is there any > way you can find them using less than O(n^2) vector comparisons? > I know there is a clever algorithm for the equivalent problem with > Euclidean metric, but I could not find any for the Hamming-metric > variant. Good problem! Some friends and I tried to come up with something better, but couldn't. We were thinking along the lines of geometric algorithms (but couldn't manage to extend a 2-d closest-pair-of-points algorithm) or something based on sorting or divide-and-conquer. I like the other poster's n*log(n)*m! answer -- deliciously crazy and good for minuscule m. ---- If you expect the distribution to be skewed, with a small number of very well matched pairs of vectors, then you might adopt a method with better average-case behavior at the expense of worst-case running time. You could use a hashing technique to see if you might luck out and find two duplicate vectors. With an extra log n factor, you can compare the n^2 candidate vector-pairs in a best-first fashion, comparing elements in each pair only until you have seen more mismatches than the best so far. This means that if b is the number of total mismatches for the best pair, p_ij is the number of elements in the largest prefix of vectors i and j which has fewer than b mismatches, and P is the sum of all p_ij, then the algorithm runs in time O(P log n), where n*b+m <= P <= n^2*m. ---- Other ideas, anyone? If not, how about a lower bound on the worst case running time? [ Marc Ringuette | Cranberry Melon University, Cucumber Science Department ] [ mnr@cs.cmu.edu | 412-268-3728 | ".surivorter erutangis a ma I" ]