home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ogicse!uwm.edu!linac!pacific.mps.ohio-state.edu!cis.ohio-state.edu!daisy.learning.cs.cmu.edu!Marc.Ringuette
- From: Marc.Ringuette@daisy.learning.cs.cmu.edu
- Newsgroups: comp.programming
- Subject: Re: How to find vectors with minimum Hamming-distance?
- Message-ID: <9209042121.AA00144@news.cis.ohio-state.edu>
- Date: 4 Sep 92 20:56:00 GMT
- Article-I.D.: news.9209042121.AA00144
- Sender: daemon@cis.ohio-state.edu
- Organization: The Ohio State University Department of Computer and Information Science
- Lines: 43
-
- rainer@ruble.fml.tuwien.ac.at (Rainer Staringer) writes,
- > I stumbled over the following problem recently, and I wonder if there
- > is a clever algorithm for it: I have a set of n m-dimensional vectors
- > (vector elements are from a finite set, but not necessarily binary),
- > and I want to find the 2 vectors with the minimum Hamming-distance
- > (i.e. those which have the most elements in common). Is there any
- > way you can find them using less than O(n^2) vector comparisons?
- > I know there is a clever algorithm for the equivalent problem with
- > Euclidean metric, but I could not find any for the Hamming-metric
- > variant.
-
- Good problem! Some friends and I tried to come up with something
- better, but couldn't. We were thinking along the lines of geometric
- algorithms (but couldn't manage to extend a 2-d closest-pair-of-points
- algorithm) or something based on sorting or divide-and-conquer. I like
- the other poster's n*log(n)*m! answer -- deliciously crazy and good for
- minuscule m.
-
- ----
-
- If you expect the distribution to be skewed, with a small number of
- very well matched pairs of vectors, then you might adopt a method with
- better average-case behavior at the expense of worst-case running time.
-
- You could use a hashing technique to see if you might luck out and find
- two duplicate vectors.
-
- With an extra log n factor, you can compare the n^2 candidate
- vector-pairs in a best-first fashion, comparing elements in each pair
- only until you have seen more mismatches than the best so far. This
- means that if b is the number of total mismatches for the best pair,
- p_ij is the number of elements in the largest prefix of vectors i and j
- which has fewer than b mismatches, and P is the sum of all p_ij, then
- the algorithm runs in time O(P log n), where n*b+m <= P <= n^2*m.
-
- ----
-
- Other ideas, anyone? If not, how about a lower bound on the worst case
- running time?
-
-
- [ Marc Ringuette | Cranberry Melon University, Cucumber Science Department ]
- [ mnr@cs.cmu.edu | 412-268-3728 | ".surivorter erutangis a ma I" ]
-