home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.programming
- Path: sparky!uunet!email!news
- From: rainer@ruble.fml.tuwien.ac.at (Rainer Staringer)
- Subject: Re: How to find vectors with minimum Hamming-distance?
- Message-ID: <1992Sep7.071227.11596@email.tuwien.ac.at>
- Sender: news@email.tuwien.ac.at
- Nntp-Posting-Host: ruble.fml.tuwien.ac.at
- Reply-To: rainer@eimoni.tuwien.ac.at
- Organization: Technical University of Vienna
- References: <9209042121.AA00144@news.cis.ohio-state.edu>
- Date: Mon, 7 Sep 1992 07:12:27 GMT
- Lines: 97
-
- In article <9209042121.AA00144@news.cis.ohio-state.edu>
- Marc.Ringuette@daisy.learning.cs.cmu.edu writes:
- > rainer@ruble.fml.tuwien.ac.at (Rainer Staringer) writes,
- > > I stumbled over the following problem recently, and I wonder if there
- > > is a clever algorithm for it: I have a set of n m-dimensional vectors
- > > (vector elements are from a finite set, but not necessarily binary),
- > > and I want to find the 2 vectors with the minimum Hamming-distance
- > > (i.e. those which have the most elements in common). Is there any
- > > way you can find them using less than O(n^2) vector comparisons?
- > > I know there is a clever algorithm for the equivalent problem with
- > > Euclidean metric, but I could not find any for the Hamming-metric
- > > variant.
- >
- > Good problem! Some friends and I tried to come up with something
- > better, but couldn't. We were thinking along the lines of geometric
- > algorithms (but couldn't manage to extend a 2-d closest-pair-of-points
- > algorithm) or something based on sorting or divide-and-conquer. I like
- > the other poster's n*log(n)*m! answer -- deliciously crazy and good for
- > minuscule m.
-
- Taking closest-pair and modifying it for the Hamming-metric was my first
- idea too, but it doesnt't work. As for m!*n*log(n) -- m is an order of
- magnitude smaller than n in my situation, but could still range up to
- 10 or 20, so I will not use this method if I can avoid it :-)
-
- > If you expect the distribution to be skewed, with a small number of
- > very well matched pairs of vectors, then you might adopt a method with
- > better average-case behavior at the expense of worst-case running time.
- >
- > You could use a hashing technique to see if you might luck out and find
- > two duplicate vectors.
-
- I don't expect the case with two duplicates to be very common in my special
- application, but this might be a useful optimization for the general case.
-
- > With an extra log n factor, you can compare the n^2 candidate
- > vector-pairs in a best-first fashion, comparing elements in each pair
- > only until you have seen more mismatches than the best so far. This
- > means that if b is the number of total mismatches for the best pair,
- > p_ij is the number of elements in the largest prefix of vectors i and j
- > which has fewer than b mismatches, and P is the sum of all p_ij, then
- > the algorithm runs in time O(P log n), where n*b+m <= P <= n^2*m.
- > ----
- >
- > Other ideas, anyone? If not, how about a lower bound on the worst case
- > running time?
-
- Andy Lowry (lowry@watson.ibm.com) pointed out in email that if a, b, and c
- are three of my m-dimensional vectors, then h (a, c) >= |h (a, b) - h (b, c)|.
- Nifty proof of this claim follows:
-
- > Pf (A. Lowry): For each component i there are three cases:
- >
- > 1. a[i] = b[i] = c[i]. In this case, a[i] = c[i].
- > 2. a[i] <> b[i] <> c[i]. In this case we may or may not have a[i] =
- > c[i].
- > 3. a[i] = b[i] <> c[i] or a[i] <> b[i] = c[i]. In this case, a[i]
- > <> c[i].
- >
- > Let n1(a,b,c), n2(a,b,c), and n3(a,b,c) be the number of instances of
- > case 1, 2, and 3, respectively, in comparing vectors a, b, and c. And
- > let k=h(a,b) and j=h(b,c).
- >
- > The minimum possible value of h(a,c) occurs in situations in which
- > case 3 is minimized and all instances of case 2 do indeed have a[i] =
- > c[i]. And in such an instance, of course, h(a,c) = n3(a,b,c). But
- > n3(a,b,c) can never be less than |h(a,b)-h(b,c)| = |k - j|. For
- > suppose otherwise, and wlog suppose k >= j. And let n3a(a,b,c) and
- > n3c(a,b,c) count the instances of case 3 in which a[i] = b[i] or b[i]
- > = c[i], respectively (so n3(a,b,c) = n3a(a,b,c)+n3c(a,b,c)).
- >
- > We have k = n2(a,b,c)+n3c(a,b,c) and j = n2(a,b,c)+n3a(a,b,c). So
- > 0 <= k-j = n3c(a,b,c)-n3a(a,b,c) <= n3(a,b,c), which contradicts our
- > assumption that n3(a,b,c) < |k-j|. This completes the proof of the
- > claim.
-
- Making use of this fact you can prune away a lot of the n(n-1)/2 comparisons.
- I imagine a possible algorithm might look like this: Compare the first vector
- x_1 to all others and sort the remaining vectors according to their Hamming
- distances from x_1 (because these are integers between 0 and m you could even
- use distribution counting and do this in linear time).
-
- Now, because you need only compare each vector x_i to those vectors x_j where
- |h (x_i, x_1) - h (x_1, x_j)| < b (the min distance found so far), you can
- easily find smaller subdivisions of the set of remaining vectors to which the
- same algorithm can be applied recursively -- these subdivisions can overlap,
- though, so you can't do the sorting in place... Hmm, maybe there is a simpler
- way?
-
- Anybody care to analyse the running time of this? :-)
-
- > [ Marc Ringuette | Cranberry Melon University, Cucumber Science Department ]
- > [ mnr@cs.cmu.edu | 412-268-3728 | ".surivorter erutangis a ma I" ]
-
- --
- Rainer Staringer | rainer@fml.tuwien.ac.at
- Financial Markets Lab, TU Vienna | +43 (1) 58801/8138
-