NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / programm / 2578 < prev next >

Wrap

Text File | 1992-09-08 | 5.3 KB | 111 lines

Newsgroups: comp.programming Path: sparky!uunet!email!news From: rainer@ruble.fml.tuwien.ac.at (Rainer Staringer) Subject: Re: How to find vectors with minimum Hamming-distance? Message-ID: <1992Sep7.071227.11596@email.tuwien.ac.at> Sender: news@email.tuwien.ac.at Nntp-Posting-Host: ruble.fml.tuwien.ac.at Reply-To: rainer@eimoni.tuwien.ac.at Organization: Technical University of Vienna References: <9209042121.AA00144@news.cis.ohio-state.edu> Date: Mon, 7 Sep 1992 07:12:27 GMT Lines: 97 In article <9209042121.AA00144@news.cis.ohio-state.edu> Marc.Ringuette@daisy.learning.cs.cmu.edu writes: > rainer@ruble.fml.tuwien.ac.at (Rainer Staringer) writes, > > I stumbled over the following problem recently, and I wonder if there > > is a clever algorithm for it: I have a set of n m-dimensional vectors > > (vector elements are from a finite set, but not necessarily binary), > > and I want to find the 2 vectors with the minimum Hamming-distance > > (i.e. those which have the most elements in common). Is there any > > way you can find them using less than O(n^2) vector comparisons? > > I know there is a clever algorithm for the equivalent problem with > > Euclidean metric, but I could not find any for the Hamming-metric > > variant. > > Good problem! Some friends and I tried to come up with something > better, but couldn't. We were thinking along the lines of geometric > algorithms (but couldn't manage to extend a 2-d closest-pair-of-points > algorithm) or something based on sorting or divide-and-conquer. I like > the other poster's n*log(n)*m! answer -- deliciously crazy and good for > minuscule m. Taking closest-pair and modifying it for the Hamming-metric was my first idea too, but it doesnt't work. As for m!*n*log(n) -- m is an order of magnitude smaller than n in my situation, but could still range up to 10 or 20, so I will not use this method if I can avoid it :-) > If you expect the distribution to be skewed, with a small number of > very well matched pairs of vectors, then you might adopt a method with > better average-case behavior at the expense of worst-case running time. > > You could use a hashing technique to see if you might luck out and find > two duplicate vectors. I don't expect the case with two duplicates to be very common in my special application, but this might be a useful optimization for the general case. > With an extra log n factor, you can compare the n^2 candidate > vector-pairs in a best-first fashion, comparing elements in each pair > only until you have seen more mismatches than the best so far. This > means that if b is the number of total mismatches for the best pair, > p_ij is the number of elements in the largest prefix of vectors i and j > which has fewer than b mismatches, and P is the sum of all p_ij, then > the algorithm runs in time O(P log n), where n*b+m <= P <= n^2*m. > ---- > > Other ideas, anyone? If not, how about a lower bound on the worst case > running time? Andy Lowry (lowry@watson.ibm.com) pointed out in email that if a, b, and c are three of my m-dimensional vectors, then h (a, c) >= |h (a, b) - h (b, c)|. Nifty proof of this claim follows: > Pf (A. Lowry): For each component i there are three cases: > > 1. a[i] = b[i] = c[i]. In this case, a[i] = c[i]. > 2. a[i] <> b[i] <> c[i]. In this case we may or may not have a[i] = > c[i]. > 3. a[i] = b[i] <> c[i] or a[i] <> b[i] = c[i]. In this case, a[i] > <> c[i]. > > Let n1(a,b,c), n2(a,b,c), and n3(a,b,c) be the number of instances of > case 1, 2, and 3, respectively, in comparing vectors a, b, and c. And > let k=h(a,b) and j=h(b,c). > > The minimum possible value of h(a,c) occurs in situations in which > case 3 is minimized and all instances of case 2 do indeed have a[i] = > c[i]. And in such an instance, of course, h(a,c) = n3(a,b,c). But > n3(a,b,c) can never be less than |h(a,b)-h(b,c)| = |k - j|. For > suppose otherwise, and wlog suppose k >= j. And let n3a(a,b,c) and > n3c(a,b,c) count the instances of case 3 in which a[i] = b[i] or b[i] > = c[i], respectively (so n3(a,b,c) = n3a(a,b,c)+n3c(a,b,c)). > > We have k = n2(a,b,c)+n3c(a,b,c) and j = n2(a,b,c)+n3a(a,b,c). So > 0 <= k-j = n3c(a,b,c)-n3a(a,b,c) <= n3(a,b,c), which contradicts our > assumption that n3(a,b,c) < |k-j|. This completes the proof of the > claim. Making use of this fact you can prune away a lot of the n(n-1)/2 comparisons. I imagine a possible algorithm might look like this: Compare the first vector x_1 to all others and sort the remaining vectors according to their Hamming distances from x_1 (because these are integers between 0 and m you could even use distribution counting and do this in linear time). Now, because you need only compare each vector x_i to those vectors x_j where |h (x_i, x_1) - h (x_1, x_j)| < b (the min distance found so far), you can easily find smaller subdivisions of the set of remaining vectors to which the same algorithm can be applied recursively -- these subdivisions can overlap, though, so you can't do the sorting in place... Hmm, maybe there is a simpler way? Anybody care to analyse the running time of this? :-) > [ Marc Ringuette | Cranberry Melon University, Cucumber Science Department ] > [ mnr@cs.cmu.edu | 412-268-3728 | ".surivorter erutangis a ma I" ] -- Rainer Staringer | rainer@fml.tuwien.ac.at Financial Markets Lab, TU Vienna | +43 (1) 58801/8138