OS/2 Professional

home *** CD-ROM | disk | FTP | other *** search

/ OS/2 Professional / OS2PRO194.ISO / os2 / progs / clopt / clopt.doc next >

Wrap

Text File | 1993-07-18 | 27KB | 569 lines

CLIQUE OPTIMIZATION (CLOPT) 0.7 Last Modified on July 18, 1993 N. Sriram National University of Singapore email: swknasri@leonis.nus.sg The first part of this document sketches the logic used by CLOPT. The second part describes the operation of the program. The third section contains portions of CLOPT's output using a dataset based on the rated similarity among 30 mammals. CLOPT is a clustering approach that takes proximity data as input and constructs a hierarchical tree. These trees are parsimonious in that they have far fewer than the n-1 levels of HCS methods. Additionally, simulations also indicate that CLOPT is robust in terms of recovering structure from data. For more details, refer to the articles cited at the end of this section. HCS methods are "bottom-up", they cluster 2 elements, recompute proximity data for the n-1 objects, merge 2 objects again, recompute proximities among n-2 objects and so on till they merge all elements in 1 group. Various HCS methods differ on criteria used to decide the object pair that are nearest neighbours at every step in the iteration. CLOPT's control structure is both "top-down" and "bottom-up". In the first step CLOPT divides n elements into a partition in the middle of the hierarchy. This partition called a B level ("Basic") is chosen to maximize a least-squares fit index of the partition with the data. One of several graph partitioning heuristics may be used for this purpose. Then, CLOPT can merge back from the B level to the T ("Top") level, creating intermediate levels (I) in the hierarchy. The same heuristic is also used in merging, although the program can be easily modified to use different heuristics for dividing and merging. Then the first basic level can be divided and a similar process carried out. Consider an example from data that are almost perfectly hierarchical (n = 8 elements): 3 2 2 2 2 4 1 1 1 1 1 1 2 1 5 1 1 1 1 2 2 1 1 1 1 2 2 5 DIVIDE 1 --> 4 CORR=0.87857 Lt = 2.992 (0.498) Rt=0.125 OK MERGE 4 --> 2 CORR=0.94845 Lt = 1.941 (0.941) Rt=0.250 OK DIVIDE 4 --> 6 CORR=0.98023 Lt = 4.889 (0.945) Rt=0.250 OK MERGE 6 --> 5 CORR=0.98715 Lt = 3.943 (0.943) Rt=0.250 OK DIVIDE All Data values are identical The first B level has 4 groups; this merges into a 2 group I level. Then the first B level divides into the second B level with 6 groups. Then the second B level merges into a group with 5 groups. Further division is not carried out because it does not increase CORR suffiently. The OK indicates that the level that was created was accepted by the model because the change in CORR was greater than BETA, a convergence criterion (here BETA = 0.001). CLOPT implements a number of clustering heuristics. One such heuristic is described in the 1990 paper; Another, more robust heuristic is described in the second paper. This program includes a third heuristic and minor variations of the second and third. Thus 5 heuristics are available in the current version. The same control structure is used in all cases. CLOPT heuristics go through 2 steps. The first step is common to all CLOPT methods and involves Lt, the link threshold. Lt abstracts a graph by considering links between objects that have similarity >= Lt. This graph has 1 or more components depending on the value of Lt. Typically, the higher the value of Lt, the greater the number of components because fewer links exist. Note the nesting of graphs with increasing Lt. The second step applies Rt, a resemblance threshold on the Lt graph to obtain an Rt graph. That is, the Rt graph is nested in the Lt graph. The definition of Rt varies with the heuristic used. In heuristic 1 (1990 paper), Rt is simply a normalized measure of node connectivity. Objects with equal or higher connectivity are extracted to form clusters. Then the Rt graph is "redrawn" without the already selected objects and the procedure continues till the Rt graph is empty. In heuristic 1 the premise is that 2 objects in the same component with many links are likely to belong together in the same group. It is easy to construct data where heuristic 1 will fail miserably;for example, imagine data where a component is made of 2 dense clusters that are connected to each other by a bridge link. Despite its seeming vulnerability, the first heuristic is efficient and works pretty well on real data. Heuristic 2 uses a definition of Rt that is more resilient; but it pays a computational price. In Heuristic 2, the average similarity between associates of the 2 objects that form links in the Lt graph are computed. Associates are directly linked objects in the Lt graph. These averages are a measure of second order similarity. For example if A and B are linked in the Lt graph we compute the average similarity between 2 object sets, one includes all associates of A (including A) except B; the other set includes all associates of B (including B) except A. Links < Rt in this second order similarity data are eliminated. The components of the Rt graph are clusters of the derived partition. In Heuristic 3, second order similarity is a normalized measure of the number of associates common to the 2 objects forming a link. Heuristic 3 has to be tested and fine tuned, but it runs a lot faster than heuristic 2. Heuristic 4 and 5 are variations on 2 and 3 respectively. They impose an additional condition for an Lt link to be retained (in addition to the Rt condition); only if either one of the 2 objects forming the Lt link has a minimum of MINCONN links where MINCONN is a specified parameter. Theoretically, 4 and 5 will work on data that will trick heuristics 2 and 3 respectively. But these variations remain to be tested. In fact Heuristics 2,3,4 and 5 all use a second order similarity matrix. Here Rt is used as a threshold on this matrix, much as Lt is used on the raw data. So the nesting property for Lt is also true of Rt (at fixed Lt) in these cases. The nesting property of Rt is not guaranteed in Heuristic 1. However, heuristic 1 will run faster if this assumption is made. This modification will be implemented in the future. All heuristics are called by a search procedure that selectively examines combinations of Lt and Rt. There are 2 types of search available: "binary" and "downhill". In binary search, three values of Lt (high/right, medium/center and low/left) are chosen and the best solution is set as the new center and the process continues until convergence is reached. In downhill search, we start from a high value of Lt and move downhill in fixed increments until we are fairly sure that we have observed a local maxima. A given run of the program can use either search or specify them interactively for each level of the hierarchy. Other search parameters specify the range of "promising" Lt values (normalized in [0, 1]). Generally speaking, the closer Lt is to 1 (higher the link threshold), the faster the computations. In many structured datasets, similarities are positively skewed and optimal solutions can be got when Lt is fairly high. Consequently for large datasets, "downhill" search tends to be much faster (sometimes by an order of magnitude) than the slower "binary" search. Note that the values of Rt are always assigned by a "binary" search criterion; the choice is available (and useful) for specifying the method for examining Lt values. Finally in this version of CLOPT, there is an option for randomly permuting the data and running the procedure a specified number of times to obtain a sampling distribution of the fit measure. One can compare the obtained fit with this distribution. Clustering methods can capitalize on chance and fairly high model-data correlations can be obtained for random data when n is small. The value of the fit also depends on the distribution of the proximities (positively skewed similarities favor higher correlations). Of course, it must be pointed out that the presence of clear clustering structure in the data is also associated with increased values of the fit between model and data. However, one would be mistaken to treat the magnitude of the fit as the sole indication of the "amount" of structure in the data for the reasons stated above. The position of the obtained fit vis-a-vis the sampling distribution provides an additional perspective when comparing the presence of structure in 2 or more datasets. References Sriram, N. (1990), Clique Optimization: A method to construct parsimonious ultrametric trees from similarity data, Journal of Classification, 7, 33-52. Sriram, N. and Lewis, Scott (in press), Constructing Optimal Ultrametrics, Journal of Classification. **************** * Running CLOPT* **************** You can run the executable clopt.exe in either one of two ways. Simply type "clopt" and respond to the queries. A file called "normal.cfg" will be written out. If you are not sure about a response, type Enter to select the default value. You can also run clopt with the following parameters: "clopt <NELM> <config_file> <input_file> <output_file>". NELM refers to the number of elements in the data. I have compiled the OS/2 32-bit executable using Borland C (single thread, no PM calls). The source also compiles and runs on DOS. The upper limit on NELM for the DOS version is around 50. The DOS version also runs slower than the OS/2 version. I have included the OS/2 and DOS executables in this distribution; they are named os2clopt.exe and dosclopt.exe. Use fixed-width fonts (like courier) to print the tree; results will be less than satisfactory with proportionally spaced fonts (times roman). On non-PC systems, extended ASCII characters used for drawing the tree look weird; however if suitable characters exist they can be specified in a header file). E-mail me if you want the source for *nix or other systems. The memory requirements grow as the square of NELM; I am not sure about the time complexity but I guess a practical limit will be reached well before NELM = 1000. Some heuristics are faster and some take up a little more space. As noted earlier, using "downhill" search can speed up things for large datasets. For large datasets, the recommended strategy is to use downhill search interactively examining small intervals in [0,1] for the values of Lt. Initially, it is also better to use Heuristic 1 or 3 instead of Heuristic 2 as the latter takes more time. However, I have had no problems in applying Heuristic 2 in conjunction with downhill search for a medium size dataset (n=184) using the OS/2 executable. Below are the contents of a particular configuration file, test.cfg: 2 0 0 0 <HEURISTIC,ISEARCH,ISOTONE,MINCONN> 0 3 1 1 8 <DATYPE,DAFORM,MISSING,LABELS,LLABEL> 1 0 <MODOUT,MOTYPE> 1 60 1 5 1 <TRACE,PAGEW,TREE,MINGAP,BETGAP> 1 1 1 <HOMOG,TYPIC,LEVOUT> 0.01 0.01 0.001 0.0001 -999.999 <DEL_LT,DEL_RT,BETA,VEQUAL,MISSVAL> 1 5 0.950 0.050 0.100 <SEARCH,NDOWN,MAXLT,MINLT,STEP> 0 10 <PERM,NPERM> Below are the contents of the input file, test.dat: 3 2 2 2 2 4 1 1 1 1 1 1 -999.999 1 5 1 1 1 1 2 2 1 1 1 1 2 2 5 one two three four five six seven eight /*Description of parameters in the config file*/ HEURISTIC can be selected from {1, 2, 3, 4, 5} set ISEARCH=1 to enable interactive search, turn it off with 0 Setting ISOTONE=1 slows down things a bit; In most cases setting ISOTONE=0 will still result in an ultrametric hierarchy, where alphas are monotonic MINCONN is only relevant for HEURISTICS 4 and 5; Choose a small integer say from [0, 1, 2, 3] if you want to experiment. For HEURISTIC 4, MINCONN=0 makes it equivalent to HEURISTIC 2; HEURISTIC 5 is equivalent to HEURISTIC 3 when MINCONN=0. MINCONN is a lower bound on the number of associates of an element for it to be included in a cluster. This parameter was included because it helped the heuristics in certain situations (but probably is of little, if any, consequence in most datasets). DATYPE assumes 0=SIMILARITY DATA and 1=DISSIMILARITY DATA DAFORM assumes 0=FULL MATRIX 1=LOWER DIAGONAL 2=UPPER DIAGONAL 3=LOWER ONLY and 4=UPPER ONLY if there are missing values in the data MISSING=1 else MISSING=0 MISSVAL is the numerical code for missing values (default is -999.999) if there are element identifiers in the input file following the data matrix then LABELS=1 else LABELS=0 LLABEL pertains to the number of characters you want read from the labels This version can handle spaces (treats them as characters in the label) and other white space characters. if you want the model matrix output then MODOUT=1 else MODOUT=0 MOTYPE is analagous to DATYPE. The model matrix is always written out as a lower-half matrix. The trace of the program appears on screen; if you want it to be sent to the output file also then TRACE=1 else TRACE=0 PAGEW is the maximum width of the output, only used to constrain the tree; Set TREE=1 if you want the standard tree in the output, else TREE=0 MINGAP is the minimum number of columns (default=5) between adjacent levels BETGAP is the number of rows separating adjacent clusters at the left_most part of the tree (use odd numbers like 1 or 3) If you want a tree with cluster homogenities set HOMOG=1 else HOMOG=0 For element typicalities at each hierarchy level set TYPIC=1 else TYPIC=0. If you want the levels in the hierarchy to be represented as indexed partitions set LEVOUT=1 else LEVOUT=0. These partitions can be useful in secondary analyses. DEL_LT is the resolution of the link threshold DEL_RT is the resolution of the resemblance threshold BETA is the convergence criterion; This decides when the merging and dividing processes are halted VEQUAL differences less than VEQUAL are considered insignificant in making comparisons of fit. Increasing VEQUAL and BETA may speed up CLOPT but the solution may be inferior; A similar comment applies to the resolution parameters Increasing BETA slightly (say from 0.001 to 0.005) will make the model more parsimonious (with fewer levels in the hierarchy) SEARCH=1 is binary search; SEARCH=2 is downhill search NDOWN is only relevant for downhill search; refers to the number of downhill steps after local maxima after which search is called off. MAXLT and MINLT specify the range of Lt search. MAXLT > MINLT and both are in the range [0, 1]. Used by both binary and downhill search. STEP is relevant only for downhill search. specifies increments in which Lt is decremented from MAXLT downward. A small increment may slow down things but is conservative (say 0.05). Or one can use a coarse increment on the first run and use a fine increment later in the second run along with a narrower range of MAXLT and MINLT. set PERM=1 to analyze random permutations of data, else PERM=0 NPERM is the number of random permutations ----------------------------------------------------------------- Run the program by executing "clopt 8 test.cfg test.dat test.clp" You can view the outcome in test.clp. In the output there are some terms that need to be clarified: In the TRACE, Lt and Rt refer to the thresholds used by the program. The normalized value of Lt in [0, 1] is given in parentheses. The normalization is with respect to the range of the data-matrix under consideration for the divide/merge step (not necessarily the raw data). OK means that the level is included, KO means it is knocked out. LEVEL refers to the level in hierarchy, starting from the bottom NCLUSTER is the number of clusters at a level (including singletons) WLINKS is the number of within-cluster links at a level Note: WLINKS will be less than what you would expect if there are missing values in the data. These are not counted. AVG WSIM is the average within-cluster similarity at a level ALPHA is the average within-cluster similarity of links that are newly formed at the level If ALPHA is ordered then the tree is ultrametric but for the tree to be ultrametric ALPHA need not be ordered. ALPHAS are used to make the model matrix, calculate fit etc. The ultrametric constraint implies that, in the model matrix, for any 3 objects the 2 smallest similarities must be equal to each other. For model distances it implies that the 2 largest distances among 3 objects must be equal to each other. Thus, this constraint is more severe than the triangle inequality. PBI_CORR is the point-biserial correlation between each partition (level) and the data. Partitions are represented by 0's and 1's; 1's stand for within-cluster links and 0 for between-cluster links. In some sense, the partition with the maximum PBI_CORR represents the data best (among others in the tree). Usually, this "basic level" partition is the first partition derived by CLOPT (exceptions, though infrequent, do occur). In the homogenity tree, the values represent average within-cluster similarities in [1,1000] normalized with respect to the range of the raw data. The no. of columns between levels is a function of the difference between the AVG WSIM of adjacent levels. A given tree can be displayed in many ways. I use the homogenity of clusters to constrain the display. Starting from the level on the extreme right, more homogenous clusters are above less homogeneous clusters; this process is recursive all the way to the leaves. At the extreme left, objects are sorted by their typicality at level 1. Singletons are assumed to be maximally homogeneous. The typicality of an element is it's average similarity to other members in the cluster. If the homogenities of clusters was used to construct the model matrix then it is certain that the fit to the data will improve; However the model may not remain ultrametric. Element typicalities are normalized to be in the range 0-100 with respect to the range of the raw data. \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ / Suggestions for improving CLOPT are welcome. / \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ CLIQUE OPTIMIZATION 0.70 N. Sriram National University of Singapore Last Modified: July 18 1993 Please e-mail queries or comments to swknasri@leonis.nus.sg NELM = 30 HEURISTIC = 1 In_file = henley.dat Out_file = henley.out Config_file = hm.cfg PROGRAM TRACE DIVIDE 1 --> 10 CORR=0.77142 Lt = 590.430 (0.622) Rt=0.031 OK MERGE 10 --> 5 CORR=0.82695 Lt = 410.473 (0.624) Rt=0.406 OK MERGE 5 --> 2 CORR=0.83066 Lt = 306.251 (0.791) Rt=0.562 OK DIVIDE 10 --> 16 CORR=0.85461 Lt = 716.479 (0.614) Rt=0.562 OK MERGE 16 --> 12 CORR=0.85825 Lt = 617.632 (0.755) Rt=0.250 OK MERGE 12 --> 11 CORR=0.85841 Lt = 578.849 (0.943) Rt=0.250 KO DIVIDE 16 --> 21 CORR=0.86107 Lt = 777.146 (0.512) Rt=0.125 OK MERGE 21 --> 17 CORR=0.86137 Lt = 726.732 (0.659) Rt=0.250 KO DIVIDE 21 --> 27 CORR=0.86164 Lt = 850.474 (0.766) Rt=0.750 KO MAX Data Similarity = 887.000 MIN Data Similarity = 103.000 MEAN Data Similarity = 356.510 The Pearson correlation between data and model = 0.861 The model hierarchy has 7 levels LEVEL NCLUSTERS WLINKS AVG WSIM ALPHA PBI_CORR 1 21 11 818.909 818.909 0.473 2 16 22 770.136 721.364 0.606 3 12 44 700.364 630.591 0.732 4 10 61 657.557 546.765 0.771 5 5 104 570.875 447.907 0.762 6 2 226 428.810 307.705 0.477 7 1 435 356.510 278.330 0.000 HIERARCHICAL TREE WITH NODE HOMOGENITIES elephant *───────────────────────────────────────┐ │ camel *───────────────────────────┐ │ │ │ giraffe *───────────────────────────┤ │ │ │ cow *────────────────────┐ │ │ │ │ │ goat *────┐ │ │ │ 860──────────────┤ │ │ sheep *────┘ │ │ 515────┐ │ 561──────────┘ │ antelope *────┐ │ │ │ 906─────────┐ 635─────┘ │ deer *────┘ │ │ │ │ │ │ donkey *─────────┐ 732───┘ │ │ │ │ horse *────┐ 824────┘ │ 882───┘ │ zebra *────┘ │ │ chimp *────┐ │ │ │ monkey *───911─────────────────────────────────┐ │ │ │ │ gorilla *────┘ │ │ │ │ pig *───────────────────────────┐ │ │ │ │ 323 rabbit *───────────────┐ │ │ │ │ │ │ │ beaver *─────────┐ │ │ │ │ 816────┤ │ │ │ raccoon *─────────┘ │ │ │ │ │ 636──────────┤ │ mouse *────┐ 717──────────┘ │ │ 918───┐ │ │ │ rat *────┘ │ │ │ │ 799────┘ │ │ squirrel *────┐ │ 383────┘ 878───┘ │ chipmunk *────┘ │ │ bear *───────────────────────────┐ │ │ │ cat *─────────┐ │ │ │ │ │ tiger *────┐ │ │ │ │ 884────────────────┤ │ lion *───954───┘ │ │ │ 579──────────┘ leopard *────┘ │ │ dog *───────────────┐ │ │ │ fox *─────────┐ 744──────────┘ 818────┘ wolf *─────────┘ ELEMENT TYPICALITIES AT VARIOUS LEVELS IN THE HIERARCHY 1 2 3 4 5 6 7 elephant 30 20 camel 47 46 26 giraffe 42 42 23 cow 55 51 49 30 goat 86 86 86 64 58 54 38 sheep 86 86 86 59 52 49 34 antelope 90 90 72 64 60 57 36 deer 90 90 70 65 60 56 37 donkey 79 69 65 60 58 35 horse 88 84 76 67 65 62 37 zebra 88 82 75 65 62 59 35 chimp 96 96 96 96 96 36 29 monkey 90 90 90 90 90 37 31 gorilla 86 86 86 86 86 28 23 pig 39 29 31 rabbit 69 69 64 40 35 beaver 81 67 67 63 42 35 raccoon 81 69 69 65 47 38 mouse 91 80 70 70 66 37 29 rat 91 80 71 71 68 40 31 squirrel 87 80 77 77 72 43 35 chipmunk 87 77 75 75 69 42 33 bear 40 28 25 cat 81 81 81 61 48 38 tiger 96 91 91 91 65 34 29 lion 95 90 90 90 66 33 28 leopard 94 90 90 90 65 34 29 dog 70 70 48 41 37 fox 81 73 73 54 43 36 wolf 81 79 79 61 38 31 MODEL-DATA CORRELATIONS FOR RANDOM PERMUTATIONS OF DATA Permutation 1, Correlation = 0.4671 Permutation 2, Correlation = 0.4842 Permutation 3, Correlation = 0.5499 Permutation 4, Correlation = 0.4878 Permutation 5, Correlation = 0.4959 Permutation 6, Correlation = 0.4691 Permutation 7, Correlation = 0.5181 Permutation 8, Correlation = 0.4861 Permutation 9, Correlation = 0.4673 Permutation 10, Correlation = 0.4566 Permutation 11, Correlation = 0.5075 Permutation 12, Correlation = 0.5359 Permutation 13, Correlation = 0.4349 Permutation 14, Correlation = 0.4819 Permutation 15, Correlation = 0.5346 Permutation 16, Correlation = 0.5119 Permutation 17, Correlation = 0.4971 Permutation 18, Correlation = 0.4854 Permutation 19, Correlation = 0.4458 Permutation 20, Correlation = 0.5041 Mean = 0.4911 SD = 0.0294 MAX = 0.5499 Original fit is 12.602 SDs above the mean of the sampling distribution Original fit is 10.599 SDs above the maximum of the sampling distribution