home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Professional
/
OS2PRO194.ISO
/
os2
/
progs
/
clopt
/
clopt.doc
next >
Wrap
Text File
|
1993-07-18
|
27KB
|
569 lines
CLIQUE OPTIMIZATION (CLOPT) 0.7
Last Modified on July 18, 1993
N. Sriram
National University of Singapore
email: swknasri@leonis.nus.sg
The first part of this document sketches the logic used by CLOPT.
The second part describes the operation of the program. The third
section contains portions of CLOPT's output using a dataset based on
the rated similarity among 30 mammals.
CLOPT is a clustering approach that takes proximity data as input and
constructs a hierarchical tree. These trees are parsimonious in that
they have far fewer than the n-1 levels of HCS methods. Additionally,
simulations also indicate that CLOPT is robust in terms of recovering
structure from data. For more details, refer to the articles cited at
the end of this section.
HCS methods are "bottom-up", they cluster 2 elements, recompute proximity
data for the n-1 objects, merge 2 objects again, recompute proximities
among n-2 objects and so on till they merge all elements in 1 group.
Various HCS methods differ on criteria used to decide the object pair that
are nearest neighbours at every step in the iteration.
CLOPT's control structure is both "top-down" and "bottom-up". In the
first step CLOPT divides n elements into a partition in the middle of the
hierarchy. This partition called a B level ("Basic") is chosen to maximize
a least-squares fit index of the partition with the data. One of several
graph partitioning heuristics may be used for this purpose. Then, CLOPT
can merge back from the B level to the T ("Top") level, creating
intermediate levels (I) in the hierarchy. The same heuristic is also
used in merging, although the program can be easily modified to use
different heuristics for dividing and merging. Then the first basic level
can be divided and a similar process carried out. Consider an example
from data that are almost perfectly hierarchical (n = 8 elements):
3
2 2
2 2 4
1 1 1 1
1 1 2 1 5
1 1 1 1 2 2
1 1 1 1 2 2 5
DIVIDE 1 --> 4 CORR=0.87857 Lt = 2.992 (0.498) Rt=0.125 OK
MERGE 4 --> 2 CORR=0.94845 Lt = 1.941 (0.941) Rt=0.250 OK
DIVIDE 4 --> 6 CORR=0.98023 Lt = 4.889 (0.945) Rt=0.250 OK
MERGE 6 --> 5 CORR=0.98715 Lt = 3.943 (0.943) Rt=0.250 OK
DIVIDE All Data values are identical
The first B level has 4 groups; this merges into a 2 group I level. Then
the first B level divides into the second B level with 6 groups. Then
the second B level merges into a group with 5 groups. Further division
is not carried out because it does not increase CORR suffiently. The OK
indicates that the level that was created was accepted by the model
because the change in CORR was greater than BETA, a convergence criterion
(here BETA = 0.001).
CLOPT implements a number of clustering heuristics. One such heuristic is
described in the 1990 paper; Another, more robust heuristic is described in
the second paper. This program includes a third heuristic and minor
variations of the second and third. Thus 5 heuristics are available in the current
version. The same control structure is used in all cases.
CLOPT heuristics go through 2 steps. The first step is common to all
CLOPT methods and involves Lt, the link threshold. Lt abstracts a graph
by considering links between objects that have similarity >= Lt. This
graph has 1 or more components depending on the value of Lt. Typically,
the higher the value of Lt, the greater the number of components because
fewer links exist. Note the nesting of graphs with increasing Lt.
The second step applies Rt, a resemblance threshold on the Lt graph to
obtain an Rt graph. That is, the Rt graph is nested in the Lt graph.
The definition of Rt varies with the heuristic used. In heuristic 1
(1990 paper), Rt is simply a normalized measure of node connectivity.
Objects with equal or higher connectivity are extracted to form clusters.
Then the Rt graph is "redrawn" without the already selected objects and
the procedure continues till the Rt graph is empty. In heuristic 1
the premise is that 2 objects in the same component with many links are
likely to belong together in the same group. It is easy to construct
data where heuristic 1 will fail miserably;for example, imagine data
where a component is made of 2 dense clusters that are connected to
each other by a bridge link. Despite its seeming vulnerability, the
first heuristic is efficient and works pretty well on real data.
Heuristic 2 uses a definition of Rt that is more resilient; but it pays
a computational price. In Heuristic 2, the average similarity between
associates of the 2 objects that form links in the Lt graph are computed.
Associates are directly linked objects in the Lt graph. These averages
are a measure of second order similarity. For example if A and B are
linked in the Lt graph we compute the average similarity between 2 object
sets, one includes all associates of A (including A) except B; the other
set includes all associates of B (including B) except A. Links < Rt in
this second order similarity data are eliminated. The components of the
Rt graph are clusters of the derived partition.
In Heuristic 3, second order similarity is a normalized measure of the
number of associates common to the 2 objects forming a link. Heuristic 3
has to be tested and fine tuned, but it runs a lot faster than heuristic
2. Heuristic 4 and 5 are variations on 2 and 3 respectively. They impose
an additional condition for an Lt link to be retained (in addition to the
Rt condition); only if either one of the 2 objects forming the Lt link has
a minimum of MINCONN links where MINCONN is a specified parameter.
Theoretically, 4 and 5 will work on data that will trick heuristics 2 and
3 respectively. But these variations remain to be tested.
In fact Heuristics 2,3,4 and 5 all use a second order similarity matrix.
Here Rt is used as a threshold on this matrix, much as Lt is used on the
raw data. So the nesting property for Lt is also true of Rt (at fixed
Lt) in these cases. The nesting property of Rt is not guaranteed in
Heuristic 1. However, heuristic 1 will run faster if this assumption
is made. This modification will be implemented in the future.
All heuristics are called by a search procedure that selectively
examines combinations of Lt and Rt. There are 2 types of search
available: "binary" and "downhill". In binary search, three values
of Lt (high/right, medium/center and low/left) are chosen and the
best solution is set as the new center and the process continues
until convergence is reached. In downhill search, we start from a
high value of Lt and move downhill in fixed increments until we are
fairly sure that we have observed a local maxima.
A given run of the program can use either search or specify them
interactively for each level of the hierarchy. Other search parameters
specify the range of "promising" Lt values (normalized in [0, 1]).
Generally speaking, the closer Lt is to 1 (higher the link threshold),
the faster the computations. In many structured datasets, similarities
are positively skewed and optimal solutions can be got when Lt is fairly
high. Consequently for large datasets, "downhill" search tends to be
much faster (sometimes by an order of magnitude) than the slower "binary"
search. Note that the values of Rt are always assigned by a "binary"
search criterion; the choice is available (and useful) for specifying
the method for examining Lt values.
Finally in this version of CLOPT, there is an option for randomly
permuting the data and running the procedure a specified number
of times to obtain a sampling distribution of the fit measure.
One can compare the obtained fit with this distribution. Clustering
methods can capitalize on chance and fairly high model-data correlations
can be obtained for random data when n is small. The value of the fit
also depends on the distribution of the proximities (positively skewed
similarities favor higher correlations). Of course, it must be pointed
out that the presence of clear clustering structure in the data is also
associated with increased values of the fit between model and data.
However, one would be mistaken to treat the magnitude of the fit as the
sole indication of the "amount" of structure in the data for the reasons
stated above. The position of the obtained fit vis-a-vis the sampling
distribution provides an additional perspective when comparing the
presence of structure in 2 or more datasets.
References
Sriram, N. (1990), Clique Optimization: A method to construct parsimonious
ultrametric trees from similarity data, Journal of Classification, 7, 33-52.
Sriram, N. and Lewis, Scott (in press), Constructing Optimal Ultrametrics,
Journal of Classification.
****************
* Running CLOPT*
****************
You can run the executable clopt.exe in either one of two ways. Simply type
"clopt" and respond to the queries. A file called "normal.cfg" will be
written out. If you are not sure about a response, type Enter to select the
default value.
You can also run clopt with the following parameters:
"clopt <NELM> <config_file> <input_file> <output_file>". NELM refers to
the number of elements in the data.
I have compiled the OS/2 32-bit executable using Borland C (single thread,
no PM calls). The source also compiles and runs on DOS. The upper limit
on NELM for the DOS version is around 50. The DOS version also runs slower
than the OS/2 version. I have included the OS/2 and DOS executables in this
distribution; they are named os2clopt.exe and dosclopt.exe.
Use fixed-width fonts (like courier) to print the tree; results will be
less than satisfactory with proportionally spaced fonts (times roman).
On non-PC systems, extended ASCII characters used for drawing the tree look
weird; however if suitable characters exist they can be specified in a
header file). E-mail me if you want the source for *nix or other systems.
The memory requirements grow as the square of NELM; I am not sure about
the time complexity but I guess a practical limit will be reached well
before NELM = 1000. Some heuristics are faster and some take up a little
more space. As noted earlier, using "downhill" search can speed up things
for large datasets. For large datasets, the recommended strategy is to
use downhill search interactively examining small intervals in [0,1] for
the values of Lt. Initially, it is also better to use Heuristic 1 or 3
instead of Heuristic 2 as the latter takes more time. However, I have
had no problems in applying Heuristic 2 in conjunction with downhill search
for a medium size dataset (n=184) using the OS/2 executable.
Below are the contents of a particular configuration file, test.cfg:
2 0 0 0 <HEURISTIC,ISEARCH,ISOTONE,MINCONN>
0 3 1 1 8 <DATYPE,DAFORM,MISSING,LABELS,LLABEL>
1 0 <MODOUT,MOTYPE>
1 60 1 5 1 <TRACE,PAGEW,TREE,MINGAP,BETGAP>
1 1 1 <HOMOG,TYPIC,LEVOUT>
0.01 0.01 0.001 0.0001 -999.999 <DEL_LT,DEL_RT,BETA,VEQUAL,MISSVAL>
1 5 0.950 0.050 0.100 <SEARCH,NDOWN,MAXLT,MINLT,STEP>
0 10 <PERM,NPERM>
Below are the contents of the input file, test.dat:
3
2 2
2 2 4
1 1 1 1
1 1 -999.999 1 5
1 1 1 1 2 2
1 1 1 1 2 2 5
one
two
three
four
five
six
seven
eight
/*Description of parameters in the config file*/
HEURISTIC can be selected from {1, 2, 3, 4, 5}
set ISEARCH=1 to enable interactive search, turn it off with 0
Setting ISOTONE=1 slows down things a bit; In most cases setting ISOTONE=0
will still result in an ultrametric hierarchy, where alphas are monotonic
MINCONN is only relevant for HEURISTICS 4 and 5; Choose a small integer
say from [0, 1, 2, 3] if you want to experiment. For HEURISTIC 4,
MINCONN=0 makes it equivalent to HEURISTIC 2; HEURISTIC 5 is equivalent to
HEURISTIC 3 when MINCONN=0. MINCONN is a lower bound on the number of
associates of an element for it to be included in a cluster. This parameter
was included because it helped the heuristics in certain situations (but
probably is of little, if any, consequence in most datasets).
DATYPE assumes 0=SIMILARITY DATA and 1=DISSIMILARITY DATA
DAFORM assumes 0=FULL MATRIX 1=LOWER DIAGONAL 2=UPPER DIAGONAL
3=LOWER ONLY and 4=UPPER ONLY
if there are missing values in the data MISSING=1 else MISSING=0
MISSVAL is the numerical code for missing values (default is -999.999)
if there are element identifiers in the input file following the data matrix
then LABELS=1 else LABELS=0
LLABEL pertains to the number of characters you want read from the labels
This version can handle spaces (treats them as characters in the label)
and other white space characters.
if you want the model matrix output then MODOUT=1 else MODOUT=0
MOTYPE is analagous to DATYPE. The model matrix is always written out
as a lower-half matrix.
The trace of the program appears on screen; if you want it to be sent
to the output file also then TRACE=1 else TRACE=0
PAGEW is the maximum width of the output, only used to constrain the tree;
Set TREE=1 if you want the standard tree in the output, else TREE=0
MINGAP is the minimum number of columns (default=5) between adjacent levels
BETGAP is the number of rows separating adjacent clusters at the left_most
part of the tree (use odd numbers like 1 or 3)
If you want a tree with cluster homogenities set HOMOG=1 else HOMOG=0
For element typicalities at each hierarchy level set TYPIC=1 else TYPIC=0.
If you want the levels in the hierarchy to be represented as indexed
partitions set LEVOUT=1 else LEVOUT=0. These partitions can be useful in
secondary analyses.
DEL_LT is the resolution of the link threshold
DEL_RT is the resolution of the resemblance threshold
BETA is the convergence criterion; This decides when the merging and
dividing processes are halted
VEQUAL differences less than VEQUAL are considered insignificant in
making comparisons of fit.
Increasing VEQUAL and BETA may speed up CLOPT but the solution may be
inferior; A similar comment applies to the resolution parameters
Increasing BETA slightly (say from 0.001 to 0.005) will make the model
more parsimonious (with fewer levels in the hierarchy)
SEARCH=1 is binary search; SEARCH=2 is downhill search
NDOWN is only relevant for downhill search; refers to the number of
downhill steps after local maxima after which search is called off.
MAXLT and MINLT specify the range of Lt search. MAXLT > MINLT and both
are in the range [0, 1]. Used by both binary and downhill search.
STEP is relevant only for downhill search. specifies increments in
which Lt is decremented from MAXLT downward. A small increment
may slow down things but is conservative (say 0.05). Or one
can use a coarse increment on the first run and use a fine
increment later in the second run along with a narrower range of
MAXLT and MINLT.
set PERM=1 to analyze random permutations of data, else PERM=0
NPERM is the number of random permutations
-----------------------------------------------------------------
Run the program by executing "clopt 8 test.cfg test.dat test.clp"
You can view the outcome in test.clp.
In the output there are some terms that need to be clarified:
In the TRACE, Lt and Rt refer to the thresholds used by the program.
The normalized value of Lt in [0, 1] is given in parentheses. The
normalization is with respect to the range of the data-matrix under
consideration for the divide/merge step (not necessarily the raw data).
OK means that the level is included, KO means it is knocked out.
LEVEL refers to the level in hierarchy, starting from the bottom
NCLUSTER is the number of clusters at a level (including singletons)
WLINKS is the number of within-cluster links at a level
Note: WLINKS will be less than what you would expect if there are
missing values in the data. These are not counted.
AVG WSIM is the average within-cluster similarity at a level
ALPHA is the average within-cluster similarity of links that are newly
formed at the level
If ALPHA is ordered then the tree is ultrametric but for the tree to be
ultrametric ALPHA need not be ordered. ALPHAS are used to make the model
matrix, calculate fit etc.
The ultrametric constraint implies that, in the model matrix, for any 3
objects the 2 smallest similarities must be equal to each other. For
model distances it implies that the 2 largest distances among 3 objects
must be equal to each other. Thus, this constraint is more severe than
the triangle inequality.
PBI_CORR is the point-biserial correlation between each partition (level)
and the data. Partitions are represented by 0's and 1's; 1's stand for
within-cluster links and 0 for between-cluster links. In some sense, the
partition with the maximum PBI_CORR represents the data best (among others
in the tree). Usually, this "basic level" partition is the first partition
derived by CLOPT (exceptions, though infrequent, do occur).
In the homogenity tree, the values represent average within-cluster
similarities in [1,1000] normalized with respect to the range of the raw
data. The no. of columns between levels is a function of the difference
between the AVG WSIM of adjacent levels. A given tree can be displayed
in many ways. I use the homogenity of clusters to constrain the display.
Starting from the level on the extreme right, more homogenous clusters
are above less homogeneous clusters; this process is recursive all the way
to the leaves. At the extreme left, objects are sorted by their typicality
at level 1. Singletons are assumed to be maximally homogeneous. The
typicality of an element is it's average similarity to other members in the
cluster. If the homogenities of clusters was used to construct the model
matrix then it is certain that the fit to the data will improve; However
the model may not remain ultrametric. Element typicalities are normalized
to be in the range 0-100 with respect to the range of the raw data.
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
/ Suggestions for improving CLOPT are welcome. /
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
CLIQUE OPTIMIZATION 0.70
N. Sriram
National University of Singapore
Last Modified: July 18 1993
Please e-mail queries or comments to swknasri@leonis.nus.sg
NELM = 30 HEURISTIC = 1
In_file = henley.dat
Out_file = henley.out
Config_file = hm.cfg
PROGRAM TRACE
DIVIDE 1 --> 10 CORR=0.77142 Lt = 590.430 (0.622) Rt=0.031 OK
MERGE 10 --> 5 CORR=0.82695 Lt = 410.473 (0.624) Rt=0.406 OK
MERGE 5 --> 2 CORR=0.83066 Lt = 306.251 (0.791) Rt=0.562 OK
DIVIDE 10 --> 16 CORR=0.85461 Lt = 716.479 (0.614) Rt=0.562 OK
MERGE 16 --> 12 CORR=0.85825 Lt = 617.632 (0.755) Rt=0.250 OK
MERGE 12 --> 11 CORR=0.85841 Lt = 578.849 (0.943) Rt=0.250 KO
DIVIDE 16 --> 21 CORR=0.86107 Lt = 777.146 (0.512) Rt=0.125 OK
MERGE 21 --> 17 CORR=0.86137 Lt = 726.732 (0.659) Rt=0.250 KO
DIVIDE 21 --> 27 CORR=0.86164 Lt = 850.474 (0.766) Rt=0.750 KO
MAX Data Similarity = 887.000
MIN Data Similarity = 103.000
MEAN Data Similarity = 356.510
The Pearson correlation between data and model = 0.861
The model hierarchy has 7 levels
LEVEL NCLUSTERS WLINKS AVG WSIM ALPHA PBI_CORR
1 21 11 818.909 818.909 0.473
2 16 22 770.136 721.364 0.606
3 12 44 700.364 630.591 0.732
4 10 61 657.557 546.765 0.771
5 5 104 570.875 447.907 0.762
6 2 226 428.810 307.705 0.477
7 1 435 356.510 278.330 0.000
HIERARCHICAL TREE WITH NODE HOMOGENITIES
elephant *───────────────────────────────────────┐
│
camel *───────────────────────────┐ │
│ │
giraffe *───────────────────────────┤ │
│ │
cow *────────────────────┐ │ │
│ │ │
goat *────┐ │ │ │
860──────────────┤ │ │
sheep *────┘ │ │ 515────┐
│ 561──────────┘ │
antelope *────┐ │ │ │
906─────────┐ 635─────┘ │
deer *────┘ │ │ │
│ │ │
donkey *─────────┐ 732───┘ │
│ │ │
horse *────┐ 824────┘ │
882───┘ │
zebra *────┘ │
│
chimp *────┐ │
│ │
monkey *───911─────────────────────────────────┐ │
│ │ │
gorilla *────┘ │ │
│ │
pig *───────────────────────────┐ │ │
│ │ 323
rabbit *───────────────┐ │ │ │
│ │ │ │
beaver *─────────┐ │ │ │ │
816────┤ │ │ │
raccoon *─────────┘ │ │ │ │
│ 636──────────┤ │
mouse *────┐ 717──────────┘ │ │
918───┐ │ │ │
rat *────┘ │ │ │ │
799────┘ │ │
squirrel *────┐ │ 383────┘
878───┘ │
chipmunk *────┘ │
│
bear *───────────────────────────┐ │
│ │
cat *─────────┐ │ │
│ │ │
tiger *────┐ │ │ │
│ 884────────────────┤ │
lion *───954───┘ │ │
│ 579──────────┘
leopard *────┘ │
│
dog *───────────────┐ │
│ │
fox *─────────┐ 744──────────┘
818────┘
wolf *─────────┘
ELEMENT TYPICALITIES AT VARIOUS LEVELS IN THE HIERARCHY
1 2 3 4 5 6 7
elephant 30 20
camel 47 46 26
giraffe 42 42 23
cow 55 51 49 30
goat 86 86 86 64 58 54 38
sheep 86 86 86 59 52 49 34
antelope 90 90 72 64 60 57 36
deer 90 90 70 65 60 56 37
donkey 79 69 65 60 58 35
horse 88 84 76 67 65 62 37
zebra 88 82 75 65 62 59 35
chimp 96 96 96 96 96 36 29
monkey 90 90 90 90 90 37 31
gorilla 86 86 86 86 86 28 23
pig 39 29 31
rabbit 69 69 64 40 35
beaver 81 67 67 63 42 35
raccoon 81 69 69 65 47 38
mouse 91 80 70 70 66 37 29
rat 91 80 71 71 68 40 31
squirrel 87 80 77 77 72 43 35
chipmunk 87 77 75 75 69 42 33
bear 40 28 25
cat 81 81 81 61 48 38
tiger 96 91 91 91 65 34 29
lion 95 90 90 90 66 33 28
leopard 94 90 90 90 65 34 29
dog 70 70 48 41 37
fox 81 73 73 54 43 36
wolf 81 79 79 61 38 31
MODEL-DATA CORRELATIONS FOR RANDOM PERMUTATIONS OF DATA
Permutation 1, Correlation = 0.4671
Permutation 2, Correlation = 0.4842
Permutation 3, Correlation = 0.5499
Permutation 4, Correlation = 0.4878
Permutation 5, Correlation = 0.4959
Permutation 6, Correlation = 0.4691
Permutation 7, Correlation = 0.5181
Permutation 8, Correlation = 0.4861
Permutation 9, Correlation = 0.4673
Permutation 10, Correlation = 0.4566
Permutation 11, Correlation = 0.5075
Permutation 12, Correlation = 0.5359
Permutation 13, Correlation = 0.4349
Permutation 14, Correlation = 0.4819
Permutation 15, Correlation = 0.5346
Permutation 16, Correlation = 0.5119
Permutation 17, Correlation = 0.4971
Permutation 18, Correlation = 0.4854
Permutation 19, Correlation = 0.4458
Permutation 20, Correlation = 0.5041
Mean = 0.4911 SD = 0.0294 MAX = 0.5499
Original fit is 12.602 SDs above the mean of the sampling distribution
Original fit is 10.599 SDs above the maximum of the sampling distribution