The World of Computer Software

home *** CD-ROM | disk | FTP | other *** search

/ The World of Computer Software / World_Of_Computer_Software-02-385-Vol-1of3.iso / e / estat21.zip / EASIDOC.ZIP / ESCHAP06.DOC < prev next >

Wrap

Text File | 1992-06-30 | 43KB | 1,189 lines

Chapter 6: STATISTICAL COMMANDS In this chapter the statistical commands available are described. This manual does not seek to replace a statistics textbook, so only minimal guidance will be given as to which tests are appropiate for which data. The field is complex and controversial and if the user is not sure which test to use he or she should consult a textbook or professional statistician for guidance. Broadly speaking the tests may be divided according to whether they deal with parametric, nonparametric or categorical data. Data which is parametric should be continuous rather than discrete, and ideally should follow a normal distribution though different tests are more or less robust to departures from normality. It should be like the markings on a ruler in that the distance between each pair of consecutive numbers is always equal, with the proviso that to satisfy the requirement that the data is continuous the "marks" should be close together. Nonparametric data need not be so distributed, but the values must be ordinal in the sense that it is always possible to say that one value is greater than another. All the nonparametric tests supplied, Wilcoxon's rank sum, Wilcoxon's signed rank, Kendall's rank correlation coefficient, the Kruskall- Wallis test and the Kolmogorov-Smirnov test work by first assigning ranks to the values and then comparing ranks rather than the values themselves. Categorical information lacks even this quality of being ordered, so that one can simply say that a quality is different, but not greater or less than another. Parametric data might include height, weight, blood pressure, temperature. It is often acceptable to apply it to age in years provided that the total age range is reasonably large since then it can approximate to a continuous distribution. Nonparametric data would include age if it were broken down by decades, an assessment scale with only five points, social class, rank score on a measure, number of children, etc. Categorical data might include gender, marital status, ethnic origin, etc. It is always possible to treat parametric data as if it were nonparametric, and any data may be treated as categorical. However if the data is distributed such that a parametric test is feasible, this should be used in preference to a nonparametric one since the parametric test will have more power, i.e. the nonparametric test might produce a spuriously negative result. However if the data is nonparametric then the nonparametric test should be used, since otherwise spuriously positive or negative results may be produced. Unless there are good reasons to use cut-off points to divide ordinal data into categories, categorical tests should not be used on ordinal data because power will be lost and spuriously negative results can occur. 50 Statistical commands No specific test of normality is provided, and the user's understanding of the nature of the quantity which the data measures is crucial. However examining the frequency distribution, skewness and kurtosis may be helpful, and also note should be taken of how closely together lie the mean, median and mode. If they are far apart then the data must be skewed. Sometimes data which is quite non- normally distributed can be converted to data that more closely follows a normal distribution by applying a mathematical transformation. One of these is simply to take the log of the value. Other suggestions are described in textbooks. The chi-squared test compares data divided into categories in two different ways. The Wilcoxon rank sum test compares nonparametric data between two groups defined categorically, as does the Kolmogorov-Smirnov test. The Kruskall-Wallis one way analysis of variance does the same for more than two groups. The Wilcoxon signed rank sum test can be used to compare pairs of measures in two different columns. Kendall's rank correlation coefficient compares the relationship of two nonparametric measures. Student's t test compares a parametric measure in two groups defined categorically, and the analysis of variance does the same thing with more than two groups. The standard (Pearson's) correlation coefficient with linear regression compares data from two parametric measures. Multiple linear regression compares data from one parametric measure with data from several other parametric measures (though for some purposes this requirement may be relaxed, for example in discriminant analysis). Principal components analysis analyses data from several measures, which are all taken to be parametric. Finally a general purpose minimisation routine is provided, which can be used to perform non-linear regression and other optimisation problems. 6:1. Basics Format: B[asics] [r[anks]] [g[raphfile]] column [if condition] Outputs basic information about the data in a column - the total of the values in the column and the number of items, and the mean, mode, median, minimum, maximum, variance, standard deviation, standard error of the mean, skewness and kurtosis. Optionally a frequency and rank table of the values may also be produced. The graphing option allows a frequency distribution or cumulative distribution to be displayed in a histogram as described in the relevant section of the EASIGRAF documentation (although for some variables the graphing option of the CHISQ command may be more suitable). Select command - BASICS C15 Select command - BAS C15 IF ROW<=50 51 Statistical commands Select command - b ranks c19 Example output: A - C19 Column total=269.0 Number of items=100 Mean=2.690 Minimum=0.0 Maximum=7.0 Mode=0.0 Median=3.0 Variance=5.314 Population variance=5.368 Standard deviation=2.305 Population standard deviation=2.317 Standard error of mean=0.231 No. Rank % Cum% Value 30 15.5 30.0 30.0 0.000000 8 34.5 8.0 38.0 1.000000 9 43.0 9.0 47.0 2.000000 15 55.0 15.0 62.0 3.000000 14 69.5 14.0 76.0 4.000000 7 80.0 7.0 83.0 5.000000 12 89.5 12.0 95.0 6.000000 5 98.0 5.0 100.0 7.000000 Comments Further detail about the syntax of the BASICS command is provided in the general section on command syntax. If confidence limits have been requested (using the LIMITS command) then upper and lower confidence limits for the population mean will also be output. If there is more than one mode then only the lowest will be output. If you suspect there may be more than one then you can either look at the frequency table or can issue the BASICS command again together with a condition which excludes the first mode found, e.g. if a mode of 1.5 is reported: Select command - b c19 if c19!=1.5 The standard deviation given is the actual standard deviation of the sample, i.e. the root of the sum of the squares of the difference between each value and the mean divided by the number of items. The standard deviation of the population is obtained by dividing the same sum of squares by the number of items minus one. It is almost always this latter figure which should be quoted as "the standard deviation". It represents an attempt to estimate what is the standard deviation of the measure in the whole population from which the sample was drawn, and seeks to correct for the effects of the limited sample size. Equivalent remarks apply to the values quoted for variance and population variance - generally the latter should be used. Performing the BASICS command sets the values of the special variables XMEAN, XTOTAL and XNUMBER, in addition to the variables VV1, VV2, etc. 52 Statistical commands 6:2. Chisq Format: C[hisq] [f] [n] [g[raphfile]] [cols rows] This command sets up a contingency table and performs a chi-squared test on that table to determine the extent to which the values in the table depart from those expected if there is no tendency for categories to be associated with each other. There are three options. An F will mean that provided a two-by-two table is used Fisher's exact test will always be performed, regardless of the values in the table. EASISTAT automatically performs Fisher's exact test if there is a total of less than 20 items in the table or if the expected value for any cell is less than 5, but specifying F will cause EASISTAT to perform Fisher's test even if these conditions are not met. The N option means that instead of composing the contingency table by applying conditions to the data in EASISTAT's data table, the user can enter by hand the values that he or she wants to appear in each cell of the table. Using the graphing option allows frequency histograms to be displayed from the contingency table (see the relevant section in the EASIGRAF documentation). Any, all, or none of the options may be used at once. The user must supply the number of columns and rows for the table. These can optionally be supplied on the command line, otherwise EASISTAT will request them to be entered. The user must also supply the conditions to be used to categorise the values into these rows and columns, or alternatively (when the N option is used) must enter the numbers for each cell of the contingency table. Example: Select command - CHISQ - Chi-squared test - Input number of columns: 2 Input number of rows: 2 Enter condition for column 1: C15<12 Enter condition for column 2: C15>=12 Enter condition for row A: SEX=1 Enter condition for row B: SEX=2 Output: Column 1: C15<12 Column 2: C15>=12 Row A: SEX=1 Row B: SEX=2 1 2 A 31.0 31.0% (32.12) 42.0 42.0% (40.88) 73.0 73.0% 53 Statistical commands B 13.0 13.0% (11.88) 14.0 14.0% (15.12) 27.0 27.0% 44.0 44.0% 56.0 56.0% 100.0 Chi-squared = 0.258, 1 df p = 0.6113 Using Yates' correction: Chi-squared = 0.079, 1 df p = 0.7785 In this example the first column consists of the number of data rows for which C15 is less than 12, and the second column the number of rows for which it is greater than or equal to 12. (In the example data set in the file EXAMPLE.DAT, C15 contains the GHQ scores.) The rows of the contingency table contain a count of the number of rows of the data table for which the value in the column titled SEX (column 3 in the example data set) is equal to 1 or to 2. The contingency table output shows the observed number of values falling into each category followed by the observed percentage and then in brackets by the expected number for each category. Since this example was performed with 100 valid data rows, the observed numbers and percentages are in fact equal. Row and column totals and percentages are also output. Example: Select command - c n f - Chi-squared test - Input number of columns: 2 Input number of rows: 2 Enter 2 values for row A (all on one line): 13 7 Enter 2 values for row B (all on one line): 10 6 Output: 1 2 A 13 (12.78) 36.1% 7 ( 7.22) 19.4% 20 55.6% B 10 (10.22) 27.8% 6 ( 5.78) 16.7% 16 44.4% 23 63.9% 13 36.1% 36 Chi-squared = 0.024, 1 df p = 0.8767 Using Yates' correction: Chi-squared = 0.038, 1 df p = 0.8462 Fisher's exact test, p = 0.5752 When the N option is used the values to go into the table are entered directly by the user rather than being counted from the data set. In the example above the user enters the values 13, 7, 10 and 6 for a two-by-two table. Since the F option was also specified, Fisher's exact test is also performed. Example: Select command - CH 2 3 54 Statistical commands - Chi-squared test - Enter condition for column 1: SEX=1 Enter condition for column 2: SEX=2 Enter condition for row A: CLASS=1 Enter condition for row B: CLASS=2 Enter condition for row C: CLASS>2 Output: 1 2 A 41 (40.88) 41.0% 15 (15.12) 15.0% 56 56.0% B 24 (25.55) 24.0% 11 ( 9.45) 11.0% 35 35.0% C 8 ( 6.57) 8.0% 1 ( 2.43) 1.0% 9 9.0% 73 73.0% 27 27.0% 100 Chi-squared = 2.011, 2 df p = 0.3659 Using Yates' correction: Chi-squared = 0.537, 2 df p = 0.7644 Comments The CHISQ command outputs the observed value in each cell, the expected value in brackets, the column and row totals and the percentages each figure represents with respect to the total number of items. This means that the command can be used simply to provide a frequency distribution of the numbers and percentages of certain observations falling within certain criteria, by setting up a table with only one column. Here's how we can see the numbers and percentages falling within different ranges of GHQ score: Select command - c - Chi-squared test - Input number of columns: 1 Input number of rows: 5 Enter condition for column 1: 1 Enter condition for row A: GHQ<=25 Enter condition for row B: GHQ>25&GHQ<=35 Enter condition for row C: GHQ>35&GHQ<=45 Enter condition for row D: GHQ>45&GHQ<=55 Enter condition for row E: GHQ>55 Output: 1 A 31 (31.00) 35.6% 31 35.6% B 35 (35.00) 40.2% 35 40.2% C 11 (11.00) 12.6% 11 12.6% D 7 ( 7.00) 8.0% 7 8.0% 55 Statistical commands E 3 ( 3.00) 3.4% 3 3.4% 87 100.0% 87 The expected values and row totals are still calculated, though obviously they are the same as the observed values. The use of just one column in this way can be particularly useful when preparing graphs, especially of continuous variables. A frequency histogram can be generated using values gathered into a small number of groups. If the G option had been specified with the above example then a histogram with five bars would be generated representing the five ranges. 6:3. Wilcoxon Format: W[ilcoxon] [s[igned]] [column] This command performs Wilcoxon's rank sum test to compare the values in two groups to say whether the values are generally higher in one group than those in the other. It is exactly equivalent to two other commonly used nonparametric tests, the Mann-Whitney U test and Kendall's S, so only one of these tests is provided. Alternatively Wilcoxon's signed rank sum test can be applied to test whether the values in a column are generally greater than or less than 0. It would generally be used to compare two columns in a pairwise manner. Example: Select command - w c15 Enter condition for first group: SEX=1 Enter condition for second group: SEX=2 Output: Wilcoxon's comparison of two groups: Number (%) Sum of ranks Mean Group 73 (73.0%) T0 3804.0 3686.5 SEX=1 27 (27.0%) T1 1246.0 1363.5 SEX=2 Variance: 16554.807 (Sum-mean)/sd: 0.913 One-tailed p = 0.1805 The rank sum is taken as approximating to a normal distribution with mean and standard deviation derived as described by Armitage and Berry. The probability value given is the one-tailed probability of the rank sum reaching a value of such magnitude in the given direction assuming this normal distribution. For low numbers in each group the user may prefer to refer to a set of tables quoting the exact probability value for the rank sum dependent on the numbers in each group. If the SIGNED option (which can be abbreviated down to S) is specified then Wilcoxon's signed rank sum test is used to determine whether the values in a column are significantly less than or greater than zero. The usual application of this would be to first make one column the difference between two others using the DERIVE command, 56 Statistical commands and then to perform the signed rank test on it. This would then be a pairwise test of whether the values in one column were higher than those in the other. Example: Select command - new c3 diff Select command - derive c3 c1-c2 Select command - w s c3 Output: Wilcoxon's signed rank sum test using C3 (DIFF) n' = 25 T+ = 24.0 T- = 301.0 Variance = 1338.5 Standardised normal deviate with continuity correction = 3.772 One-tailed p = 0.0001 Here the DERIVE command is first used to make the values in c3 equal to the differences between the values in c1 and c2. Then the signed rank test is applied to c3 to provide a pairwise test of whether the values in c1 are signficantly higher or lower than in c2. A one-tailed probability for the difference to assume such a magnitude in the given direction is output. 6:4. Kendall's rank correlation coefficient Format: K[endall] [column column] Investigates the relationship between two columns of nonparametric data using Kendall's rank correlation coefficient. Examples: Select command - KEND GHQ HDA Select command - k Enter two columns to compare (one on each line): c15 c16 If the two columns are not included on the command line EASISTAT will ask for them. Output: Rank correlation of C15 (GHQ) with C16 (HDA) Kendall's S = P - Q = 3393 - 1164 = 2229 Kendall's tau (correlation coefficient) = 0.450 Variance of S = 111938.7, corrected normal deviate of S = 6.659 One-tailed p = 0.0000 The correlation coefficient is sometimes referred to as Kendall's tau. Kendall's S is taken as an approximating to a normal distribution with mean and standard deviation derived as described by Armitage and Berry. The probability value given is the one-tailed probability of Kendall's S of such magnitude assuming this normal 57 Statistical commands distribution. It is the one-tailed probability that such a correlation could have occurred in the direction found by chance. Note: Some other statistics programs sometimes give a slightly different value for the correlation coefficient. This is because because they take into account ties (two rows having the same value) before they calculate the correlation coefficient. The procedure used by EASISTAT (as recommended by Armitage and Berry) is to take into account ties only when calculating the significance of the correlation coefficient. Thus some other programs may give different values for Kendall's tau, but the eventual p value calculated should be the same (unless the other program makes a mistake - at least one gives the wrong answer). 6:5. Kolmogorov-Smirnov test Format: Ko[lmogorov] [g[raphfile]] [column] This compares a nonparametric measure between two groups and tests whether the distribution of values between them is significantly different. Examples: Select command - ko g c15 Enter condition for first group: c3=1 Enter condition for second group: c3=2 Select command - ko Enter column to test: c15 Enter condition for first group: c5<4 Enter condition for second group: c5>=4 If the column supplying the variable is not included in the first line EASISTAT will ask for it. Output: Kolmogorov-Smirnov comparison of two groups using C15 (GHQ) C5<4 C5>=4 K-S statistic = 0.4302 p = 0.0002 Comments This test is not described in Armitage and Berry, but it is sometimes used and it should not be hard to find a reference to it in a statistics textbook (our implementation is from Numerical Recipes in C by Press et al). The function of this test is similar to Wilcoxon's rank sum test, except that it does not test whether the values in one group are in general higher or lower than those in the other, but only whether the distributions differ. It might therefore be possible to detect that two distributions with equal medians are significantly different because of differences in range, skewness or kurtosis. 58 Statistical commands What the test does is to compare the cumulative percentage distributions of the two groups and to measure the maximum separation between these two distributions (a value between 0 and 1). The graphing option provides a graphical representation of this and allows other measures to be displayed as well, for example the frequency distributions of the groups (see the relevant section in the EASIGRAF documentation). 6:6. Ttest Format: Tt[est] [column] Format: Tt[est] [p[aired] [column column]] Student's t test is used to determine whether the values in one group are significantly higher than those in another group. If the option PAIRED (which may be abbreviated down to P) follows the command then a paired t test will be performed, otherwise an unpaired test. For the unpaired test the values all lie in one column, and two logical expressions must be entered to specify the conditions which define the two groups. For the paired test two columns are compared and the measures in each row are taken to be paired. Example: Select command - tt Enter column to test: GHQ Enter condition for first group: SEX=1 Enter condition for second group: SEX=2 Output: Studying C15 (GHQ) Mean 18.84 if SEX=1 Mean 15.30 if SEX=2 Unpaired t test, 98 degrees of freedom: t = 0.927 Two-tailed p = 0.3560 (Assumes equal variances) Comparison of means, standardised normal deviate: d = 0.977 Two-tailed p = 0.3283 (Does not assume equal variances) Two tests are performed. One assumes that although the means may differ between the groups, the variances do not. This compares the means of the groups according to a t statistic and outputs a two-tailed probability value for the difference between the two means to be as large as it is by chance. The second test does not make the assumption of equal variances. It takes the difference between the means to approximate a normal distribution and quotes the two-tailed probability for the difference to be as large as it is. If the PAIRED option is selected then the values in two columns are compared in a pairwise fashion, to see if the mean difference between pairs is significantly less than or greater than zero. 59 Statistical commands Example: Select command - TTEST P C16 C17 Output: Comparing C16 (HDA) and C17 (HDD) Paired t test, 99 degrees of freedom: t = 0.068 Two-tailed p = 0.9462 Each row contains a pair of values, one in each of the columns specified. The mean of the differences between these values is taken to be a t statistic and the probability value quoted is the two-tailed value for a difference as large as this to occur by chance. If confidence limits have been requested (using the LIMITS command) then upper and lower confidence limits for the true mean difference between the groups or pairs will also be output. 6:7. Linear regression and correlation coefficient Format: R[egress] [g[raphfile]] [column column [column]] Calculates the correlation coefficient between two measures in different columns, and calculates the linear regression line ("least squares fit") for the second column on the first. Examples: Select command - REG C15 C16 Select command - r Enter two columns to compare (one on each line): c15 c16 Output: Linear regression using C15 (GHQ) and C16 (HDA) Regression of c16 on c15: C16 = 4.788 + 0.208 * C15 Correlation coefficient r = 0.725 SE(b)= 0.020 Significance: t = 10.416, 98 df p = 0.0000 The correlation coefficient, r, is output (this is sometimes referred to as Pearson's correlation coefficient). The standard error of the gradient of the line SE(b) is output, and this can be taken to be distributed as a t statistic allowing the calculation of confidence limits. It is also used to calculate the significance of the results - the probability value quoted is a two-tailed value for a correlation of such magnitude to occur by chance. If confidence limits have been requested (using the LIMITS command) then upper and lower confidence limits for the true correlation coefficient and gradient will also be output. The graphing option plots one variable 60 Statistical commands against the other and allows the regression lines to be displayed (see the relevant section in the EASIGRAF documentation). If a third column name is given then it will be filled with the values which would be predicted from the regression equation with the coefficients arrived at. These are the values which the dependent variable would take if it was completely determined by the independent variable according to the regression equation. Example: Select command - REG C15 C16 C17 This gives just the same as result as entering: Select command - REG C15 C16 Select command - DERIVE C17 4.788 + 0.208 * C15 The linear regression equation is automatically applied to column 15 and the results entered into column 17. 6:8. Anova Format: A[nova] [N or g[raphfile]] [column] The one-way analysis of variance is equivalent to an unpaired t test except that the comparison is performed between more than two groups. It measures whether there is a tendency for the groups of values to have different means, or whether they might all be drawn from the same population. The values lie in one column and the groups are defined by conditions. If the option N (for nonparametric) is chosen then the Kruskal-Wallis one-way analysis of variance by ranks test is performed instead. Example: Select command - A - One-way analysis of variance - Enter column for dependent variable c15 Input number of groups: 5 Enter condition for group A: c5=1 Enter condition for group B: c5=2 Enter condition for group C: c5=3 Enter condition for group D: c5=4 Enter condition for group E: c5=5 Output: One-way analysis of variance with C15 (GHQ) as dependent variable Group A: C5=1 Group B: C5=2 Group C: C5=3 Group D: C5>3 61 Statistical commands Between pairs of groups: t tests (96 df) A B C t p t p t p B -1.111 0.2692 C -0.400 0.6897 1.456 0.1487 D -1.940 0.0553 -1.353 0.1793 -4.231 0.0001 Overall significance: F = 6.463 3,96 df, p = 0.0005 The analysis of variance outputs an F ratio which gives the overall significance representing the probability that all the group means could have varied so much by chance. It also computes a t statistic and two-tailed probability value for the difference between the means for each pair of groups. This latter differs from performing an ordinary unpaired t test between the two groups only in that the whole sample is used to provide an estimate of the overall variance of the measure, rather than only relying on the values in the pair of groups under consideration. The graphing option plots the values from each group in a vertical scatter plot and allows the group means to be displayed (see the relevant section in the EASIGRAF documentation). Example: Select command - A N - One-way analysis of variance - Enter column for dependent variable c15 Input number of groups: 4 Enter condition for group A: c5=1 Enter condition for group B: c5=2 Enter condition for group C: c5=3 Enter condition for group D: c5>3 Output: Kruskal-Wallis test with C15 (GHQ) as dependent variable Group A: Number = 3 Mean rank = 35.00 C5=1 Group B: Number = 12 Mean rank = 52.88 C5=2 Group C: Number = 41 Mean rank = 38.71 C5=3 Group D: Number = 25 Mean rank = 59.66 C5=4 Group E: Number = 19 Mean rank = 64.84 C5=5 Between pairs of groups comparisons of mean ranks (Two-tailed, corrected for multiple comparisons) A B C D Ru-Rv p Ru-Rv p Ru-Rv p Ru-Rv p B 17.88 NS C 3.71 NS -14.17 NS D 24.66 NS 6.78 NS 20.95 0.0887 E 29.84 NS 11.97 NS 26.13 0.0235 5.18 NS Overall significance: KW (corrected for ties) = 14.880, 4 df p = 0.0050 For the Kruskal-Wallis test the column and groups are selected in the same way as for the parametric analysis 62 Statistical commands of variance, and the output reports the overall differences between the group ranks and between pairs of groups comparisons as described in Nonparametric Statistics for the Behavioural Sciences by Siegel. 6:9. Multiple regression Format: M[ultiple] [column [column]] This test measures how well one dependent variable (in one column) is predicted by a number of independent variables in other columns. Example: Select command - m - Multiple linear regression - Enter column for dependent variable c27 Input number of independent variables: 4 Input 4 columns (one on each line): c18 c19 c20 c21 Output: Multiple linear regression with C27 (SEV) as dependent variable Regression equation: C27 = 0.146 + 0.093 * C19 SE(b) = 0.055 + 0.154 * C20 SE(b) = 0.075 + 0.029 * C21 SE(b) = 0.079 + 0.140 * C22 SE(b) = 0.065 Variance ratio F = (73.844/4)/0.717 = 25.747 df = 4,95 p = 0.0000 Multiple correlation coefficient R = 0.721 Significance of each measure (95 degrees of freedom): C19: t = 1.694 p = 0.0935 C20: t = 2.050 p = 0.0431 C21: t = 0.370 p = 0.7122 C22: t = 2.156 p = 0.0336 This test outputs a multiple correlation coefficient and the best-fitting linear regression equation using all the independent variables. The coefficients for each variable are given and their standard errors. These are used to produce a t statistic and two-tailed significance for the independent correlation of each variable with the dependent variable. Note that this will vary according to which other variables are included in the analysis. An overall two-tailed probability derived from an F ratio of variances is also given, representing the probability of such a large multiple correlation coefficient occurring by chance. 63 Statistical commands If a second column name is given after the first, then it will be filled with the values which would be predicted from the regression equation with the coefficients arrived at. These are the values which the dependent variable would take if it was completely determined by the independent variables according to the regression equation. Example: Select command - m c27 c28 Input number of independent variables: 4 Input 4 columns (one on each line): c18 c19 c20 c21 In this case column 28 will be filled with the predicted values. 6:10. Principal component analysis Format: Co[mponent] [g[raphfile]] [number of variables] Performs principal component analysis between a number of variables. Example: Select command - co - Principal component analysis - Enter number of columns to analyse: 4 Input 4 columns (one on each line): c19 c20 c21 c22 Input lower limit of contribution to variance to include component into main table (0 for all, 1 for none): 0.05 Output: Including largest 4 components into table Principal component analysis Contribution to overall variance: Co1 Co2 Co3 Co4 0.7284 0.1586 0.0603 0.0528 Correlations between components and variables: Co1 Co2 Co3 Co4 C19 A -0.6979 0.6932 -0.1751 -0.0421 C20 B -0.9193 -0.0405 0.3085 -0.2409 C21 C -0.9095 -0.0443 0.1250 0.3941 C22 D -0.8617 -0.4021 -0.3001 -0.0758 Principal components are derived (there is no facility to rotate them). The contribution of each to the overall variance is output, as is the correlation matrix between 64 Statistical commands them and the original variables. All components contributing more than a certain fraction of the overall variance are incorporated into the main data table as new columns at the right-hand edge of the table. They are titled Co1, Co2, etc. If the critical fraction requested is 0 then all of the components will be so incorporated, if it is 1 then none of them will be. The orginal variables are not normalised before the analysis (i.e. they are not altered to have unit variance). The user may do this himself if he or she wishes, otherwise variables with a large variance will produce a proportionately large contribution to the analysis. The graphing option has nothing to do with principal components analysis and is just a way of selecting multiple columns to be output to a graph file so that they can subsequently be plotted against each other (see the relevant section in the EASIGRAF documentation). 6:11. Minimise Format: Mi[nimise] [expression] This is a general purpose minimisation function which allows the performance, for example, of non-linear optimisations. To use it you enter an arithmetic expression which includes at least one of the general purpose variables (V1, V2, etc.) and then the names of those variables within the expression which are to be altered to minimise the value of the expression over all the data rows. Usually the aim will be to find the best fit of the expression to the values in one column and in this case the expression will automatically be converted into the expression for the least-squares fit to that column. If the original function is to be minimised instead, then enter NONE (which can be abbreviated down to N) instead of a column name. For example, to use the MINIMISE command to perform multiple linear regression with column HDD as the dependent variable and columns A, B, C, and D as independent variables: Select command - min V1 + V2*A + V3*B + V4*C + V5*D Enter column to fit to or NONE to minimise function, and optional second column for best predicted fit: HDD Enter list of variables to iterate (all on one line): V1 V2 V3 V4 V5 Output: Sigma: ((V1+V2*A+V3*B+V4*C+V5*D) - HDD)POW2 - function minimised after 9 iterations. Final value: 866.398 v1 = 1.82441 v2 = 0.346348 v3 = 0.265328 65 Statistical commands v4 = 0.678139 v5 = 0.24945 The output will show you that the following function is in fact the one which is minimised: Sigma: ((V1 + V2*A + V3*B + V4*C + V5*D)-HDD)pow2 This is the function for the least sum of squares difference between the supplied function and the column to fit to. The final value of this sum of squares is output, together with the best-fitting values for the variables which have been altered from their starting values. In this case the variables are the coefficients of the linear regression equation. General minimisation is slower, less accurate and less informative than the linear functions supplied (i.e. the simple and multiple linear regression commands), so it is best to try to convert your function to a linear form instead whenever possible (it often is). It is up to you to make sure that the function has a minimum, and to set appropriate starting values for the variables so that the global minimum is found if there is more than one local minimum. If no column is to be fitted to then the function itself is minimised, for example: Select command - min 6*V1pow2+4*V1-123 Enter column to fit to or NONE to minimise function, and optional second column for best predicted fit: NONE Enter list of variables to iterate (all on one line): V1 Sigma: 6*V1POW2+4*V1-123 - function minimised after 1 iterations. Final value: -12366.7 V1 = -0.33458 This finds the value of V1 for which the supplied quadratic equation has a minimum, which in this case is -0.335. However note that even when the expression being minimised contains no column references, it is still evaluated once for every data row and the function value is the total over all the rows (this example was run with 100 data rows, so the final value is 100 times what would be expected). This makes sense, because one might want to minimise a function such as 6*V1POW2+4*V1-C2, but it means that the minimisation will be unnecessarily slow if no values from the data table are actually needed. In such a case all the data rows can be temporarily excluded by issuing the command: Select command - NARROW 0 The condition is always false so this makes all the data rows invalid. When there are no valid data rows the expression is calculated just once, rather than once for every data row. 66 Statistical commands If a second column name is given after the first, then it will be filled with the values which would be predicted from the function with the coefficients arrived at. These are the values which the dependent variable would take if the best-fitting function applied exactly. Here is a final example where column 3 is fitted to a non-linear function of columns 1 and 2. The results predicted from the function are then written into column 4: Select command - min V1*C1*exp(C2powV2) Enter column to fit to or NONE to minimise function, and optional second column for best predicted fit: c3 c4 Enter list of variables to iterate (all on one line): v1 v2 Note The iterative process stops when one step fails to reduce the absolute value of the function by one ten thousandth. This should be appropriate for most applications, especially for least-squares fitting rather than general minimisation. This means that if the function has a value of 2 then the last step-size is less than 0.0002. However if the function (which may be the same shape) has a value of 200000 then the last step-size may be up to 20. If you want higher accuracy then you will have to add a constant to the function which reduces its absolute value to close to zero (in the latter example one would add -200000) and start the minimisation process again. Again, if the function is being evaluated over a number of data rows then the constant to add would need to be first divided by this number. 67