home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The World of Computer Software
/
World_Of_Computer_Software-02-385-Vol-1of3.iso
/
e
/
estat21.zip
/
EASIDOC.ZIP
/
ESCHAP06.DOC
< prev
next >
Wrap
Text File
|
1992-06-30
|
43KB
|
1,189 lines
Chapter 6: STATISTICAL COMMANDS
In this chapter the statistical commands available are
described. This manual does not seek to replace a
statistics textbook, so only minimal guidance will be
given as to which tests are appropiate for which data.
The field is complex and controversial and if the user is
not sure which test to use he or she should consult a
textbook or professional statistician for guidance.
Broadly speaking the tests may be divided according to
whether they deal with parametric, nonparametric or
categorical data. Data which is parametric should be
continuous rather than discrete, and ideally should
follow a normal distribution though different tests are
more or less robust to departures from normality. It
should be like the markings on a ruler in that the
distance between each pair of consecutive numbers is
always equal, with the proviso that to satisfy the
requirement that the data is continuous the "marks"
should be close together. Nonparametric data need not be
so distributed, but the values must be ordinal in the
sense that it is always possible to say that one value is
greater than another. All the nonparametric tests
supplied, Wilcoxon's rank sum, Wilcoxon's signed rank,
Kendall's rank correlation coefficient, the Kruskall-
Wallis test and the Kolmogorov-Smirnov test work by first
assigning ranks to the values and then comparing ranks
rather than the values themselves. Categorical
information lacks even this quality of being ordered, so
that one can simply say that a quality is different, but
not greater or less than another.
Parametric data might include height, weight, blood
pressure, temperature. It is often acceptable to apply it
to age in years provided that the total age range is
reasonably large since then it can approximate to a
continuous distribution.
Nonparametric data would include age if it were broken
down by decades, an assessment scale with only five
points, social class, rank score on a measure, number of
children, etc.
Categorical data might include gender, marital status,
ethnic origin, etc.
It is always possible to treat parametric data as if it
were nonparametric, and any data may be treated as
categorical. However if the data is distributed such that
a parametric test is feasible, this should be used in
preference to a nonparametric one since the parametric
test will have more power, i.e. the nonparametric test
might produce a spuriously negative result. However if
the data is nonparametric then the nonparametric test
should be used, since otherwise spuriously positive or
negative results may be produced. Unless there are good
reasons to use cut-off points to divide ordinal data into
categories, categorical tests should not be used on
ordinal data because power will be lost and spuriously
negative results can occur.
50
Statistical commands
No specific test of normality is provided, and the user's
understanding of the nature of the quantity which the
data measures is crucial. However examining the frequency
distribution, skewness and kurtosis may be helpful, and
also note should be taken of how closely together lie the
mean, median and mode. If they are far apart then the
data must be skewed. Sometimes data which is quite non-
normally distributed can be converted to data that more
closely follows a normal distribution by applying a
mathematical transformation. One of these is simply to
take the log of the value. Other suggestions are
described in textbooks.
The chi-squared test compares data divided into
categories in two different ways. The Wilcoxon rank sum
test compares nonparametric data between two groups
defined categorically, as does the Kolmogorov-Smirnov
test. The Kruskall-Wallis one way analysis of variance
does the same for more than two groups. The Wilcoxon
signed rank sum test can be used to compare pairs of
measures in two different columns. Kendall's rank
correlation coefficient compares the relationship of two
nonparametric measures. Student's t test compares a
parametric measure in two groups defined categorically,
and the analysis of variance does the same thing with
more than two groups. The standard (Pearson's)
correlation coefficient with linear regression compares
data from two parametric measures. Multiple linear
regression compares data from one parametric measure with
data from several other parametric measures (though for
some purposes this requirement may be relaxed, for
example in discriminant analysis). Principal components
analysis analyses data from several measures, which are
all taken to be parametric. Finally a general purpose
minimisation routine is provided, which can be used to
perform non-linear regression and other optimisation
problems.
6:1. Basics
Format: B[asics] [r[anks]] [g[raphfile]] column [if
condition]
Outputs basic information about the data in a column -
the total of the values in the column and the number of
items, and the mean, mode, median, minimum, maximum,
variance, standard deviation, standard error of the mean,
skewness and kurtosis. Optionally a frequency and rank
table of the values may also be produced. The graphing
option allows a frequency distribution or cumulative
distribution to be displayed in a histogram as described
in the relevant section of the EASIGRAF documentation
(although for some variables the graphing option of the
CHISQ command may be more suitable).
Select command - BASICS C15
Select command - BAS C15 IF ROW<=50
51
Statistical commands
Select command - b ranks c19
Example output:
A - C19
Column total=269.0 Number of items=100
Mean=2.690
Minimum=0.0 Maximum=7.0
Mode=0.0 Median=3.0
Variance=5.314 Population variance=5.368
Standard deviation=2.305 Population standard deviation=2.317
Standard error of mean=0.231
No. Rank % Cum% Value
30 15.5 30.0 30.0 0.000000
8 34.5 8.0 38.0 1.000000
9 43.0 9.0 47.0 2.000000
15 55.0 15.0 62.0 3.000000
14 69.5 14.0 76.0 4.000000
7 80.0 7.0 83.0 5.000000
12 89.5 12.0 95.0 6.000000
5 98.0 5.0 100.0 7.000000
Comments
Further detail about the syntax of the BASICS command is
provided in the general section on command syntax.
If confidence limits have been requested (using the
LIMITS command) then upper and lower confidence limits
for the population mean will also be output.
If there is more than one mode then only the lowest will
be output. If you suspect there may be more than one then
you can either look at the frequency table or can issue
the BASICS command again together with a condition which
excludes the first mode found, e.g. if a mode of 1.5 is
reported:
Select command - b c19 if c19!=1.5
The standard deviation given is the actual standard
deviation of the sample, i.e. the root of the sum of the
squares of the difference between each value and the mean
divided by the number of items. The standard deviation of
the population is obtained by dividing the same sum of
squares by the number of items minus one. It is almost
always this latter figure which should be quoted as "the
standard deviation". It represents an attempt to estimate
what is the standard deviation of the measure in the
whole population from which the sample was drawn, and
seeks to correct for the effects of the limited sample
size. Equivalent remarks apply to the values quoted for
variance and population variance - generally the latter
should be used.
Performing the BASICS command sets the values of the
special variables XMEAN, XTOTAL and XNUMBER, in addition
to the variables VV1, VV2, etc.
52
Statistical commands
6:2. Chisq
Format: C[hisq] [f] [n] [g[raphfile]] [cols rows]
This command sets up a contingency table and performs a
chi-squared test on that table to determine the extent to
which the values in the table depart from those expected
if there is no tendency for categories to be associated
with each other.
There are three options. An F will mean that provided a
two-by-two table is used Fisher's exact test will always
be performed, regardless of the values in the table.
EASISTAT automatically performs Fisher's exact test if
there is a total of less than 20 items in the table or if
the expected value for any cell is less than 5, but
specifying F will cause EASISTAT to perform Fisher's test
even if these conditions are not met. The N option means
that instead of composing the contingency table by
applying conditions to the data in EASISTAT's data table,
the user can enter by hand the values that he or she
wants to appear in each cell of the table. Using the
graphing option allows frequency histograms to be
displayed from the contingency table (see the relevant
section in the EASIGRAF documentation). Any, all, or none
of the options may be used at once.
The user must supply the number of columns and rows for
the table. These can optionally be supplied on the
command line, otherwise EASISTAT will request them to be
entered. The user must also supply the conditions to be
used to categorise the values into these rows and
columns, or alternatively (when the N option is used)
must enter the numbers for each cell of the contingency
table.
Example:
Select command - CHISQ
- Chi-squared test -
Input number of columns: 2
Input number of rows: 2
Enter condition for column 1: C15<12
Enter condition for column 2: C15>=12
Enter condition for row A: SEX=1
Enter condition for row B: SEX=2
Output:
Column 1: C15<12
Column 2: C15>=12
Row A: SEX=1
Row B: SEX=2
1 2
A 31.0 31.0% (32.12) 42.0 42.0% (40.88) 73.0 73.0%
53
Statistical commands
B 13.0 13.0% (11.88) 14.0 14.0% (15.12) 27.0 27.0%
44.0 44.0% 56.0 56.0% 100.0
Chi-squared = 0.258, 1 df p = 0.6113
Using Yates' correction: Chi-squared = 0.079, 1 df p = 0.7785
In this example the first column consists of the number
of data rows for which C15 is less than 12, and the
second column the number of rows for which it is greater
than or equal to 12. (In the example data set in the file
EXAMPLE.DAT, C15 contains the GHQ scores.) The rows of
the contingency table contain a count of the number of
rows of the data table for which the value in the column
titled SEX (column 3 in the example data set) is equal to
1 or to 2. The contingency table output shows the
observed number of values falling into each category
followed by the observed percentage and then in brackets
by the expected number for each category. Since this
example was performed with 100 valid data rows, the
observed numbers and percentages are in fact equal. Row
and column totals and percentages are also output.
Example:
Select command - c n f
- Chi-squared test -
Input number of columns: 2
Input number of rows: 2
Enter 2 values for row A (all on one line): 13 7
Enter 2 values for row B (all on one line): 10 6
Output:
1 2
A 13 (12.78) 36.1% 7 ( 7.22) 19.4% 20 55.6%
B 10 (10.22) 27.8% 6 ( 5.78) 16.7% 16 44.4%
23 63.9% 13 36.1% 36
Chi-squared = 0.024, 1 df p = 0.8767
Using Yates' correction: Chi-squared = 0.038, 1 df p = 0.8462
Fisher's exact test, p = 0.5752
When the N option is used the values to go into the table
are entered directly by the user rather than being
counted from the data set. In the example above the user
enters the values 13, 7, 10 and 6 for a two-by-two table.
Since the F option was also specified, Fisher's exact
test is also performed.
Example:
Select command - CH 2 3
54
Statistical commands
- Chi-squared test -
Enter condition for column 1: SEX=1
Enter condition for column 2: SEX=2
Enter condition for row A: CLASS=1
Enter condition for row B: CLASS=2
Enter condition for row C: CLASS>2
Output:
1 2
A 41 (40.88) 41.0% 15 (15.12) 15.0% 56 56.0%
B 24 (25.55) 24.0% 11 ( 9.45) 11.0% 35 35.0%
C 8 ( 6.57) 8.0% 1 ( 2.43) 1.0% 9 9.0%
73 73.0% 27 27.0% 100
Chi-squared = 2.011, 2 df p = 0.3659
Using Yates' correction: Chi-squared = 0.537, 2 df p = 0.7644
Comments
The CHISQ command outputs the observed value in each
cell, the expected value in brackets, the column and row
totals and the percentages each figure represents with
respect to the total number of items. This means that the
command can be used simply to provide a frequency
distribution of the numbers and percentages of certain
observations falling within certain criteria, by setting
up a table with only one column. Here's how we can see
the numbers and percentages falling within different
ranges of GHQ score:
Select command - c
- Chi-squared test -
Input number of columns: 1
Input number of rows: 5
Enter condition for column 1: 1
Enter condition for row A: GHQ<=25
Enter condition for row B: GHQ>25&GHQ<=35
Enter condition for row C: GHQ>35&GHQ<=45
Enter condition for row D: GHQ>45&GHQ<=55
Enter condition for row E: GHQ>55
Output:
1
A 31 (31.00) 35.6% 31 35.6%
B 35 (35.00) 40.2% 35 40.2%
C 11 (11.00) 12.6% 11 12.6%
D 7 ( 7.00) 8.0% 7 8.0%
55
Statistical commands
E 3 ( 3.00) 3.4% 3 3.4%
87 100.0% 87
The expected values and row totals are still calculated,
though obviously they are the same as the observed
values. The use of just one column in this way can be
particularly useful when preparing graphs, especially of
continuous variables. A frequency histogram can be
generated using values gathered into a small number of
groups. If the G option had been specified with the above
example then a histogram with five bars would be
generated representing the five ranges.
6:3. Wilcoxon
Format: W[ilcoxon] [s[igned]] [column]
This command performs Wilcoxon's rank sum test to compare
the values in two groups to say whether the values are
generally higher in one group than those in the other. It
is exactly equivalent to two other commonly used
nonparametric tests, the Mann-Whitney U test and
Kendall's S, so only one of these tests is provided.
Alternatively Wilcoxon's signed rank sum test can be
applied to test whether the values in a column are
generally greater than or less than 0. It would generally
be used to compare two columns in a pairwise manner.
Example:
Select command - w c15
Enter condition for first group: SEX=1
Enter condition for second group: SEX=2
Output:
Wilcoxon's comparison of two groups:
Number (%) Sum of ranks Mean Group
73 (73.0%) T0 3804.0 3686.5 SEX=1
27 (27.0%) T1 1246.0 1363.5 SEX=2
Variance: 16554.807 (Sum-mean)/sd: 0.913
One-tailed p = 0.1805
The rank sum is taken as approximating to a normal
distribution with mean and standard deviation derived as
described by Armitage and Berry. The probability value
given is the one-tailed probability of the rank sum
reaching a value of such magnitude in the given direction
assuming this normal distribution. For low numbers in
each group the user may prefer to refer to a set of
tables quoting the exact probability value for the rank
sum dependent on the numbers in each group.
If the SIGNED option (which can be abbreviated down to S)
is specified then Wilcoxon's signed rank sum test is used
to determine whether the values in a column are
significantly less than or greater than zero. The usual
application of this would be to first make one column the
difference between two others using the DERIVE command,
56
Statistical commands
and then to perform the signed rank test on it. This
would then be a pairwise test of whether the values in
one column were higher than those in the other.
Example:
Select command - new c3 diff
Select command - derive c3 c1-c2
Select command - w s c3
Output:
Wilcoxon's signed rank sum test using C3 (DIFF)
n' = 25 T+ = 24.0 T- = 301.0
Variance = 1338.5
Standardised normal deviate with continuity correction = 3.772
One-tailed p = 0.0001
Here the DERIVE command is first used to make the values
in c3 equal to the differences between the values in c1
and c2. Then the signed rank test is applied to c3 to
provide a pairwise test of whether the values in c1 are
signficantly higher or lower than in c2. A one-tailed
probability for the difference to assume such a magnitude
in the given direction is output.
6:4. Kendall's rank
correlation coefficient
Format: K[endall] [column column]
Investigates the relationship between two columns of
nonparametric data using Kendall's rank correlation
coefficient.
Examples:
Select command - KEND GHQ HDA
Select command - k
Enter two columns to compare (one on each line):
c15
c16
If the two columns are not included on the command line
EASISTAT will ask for them.
Output:
Rank correlation of C15 (GHQ) with C16 (HDA)
Kendall's S = P - Q = 3393 - 1164 = 2229
Kendall's tau (correlation coefficient) = 0.450
Variance of S = 111938.7, corrected normal deviate of S = 6.659
One-tailed p = 0.0000
The correlation coefficient is sometimes referred to as
Kendall's tau. Kendall's S is taken as an approximating
to a normal distribution with mean and standard deviation
derived as described by Armitage and Berry. The
probability value given is the one-tailed probability of
Kendall's S of such magnitude assuming this normal
57
Statistical commands
distribution. It is the one-tailed probability that such
a correlation could have occurred in the direction found
by chance.
Note: Some other statistics programs sometimes give a
slightly different value for the correlation coefficient.
This is because because they take into account ties (two
rows having the same value) before they calculate the
correlation coefficient. The procedure used by EASISTAT
(as recommended by Armitage and Berry) is to take into
account ties only when calculating the significance of
the correlation coefficient. Thus some other programs may
give different values for Kendall's tau, but the eventual
p value calculated should be the same (unless the other
program makes a mistake - at least one gives the wrong
answer).
6:5. Kolmogorov-Smirnov test
Format: Ko[lmogorov] [g[raphfile]] [column]
This compares a nonparametric measure between two groups
and tests whether the distribution of values between them
is significantly different.
Examples:
Select command - ko g c15
Enter condition for first group: c3=1
Enter condition for second group: c3=2
Select command - ko
Enter column to test: c15
Enter condition for first group: c5<4
Enter condition for second group: c5>=4
If the column supplying the variable is not included in
the first line EASISTAT will ask for it.
Output:
Kolmogorov-Smirnov comparison of two groups using C15 (GHQ)
C5<4
C5>=4
K-S statistic = 0.4302
p = 0.0002
Comments
This test is not described in Armitage and Berry, but it
is sometimes used and it should not be hard to find a
reference to it in a statistics textbook (our
implementation is from Numerical Recipes in C by Press et
al). The function of this test is similar to Wilcoxon's
rank sum test, except that it does not test whether the
values in one group are in general higher or lower than
those in the other, but only whether the distributions
differ. It might therefore be possible to detect that two
distributions with equal medians are significantly
different because of differences in range, skewness or
kurtosis.
58
Statistical commands
What the test does is to compare the cumulative
percentage distributions of the two groups and to measure
the maximum separation between these two distributions (a
value between 0 and 1). The graphing option provides a
graphical representation of this and allows other
measures to be displayed as well, for example the
frequency distributions of the groups (see the relevant
section in the EASIGRAF documentation).
6:6. Ttest
Format: Tt[est] [column]
Format: Tt[est] [p[aired] [column column]]
Student's t test is used to determine whether the values
in one group are significantly higher than those in
another group. If the option PAIRED (which may be
abbreviated down to P) follows the command then a paired
t test will be performed, otherwise an unpaired test. For
the unpaired test the values all lie in one column, and
two logical expressions must be entered to specify the
conditions which define the two groups. For the paired
test two columns are compared and the measures in each
row are taken to be paired.
Example:
Select command - tt
Enter column to test: GHQ
Enter condition for first group: SEX=1
Enter condition for second group: SEX=2
Output:
Studying C15 (GHQ)
Mean 18.84 if SEX=1
Mean 15.30 if SEX=2
Unpaired t test, 98 degrees of freedom: t = 0.927
Two-tailed p = 0.3560 (Assumes equal variances)
Comparison of means, standardised normal deviate: d = 0.977
Two-tailed p = 0.3283 (Does not assume equal variances)
Two tests are performed. One assumes that although the
means may differ between the groups, the variances do
not. This compares the means of the groups according to a
t statistic and outputs a two-tailed probability value
for the difference between the two means to be as large
as it is by chance. The second test does not make the
assumption of equal variances. It takes the difference
between the means to approximate a normal distribution
and quotes the two-tailed probability for the difference
to be as large as it is.
If the PAIRED option is selected then the values in two
columns are compared in a pairwise fashion, to see if the
mean difference between pairs is significantly less than
or greater than zero.
59
Statistical commands
Example:
Select command - TTEST P C16 C17
Output:
Comparing C16 (HDA) and C17 (HDD)
Paired t test, 99 degrees of freedom: t = 0.068
Two-tailed p = 0.9462
Each row contains a pair of values, one in each of the
columns specified. The mean of the differences between
these values is taken to be a t statistic and the
probability value quoted is the two-tailed value for a
difference as large as this to occur by chance.
If confidence limits have been requested (using the
LIMITS command) then upper and lower confidence limits
for the true mean difference between the groups or pairs
will also be output.
6:7. Linear regression and
correlation coefficient
Format: R[egress] [g[raphfile]] [column column [column]]
Calculates the correlation coefficient between two
measures in different columns, and calculates the linear
regression line ("least squares fit") for the second
column on the first.
Examples:
Select command - REG C15 C16
Select command - r
Enter two columns to compare (one on each line):
c15
c16
Output:
Linear regression using C15 (GHQ) and C16 (HDA)
Regression of c16 on c15: C16 = 4.788 + 0.208 * C15
Correlation coefficient r = 0.725
SE(b)= 0.020 Significance: t = 10.416, 98 df p = 0.0000
The correlation coefficient, r, is output (this is
sometimes referred to as Pearson's correlation
coefficient). The standard error of the gradient of the
line SE(b) is output, and this can be taken to be
distributed as a t statistic allowing the calculation of
confidence limits. It is also used to calculate the
significance of the results - the probability value
quoted is a two-tailed value for a correlation of such
magnitude to occur by chance.
If confidence limits have been requested (using the
LIMITS command) then upper and lower confidence limits
for the true correlation coefficient and gradient will
also be output. The graphing option plots one variable
60
Statistical commands
against the other and allows the regression lines to be
displayed (see the relevant section in the EASIGRAF
documentation).
If a third column name is given then it will be filled
with the values which would be predicted from the
regression equation with the coefficients arrived at.
These are the values which the dependent variable would
take if it was completely determined by the independent
variable according to the regression equation.
Example:
Select command - REG C15 C16 C17
This gives just the same as result as entering:
Select command - REG C15 C16
Select command - DERIVE C17 4.788 + 0.208 * C15
The linear regression equation is automatically applied
to column 15 and the results entered into column 17.
6:8. Anova
Format: A[nova] [N or g[raphfile]] [column]
The one-way analysis of variance is equivalent to an
unpaired t test except that the comparison is performed
between more than two groups. It measures whether there
is a tendency for the groups of values to have different
means, or whether they might all be drawn from the same
population. The values lie in one column and the groups
are defined by conditions. If the option N (for
nonparametric) is chosen then the Kruskal-Wallis one-way
analysis of variance by ranks test is performed instead.
Example:
Select command - A
- One-way analysis of variance -
Enter column for dependent variable
c15
Input number of groups: 5
Enter condition for group A: c5=1
Enter condition for group B: c5=2
Enter condition for group C: c5=3
Enter condition for group D: c5=4
Enter condition for group E: c5=5
Output:
One-way analysis of variance with C15 (GHQ) as dependent variable
Group A: C5=1
Group B: C5=2
Group C: C5=3
Group D: C5>3
61
Statistical commands
Between pairs of groups: t tests (96 df)
A B C
t p t p t p
B -1.111 0.2692
C -0.400 0.6897 1.456 0.1487
D -1.940 0.0553 -1.353 0.1793 -4.231 0.0001
Overall significance: F = 6.463 3,96 df, p = 0.0005
The analysis of variance outputs an F ratio which gives
the overall significance representing the probability
that all the group means could have varied so much by
chance. It also computes a t statistic and two-tailed
probability value for the difference between the means
for each pair of groups. This latter differs from
performing an ordinary unpaired t test between the two
groups only in that the whole sample is used to provide
an estimate of the overall variance of the measure,
rather than only relying on the values in the pair of
groups under consideration.
The graphing option plots the values from each group in a
vertical scatter plot and allows the group means to be
displayed (see the relevant section in the EASIGRAF
documentation).
Example:
Select command - A N
- One-way analysis of variance -
Enter column for dependent variable
c15
Input number of groups: 4
Enter condition for group A: c5=1
Enter condition for group B: c5=2
Enter condition for group C: c5=3
Enter condition for group D: c5>3
Output:
Kruskal-Wallis test with C15 (GHQ) as dependent variable
Group A: Number = 3 Mean rank = 35.00 C5=1
Group B: Number = 12 Mean rank = 52.88 C5=2
Group C: Number = 41 Mean rank = 38.71 C5=3
Group D: Number = 25 Mean rank = 59.66 C5=4
Group E: Number = 19 Mean rank = 64.84 C5=5
Between pairs of groups comparisons of mean ranks
(Two-tailed, corrected for multiple comparisons)
A B C D
Ru-Rv p Ru-Rv p Ru-Rv p Ru-Rv p
B 17.88 NS
C 3.71 NS -14.17 NS
D 24.66 NS 6.78 NS 20.95 0.0887
E 29.84 NS 11.97 NS 26.13 0.0235 5.18 NS
Overall significance: KW (corrected for ties) = 14.880, 4 df
p = 0.0050
For the Kruskal-Wallis test the column and groups are
selected in the same way as for the parametric analysis
62
Statistical commands
of variance, and the output reports the overall
differences between the group ranks and between pairs of
groups comparisons as described in Nonparametric
Statistics for the Behavioural Sciences by Siegel.
6:9. Multiple regression
Format: M[ultiple] [column [column]]
This test measures how well one dependent variable (in
one column) is predicted by a number of independent
variables in other columns.
Example:
Select command - m
- Multiple linear regression -
Enter column for dependent variable
c27
Input number of independent variables: 4
Input 4 columns (one on each line):
c18
c19
c20
c21
Output:
Multiple linear regression with C27 (SEV) as dependent
variable
Regression equation:
C27 = 0.146
+ 0.093 * C19 SE(b) = 0.055
+ 0.154 * C20 SE(b) = 0.075
+ 0.029 * C21 SE(b) = 0.079
+ 0.140 * C22 SE(b) = 0.065
Variance ratio F = (73.844/4)/0.717 = 25.747 df = 4,95
p = 0.0000
Multiple correlation coefficient R = 0.721
Significance of each measure (95 degrees of freedom):
C19: t = 1.694 p = 0.0935
C20: t = 2.050 p = 0.0431
C21: t = 0.370 p = 0.7122
C22: t = 2.156 p = 0.0336
This test outputs a multiple correlation coefficient and
the best-fitting linear regression equation using all the
independent variables. The coefficients for each variable
are given and their standard errors. These are used to
produce a t statistic and two-tailed significance for the
independent correlation of each variable with the
dependent variable. Note that this will vary according to
which other variables are included in the analysis. An
overall two-tailed probability derived from an F ratio of
variances is also given, representing the probability of
such a large multiple correlation coefficient occurring
by chance.
63
Statistical commands
If a second column name is given after the first, then it
will be filled with the values which would be predicted
from the regression equation with the coefficients
arrived at. These are the values which the dependent
variable would take if it was completely determined by
the independent variables according to the regression
equation.
Example:
Select command - m c27 c28
Input number of independent variables: 4
Input 4 columns (one on each line):
c18
c19
c20
c21
In this case column 28 will be filled with the predicted values.
6:10. Principal component analysis
Format: Co[mponent] [g[raphfile]] [number of variables]
Performs principal component analysis between a number of variables.
Example:
Select command - co
- Principal component analysis -
Enter number of columns to analyse: 4
Input 4 columns (one on each line):
c19
c20
c21
c22
Input lower limit of contribution to variance to include
component into main table (0 for all, 1 for none):
0.05
Output:
Including largest 4 components into table
Principal component analysis
Contribution to overall variance:
Co1 Co2 Co3 Co4
0.7284 0.1586 0.0603 0.0528
Correlations between components and variables:
Co1 Co2 Co3 Co4
C19 A -0.6979 0.6932 -0.1751 -0.0421
C20 B -0.9193 -0.0405 0.3085 -0.2409
C21 C -0.9095 -0.0443 0.1250 0.3941
C22 D -0.8617 -0.4021 -0.3001 -0.0758
Principal components are derived (there is no facility to
rotate them). The contribution of each to the overall
variance is output, as is the correlation matrix between
64
Statistical commands
them and the original variables.
All components contributing more than a certain fraction
of the overall variance are incorporated into the main
data table as new columns at the right-hand edge of the
table. They are titled Co1, Co2, etc. If the critical
fraction requested is 0 then all of the components will
be so incorporated, if it is 1 then none of them will be.
The orginal variables are not normalised before the
analysis (i.e. they are not altered to have unit
variance). The user may do this himself if he or she
wishes, otherwise variables with a large variance will
produce a proportionately large contribution to the
analysis.
The graphing option has nothing to do with principal
components analysis and is just a way of selecting
multiple columns to be output to a graph file so that
they can subsequently be plotted against each other (see
the relevant section in the EASIGRAF documentation).
6:11. Minimise
Format: Mi[nimise] [expression]
This is a general purpose minimisation function which
allows the performance, for example, of non-linear
optimisations. To use it you enter an arithmetic
expression which includes at least one of the general
purpose variables (V1, V2, etc.) and then the names of
those variables within the expression which are to be
altered to minimise the value of the expression over all
the data rows. Usually the aim will be to find the best
fit of the expression to the values in one column and in
this case the expression will automatically be converted
into the expression for the least-squares fit to that
column. If the original function is to be minimised
instead, then enter NONE (which can be abbreviated down
to N) instead of a column name.
For example, to use the MINIMISE command to perform
multiple linear regression with column HDD as the
dependent variable and columns A, B, C, and D as
independent variables:
Select command - min V1 + V2*A + V3*B + V4*C + V5*D
Enter column to fit to or NONE to minimise function, and
optional second column for best predicted fit: HDD
Enter list of variables to iterate (all on one line):
V1 V2 V3 V4 V5
Output:
Sigma: ((V1+V2*A+V3*B+V4*C+V5*D) - HDD)POW2
- function minimised after 9 iterations.
Final value: 866.398
v1 = 1.82441
v2 = 0.346348
v3 = 0.265328
65
Statistical commands
v4 = 0.678139
v5 = 0.24945
The output will show you that the following function is
in fact the one which is minimised:
Sigma: ((V1 + V2*A + V3*B + V4*C + V5*D)-HDD)pow2
This is the function for the least sum of squares
difference between the supplied function and the column
to fit to. The final value of this sum of squares is
output, together with the best-fitting values for the
variables which have been altered from their starting
values. In this case the variables are the coefficients
of the linear regression equation.
General minimisation is slower, less accurate and less
informative than the linear functions supplied (i.e. the
simple and multiple linear regression commands), so it is
best to try to convert your function to a linear form
instead whenever possible (it often is). It is up to you
to make sure that the function has a minimum, and to set
appropriate starting values for the variables so that the
global minimum is found if there is more than one local
minimum.
If no column is to be fitted to then the function itself
is minimised, for example:
Select command - min 6*V1pow2+4*V1-123
Enter column to fit to or NONE to minimise function, and
optional second column for best predicted fit: NONE
Enter list of variables to iterate (all on one line):
V1
Sigma: 6*V1POW2+4*V1-123
- function minimised after 1 iterations.
Final value: -12366.7
V1 = -0.33458
This finds the value of V1 for which the supplied
quadratic equation has a minimum, which in this case is
-0.335. However note that even when the expression being
minimised contains no column references, it is still
evaluated once for every data row and the function value
is the total over all the rows (this example was run with
100 data rows, so the final value is 100 times what would
be expected). This makes sense, because one might want to
minimise a function such as 6*V1POW2+4*V1-C2, but it
means that the minimisation will be unnecessarily slow if
no values from the data table are actually needed. In
such a case all the data rows can be temporarily excluded
by issuing the command:
Select command - NARROW 0
The condition is always false so this makes all the data
rows invalid. When there are no valid data rows the
expression is calculated just once, rather than once for
every data row.
66
Statistical commands
If a second column name is given after the first, then it
will be filled with the values which would be predicted
from the function with the coefficients arrived at. These
are the values which the dependent variable would take if
the best-fitting function applied exactly.
Here is a final example where column 3 is fitted to a
non-linear function of columns 1 and 2. The results
predicted from the function are then written into column
4:
Select command - min V1*C1*exp(C2powV2)
Enter column to fit to or NONE to minimise function, and
optional second column for best predicted fit: c3 c4
Enter list of variables to iterate (all on one line):
v1 v2
Note
The iterative process stops when one step fails to reduce
the absolute value of the function by one ten thousandth.
This should be appropriate for most applications,
especially for least-squares fitting rather than general
minimisation. This means that if the function has a value
of 2 then the last step-size is less than 0.0002. However
if the function (which may be the same shape) has a value
of 200000 then the last step-size may be up to 20. If you
want higher accuracy then you will have to add a constant
to the function which reduces its absolute value to close
to zero (in the latter example one would add -200000) and
start the minimisation process again. Again, if the
function is being evaluated over a number of data rows
then the constant to add would need to be first divided
by this number.
67