home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Current Shareware 1994 January
/
SHAR194.ISO
/
engineer
/
ksprob21.zip
/
KSDOCS.EXE
/
KSSTAT.DOC
< prev
Wrap
Text File
|
1993-05-23
|
33KB
|
643 lines
ksstat 2.10
Joseph C. Hudson
4903 Algonquin
Clarkston, MI 48348
Introduction
ksstat performs exploratory regression, crosstabs, Lilliefor's
test for normality and produces summary statistics, histograms
and multiple scatter plots.
ksstat uses the 80x87 chip if it is present, and emulates it if
it is not present. All computation is done with 64 bit reals.
I do not offer a warranty or guarantee of any kind for this
program. I've tried hard to make the output correct, but using
it with new data sets and different machines may reveal errors
I'm not aware of. Follow the advice of Gerard E. Dallal
(Statistical Microcomputing - Like It Is, American Statistician,
V42 N3 Aug 1988): assume that this program does everything wrong
until you put it through its paces with difficult input and
conclude otherwise. Above all, enjoy. If you care to send me a
brief report about what you like and don't like about this
program, it would be very much appreciated.
ksstat is copyright (C) 1990-93 Joseph C. Hudson 4903 Algonquin
Clarkston MI 48348. All rights are reserved.
Files
Before running ksstat, you need to have a data file and,
optionally, a codebook file. See descriptions below. When
running ksstat, the first thing you should do is to select
<get files> from the main menu and give data, codebook and output
file names. The extensions .dat, .cod and .out will be added if
you type in names without extensions. If you want no extension,
follow the name with a period.
ksstat page 2
Most output is sent to the output file. You can specify the
printer as the output file, but it is really a good idea to use
a disk file for output. You can view the output file from within
the program.
A couple of the modules use special output files in addition to
the general one. Lilfor uses separate files for graphs, since
they may have ^z's imbedded. Cfit uses a separate file to store
its regressions.
In the line of the menu beginning "graphics output to", you can
toggle between a number of choices. Initially, graphics output
is to screen only. You may select screen, text and/or hpgl
graphics output. Text graphics is crude, but works. The best
option for permanent graphics output is to use hpgl files. If
you select this option, a separate .Hxx file will be created
for each graph you request. These files contain hpgl language
commands. The files can be printed by PRINTGL, WordPerfect,
DrawPerfect and many other programs. This is the highest
resolution output ks programs offer.
The data file
The data file is an ascii file with the data in rows (cases) and
columns (variables). When you create a data file, use blanks or
tabs to separate numbers. Each row should have the same number of
entries as all other rows. Use nothing but numbers; no alpha
stuff. The numbers don't need to be lined up in columns.
If you have missing values, use whatever numeric missing value
code is convenient to represent them in the data. The codebook
file, described below, will tell ksstat about missing values.
Missing values are excluded from analysis by all routines,
listwise for kscfit, kstplot and kslilfor, pairwise for ksxtab,
kshist and kssumst.
Each row of data should end in a cr-lf sequence. This is what the
majority of text editors automatically append to a line when you
hit the return or enter key. If you create a data file with a
word processor, make sure the file is a plain ascii file, without
any strange word-processor stuff floating around. A small editor,
like sled or E!, is perfectly adequate to use.
You can have as much data as your computer memory can handle. I
have run ksstat using a data set with 811 rows and 30 columns
with no problem on an 8088 machine with 640k and an 8087. 811
columns and 30 rows would also work fine. Out of programming
laziness, a couple of minor routines will not work with columns
numbered above 1000. Without an 8087, the program will run
slower, but no less precisely.
ksstat page 3
The codebook file
A codebook file consists of as many lines as you need with the
following three pieces of information in each line:
column number column name missing value code.
The column number identifies which data column is being referred
to. Column 1 is the leftmost column. The column name is used by
all ksstat routines. ksstat makes up a name for any columns not
named in the codebook file. The missing value code identifies
missing data. Data in the column referenced with the value of the
missing value code is treated as missing be all ksstat routines.
Lines in the codebook file need not be in any particular order,
and column numbers can be repeated, so you can have more than one
missing value code for a column. See kssample.cod for an example.
In this file, column 1 has a name but no missing value code. Col-
umn 2 has 3 missing value codes.
There is one restriction: if a missing value code is given, a
name must be given also. Otherwise, ksstat will interpret the
missing value code as a name. Names can be up to 9 characters
long. Anything longer is truncated.
In .cod files, everything on a line after a semicolon is ignored,
so you can add comment lines or comments on data lines after the
data by prefacing the comment with a semicolon.
Printer codes output by ksstat are for Epson compatible 9 pin
printers. If your printer is incompatible with these commands,
you could send output to a file and then edit the file to change
the imbedded commands. This is not easily done.
If you are really desparate, I'll be glad to change the printer
commands for you as long as I don't have to spend any money to do
it. Send me a disk, stamped mailer and a list of the changes you
need. Be very specific, since I have access to very few printer
manuals, and I don't consider the ones I do have good bedtime
reading. The printer codes currently used are:
resetPrinter := #27#64;
formFeed := #12;
elite := #27#77;
lpi6 := #27#51#36; {set linefeed to 12/72 in}
lpi9 := #27#51#24; {set linefeed to 8/72 in}
lpi12 := #27#51#18; {set linefeed to 6/72 in}
setCondensed := #27#15; {set to 17 chars/in}
canCondensed := #18; {cancel 17 chars/in}
setOneWay := #27#85#01; {set unidirectional}
graphLine := #27#76#208#2 {set for 8 lines of 720
dots @ 120 dots per in.}
ksstat page 4
Notes on Individual Modules
kscfit - exploratory regression
kscfit tries to fit 726 different curves to paired data. The
curves used are of the form
f(y) = b0 + b1 * x1 + b2 * x2
where x1 and x2 are transformations of the original predictor
variable and f(y) is a transformation of the original dependent
variable.
The index of a regression indicates which transformations were
used in that regression. The index is a 3 character string: the
first character indicates the form used for y, the second char-
acter the form used for x1 and the third character the form used
for x2. the forms and their character codes are
0 - the variable does not 6 - v * ln(v)
appear in the equation 7 - ln(v) / v
1 - v the variable itself 8 - ln(v / (1 - v))
2 - v² 9 - ln(-ln(1 - v))
3 - 1 / v A - √v
4 - 1 / v² B = 1 / √v
5 - ln(v)
Legal values for the three characters are 1 to B for y, 1 to B
for x1 and anything prior to x1 for x2, including 0. This gives
726 possible equations, not all of which can actually be fit for
a given data set. (Homework: what is the maximum number that can
be fit to any xcolumn - ycolumn pair?)
Examples:
the index 662 indicates that the regression is
y * ln(y) = b0 + b1 * [x * ln(x)] + b2 * x²
110 is the index of the simple linear regression of y on x.
950 is the index of a two parameter Weibull fit to the data,
presuming x is failure time and y is cumulative percent failed.
If y is cumulative probability and x is any random variable, 9nn
will be a legitimate fit of a probability model. 8nn is a
logistic fit. Suggestions for additional forms would be
appreciated. The program can handle any number of them.
ksstat page 5
Running kscfit
When move the cursor to <exploratory regression> in the main menu
and hit return, the kscfit screen appears:
┌──────────────────────────────────────────────────────────────┐
│ kscfit exploratory regr │
│ │
│ data file: rows │
│ codebook file: cols │
│ output file: of data │
│ regr file: │
│ │
│ ind var (x): dep var (y): │
│ data column: data column: │
│ │
│ run regs brief summary expanded sumry get reg file │
│ view forms view output disk directory save reg file │
│ │
│ crindex: ntosave: forr: oort: │
│ │
│ detail stats hist resid comp orig y conf int y │
│ plot regr add resid add fit y add conf lim │
│ plot resid tabl resid view data plot conf lim │
└──────────────────────────────────────────────────────────────┘
The cursor will be at the data column: prompt for the independent
variable. Enter a column number. The cursor will move to the
dependent variable data colunm: prompt. Enter another column
number. the cursor will then move to the <run regs> prompt. Hit
the enter key and as many of the 405 possible regressions that
can be run will be. The number of regressions actually run will
appear after the ntosave: prompt. 110 will be the current
regression index.
If the number run is large, move the cursor to <brief summary>
and hit enter. The regression indices and their adjusted
coefficients of determination will be sent to the output file.
Move to <view output> and hit enter to see the results. To see
more information use <expanded summary>. This option outputs one
line per regression, so output can be voluminous. You can cut
this down by changing ntosave to a smaller number. Only the best
ntosave regressions will be included in the summary.
If you want a more complete report on a single regression, put
that regression's index in crindex and then move to the <detail
stats> prompt. Hit enter. To see the results, use <view output>.
ksstat page 6
All of the menu choices in the last three rows except <view
data> are for a specific regression, and are always performed
for the regression whose index is crindex. <plot regr> plots
the regression line along with the original data. <plot resid>
plots the residuals. <tabl resid> shows the residuals in a
table. <hist resid> shows them in a histogram. <add resid> adds
the residuals to the data matrix, not to the data on disk. To get
the residuals or anything else added to the data on disk, select
the SD option from the main ksstat menu.
<add fit y> adds the predicted y values to the data, and <add
conf lim> adds upper and lower confidence limits for either y or
the mean of y to the data. You can choose the percent confidence
in a submenu that will appear when you select <add conf lim>. By
adding the fitted y values and confidence limits to the data, you
can plot y, fitted y, and both confidence limit columns on the
same graph to get a nice graphical summary of the fit. The PL
option on the main ksstat menu can be used to do this. So can
the <plot conf lim> option on this menu.
The <conf int y> option allows you to compute confidence
intervals for y or the mean of y for any value of x, whether or
not the x is part of the data set.
When you start kscfit, you will see <oort: o> on the lower right.
oort stands for o or t which stands for original units or
transformed units. These refer to the units y is reported in
under any option that computes either y values or residual
values. Any regression with an index starting with anything but 1
has y transformed. The transformed units are probably not
physicaly meaningful, so if you are doing computations for use in
your application, original unit results will be most useful. If,
however, you are diagnosing regression results, looking at
residuals, possible outliers, etc, the transformed units are
appropriate, since the usual statistical properties of residuals
(0 mean, asymptotic normality, etc) apply to the residuals in
transformed units and <<not>> to the residuals in original units.
So select o or t as needed.
<forr: r> of the original menu allows you to toggle between
adjusted coefficient of determination (r) and the f statistic as
the measure used to rank the regressions from best to worst. This
affects the two summary options and the <save reg file> option.
With <save reg file>, you can save the results of the best
ntosave regressions in a special file called the reg file. The
program can later read this file, allowing you to continue
investigating the regression results without first regenerating
them from the original data.
ksstat page 7
Be sure to name the reg file on the top of the menu before trying
to save the regressions. The reg file is an ascii file, but don't
fool with it: kscfit might barf trying to read a messed up file
later. Regression coefficients, standard errors etc. are saved in
the reg file to umpteen significant figures, so if you need more
sig figs than kscfit normally prints out, you can find them there.
As you've probably guessed, the <get reg file> option does just
that. If you retrieve regressions from a reg file and the data
file name saved in the reg file is the same as the current data
file name, the data is not read. If the data on disk and the data
in memory are the same, as kscfit assumes, no problem. If the two
data sets are different, however, then strange things could
happen. The column names may not match, missing values will be
messed up and so on. The best bet is to use different file names
for different data sets, avoiding this problem.
One statistics note: you may observe high correlations between x1
and x2 for some models. The natural inclination here is to remove
either x1 or x2. Often, this is a mistake. The high correlation
in this situation is neither troublesome nor particularly
undesirable, since x1 and x2 are just two components of a
transformation of the (single) predictor variable.
kslilfor - Lilliefors' normality test
kslilfor performs the Lilliefors test for normality. Typing li at
the main menu brings up the Lillifor menu:
┌──────────────────────────────────────────────────────────────┐
│ ksstat Lilliefors test 2/4/91 1:50 247K free mem │
│ │
│ data file: kssample.dat 811 rows │
│ codebook file: kssample.cod 7 cols │
│ output file: kssample.out of data │
│ lilgraph file: kssample.L01 │
│ │
│ variable : 5 Q11 sig level: 0.050 │
│ │
│ There are 301 nonmissing values and 510 missing values. │
│ The maximum distance between the sample and normal cdfs of │
│ 0.2442 occurs at z = -0.3840, data = 2.0000000. │
│ The critical distance is 0.0518. Normality is rejected. │
│ │
│ sample skewness is 1.70342. │
│ z for skewness test is 6.300, one tailed p value 0.0000 │
│ sample kurtosis is 6.51401. │
│ z for kurtosis test is 5.415, one tailed p value 0.0000 │
└──────────────────────────────────────────────────────────────┘
ksstat page 8
Entering the variable, significance level, and, if not previously
done, the lilgraph file name, the test is performed. The program
displays the "Working..." message, possibly for what seems to be
a long time, and then displays a message asking if the above
information is ok. At this point, hit return to actually perform
the test. The work done by the program prior to this message is
preliminary set up, not the test itself. During the actual
performance of the test, the bottom of the menu will be filled
with the information shown in the sample above. The skewness and
kurtosis tests are those discussed by d'Agostino, Belanger and
d'Agostino in the American Statistician, November 1990.
If screen and/or hpgl graphics output is selected, graphs are
produced as usual, either on screen or as a hpgl file or both. If
text graphics output is selected, the results are not sent to the
output file. Instead, a special lilliefors graph file is used.
This is because text based output is too crude to be useful in
this application. If the printer is selected as the lilliefors
graph file, output is sent there. If a disk file is selected,
graphics information is sent to the file. One disk file is used
for each graph. The files have the name you supply and exten-
sions numbered sequentially starting with .L01. The files are
set up so that copying them to the printer with a dos command
of the form
copy kssample.L01 prn /b
will produce the graph, with a copy of the menu as a header.
The printer commands used are for an Epson, so the graphs may not
print properly on non-Epson compatible printers. You will still
have a record of the results, though, since the menu with the
summary information is written to the output file.
The screen and/or hpgl options are probably the easiest ways to
get output.
The file kssample.L01 is included with this document for you to
practice with.
There must be at least 4 nonmissing data values to do the test.
The Lilliefors test is an attractive alternative to the
Chi-Square test usually used to test for normality. The test is
simple to use. If the sample CDF falls outside the Lilliefors
bounds for the selected significance level, the hypothesis that
the data is normally distributed is rejected. Try the program
with a few sample data sets to get the idea. A number of recent
statistics books discuss this test, e.g. Iman and Conover, "A
Modern Approach to Statistics", Wiley; Conover, "Practical
Nonparametric Statistics", Wiley; and Milton and Arnold,
"Probability and Statistics in the Engineering and Computing
Sciences", McGraw Hill. Conover gives the table of quantiles
ksstat page 9
originally used in this program. The quantiles used this version
come from Dallal and Wilkinson, "An analytic approximation to the
Distribution of Lilliefor's test statistic for normality" in The
November 1986 American Statistician (Vol 40 No 4).
ksxtab - crosstabs
ksxtab does crosstabulations of up to 7 columns of data. the
first two columns form the rows and columns of a table. any
additional columns are used as breakdown variables to create
multiple tables.
ksxtab, as all ksstat routines, considers each row of data as a
single case. It does not accept summary tables as input. It
does produce summary tables as output.
The data used to form the tables is usually integer valued,
though ksxtab is perfectly happy with any data at all. You may
not be happy, though, with the voluminous, nearly empty tables
that could appear as output if, say, a breakdown variable is
continuous with many distinct values.
Statistics are reported for individual two dimensional tables. I
haven't really given these much thought, since I use ksxtab to
get information for summary reports and to just take a look at
data. For hypothesis testing, I tend to use logistic models
and/or bootstrapping. Bootstrapping may eventually appear in
ksstat. Right now its experimental.
If you specify just one column, either as the row variable or as
the column variable, ksxtab will produce a frequency table for
that variable.
ksmplot - scatter plots
ksmplot produces scatter plots in any or all of the three
graphics output options. Up to six variables can be plotted on
the vertical axis against one on the horizontal axis. Different
plot symbols are used for each dependent variable. The symbols
are shown in the ksmplot menu.
To use ksmplot, you must specify the variable on the horizontal
axis and as many vertical axis variables you wish, up to 6. You
may use points, lines or points and lines for any plot. The
symbols used to plot each variable are fixed, but you can use
any of those available. Just skip down the menu on the left side
until you are in the line with the plot symbol you want to use,
then specify the variable in that line.
ksstat page 10
If more than one but fewer than 10 points appear at the same plot
position, the number of points plotted there is shown instead of
a plot symbol. if 10 or more points appear at a plot position, a
type of grid is printed.
You can specify the plotting limits if you wish, or you can leave
them as missing and the program will figure them out.
If you supply a title, it will be printed at the top of the
graph.
kshist - histograms
kshist produces histograms in two different ways, with fixed
width cells or with variable width cells in screen or hpgl
output. Text output histograms are still another variety that
uses a few ideas from stem and leaf plots. In text plot histo-
grams, the histogram is plotted using the first two digits of
the standardized data values. A histogram of column 3 of
kssample.dat looks like this:
histogram of Q9a. each digit represents 4 observations.
mean: 2.99566474 st dev: 1.48946469 missing: 119 nonmissing: 692
midpoint freq stem
-2.590 0 -3|
-1.845 0 -3|
-1.100 0 -2|
-0.356 0 -2|
0.389 0 -1|
1.134 63 -1|333333333333333c
1.879 204 -0|66666666666666666666666...6666666666666
2.623 0 -0|
3.368 283 0|00000000000000000000000...000000000000000000c
4.113 81 0|66666666666666666666a
4.857 0 1|
5.602 0 1|
6.347 0 2|
7.092 60 2|666666666666666
7.836 0 3|
8.581 0 3|
4.0
The ... above do not appear in the actual histogram. I just put
them in and eliminated some digits so the histogram would fit
here. The histogram is designed to print in condensed mode on
Epson printers, so that 120 columns can be used.
ksstat page 11
The stem is the first digit of the studentized sample value,
(sample value - mean) / sample std dev, and the digits that make
up the histogram are the first digits after the decimal point of
the studentized values truncated, not rounded. An entry in the
histogram with a stem of 2 and a digit of 6 represents one or
more sample values at or above 2.60 and below 2.70 standard
deviations above the mean. The number of values represented by
each digit, the mean and standard deviation are shown in the
header.
Each stem appears twice, once for the digits 0 to 4 and once for
the digits 5 through 9. If any studentized values are 4.0 or
larger in absolute value, they are printed individually above
(<= -4.0) or below (>= 4.0) the histogram itself. In the example
above, there is one value at 4.0.
The cell midpoints are shown in the histogram. Each cell is half
a standard deviation wide, so the cell boundaries are always the
midpoint ± (std dev) / 4.
The letters a, b, c,... that may appear at the right end of a
histogram element represent data values not numerous enough to
merit a digit of their own, a = 1, b = 2 etc. In the sample, the
cell with a count of 81 has 20 6's and an a. Since each 6
represent 4 data values, 20 * 4 + 1 = 81. The data graphed here
is multiple choice survey data, and so is pretty homogeneous.
With more diverse data, there could be left over values with
different individual digits. These are collected and represented
with asterisks at the right, before the letters. For example, if
there were a cell in the histogram above whose data to be plotted
had the digits
2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4
\ / \ / \ /
5 2s 10 3s 3 4s
the histogram cell would appear as |233*b
each 2 and 3 representing 4 data values, and the 6 odd data
values, a 2, two 3s and the three 4s collected into the *b at the
end.
If you specify one or more breakdown variables in the histogram
menu, you will get one histogram for each combination of values
of the breakdown variables. I'll be waiting for the first irate
letter telling me that the user didn't know that 4 breakdown
variables with 6, 8, 4 and 7 levels respectively would produce
1344 histograms and lock his computer up for 3 hours. They will,
but control - C is activated in the program and can be used to
break out of such situations. (I hope - I've never tried to
produce 1344 histograms of one data column and don't plan to.)
ksstat page 12
Fixed width histograms are pretty much standard histograms. You
may specify the number of cells to use, or let the program pick a
reasonable number. To let the program do it, specify 0 as the
number of cells (the default).
Variable width histograms take a long time to compute, sometimes
a really long time. The total area of all rectangles is 1, and
the widths are chosen to maximize the interval covered along the
x axis. The algorithm to determine widths is very complicated,
which is why it takes so long to compute. The results are often
not satisfactory visually, since there can be very tall, very
skinny rectangles generated. These tend to be overwhelmed by the
limited resolution and size of the display. If you are interested
in mare information about these, write.
summary statistics
This module produces the usual summary statistics for the data
columns you choose to include, or for all data columns except
those you choose to exclude. This is the point of the <incl or
excl?> part of the menu. To get a summary of all variables,
simply type i after <incl or excl?> and then ALL in the next
line. To get a summary of all columns except columns 2, 3 and 4,
respond with e and 2 3 4 in these places. The output looks
like this:
Summary Statistics
variable: id Q11 Q12 Q37a
data col: 1 5 6 7
mean 406.00000 2.55482 1.77409 2.48089
std dev 234.25983 1.44493 0.41888 0.74075
skewness 0.00000 1.70342 -1.31085 1.07777
kurtosis 1.80000 6.51401 2.71832 14.13046
coef of var 57.70% 56.56% 23.61% 29.86%
min 1.00000 1.00000 1.00000 1.00000
max 811.00000 9.00000 2.00000 9.00000
range 810.00000 8.00000 1.00000 8.00000
missing 0 510 510 0
nonmissing 811 301 301 811
As with kshist and ksxtab, you can specify breakdown variables to
get summaries of subsets of the data.
For references, see ks.doc.
Have fun.