Monster Media 1993 #2

home *** CD-ROM | disk | FTP | other *** search

/ Monster Media 1993 #2 / Image.iso / math / ksprob21.zip / KSDOCS.EXE / KSSTAT.DOC < prev

Wrap

Text File | 1993-05-23 | 33KB | 643 lines

ksstat 2.10 Joseph C. Hudson 4903 Algonquin Clarkston, MI 48348 Introduction ksstat performs exploratory regression, crosstabs, Lilliefor's test for normality and produces summary statistics, histograms and multiple scatter plots. ksstat uses the 80x87 chip if it is present, and emulates it if it is not present. All computation is done with 64 bit reals. I do not offer a warranty or guarantee of any kind for this program. I've tried hard to make the output correct, but using it with new data sets and different machines may reveal errors I'm not aware of. Follow the advice of Gerard E. Dallal (Statistical Microcomputing - Like It Is, American Statistician, V42 N3 Aug 1988): assume that this program does everything wrong until you put it through its paces with difficult input and conclude otherwise. Above all, enjoy. If you care to send me a brief report about what you like and don't like about this program, it would be very much appreciated. ksstat is copyright (C) 1990-93 Joseph C. Hudson 4903 Algonquin Clarkston MI 48348. All rights are reserved. Files Before running ksstat, you need to have a data file and, optionally, a codebook file. See descriptions below. When running ksstat, the first thing you should do is to select <get files> from the main menu and give data, codebook and output file names. The extensions .dat, .cod and .out will be added if you type in names without extensions. If you want no extension, follow the name with a period. ksstat page 2 Most output is sent to the output file. You can specify the printer as the output file, but it is really a good idea to use a disk file for output. You can view the output file from within the program. A couple of the modules use special output files in addition to the general one. Lilfor uses separate files for graphs, since they may have ^z's imbedded. Cfit uses a separate file to store its regressions. In the line of the menu beginning "graphics output to", you can toggle between a number of choices. Initially, graphics output is to screen only. You may select screen, text and/or hpgl graphics output. Text graphics is crude, but works. The best option for permanent graphics output is to use hpgl files. If you select this option, a separate .Hxx file will be created for each graph you request. These files contain hpgl language commands. The files can be printed by PRINTGL, WordPerfect, DrawPerfect and many other programs. This is the highest resolution output ks programs offer. The data file The data file is an ascii file with the data in rows (cases) and columns (variables). When you create a data file, use blanks or tabs to separate numbers. Each row should have the same number of entries as all other rows. Use nothing but numbers; no alpha stuff. The numbers don't need to be lined up in columns. If you have missing values, use whatever numeric missing value code is convenient to represent them in the data. The codebook file, described below, will tell ksstat about missing values. Missing values are excluded from analysis by all routines, listwise for kscfit, kstplot and kslilfor, pairwise for ksxtab, kshist and kssumst. Each row of data should end in a cr-lf sequence. This is what the majority of text editors automatically append to a line when you hit the return or enter key. If you create a data file with a word processor, make sure the file is a plain ascii file, without any strange word-processor stuff floating around. A small editor, like sled or E!, is perfectly adequate to use. You can have as much data as your computer memory can handle. I have run ksstat using a data set with 811 rows and 30 columns with no problem on an 8088 machine with 640k and an 8087. 811 columns and 30 rows would also work fine. Out of programming laziness, a couple of minor routines will not work with columns numbered above 1000. Without an 8087, the program will run slower, but no less precisely. ksstat page 3 The codebook file A codebook file consists of as many lines as you need with the following three pieces of information in each line: column number column name missing value code. The column number identifies which data column is being referred to. Column 1 is the leftmost column. The column name is used by all ksstat routines. ksstat makes up a name for any columns not named in the codebook file. The missing value code identifies missing data. Data in the column referenced with the value of the missing value code is treated as missing be all ksstat routines. Lines in the codebook file need not be in any particular order, and column numbers can be repeated, so you can have more than one missing value code for a column. See kssample.cod for an example. In this file, column 1 has a name but no missing value code. Col- umn 2 has 3 missing value codes. There is one restriction: if a missing value code is given, a name must be given also. Otherwise, ksstat will interpret the missing value code as a name. Names can be up to 9 characters long. Anything longer is truncated. In .cod files, everything on a line after a semicolon is ignored, so you can add comment lines or comments on data lines after the data by prefacing the comment with a semicolon. Printer codes output by ksstat are for Epson compatible 9 pin printers. If your printer is incompatible with these commands, you could send output to a file and then edit the file to change the imbedded commands. This is not easily done. If you are really desparate, I'll be glad to change the printer commands for you as long as I don't have to spend any money to do it. Send me a disk, stamped mailer and a list of the changes you need. Be very specific, since I have access to very few printer manuals, and I don't consider the ones I do have good bedtime reading. The printer codes currently used are: resetPrinter := #27#64; formFeed := #12; elite := #27#77; lpi6 := #27#51#36; {set linefeed to 12/72 in} lpi9 := #27#51#24; {set linefeed to 8/72 in} lpi12 := #27#51#18; {set linefeed to 6/72 in} setCondensed := #27#15; {set to 17 chars/in} canCondensed := #18; {cancel 17 chars/in} setOneWay := #27#85#01; {set unidirectional} graphLine := #27#76#208#2 {set for 8 lines of 720 dots @ 120 dots per in.} ksstat page 4 Notes on Individual Modules kscfit - exploratory regression kscfit tries to fit 726 different curves to paired data. The curves used are of the form f(y) = b0 + b1 * x1 + b2 * x2 where x1 and x2 are transformations of the original predictor variable and f(y) is a transformation of the original dependent variable. The index of a regression indicates which transformations were used in that regression. The index is a 3 character string: the first character indicates the form used for y, the second char- acter the form used for x1 and the third character the form used for x2. the forms and their character codes are 0 - the variable does not 6 - v * ln(v) appear in the equation 7 - ln(v) / v 1 - v the variable itself 8 - ln(v / (1 - v)) 2 - v² 9 - ln(-ln(1 - v)) 3 - 1 / v A - √v 4 - 1 / v² B = 1 / √v 5 - ln(v) Legal values for the three characters are 1 to B for y, 1 to B for x1 and anything prior to x1 for x2, including 0. This gives 726 possible equations, not all of which can actually be fit for a given data set. (Homework: what is the maximum number that can be fit to any xcolumn - ycolumn pair?) Examples: the index 662 indicates that the regression is y * ln(y) = b0 + b1 * [x * ln(x)] + b2 * x² 110 is the index of the simple linear regression of y on x. 950 is the index of a two parameter Weibull fit to the data, presuming x is failure time and y is cumulative percent failed. If y is cumulative probability and x is any random variable, 9nn will be a legitimate fit of a probability model. 8nn is a logistic fit. Suggestions for additional forms would be appreciated. The program can handle any number of them. ksstat page 5 Running kscfit When move the cursor to <exploratory regression> in the main menu and hit return, the kscfit screen appears: ┌──────────────────────────────────────────────────────────────┐ │ kscfit exploratory regr │ │ │ │ data file: rows │ │ codebook file: cols │ │ output file: of data │ │ regr file: │ │ │ │ ind var (x): dep var (y): │ │ data column: data column: │ │ │ │ run regs brief summary expanded sumry get reg file │ │ view forms view output disk directory save reg file │ │ │ │ crindex: ntosave: forr: oort: │ │ │ │ detail stats hist resid comp orig y conf int y │ │ plot regr add resid add fit y add conf lim │ │ plot resid tabl resid view data plot conf lim │ └──────────────────────────────────────────────────────────────┘ The cursor will be at the data column: prompt for the independent variable. Enter a column number. The cursor will move to the dependent variable data colunm: prompt. Enter another column number. the cursor will then move to the <run regs> prompt. Hit the enter key and as many of the 405 possible regressions that can be run will be. The number of regressions actually run will appear after the ntosave: prompt. 110 will be the current regression index. If the number run is large, move the cursor to <brief summary> and hit enter. The regression indices and their adjusted coefficients of determination will be sent to the output file. Move to <view output> and hit enter to see the results. To see more information use <expanded summary>. This option outputs one line per regression, so output can be voluminous. You can cut this down by changing ntosave to a smaller number. Only the best ntosave regressions will be included in the summary. If you want a more complete report on a single regression, put that regression's index in crindex and then move to the <detail stats> prompt. Hit enter. To see the results, use <view output>. ksstat page 6 All of the menu choices in the last three rows except <view data> are for a specific regression, and are always performed for the regression whose index is crindex. <plot regr> plots the regression line along with the original data. <plot resid> plots the residuals. <tabl resid> shows the residuals in a table. <hist resid> shows them in a histogram. <add resid> adds the residuals to the data matrix, not to the data on disk. To get the residuals or anything else added to the data on disk, select the SD option from the main ksstat menu. <add fit y> adds the predicted y values to the data, and <add conf lim> adds upper and lower confidence limits for either y or the mean of y to the data. You can choose the percent confidence in a submenu that will appear when you select <add conf lim>. By adding the fitted y values and confidence limits to the data, you can plot y, fitted y, and both confidence limit columns on the same graph to get a nice graphical summary of the fit. The PL option on the main ksstat menu can be used to do this. So can the <plot conf lim> option on this menu. The <conf int y> option allows you to compute confidence intervals for y or the mean of y for any value of x, whether or not the x is part of the data set. When you start kscfit, you will see <oort: o> on the lower right. oort stands for o or t which stands for original units or transformed units. These refer to the units y is reported in under any option that computes either y values or residual values. Any regression with an index starting with anything but 1 has y transformed. The transformed units are probably not physicaly meaningful, so if you are doing computations for use in your application, original unit results will be most useful. If, however, you are diagnosing regression results, looking at residuals, possible outliers, etc, the transformed units are appropriate, since the usual statistical properties of residuals (0 mean, asymptotic normality, etc) apply to the residuals in transformed units and <<not>> to the residuals in original units. So select o or t as needed. <forr: r> of the original menu allows you to toggle between adjusted coefficient of determination (r) and the f statistic as the measure used to rank the regressions from best to worst. This affects the two summary options and the <save reg file> option. With <save reg file>, you can save the results of the best ntosave regressions in a special file called the reg file. The program can later read this file, allowing you to continue investigating the regression results without first regenerating them from the original data. ksstat page 7 Be sure to name the reg file on the top of the menu before trying to save the regressions. The reg file is an ascii file, but don't fool with it: kscfit might barf trying to read a messed up file later. Regression coefficients, standard errors etc. are saved in the reg file to umpteen significant figures, so if you need more sig figs than kscfit normally prints out, you can find them there. As you've probably guessed, the <get reg file> option does just that. If you retrieve regressions from a reg file and the data file name saved in the reg file is the same as the current data file name, the data is not read. If the data on disk and the data in memory are the same, as kscfit assumes, no problem. If the two data sets are different, however, then strange things could happen. The column names may not match, missing values will be messed up and so on. The best bet is to use different file names for different data sets, avoiding this problem. One statistics note: you may observe high correlations between x1 and x2 for some models. The natural inclination here is to remove either x1 or x2. Often, this is a mistake. The high correlation in this situation is neither troublesome nor particularly undesirable, since x1 and x2 are just two components of a transformation of the (single) predictor variable. kslilfor - Lilliefors' normality test kslilfor performs the Lilliefors test for normality. Typing li at the main menu brings up the Lillifor menu: ┌──────────────────────────────────────────────────────────────┐ │ ksstat Lilliefors test 2/4/91 1:50 247K free mem │ │ │ │ data file: kssample.dat 811 rows │ │ codebook file: kssample.cod 7 cols │ │ output file: kssample.out of data │ │ lilgraph file: kssample.L01 │ │ │ │ variable : 5 Q11 sig level: 0.050 │ │ │ │ There are 301 nonmissing values and 510 missing values. │ │ The maximum distance between the sample and normal cdfs of │ │ 0.2442 occurs at z = -0.3840, data = 2.0000000. │ │ The critical distance is 0.0518. Normality is rejected. │ │ │ │ sample skewness is 1.70342. │ │ z for skewness test is 6.300, one tailed p value 0.0000 │ │ sample kurtosis is 6.51401. │ │ z for kurtosis test is 5.415, one tailed p value 0.0000 │ └──────────────────────────────────────────────────────────────┘ ksstat page 8 Entering the variable, significance level, and, if not previously done, the lilgraph file name, the test is performed. The program displays the "Working..." message, possibly for what seems to be a long time, and then displays a message asking if the above information is ok. At this point, hit return to actually perform the test. The work done by the program prior to this message is preliminary set up, not the test itself. During the actual performance of the test, the bottom of the menu will be filled with the information shown in the sample above. The skewness and kurtosis tests are those discussed by d'Agostino, Belanger and d'Agostino in the American Statistician, November 1990. If screen and/or hpgl graphics output is selected, graphs are produced as usual, either on screen or as a hpgl file or both. If text graphics output is selected, the results are not sent to the output file. Instead, a special lilliefors graph file is used. This is because text based output is too crude to be useful in this application. If the printer is selected as the lilliefors graph file, output is sent there. If a disk file is selected, graphics information is sent to the file. One disk file is used for each graph. The files have the name you supply and exten- sions numbered sequentially starting with .L01. The files are set up so that copying them to the printer with a dos command of the form copy kssample.L01 prn /b will produce the graph, with a copy of the menu as a header. The printer commands used are for an Epson, so the graphs may not print properly on non-Epson compatible printers. You will still have a record of the results, though, since the menu with the summary information is written to the output file. The screen and/or hpgl options are probably the easiest ways to get output. The file kssample.L01 is included with this document for you to practice with. There must be at least 4 nonmissing data values to do the test. The Lilliefors test is an attractive alternative to the Chi-Square test usually used to test for normality. The test is simple to use. If the sample CDF falls outside the Lilliefors bounds for the selected significance level, the hypothesis that the data is normally distributed is rejected. Try the program with a few sample data sets to get the idea. A number of recent statistics books discuss this test, e.g. Iman and Conover, "A Modern Approach to Statistics", Wiley; Conover, "Practical Nonparametric Statistics", Wiley; and Milton and Arnold, "Probability and Statistics in the Engineering and Computing Sciences", McGraw Hill. Conover gives the table of quantiles ksstat page 9 originally used in this program. The quantiles used this version come from Dallal and Wilkinson, "An analytic approximation to the Distribution of Lilliefor's test statistic for normality" in The November 1986 American Statistician (Vol 40 No 4). ksxtab - crosstabs ksxtab does crosstabulations of up to 7 columns of data. the first two columns form the rows and columns of a table. any additional columns are used as breakdown variables to create multiple tables. ksxtab, as all ksstat routines, considers each row of data as a single case. It does not accept summary tables as input. It does produce summary tables as output. The data used to form the tables is usually integer valued, though ksxtab is perfectly happy with any data at all. You may not be happy, though, with the voluminous, nearly empty tables that could appear as output if, say, a breakdown variable is continuous with many distinct values. Statistics are reported for individual two dimensional tables. I haven't really given these much thought, since I use ksxtab to get information for summary reports and to just take a look at data. For hypothesis testing, I tend to use logistic models and/or bootstrapping. Bootstrapping may eventually appear in ksstat. Right now its experimental. If you specify just one column, either as the row variable or as the column variable, ksxtab will produce a frequency table for that variable. ksmplot - scatter plots ksmplot produces scatter plots in any or all of the three graphics output options. Up to six variables can be plotted on the vertical axis against one on the horizontal axis. Different plot symbols are used for each dependent variable. The symbols are shown in the ksmplot menu. To use ksmplot, you must specify the variable on the horizontal axis and as many vertical axis variables you wish, up to 6. You may use points, lines or points and lines for any plot. The symbols used to plot each variable are fixed, but you can use any of those available. Just skip down the menu on the left side until you are in the line with the plot symbol you want to use, then specify the variable in that line. ksstat page 10 If more than one but fewer than 10 points appear at the same plot position, the number of points plotted there is shown instead of a plot symbol. if 10 or more points appear at a plot position, a type of grid is printed. You can specify the plotting limits if you wish, or you can leave them as missing and the program will figure them out. If you supply a title, it will be printed at the top of the graph. kshist - histograms kshist produces histograms in two different ways, with fixed width cells or with variable width cells in screen or hpgl output. Text output histograms are still another variety that uses a few ideas from stem and leaf plots. In text plot histo- grams, the histogram is plotted using the first two digits of the standardized data values. A histogram of column 3 of kssample.dat looks like this: histogram of Q9a. each digit represents 4 observations. mean: 2.99566474 st dev: 1.48946469 missing: 119 nonmissing: 692 midpoint freq stem -2.590 0 -3| -1.845 0 -3| -1.100 0 -2| -0.356 0 -2| 0.389 0 -1| 1.134 63 -1|333333333333333c 1.879 204 -0|66666666666666666666666...6666666666666 2.623 0 -0| 3.368 283 0|00000000000000000000000...000000000000000000c 4.113 81 0|66666666666666666666a 4.857 0 1| 5.602 0 1| 6.347 0 2| 7.092 60 2|666666666666666 7.836 0 3| 8.581 0 3| 4.0 The ... above do not appear in the actual histogram. I just put them in and eliminated some digits so the histogram would fit here. The histogram is designed to print in condensed mode on Epson printers, so that 120 columns can be used. ksstat page 11 The stem is the first digit of the studentized sample value, (sample value - mean) / sample std dev, and the digits that make up the histogram are the first digits after the decimal point of the studentized values truncated, not rounded. An entry in the histogram with a stem of 2 and a digit of 6 represents one or more sample values at or above 2.60 and below 2.70 standard deviations above the mean. The number of values represented by each digit, the mean and standard deviation are shown in the header. Each stem appears twice, once for the digits 0 to 4 and once for the digits 5 through 9. If any studentized values are 4.0 or larger in absolute value, they are printed individually above (<= -4.0) or below (>= 4.0) the histogram itself. In the example above, there is one value at 4.0. The cell midpoints are shown in the histogram. Each cell is half a standard deviation wide, so the cell boundaries are always the midpoint ± (std dev) / 4. The letters a, b, c,... that may appear at the right end of a histogram element represent data values not numerous enough to merit a digit of their own, a = 1, b = 2 etc. In the sample, the cell with a count of 81 has 20 6's and an a. Since each 6 represent 4 data values, 20 * 4 + 1 = 81. The data graphed here is multiple choice survey data, and so is pretty homogeneous. With more diverse data, there could be left over values with different individual digits. These are collected and represented with asterisks at the right, before the letters. For example, if there were a cell in the histogram above whose data to be plotted had the digits 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 \ / \ / \ / 5 2s 10 3s 3 4s the histogram cell would appear as |233*b each 2 and 3 representing 4 data values, and the 6 odd data values, a 2, two 3s and the three 4s collected into the *b at the end. If you specify one or more breakdown variables in the histogram menu, you will get one histogram for each combination of values of the breakdown variables. I'll be waiting for the first irate letter telling me that the user didn't know that 4 breakdown variables with 6, 8, 4 and 7 levels respectively would produce 1344 histograms and lock his computer up for 3 hours. They will, but control - C is activated in the program and can be used to break out of such situations. (I hope - I've never tried to produce 1344 histograms of one data column and don't plan to.) ksstat page 12 Fixed width histograms are pretty much standard histograms. You may specify the number of cells to use, or let the program pick a reasonable number. To let the program do it, specify 0 as the number of cells (the default). Variable width histograms take a long time to compute, sometimes a really long time. The total area of all rectangles is 1, and the widths are chosen to maximize the interval covered along the x axis. The algorithm to determine widths is very complicated, which is why it takes so long to compute. The results are often not satisfactory visually, since there can be very tall, very skinny rectangles generated. These tend to be overwhelmed by the limited resolution and size of the display. If you are interested in mare information about these, write. summary statistics This module produces the usual summary statistics for the data columns you choose to include, or for all data columns except those you choose to exclude. This is the point of the <incl or excl?> part of the menu. To get a summary of all variables, simply type i after <incl or excl?> and then ALL in the next line. To get a summary of all columns except columns 2, 3 and 4, respond with e and 2 3 4 in these places. The output looks like this: Summary Statistics variable: id Q11 Q12 Q37a data col: 1 5 6 7 mean 406.00000 2.55482 1.77409 2.48089 std dev 234.25983 1.44493 0.41888 0.74075 skewness 0.00000 1.70342 -1.31085 1.07777 kurtosis 1.80000 6.51401 2.71832 14.13046 coef of var 57.70% 56.56% 23.61% 29.86% min 1.00000 1.00000 1.00000 1.00000 max 811.00000 9.00000 2.00000 9.00000 range 810.00000 8.00000 1.00000 8.00000 missing 0 510 510 0 nonmissing 811 301 301 811 As with kshist and ksxtab, you can specify breakdown variables to get summaries of subsets of the data. For references, see ks.doc. Have fun.