home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Shareware Overload
/
ShartewareOverload.cdr
/
educ
/
pstat1.zip
/
CFIT.DOC
next >
Wrap
Text File
|
1986-12-17
|
23KB
|
454 lines
Exploratory Regression using CFIT
Copyright (C) 1986
Joseph C. Hudson
Introduction
CFIT fits up to 196 different curves to paired data using
least squares regression. The program can report results sorted
by F value or by adjusted coefficient of determination. It can
plot the observed data together with any of the fitted curves.
CFIT can report a variety of diagnostic information, including
histograms of residuals, anova tables and various standard
errors. Fitted Y values may be calculated. The regressions
generated by CFIT can be saved to disk for future use.
Data Requirements
CFIT accepts data in SDA, space delimited ASCII, format.
In SDA format, variables are in columns and observations are in
rows. Each row may be in free format with one or more spaces
separating each value. Tabs may be used as separators, but not
commas. The data set should contain only numbers, no letters
or symbols (except minus signs and decimal points.) Missing
data must be coded with some value, not just left out, since
each row must have the same number of entries. Any number that
does not otherwise occur in the data set can serve as the miss-
ing value code. There can be up to 250 rows of data and any
number of columns. CFIT will only read the two columns it needs,
the column of X values and the column of Y values.
Printer Requirements
CFIT uses printer control codes for reset printer, formfeed,
condensed pitch, unidirectional printing, 6/72 inch linefeed and
12/72 inch linefeed. These codes only appear once in the source
code, in the procedure SetPrintVars. Changing these and recom-
piling will adapt CFIT to any printer capable of these actions.
The unidirectional print command is fairly important, since CFIT
prints tables and graphs that can look ragged without it. CFIT
comes with printer codes that work with Epson FX-80, LX-80 and
RX-80 and Star Gemini-Delta printers. The codes do not work with
Star SG10 or Epson LQ printers.
Running CFIT
CFIT1287 uses the 8087 chip, so if one is not installed,
use CFIT12, which does not use the 8087. Have the data file and
any output files from previous runs in the same directory. To
run, just type CFIT12 ( or CFIT1287) and respond to the prompts.
CFIT Page 2
CFIT first asks you to turn your printer on and set it to
top of form. After doing this, strike any key and the printer
will be set up. CFIT keeps track of how many lines have been
printed and avoids splitting output over page boundaries when-
ever possible. If things should get messed up, you can setup
the printer again from the main menu (option U).
The Screens
The first screen CFit presents is
*==============================================================*
| |
| C Compute Y Hat Values |
| D Print Details of a Fit |
| F Find Y in Original Units Given Y in Transformed Units |
| H Print Histogram Of Residuals |
| L List Files On Disk |
| N Run New Regression |
| P Plot Fitted Curve |
| Q Quit |
| R Report Regression Results |
| S Save Regresions to Disk |
| T Print Table of Residuals |
| U Setup Printer |
| |
| You can only use the G, L, N and Q options now. |
*==============================================================*
The first time through, a regression needs to be run. responding
N prompts the following questions :
*==============================================================*
| |
| What is the name of the data file? SAMPLE.SDA |
| |
| There are 17 rows and 2 columns of data. |
| |
| |
| Which column contains X? 1 |
| Which column contains Y? 2 |
| |
| |
| Do you want to omit any X or Y values ? N |
| |
*==============================================================*
The responses shown can be used to run the sample data. If
you want to omit some X and/or Y values, like codes for missing
data, answering Y to the last question will prompt a series of
questions allowing you to specify values to be omitted.
CFIT Page 3
Entering G at the original prompt will cause the response
*====================================================================*
| |
|The data file and the output files must be in the same directory or |
|subdirectory |
| |
|Type a name for the output files, without an extension. |
| |
|You may include a path and must include a name, e.g. A:\DATA\OUTFILE|
|where DATA is a subdirectory and OUTFILE is a file name. |
|Do not include a "dot" or a 3 letter extension. |
| |
|B:SAMPLE |
*====================================================================*
If you saved the regression information of a previous session
with CFIT, this option allows you to retrieve it. The data file
and the two files created with the save command, a .REG file and
a .HDR file, should all be available in same path you specify as
part of the name for output files. The output file name does not
have to be the same as the data file name, but the data file
should have the same name as it did when the regressions were
originally run.
After either getting or running a regression, you will be
presented with the main menu shown above, except now you have
all choices available. The program will return to this menu
after every operation. All of the menu selections will lead to
additional prompts. Two of the most common are
*=====================================*
| |
| Do you want the |
| O Original |
| or T Transformed data? |
| |
*=====================================*
*================================================================*
| |
|What is the regression number of the regression you want to use?|
| |
*================================================================*
Original data is the data as it comes from the data file.
Transformed data is produced in fitting Y to any of the forms
used by CFIT except (1), described below.
The regression number is a number assigned by CFIT when it gen-
erates the regressions. It can be found by running Report Re-
gression Results. Run this option right after the regressions
are generated.
CFIT Page 4
Crashing the System
CFIT has built in error detection for most input and out-
put. There are only a few places where you need to be careful.
When running a regression for the first time, CFIT reads the
data file and counts the number of rows and columns. It does not
check each row to make sure that they all have the same number
of entries. If a row is short, CFIT could read beyond the end of
the file and crash.
If the data set contains more than 250 rows, strange things
could happen. The program will continue reading past the 250th
row using memory that is probably in use for something else. The
only remedy for this is not to do it. Changing the constant
MaxNumRow at the beginning of the source code will allow larger
data sets to be used. I have successfully run CFIT with as many
as 1225 rows of data.
If you get into something you want to get out of, some-
times you can do it. If you just enter a carriage return when
CFIT asks for a name, CFIT will abort what you are doing and
go back to the main menu after another carriage return. Other
times, you'll have to grin and bear with unwanted output.
How CFIT Works
This section is a bit technical. It might help to go to the
sample run and come back here when you have questions.
CFIT uses these forms:
F = 0 (0)
F = X (1)
F = X^2 (2)
F = 1 / X (3)
F = 1 / X^2 (4)
F = Ln ( X ) (5)
F = X * Ln ( X ) (6)
F = [ Ln ( X ) ] / X (7)
to fit equations of the form
YHAT = A + B1 X1 + B2 X2 (8)
using least squares. Y can use any of the forms (1) through (7).
Both X1 and X2 are transformations of the predictor variable X.
X1 can use any of the forms (1) through (7) and X2 can use (0)
through (6). All choices of form are subject to restrictions.
No form is allowed if it cannot be computed for all data points.
if X contains one or more zeros, forms (3) through (7) cannot be
used for either X1 or X2. X cannot use forms (5), (6) or (7)
with negative data. These same restrictions apply to Y, as well
as two additional ones. In order to compute fitted Y values,
YHAT, from (8), Y must be a monotonic function of the right hand
side of (8). Forms (2), (4), (6) and (7) are not monotonic, so
additional restrictions must be placed on Y when using these
forms. To use (2) and (4) for Y, all observations must be larger
than 0. To use (6), all Y values must be larger than 1/e. For
(7), all y values must be larger than e. The forms are monotonic
over these ranges.
CFIT Page 5
The program automatically checks these restrictions, as
well as one other. If X is dichotomous, for example if X can
only be 0 or 1, then X1 and X2 will be perfectly correlated
regardless of the forms chosen. The correlation between X1 and
X2 can be near 1 in other circumstances as well. If this hap-
pens, the model is not fitted.
The regressions generated by CFIT are sorted in two ways,
by F Ratio and by Adjusted Coefficient of Determination. Sorting
is done using a heap sort procedure (Scheid [1982]). The results
of the sorts are saved as linked lists.
CFIT is written modularly in Turbo Pascal. Variable and
procedure names were chosen to be as mnemonic as possible. Those
names that are not mnemonic, as well as most that are, follow
the notation of Snedecor and Cochran [1967]. To follow the pro-
gram logic, start reading in the menu procedure and follow the
flow from there. There are few commonly used diagnostic statis-
tics that CFIT does not already calculate.
Sample Run
The file SAMPLE.SDA is an SDA data file containing data
from Myers et al [1959, page 305]. This is failure data for a
reliability test in which 17 of the items on test ultimately
failed. Column 1 contains the failure times, in thousands of
of hours, and column 2 contains the estimated reliability. The
purpose of running CFIT with this data is to establish a model
for estimated reliability, Y, as a function of time, X.
SAMPLE.OUT has CFIT output for this data. Type this file to
printer to follow the discussion.
The Report Regression Results output is shown on page 1.
To generate this output, I specified "Sort by F Ratio" and
"Print 10 Regressions" in response to prompts. In the output,
CFIT prints a three line header summarizing the run. It tells us
168 regressions were run. 38 that might have been run were not
because one or more constraints were violated. The table shows
summary information about the 10 best regressions sorted by F
ratio and the 168th, the worst regression. The second column
shows regression numbers, which you need to tell CFIT which
regression to work with. The next three columns identify the
transformations used on Y, X1 and X2. These numbers are the
ones used in the "How CFIT works"section. As an example of using
these, look at the first regression reported, number 158. Both
Y and X1 use form 6 and X2 uses form 2. Putting these forms to-
gether with the coefficients shown in the A, B1 and B2 columns,
the regression equation is
YHAT * Ln(YHAT) = .0437 + .1730 * X * Ln(X) - .0172 * X^2 .
This equation cannot be put in the form Y = f(X). The C and F
options in the main menu are there to handle results like this.
Read the discussion of the table of residuals output for more
information.
The last columns in the Report Regression Results table
show the residual degrees of freedom, the F value, the coeffi-
cient of determination, R^2, the adjusted coefficient of deter-
mination, RBAR^2 , and the standard error of estimate.
CFIT Page 6
The Print Details of a Fit output for regression 158
is also shown on page 1. The ANOVA table shows the regression
sum of squares partitioned in both possible ways. Following
this, a table shows the regression coefficients, beta coeffi-
cients, standard deviations of the regression coefficients and
Student's T ratios for the regression coefficients. Next, the
correlation between X1 and X2 is printed. Here, at -.9969, it is
high enough to cause some concern. Looking at alternative re-
gressions, like 143 or 153, might find one with a lower correla-
tion between predictors.
Finally, the values of C11, C22 and C12 are reported. These
values are needed to compute SMuHat, which we will do below.
Page 2 shows a histogram and table of residuals. Ideally,
the histogram of residuals in transformed units, as shown,
should be symmetric about the zero cell, bell shaped, with only
about 6% of the observations in the tail cells.
The table of residuals gives information necessary for
writing Student's T based confidence intervals. We go through
one example here. The data in SAMPLE.SDA is
Row Failure Time ( X ) Reliability ( Y )
1 1.187 .980
2 2.397 .959
3 2.564 .938
4 3.024 .917
5 3.364 .895
. . .
. . .
. . .
Suppose we want a 90% confidence interval for the relia-
bility of the system at X = 2,564 hours. This is the value of X
in the third row of data, so we use the third row of the resid-
uals table to find YHAT = -.04088217 and SMuHat = .0043426339.
From the top row of output, we find SEE = .01021524, and use
this to calculate SYHat,
SYHat = ( SMuHat^2 + SEE^2 )^.5 = .0111000.
Finally, we need the 95th percentile of the Student's T distri-
bution with 14 degrees of freedom. This can be found in the T
table of many statistics texts, or we can compute it using TINV,
a program supplied with CFIT for doing just this. Running TINV
and specifying DF = 14 and Prob = .95, TINV reports Percentile
as 1.763. The confidence interval is the interval from
YHAT - SYHat * Percentile to YHAT + SYHat * Percentile.
Performing the calculations, we get -.0605 and -.0213.
These are confidence limits for the transformed Y. To get limits
for Y in the original units, we need to invert the transforma-
tion. This is where menu option F comes in. Select option F with
regression 158 and respond to the prompt with -.0605. CFIT
reports back .9375 for Y in Orig Units. Doing the same with
-.0213, CFIT gives us .9785. So we are 95% confident that the
system reliability at 2,564 hours is between .9375 and .9785.
CFIT Page 7
If we needed to do this computation for a value of X that
was not part of the original data, we would use menu option C to
compute YHAT. This option will also give SMuHat with transformed
data. Alternatively, SMuHat can be computed by hand using the
formula
SMuHat^2 = SEE^2 ( 1/n + C11*X1^2 + C22*X2^2 + 2*C12*X1*X2 ).
C11, C22 and C12 are given by Print Details of a Fit.
The third page of SAMPLE.OUT shows a plot of regression 158
and the original data. The regression equation is plotted with
asterisks and the original data with zeros. If * and 0 occur at
the same plot position, the 0 is printed. Multiple observations
at the same plot position are plotted with the numbers 2 through
9. A number sign, #, indicates that 10 or more observations are
plotted at the same position.
In the plot shown, notice that the fitted curve flattens
out abruptly at .37. This is because form 6, used for Y here,
truncates at 1/e, about .37. Since Y is reliability, a number
between 0 and 1, form 6 is probably inappropriate. We may want
to look at regressions 46, 31, 41 or 18, reported on page 1, for
alternate models.
Recompiling CFIT
If you recompile CFIT to increase maximum sample size or
change printer codes, be sure to give the appropriate value to
the constant TurboType. If you use Turbo Pascal, TurboType
should be 88. If you use Turbo-87, set TurboType to 87.
List of References
Myers, R. H., H. I. Dwyer, B. P. Goldsmith, R. L. McLaughlin
and J. D. Stevenson. 1959. Production and Field Reliability.
Electronics Division of the American Society for Quality
Control. Milwaukee.
Scheid, Francis. 1982. Computers and Programming. Schaum's Out-
line Series in Computers. McGraw - Hill. New York.
Snedecor, George W. and William G. Cochran. 1967. Statistical
Methods, sixth edition. Iowa State University Press. Ames, Iowa.
The Bottom Line
I believe in a statement that appeared in a computer mag-
azine recently, that no user values any software above $4.95.
That's what I am asking for CFIT. (Gimme a break, lets make it
5 bucks). This won't make me rich or you poor, but if a enough
people think CFIT is worth $5 a copy, it will encourage me to
continue distributing programs this way.
So that's it. If you try CFIT and find it useful, please
pay $5.00 for each actively used copy. An invoice follows for
your convenience.
CFIT INVOICE
Make checks Payable to Joseph C. Hudson
4198 Warbler Dr.
Flint, MI 48504
Quantity Item Total
________ Copies of CFIT @ $5 each ______________
Name and address of sender:
____________________________________________
____________________________________________
____________________________________________
Your comments and suggestions are appreciated. Thanks for your
support.