Regression

Regression models have been implemented using XLISP-STAT's object and message sending facilities. These were introduced above in Section [*]. You might want to review that section briefly before reading on.

Let's fit a simple regression model to the bicycle data of Section [*]. The dependent variable is "2D separation and the independent variable is "2D travel-space. To form a regression model use the "2D regression-model function:

> (regression-model travel-space separation)

Least Squares Estimates:

Constant               -2.182472   (1.056688)
Variable  0:           0.6603419   (0.06747931)

R Squared:              0.922901
Sigma hat:             0.5821083
Number of cases:              10
Degrees of freedom:            8

#<Object: 1966006, prototype = REGRESSION-MODEL-PROTO>
>
The basic syntax for the "2D regression-model function is
(regression-model x y)
For a simple regression "2D x can be a single list or vector. For a multiple regression "2D x can be a list of lists or vectors or a matrix. The "2D regression-model function also takes three optional keyword arguments, "2D :intercept, "2D :print, and "2D :weights. Both "2D :intercept and "2D :print are "2D T by default. To get a model without an intercept use the expression
 
(regression-model x y :intercept nil)
To form a weighted regression model use the expression
(regression-model x y :weights w)
where "2D w is a list or vector of weights the same length as "2D y. The variances of the errors are assumed to be inversely proportional to the weights "2D w.

The "2D regression-model function prints a very simple summary of the fit model and returns a model object as its result. To be able to examine the model further assign the returned model object to a variable using an expression like 18

 (def bikes (regression-model travel-space separation
:print nil))
I have given the keyword argument "2D :print nil to suppress the printing of the summary, since we have already seen it. To find out what messages are available use the "2D :help message:
> (send bikes :help)
REGRESSION-MODEL-PROTO
Normal Linear Regression Model
Help is available on the following:

:ADD-METHOD :ADD-SLOT :BASIS :COEF-ESTIMATES :COEF-STANDARD-ERRORS :COMPUTE :DF 
:DISPLAY :DOC-TOPICS :DOCUMENTATION :FIT-VALUES :GET-METHOD :HAS-METHOD
:HAS-SLOT :HELP :INTERCEPT :INTERNAL-DOC :ISNEW :LEVERAGES
:MESSAGE-SELECTORS :METHODS :NEW :NUM-CASES :NUM-COEFS :PARENTS
:PLOT-BAYES-RESIDUALS :PLOT-RESIDUALS :PRECEDENCE-LIST :PRINT :R-SQUARED
:RESIDUALS :SAVE :SHOW :SIGMA-HAT :SLOT-NAMES :SLOT-VALUE :SLOTS
:SUM-OF-SQUARES :WEIGHTS :X :X-MATRIX :XTXINV :Y PROTO
NIL
>
Many of these messages are self explanatory, and many have already been used by the "2D :display message, which "2D regression-model sends to the new model to print the summary. As examples let's try the "2D :coef-estimates and "2D :coef-standard-errors messages 19:
> (send bikes :coef-estimates)
(-2.182472 0.6603419)
> (send bikes :coef-standard-errors)
(1.056688 0.06747931)
>

The "2D :plot-residuals message will produce a residual plot . To find out what residuals are plotted against let's look at the help information:

> (send bikes :help :plot-residuals)
:PLOT-RESIDUALS

Message args: (&optional x-values)
Opens a window with a plot of the residuals. If X-VALUES are not supplied 
the fitted values are used. The plot can be linked to other plots with the 
link-views function. Returns a plot object.
NIL
>
Using the expressions
	(plot-points travel-space separation)
	(send bikes :plot-residuals travel-space)
Figure: Linked raw data and residual plots for the bicycles example.
\begin{figure}\centering
\vspace{3.47in}
\end{figure}
we can construct two plots of the data as shown in Figure [*]. By linking the plots we can use the mouse to identify points in both plots simultaneously. A point that stands out is observation 6 (starting the count at 0, as usual).

The plots both suggest that there is some curvature in the data; this curvature is particularly pronounced in the residual plot if you ignore observation 6 for the moment. To allow for this curvature we might try to fit a model with a quadratic term in "2D travel-space:

> (def bikes2  (regression-model (list travel-space (^ travel-space 2))
                                 separation))

Least Squares Estimates:

Constant               -16.41924   (7.848271)
Variable  0:            2.432667   (0.9719628)
Variable  1:          -0.05339121   (0.02922567)

R Squared:             0.9477923
Sigma hat:             0.5120859
Number of cases:              10
Degrees of freedom:            7

BIKES2
>
I have used the exponentiation function ``^'' to compute the square of travel-space. Since I am now forming a multiple regression model the first argument to "2D regression-model is a list of the "2D x variables.

You can proceed in many directions from this point. If you want to calculate Cook's distances for the observations you can first compute internally studentized residuals as

(def studres (/ (send bikes2 :residuals)
                (* (send bikes2 :sigma-hat) 
                   (sqrt (- 1 (send bikes2 :leverages))))))
Then Cooks distances are obtained as 20
> (* (^ studres 2) 
     (/ (send bikes2 :leverages) (- 1 (send bikes2 :leverages)) 3))
(0.166673 0.00918596 0.03026801 0.01109897 0.009584418 0.1206654 0.581929 
0.0460179 0.006404474 0.09400811)
The seventh entry – observation 6, counting from zero – clearly stands out.

Another approach to examining residuals for possible outliers is to use the Bayesian residual plot proposed by Chaloner and Brant [6], which can be obtained using the message "2D :plot-bayes-residuals . The expression "2D (send bikes2 :plot-bayes-residuals) produces the plot in Figure [*].

Figure: Bayes residual plot for bicycle data.
\begin{figure}\centering
\vspace{3.47in}
\parbox[t]{3.8in}{}
\end{figure}
The bars represent mean ±2SD of the posterior distribution of the actual realized errors, based on an improper uniform prior distribution on the regression coefficients. The y axis is in units of $\hat{{\sigma}}$. Thus this plot suggests the probability that point 6 is three or more standard deviations from the mean is about 3%; the probability that it is at least two standard deviations from the mean is around 50%.



Subsections