Linear regression


Linear regression analysis

Regression analysis is concerned with making predictions of the values of one variable (known as the dependent variable, usually 'y') based upon the values of another variable (known as the independent variable, usually 'x'). It is effectively concerned with investigation of the nature of the relationship that may exist between two or more variables. As it is concerned with making predictions it has greater investigative potential than simple correlation analysis between two variables which, although useful, simply inform us of the precise relationship in quantitative terms.

Related tools:

Pearson Product Moment correlation: Bivariate correlation analysis.
Spearman Rank correlation: Non-parametric correlation.

If statistically significant correlation has been observed between two variables then these can be related mathematically through regression analysis to enable a trend line, or line of best fit, to be fitted to the data when one variable is plotted against the other.

It is possible to calculate the equations of two regression lines for linear data relationships: 'the regression line of y on x' and the 'regression line of x on y'. The tool determines the equation for the former regression line (i.e., the regression line of the dependent variable 'y' on the independent variable 'x'). The reason for this is that it is orthodox to design experiments carefully so that the independent 'x' variable is the 'fixed' variable (i.e., concentration in µg/ml, temperature in °C, or in the example data below IQ level). In effect, this then allows us to concentrate on deviations about a theoretical 'y' population mean and make according predictions of values of 'y'.

The regression line of 'y' on 'x' takes the form of:

y = bx + a (sometimes seen as: y = mx + C)
where,
'b' is the regression coefficient describing the slope of the regression line.
'a' is the intercept of the regression line where it cuts, or intercepts, the y-axis when the data are plotted.

Once both 'a' and 'b' have been obtained it is straightforward to predict values of 'y' by substituting values of 'x' into the expression. Make sure that you have not extrapolated the regression line beyond the range of the data used to derive it! At extremes of the data range the relationship may not necesarily still be of the same linear nature.

If the relationship between your 'x' and 'y' variables seems not to be linear then it may be possible to log, square root or arcsine transform it so that it is legitimate to perform the regression analysis. If this is not the case then more complex analysis may be performed which is beyond the scope of this package. If you need to fit complex lines to curvilinear data then 'PolyFit' by Camiel Rouweler (Aminet misc/sci directory) is recommended.

Script operation

This tool operates in a similar way to the others with some small differences. Raw sample data is entered in the input requestor as two columns. During the calculations a further requestor will appear to obtain information about whether the dependent variable is to be found in the first data column. Answer 'Y' or 'N' to proceed.

Click here for information about general script usage.

Note that column headings for either variable may be included in the input range but will not be used in the output. Typical spreadsheet output is printed below:

 Raw data:

       IQ   Reading Scores
	    (dependent
	    variable)
	x       	 y
      118       	66 
       99       	50
      118       	73
      121       	69
      123       	72
       98       	54
      131       	74
      121       	70
      108       	65
      111       	62
      118       	65
      112       	63
      113       	67
      111       	59
      106       	60
      102       	59
      113       	70
      101       	57


 Spreadsheet output:

 Least Squares Regression

 Predicted Values    Regression Statistics

	  67.8932    n: 			  18
	  55.1486    Pearson r: 	      0.8999
	  67.8932    r sq.:     	      0.8098
	  69.9055    Std.Err.of Est.:       146.8964
	   71.247    Intercept(a):          -11.2576
	  54.4778    Slope(b):  	      0.6708
	  76.6132
	  69.9055    T-test
	  61.1855
	  63.1978    Std.Err.of Reg.Coef.:    0.0813
	  67.8932    t: 		      8.2548
	  63.8685    d.f.:      		  16
	  64.5393    P(T<=t) one-tail:  	   0
	  63.1978    T-Critical(95%):         1.7459
	  59.8439    T-Critical(99%):         2.5835
	  57.1609    P(T<=t) two-tail:  	   0
	  64.5393    T-Critical(95%):         2.1199
	  56.4901    T-Critical(99%):         2.9208


		     ANOVA
		     Source of     Sum of    Mean     Degrees of   F-ratio
		     Variation     Squares   Squares  Freedom

		     Regression:  625.6036  625.6036           1    68.141
		     Residual:    146.8964     9.181          16
		     Total:          772.5      	      17

		     P(F Sample<=f) one-tail:      0
		     F-Critical(95%):          4.494
		     F-Critical(99%):         8.5309

Interpretation

There are several components to the output provided by this tool. On the left-hand side is a column of 'Predicted Values'. These are the predicted 'y', or dependent variable, values based on the equation of the regression line. For example, the equation of the line in the example above is:

y = 0.6708x + -11.2576

and when the known values of 'x' are fed into this expression the predicted values of 'y' are obtained and output to the spreadsheet.

The t-test and ANOVA output is designed to test the statistical significance of the regression analysis. In order to do this it is necessary to set up a null hypothesis which may be tested. For example, it may be assumed that if there was no functional relationship between 'x' and 'y' then the slope (the regression coefficient) may be zero. More importantly, the slope of the sample (known as 'b') may be something other than zero but this sample may, or may not, be representative of the population slope known as 'ß') from which the sample was derived.

A typical null hypothesis may be proposed as:

HO: ß = 0

This may be rejected only if the probability of obtaining the computed value of 'b', from sampling a population that actually has ß = 0, is significantly small (i.e., less than 0.05 or 0.01, etc.).

ANOVA analysis

In the ANOVA section of the output several intermediate statistics are computed and output to the spreadsheet. The total SS (sum of squares) is a measure of the overall variability of the dependent variable and the regression SS is a measure of the amount of variability among the 'y' values resulting from a linear regression. These two values will be identical only if all data points fall exactly on the regression line. The degrees of freedom associated with the total variability of 'y' values is n-1 and that associated with the variability of 'y' values due to regression is always 1 in simple linear regression. The MS (mean squares) statistics are calculated from this information by MS = SS/d.f. and the F-ratio is determined by F = regression MS/residual MS.

At the 0.05 level of significance the one-tailed F-critical value for d.f. = 16 can be seen to be 4.494. In this particular case it is found that the null hypothesis mentioned above should be rejected. The reason for this is that the computed F-statistic has exceeded the critical value and the probability of obtaining a value of 'b' that is derived from a population where ß = 0 is lower than 5%.

For further details about ANOVA see the one-way ANOVA section of this guide.

t-test analysis

In the t-test section of the output summary statistics are also generated. Here the t-test procedure has equivalence to the ANOVA test when the general null hypothesis is HO: ß = 0. Although most significance testing of simple linear regression analysis will employ this hypothesis the t-test, unlike the ANOVA test, will allow directional null hypotheses to be tested such as ß < 0 and ß > 0.

Where HO: ß = 0, it can be seen in the output that for a two-tailed test at the 0.05 level of significance the null hypothesis should also be rejected as the calculated value of 't' (i.e., 8.2548) exceeds the critical value of 2.1199.

For further details about t-tests see the relevant sections beginning with the t-test of independent sample means.



Back to Main Document