Linear regression analysis
Regression analysis is concerned with making predictions of the values of one variable (known as the dependent variable, usually 'y') based upon the values of another variable (known as the independent variable, usually 'x'). It is effectively concerned with investigation of the nature of the relationship that may exist between two or more variables. As it is concerned with making predictions it has greater investigative potential than simple correlation analysis between two variables which, although useful, simply inform us of the precise relationship in quantitative terms.
Related tools:
Pearson Product Moment correlation: Bivariate correlation analysis.
Spearman Rank correlation: Non-parametric correlation.
If statistically significant correlation has been observed between two variables then these can be related mathematically through regression analysis to enable a trend line, or line of best fit, to be fitted to the data when one variable is plotted against the other.
It is possible to calculate the equations of two regression lines for linear data relationships: 'the regression line of y on x' and the 'regression line of x on y'. The tool determines the equation for the former regression line (i.e., the regression line of the dependent variable 'y' on the independent variable 'x'). The reason for this is that it is orthodox to design experiments carefully so that the independent 'x' variable is the 'fixed' variable (i.e., concentration in µg/ml, temperature in °C, or in the example data below IQ level). In effect, this then allows us to concentrate on deviations about a theoretical 'y' population mean and make according predictions of values of 'y'.
The regression line of 'y' on 'x' takes the form of:
y = bx + a (sometimes seen as: y = mx + C)
where,
'b' is the regression coefficient describing the slope of the regression line.
'a' is the intercept of the regression line where it cuts, or intercepts, the y-axis when the data are plotted.
Once both 'a' and 'b' have been obtained it is straightforward to predict values of 'y' by substituting values of 'x' into the expression. Make sure that you have not extrapolated the regression line beyond the range of the data used to derive it! At extremes of the data range the relationship may not necesarily still be of the same linear nature.
If the relationship between your 'x' and 'y' variables seems not to be linear then it may be possible to log, square root or arcsine transform it so that it is legitimate to perform the regression analysis. If this is not the case then more complex analysis may be performed which is beyond the scope of this package. If you need to fit complex lines to curvilinear data then 'PolyFit' by Camiel Rouweler (Aminet misc/sci directory) is recommended.
Script operation
This tool operates in a similar way to the others with some small differences. Raw sample data is entered in the input requestor as two columns. During the calculations a further requestor will appear to obtain information about whether the dependent variable is to be found in the first data column. Answer 'Y' or 'N' to proceed.
Click here for information about general script usage.
Note that column headings for either variable may be included in the input range but will not be used in the output. Typical spreadsheet output is printed below:
Raw data: IQ Reading Scores (dependent variable) x y 118 66 99 50 118 73 121 69 123 72 98 54 131 74 121 70 108 65 111 62 118 65 112 63 113 67 111 59 106 60 102 59 113 70 101 57 Spreadsheet output: Least Squares Regression Predicted Values Regression Statistics 67.8932 n: 18 55.1486 Pearson r: 0.8999 67.8932 r sq.: 0.8098 69.9055 Std.Err.of Est.: 146.8964 71.247 Intercept(a): -11.2576 54.4778 Slope(b): 0.6708 76.6132 69.9055 T-test 61.1855 63.1978 Std.Err.of Reg.Coef.: 0.0813 67.8932 t: 8.2548 63.8685 d.f.: 16 64.5393 P(T<=t) one-tail: 0 63.1978 T-Critical(95%): 1.7459 59.8439 T-Critical(99%): 2.5835 57.1609 P(T<=t) two-tail: 0 64.5393 T-Critical(95%): 2.1199 56.4901 T-Critical(99%): 2.9208 ANOVA Source of Sum of Mean Degrees of F-ratio Variation Squares Squares Freedom Regression: 625.6036 625.6036 1 68.141 Residual: 146.8964 9.181 16 Total: 772.5 17 P(F Sample<=f) one-tail: 0 F-Critical(95%): 4.494 F-Critical(99%): 8.5309
Interpretation
There are several components to the output provided by this tool. On the left-hand side is a column of 'Predicted Values'. These are the predicted 'y', or dependent variable, values based on the equation of the regression line. For example, the equation of the line in the example above is:
y = 0.6708x + -11.2576
and when the known values of 'x' are fed into this expression the predicted values of 'y' are obtained and output to the spreadsheet.
The t-test and ANOVA output is designed to test the statistical significance of the regression analysis. In order to do this it is necessary to set up a null hypothesis which may be tested. For example, it may be assumed that if there was no functional relationship between 'x' and 'y' then the slope (the regression coefficient) may be zero. More importantly, the slope of the sample (known as 'b') may be something other than zero but this sample may, or may not, be representative of the population slope known as 'ß') from which the sample was derived.
A typical null hypothesis may be proposed as:
HO: ß = 0
This may be rejected only if the probability of obtaining the computed value of 'b', from sampling a population that actually has ß = 0, is significantly small (i.e., less than 0.05 or 0.01, etc.).
ANOVA analysis
In the ANOVA section of the output several intermediate statistics are computed and output to the spreadsheet. The total SS (sum of squares) is a measure of the overall variability of the dependent variable and the regression SS is a measure of the amount of variability among the 'y' values resulting from a linear regression. These two values will be identical only if all data points fall exactly on the regression line. The degrees of freedom associated with the total variability of 'y' values is n-1 and that associated with the variability of 'y' values due to regression is always 1 in simple linear regression. The MS (mean squares) statistics are calculated from this information by MS = SS/d.f. and the F-ratio is determined by F = regression MS/residual MS.
At the 0.05 level of significance the one-tailed F-critical value for d.f. = 16 can be seen to be 4.494. In this particular case it is found that the null hypothesis mentioned above should be rejected. The reason for this is that the computed F-statistic has exceeded the critical value and the probability of obtaining a value of 'b' that is derived from a population where ß = 0 is lower than 5%.
For further details about ANOVA see the one-way ANOVA section of this guide.
t-test analysis
In the t-test section of the output summary statistics are also generated. Here the t-test procedure has equivalence to the ANOVA test when the general null hypothesis is HO: ß = 0. Although most significance testing of simple linear regression analysis will employ this hypothesis the t-test, unlike the ANOVA test, will allow directional null hypotheses to be tested such as ß < 0 and ß > 0.
Where HO: ß = 0, it can be seen in the output that for a two-tailed test at the 0.05 level of significance the null hypothesis should also be rejected as the calculated value of 't' (i.e., 8.2548) exceeds the critical value of 2.1199.
For further details about t-tests see the relevant sections beginning with the t-test of independent sample means.