Welcome to the HTML documentation for "TCalcStats v2.0c" - A series of statistical analysis tools for TurboCalc v5.xx. There is also an AmigaGuide version of this documentation.
WARNING: This is an "almost ready for release" version. The Guide and HTML files are not completed. Some of the ARexx scripts may need correction/improvement.
What is it?! In brief... tools for number crunching!
The included ARexx scripts will allow you to perform a comprehensive range of statistical analysis tests on raw sample data. The statistical approach and the workings of all these scripts are extensively explained in this HTML guide and the associated AmigaGuide documentation. In effect, little prior knowledge of statistical analysis is assumed for application of many of the test procedures.
This package is aimed at:
Although several statistical analysis programs exist for the Amiga platform this suite of ARexx scripts have been designed to provide an extensive range of summary statistics, parametric and additional non-parametric tests of hypotheses for a variety of data sources.
This package is a series of analysis tools designed for use with TurboCalc v5.x by Michael Friedrich. A similar set of tools are available for Excel (©Microsoft Corporation) on the PC and Mac in the form of 'add-ins' but this package is intended to perform beyond these existing capabilities. It contains numerous extra features:
Be warned that this is a bit of a 'bare bones' crash course in statistical analysis! If it is either incomprehensible or on the other hand you find it lacks sufficient detail then please obtain an introductory text or take a look at the references respectively.
Numerical data gathered as part of scientific investigations is often subject to some form of statistical analysis in order that objective evaluation of this data may be made in relation to experimental expectations and conclusions. Methods of statistical analysis can be categorised:
This form of analysis is based on the organisation and summary of data and is aimed at describing particular characteristics of a sample set of data in order to obtain information about the overall population data. Under many circumstances the population data is usually unavailable and its characteristics may only be ascertained from a sample. For example, the true mean height of all lamp-posts in Germany is physically unobtainable as it would entail measuring them all. The subject of descriptive statistics may also be broken down into two broad areas of study:
(a.) Measures of central tendency
Analysis aimed at describing properties of statistical populations by determination of the most typical data element. In other words, measures of central tendency represent central or focal points in the distribution of data elements. Typical examples of this type of analysis include measurement of the mode (most frequently occurring element of data), the median (point constituting the middle measurement in a set of data) and the mean (the average or sum of all measurements divided by the number of measurements).
(b.) Measures of dispersion and variability
Analysis designed to indicate the extent of scatter or clustering of data measurements around the centre of a data distribution. Typical examples of this type of analysis include calculation of the range (difference between the highest and lowest measurements in a group of data - a fairly crude measure of dispersion), the standard deviation (a measure of data dispersion in the form of a determination of the average extent of deviation of the data values from the mean) and the variance (a measure of data dispersion in the form of a determination of the average of the squared deviations from the mean). A further example is the standard error of the mean (the standard deviation of the means of an infinite number of data samples composing the population - in effect, an expression of the relationship between the standard deviation of the sample and the standard deviation of the population).
Related tools:
Descriptive statistics
Measures of central tendency and variability are often used to describe the trends and characteristics of numerical data in conjunction with graphical means. Sample data may be arranged into classes or groups to illustrate the frequency of occurrence of data elements and the underlying distribution patterns of data.
In general, histograms and polygons are used to describe data which has been grouped into frequency, relative frequency and percentage distributions.
Related tools:
Frequency distribution histograms
This form of analysis is based on making general conclusions and reasonable estimates of characteristics of a potentially large data set(s) which are derived from the characteristics of smaller (and usually more practical to quantify) sample data sets. It is this branch of statistics which is concerned with hypothesis testing and probability theory.
Numerical data is usually collected after consideration and formation of a research hypothesis. An hypothesis can be described as a suggested explanation of certain observed phenomena, tentatively adopted until it can be tested experimentally. For example, consider the following research hypothesis:
"There are differences in the distribution and abundance of mole-hills at different grassland sites and this is likely to have been caused by differences in the mowing regimes applied".
In order to test the validity of this hypothesis, numerical data must be obtained from each grassland site using a sampling methodology. As the sites concerned may be very large it would be futile to attempt a count of all mole-hills. Numerical data is therefore collected from random sections of the sites. The quantity of all mole-hills at a given site forms what is known as a statistical population. Data derived from sections of the site forms what is known as a statistical sample of the population.
It is important to realise that when random samples are selected to represent the statistical population there will invariably be an element of error present. This should be minimized as much as possible but can be manifest as 'errors of measurement' or as 'sampling error'. In the latter case this simply means that the sample data may not be very representative of the true population data. In most scientific investigations 'sampling error' may be reduced by increasing the sample size.
Considerations in sampling
There are many types of sampling methodology which vary according to the requirements of the investigation. Detailed discussion is beyond the scope of this guide but all effective sampling methods should conform to the following essential requirements:
Using sample data to test the hypothesis
Probability is an important component of any form of inferential statistical hypothesis test. In the various hypothesis tests that may be applied (dependent on the type of data and the question to be answered) the hypothesis is neither proved nor disproved to be correct. The test invariably produces a test statistic (a numerical value of t, x², F, etc.) which is used to determine the probability that the hypothesis is correct or not.
In this process, whereby a hypothesis is accepted or rejected, convention dictates that a probability level of 95% is usually used to determine whether acceptance or rejection is required. In other words analysis of sample data must indicate that there is at least 95% probability that the hypothesis is correct for it to be retained.
This method of testing the validity of hypotheses has important implications for science in general. In effect, scientists do not generally set out to 'prove' that underlying mechanisms of particular observed phenomena are true. It is more accurate to state that that they make decisions about the probability of such mechanisms being operative based on available quantitative data.
Null hypotheses
Once a research hypothesis has been formulated it must be modified if it is to be tested for validity using inferential statistics. In other words, the research hypothesis must be re-stated in statistical terms as a null hypothesis (abbreviated as HO) and retained in its original form as the alternate hypothesis (abbreviated as HA). Consider the example research hypothesis provided earlier:
Null hypothesis:
"Mowing frequency has no effect on mole-hill distribution and abundance and any observed differences of mole-hill distribution and abundance between the grassland sites is a result of chance sampling".
Alternate hypothesis:
"There are differences in the distribution and abundance of mole-hills at different grassland sites and this is likely to have been caused by differences in the mowing regimes applied".
The title 'null hypothesis' is so named because the hypothesis states that there is no difference between the sample data from each site, that they come from the same statistical population, and that any non-significant difference between them is due to sampling error. The reason for formulating a null hypothesis is that statistical hypothesis tests will provide a probability value that allows it to be rejected or accepted. If the null hypothesis is rejected the alternate hypothesis can then be tentatively accepted.
Null hypotheses and alternate hypotheses should be mutually exclusive and exhaustive. In other words, there must be no option for both to be true and there should be no possibility that some other unspecified alternate hypothesis is true.
Steps in hypothesis testing:
Case example
In order to test the null hypothesis for the mole-hills at different grassland sites 20 x 4m² quadrats (a square frame) were randomly positioned at each of the sites and the quantity of mole-hills in each quadrat was recorded. The Student's t-test was chosen as a hypothesis test to determine the probability of a statistically significant difference in the means of samples of mole-hill populations at each grassland site. As the raw sample data was observed to exhibit a normal distribution and an F-test established that the variance of each sample was equivalent, it was decided to employ the ordinary t-test based on two independent sample means. The t-tests were carried out between pairs of samples (i.e., between samples taken from site A and site B, site A and C, and site B and C). The main characteristics of each site is shown below:
| Site A | Site B | Site C ----------------------------------------------------------------- Mowing regime: | Mown weekly | Mown annually | Never mown
For each t-test, calculations were made of the value of t (the test statistic), the t-critical value (at the 0.05 level of significance) and the probability figure that the samples were derived from the same statistical population.
| Site A & B | Site A & C | Site B & C ----------------------------------------------------------------- t-statistic: | 1.74 | 4.706 | 3.17 t-critical: | 2.02 | 2.059 | 2.04 P: | 0.08 | 0.0000797 | 0.003
Test between samples from site A and B:
The value of the t-critical statistic represents a separation figure between rejection and non-rejection of the null hypothesis. In this test the t-statistic was found to be lower than the critical value at the 0.05 level of significance. This arbitrary level of significance should always be stated in results and represents the fact that the probability of making an error in deciding to retain or reject a null hypothesis is no higher than 5%. In brief, a null hypothesis is usually rejected if the t-statistic exceeds the critical value at a given level of significance (i.e., 0.05, 0.01, etc.)
In the case of the hypothesis test between the sample data from site A and site B the test statistics favoured a retention of the null hypothesis. Observation of the probability value shows that there was a greater than 5% probability that the sample means were derived from the same statistical population.
There is a good reason for the choice of very low levels of significance (i.e., 0.05, 0.01, etc.). A widely used analogy in explaining the concept of a level of significance is the justice system in courts of law. Usually, a defendant is presumed innocent until proven guilty beyond a reasonable doubt. The prosecution is required to prove the defendant guilty and it is deemed preferable to free a guilty person rather than imprisoning an innocent one. In a statistical sense this is similar to saying that it is preferable to accept a null hypothesis that is actually false (known as a Type II error) than to reject a null hypothesis that is actually true (known as a Type I error). The 'reasonable doubt' in a hypothesis test is represented by the significance level and this allows little margin for error (i.e., no higher than 5% or 1%, etc.) in rejecting a null hypothesis.
Test between samples from sites A and C and sites B and C:
In both of these tests the t-statistic was found to be greater than the critical value at the 0.05 level of significance and the null hypothesis would be rejected in both cases. In other words, the results are statistically significant and a decision would then be made that the respective sample means in each test are highly likely to be from different statistical populations. This is reflected in the probability calculations: there was a less than 5% probability that this was the case.
The next logical step is the drawing of conclusions. As significant differences in sample means existed between sites A and C and sites B and C, but not between sites A and B, it may be considered that mowing caused an increase in mole-hill distribution and abundance but the frequency of mowing had little effect on this increase. However, it is important to remember that other unconsidered factors may have caused observed differences such as site exposure, geographical location, laziness of moles (!), etc.
This case example is meant to illustrate the main underlying mechanisms by which hypothesis tests function. Although in this case a relatively simple two-sample t-test is used, consideration of a one-way analysis of variance may be preferable if this was a real scenario.
Parametric and non-parametric tests
The t-test outlined above is a form of parametric statistical test and there are several others of this nature included within this package. The common basis of such tests is that they require a number of assumptions about one or more population parameters. The most common assumption when determining the nature of a statistical population from the characteristics of a sample is that it has a normal distribution. In addition, when comparisons of a sample parameter (i.e., the mean or variance, etc.) are made between two or more samples it is usually assumed that there is homogeneity of variances.
Non-parametric tests, or distribution-free tests, make less stringent assumptions about the form of the underlying population distribution. In general, a given parametric test will have an equivalent non-parametric version but it is preferable to use the former where possible as it is usually more powerful. There is a greater risk of committing a Type II error with non-parametric methods. There are several circumstances under which a non-parametric test should be generally employed:
Related tools:
Descriptive statistics: Determination of modality and distribution curve shape, etc.
Goodness of fit (x²) for normality: Detect normality using chi-square goodness of fit test.
Related tools:
F-Ratio: Variance ratio test.
One-tailed and two-tailed tests
In the t-test example above, no specifications were provided as to whether we were interested in detecting any significant difference in the means of the samples or whether we were interested in one sample mean being significantly greater or smaller than another.
In a two-tailed test (also known as a two-sided or nondirectional test) the object of the test is to determine whether there is any significant difference between two or more samples or one sample and a hypothesised parameter (i.e., the mean, variance, etc.). For example, in the t-test outlined above we could be trying to detect whether there was any significant difference in the sample means. As this method employs probability in the form of the t-distribution curve, and we are not investigating whether one particular sample mean is larger/smaller than the other in a particular direction, the rejection region is divided into two tails of the t-distribution. Any calculated t-critical value will also be accompanied by an equivalent negative t-critical value because an extreme, or very improbable value of 't' in any direction will cause the rejection of the null hypothesis.
In a one-tailed test (also known as a one-sided or directional test) the objective is determination of a significant difference between samples or parameters of samples in one direction only. As a result we are only interested in whether there is a small (i.e., 5%, 1%, etc.) probability of the test statistic occurring by chance alone in one tail of the given distribution (i.e., t, x², F, etc.). The one-tailed test is often preferable to the two-tailed test, if a direction may be specified in the alternate hypothesis, as it provides a higher chance (or power) of rejecting the null hypothesis.
A further branch of statistical analysis is concerned with examining the relationships that may exist between two quantitative variables represented in sampling efforts.
Correlation analysis
Investigation of the degree of correlation between two variables effectively determines the strength of association between them. There are two commonly used correlation techniques. Spearman Rank correlation coefficient is calculated for data variables that are ranked or do not exhibit a normal distribution, and is therefore a non-parametric form of analysis. The Pearson product-moment correlation coefficient is more sensitive and therefore preferable but assumes that sample variables are normally distributed. Is there any statistically significant association between number of late hours spent writing AmigaGuides and number of accidental hard drive invalidating incidents per month?(!)
Regression analysis
Regression analysis is concerned with making predictions of the values of one variable (known as the dependent variable 'y') based upon the values of another variable (known as the independent variable 'x'). It is effectively concerned with investigation of the nature of the relationship that may exist between two or more variables. A typical application of such procedure may be to predict agricultural crop yields on the basis of quantities of fertilizer applied to a field.
Related tools:
Pearson Product Moment correlation: Bivariate correlation analysis.
Spearman Rank correlation: Non-parametric correlation.
Linear regression: Simple linear regression analysis.
This section contains details of specific use and workings of the statistical tools included in this package. All worked examples found in sections below and the scripts themselves will assume (where necessary) that statistical tests are conducted at the 0.05 or 0.01 level of probability or significance. Most of the information found in these sections will assume some prior knowledge of statistical analysis theory and application. If required, click here for further information.
You may also want to pursue further references for information beyond the scope of this document.
Required programs and libraries etc.
How to install this package.
Recommended method
Double-click on the "Install_TCalcStats2" icon. This installs all the necessary files to your hard-disk and adds an assign to your s:user-startup file. The installation script uses the standard Amiga installer program.
Manual installation
It may take the form of:
;BEGIN TurboCalc
Assign TurboCalc: Work:TurboCalc
;END TurboCalc
Change the directory path for the TurboCalc program to suit the location on your system.
Under normal circumstances it would not matter where you install these files but in this case it is necessary to maintain this directory structure in order that the main files are able to find their associated image files which are shared.
All the tools work more or less in the same fashion. Begin by entering the data you wish to analyse into a TurboCalc spreadsheet. In general, data will need to be entered in columns. Any deviations from this, and specific instructions for each test are outlined elsewhere.
The Arexx macros can be started either from a Shell or from within TurboCalc:
If started from a Shell, each file will check to see that TurboCalc is running and, if it is not, will start it and present the user with a "File/Open" requester. Load in the file you want to work on.
To start from within TurboCalc, open the spreadsheet you have the data in, then open the "STATS_Macros.TCD" file. The macro sheet contains macros to start each of the Arexx files. Their names appear in the "Play..." macro . Make sure your data spreadsheet is active before choosing the macro to play.
If the file selection is okay, the macro will be run. You will then be asked to supply the program with some data. The most common is a request to enter both the cell range for data and an cell reference. This (and other program requests) will appear in normal requester windows. Enter the requested information and then press the <Return> key. The program will then proceed.
Note: Data ranges should be entered in the form eg. B2:E25 - the macros will parse that input to find how many rows and columns there are, and to extract the data from the spreadsheet for internal calculations. It is very important that you include blank cells that are at the end of columns to ensure that the total range is included. That is, if you have two or more columns with unequal numbers of cells with data, the range has to include a rectangle big enough to include the column with the most cells.
Labels at the top of columns will be used where they are included in the data range selection. Otherwise, the programs will generate their own labels. Note that if you need the script output to contain labels these may contain spaces when entered into a cell but the spaces themselves will be replaced by '_' characters. i.e., The title 'Female Heights' will become 'Female_Heights'.
Please note that when you give the program a cell reference for output, the program uses that reference as the top left cell in the output. All cells in the output area will be overwritten, so be careful to select cells that are empty.
Once the ARexx file is running, the user is locked out to avoid any problems with stray mouse clicks or keyboard presses when the program stops. The screen display will not be updated until the calculations are finished - so be patient when generating output derived from large sample sizes! For each test a console window provides progress details.
Operational errors.
Statistical errors.
Never trust the statistics generated unless you are sure you have conducted the test in question correctly! All of the supplied statistical tools have been designed to be as 'generic' as possible - one of the reasons why many calculate probability and critical values at more than one level of statistical significance and for both one-tailed and two-tailed tests for maximum flexibility. Bear in mind that many statistical tests are very flexible and are often modified for specific applications.
Case example
In this case a one-tailed paired sample t-test can result in problems which are not directly caused by operation of the script:
Consider first a two-tailed test:
HO: mean difference = 0; HA: mean difference is not = 0
x | y | Script Output | |
---|---|---|---|
142 | 138 | Mean of Diff.: | 3.3 |
140 | 136 | Variance: | 9.3444 |
144 | 147 | Std. Dev.: | 3.0569 |
144 | 139 | Std. Err.: | 0.9667 |
142 | 143 | t: | 3.4137 |
146 | 141 | Count: | 10 |
149 | 143 | d.f.: | 9 |
150 | 145 | P(T<=t) one-tail: | 0.003852 |
142 | 136 | T-Critical (95%): | 1.8331 |
148 | 146 | T-Critical (99%): | 2.8214 |
P(T<=t) two-tail: | 0.007703 | ||
T-Critical (95%): | 2.2622 | ||
T-Critical (99%): | 3.2498 |
This t-test is fine. The calculation for the t-statistic took the form of:
t = Mean of differences / standard error
Therefore reject HO. There is a significant difference between the two samples. i.e.,
t: 3.4137
d.f.: 9
P: <0.01 (i.e., 0.007703)
Now consider a specific one-tailed test:
HO: mean difference <= 250; HA: mean difference > 250
x | y | Script Output | |
---|---|---|---|
2250 | 1920 | Mean of Diff.: | 295.5556 |
2410 | 2020 | Variance: | 6502.7777 |
2260 | 2060 | Std. Dev.: | 80.6398 |
2200 | 1960 | Std. Err.: | 26.8799 |
2360 | 1960 | t: | 10.9953 |
2320 | 2140 | Count: | 9 |
2240 | 1980 | d.f.: | 8 |
2300 | 1940 | P(T<=t) one-tail: | 0.000002 |
2090 | 1790 | T-Critical (95%): | 1.8595 |
T-Critical (99%): | 2.8965 | ||
P(T<=t) two-tail: | 0.000004 | ||
T-Critical (95%): | 2.306 | ||
T-Critical (99%): | 3.3554 |
This t-test is not correct because we have customised the null hypothesis. This is tested on the basis of directional change with specific criteria (i.e., mean difference is equal to, or less than 250). The analysis tool calculated the t-test in the normal fashion without 'illegal' results. In effect, the output must then be modified by the user to accommodate the null hypothesis. The calculation for the t-test took the form of:
t = Mean of differences / standard error
This provided the following results and conclusion:
Reject HO. The mean difference is significantly greater than 250. i.e.,
t: 10.9953
d.f.: 8
P: <0.01
The modified t-test should take the form of:
t = (Mean of differences - 250) / standard error
To recalculate the value of the t-statistic simply use the necessary values provided by the output in a new cell calculation adjacent to the old t-statistic.
This would then provide the following result:
Retain HO. The mean difference is not significantly greater than 250. i.e.,
t: 1.695
d.f.: 8
P: >0.05
Archaelogical diggings...
v1.0
v2.0
Future versions
This package may still be in development. For those who are aware of Microsoft Excel's "Analysis Tools" add-in, this package is intended to be a (better) equivalent for TurboCalc.
It is intended that the following additions may be made (if there are a few nods of approval!):
Suggestions are welcome! For contact details see here.
The following reading list is presented as a useful source of further information about statistical analysis. It is by no means exhaustive; there are many good texts available on the subject.
If you would like to get in touch with the authors for any queries or to offer advice where you think improvements could be made, do so via the following addresses. Let us know if you find this software useful. Bug reports and/or error reports welcomed for future revisions! Please send these to the appropriate author:
Script programming and HTML documentation: Rudy Kohut.
E-mail address: kohutr@connexus.net.au
AmigaGuide documentation and some statistical analysis methods: Nick Bergquist.
Postal address:
175 Marlborough Road, Gillingham, Kent ME7 5HP, EnglandE-mail address: nick@nab.u-net.com
ICQ: 23842984
This package is ©1999 by R. Kohut and N. Bergquist. It is only copy-righted in order to protect it's integrity. The best effort has been made to ensure that the contents are complete and correct and it is freely available to all. It may be freely distributed across any medium as long as no direct profit is made but should not be placed in a disk collection (CD-ROM versions of Aminet are exempt) without prior permission from the authors.
Disclaimer
This package is supplied 'as is' and no express or implied warranties are offered. Under no circumstance will the authors be liable for any direct, indirect, incidental, special or consequential damages (including, but not limited to, loss of data, profits or business interruption) however caused and under any theory of liability arising in any way from the use of this package.
Although every effort has been made to ensure the accuracy of results produced from analysis of data using this package, it is important that familiarisation with analysis techniques and correct application of included tests is made. See the limitations section for further details.
Source Material
The authors would like to acknowledge that the coding for some complex statistical procedures has been adapted from freely available sources on the internet (usually in C or Fortran programming language and adapted to ARexx). A major source has been StatLib, a system for distributing statistical software, datasets, and information by electronic mail, FTP and WWW.
The apstat collection contains a nearly complete set of algorithms published in Applied Statistics. The collection is maintained by Alan Miller in Melbourne. Many of the algorithms came directly from the Royal Statistical Society.
Generally, though, most of the statistical algorithms in the ARexx scripts were adapted from formulas available in published text books.
TurboCalc v5.x © 1993-98 Michael Friedrich. TurboCalc is a spreadsheet program for the Amiga. The latest version (as of 20-01-99) is v.5.02. Visit the TurboCalc Home Page
There are several ways of obtaining the latest copy:
U.K. distribution:
Digita International Ltd., Black Horse House, Exmouth, Devon EX8 1JL, EnglandTel: 01395 270273
Fax: 01395 268893
German distribution:
Stefan Ossowskis Schatztruhe, Gesellschaft für Software mbh, Veronikastr. 33, D-45131 Essen, GermanyTel: 0201-788778
Fax: 0201-798447
French distribution:
Quartz Informatique, 2bis Avenue de Brogny,74000 Annecy, FranceTel: Int+50.52.83.31
Fax: Int+50.52.83.31
Italia distribution:
NonSoLoSoft di Ferruccio Zamuner, il distributore di software Amiga, Casella Postale n. 63, I-10023 Chieri (TO), ItaliaTel: 011 9415237
Fax: 011 9415237
E-mail: solo3@chierinet.it
A demo version with some limitations is also available from the biz/demo directory of Aminet or on Aminet CDROM No.25.
© Rudy Kohut and Nick Bergquist, 1999