Histograms


Descriptive and summary statistics - histograms

In order to gain an indication of sample data distribution patterns the data may be reorganised.

Sample data is often reorganised to create a frequency distribution table. In other words the raw data is split into groups, or classes, based on size.

Resulting size classes enable observation of the frequency with which data elements fall into specific numerical ranges within the overall range of the data. It may then be possible to see whether or not the data is clustered in any sub-range of the overall data range or dispersed throughout the overall range.

Script operation

This tool operates in a slightly different way to most others and this subsequently requires that the instructions outlined below are followed.

Click here for information about general script usage.

The Histogram.rexx script analyses a single column of data to generate frequencies, cumulative frequencies and cumulative percentages. This allows you to see the distribution of the data set before embarking on more elaborate statistical procedures.

NOTE: Only one column of data is allowed.

The program will request you to nominate the start cell of a range of "bin" values (ie. class interval values). If no range is nominated, the program will create its own bin values. The following example illustrates the use of bin values supplied by the user and also those calculated by the program automatically (the second set of outputs).

 Raw data:      	      Spreadsheet output 1:

 Exam scores   Bin values     Exam scores

	  42           40      Bin   Frequency   Cum.Freq.     Cum.%
	  45           50
	  46           60       40           6  	 6     31.58
	  46           70       50           3  	 9     47.37
	  42           80       60           2  	11     57.89
	  55           90       70           2  	13     68.42
	  60          100       80           3  	16     84.21
	  94    		90           2  	18     94.74
	  86    	       100           1  	19    100.00
	  72
	  64
	  59
	  52    	      Spreadsheet output 2:
	  44
	  86
	  99    	      Exam scores
	 100
	  77    	       Bin   Frequency   Cum.Freq.     Cum.%
	  84
				42           7  	 7     36.84
			      53.6           4  	11     57.89
			      65.2           1  	12     63.16
			      76.8           4  	16     84.21
			      88.4           3  	19    100.00

Interpretation

It is important to realise that frequency distributions and histograms are simply a visual aid to gain an insight into trends in the sample data frequency distribution. If it is necessary to test data for the presence of a normal distribution, before choosing subsequent parametric hypothesis testing methods the use of this histogram output data may be useful to an extent.

A more powerful way of testing for normality can be gained by applying a x² goodness of fit hypothesis test which compares the observed data distribution with that of the expected normal distribution.

Related tools:

Goodness of fit (x²) for normality: Detect normality using chi-square goodness of fit test.

Several useful assumptions may be made from the data generated from the analysis tool:

The easiest way to gain familiarity with any trends in the sample data distribution is to plot the distributions generated using TurboCalc's graphing capabilities. A simple line graph gives the clearest insight.

The following is a plot of the frequency data displayed above:

Plot of frequency data

Note that in this case the data seems very unlikely to follow a normal distribution and appears to be distinctly positively skewed.

The following is a plot of the cumulative percentage frequency data for the example above:

plot of cumulative percentage frequency data

If the plot had resembled a (S) curve it would be expected that the sample data followed a normal distribution. In this case the shape of the curve produced by the plot resembles the upper half of a sigmoid curve, reinforcing the likelihood that the sample data is from a positively skewed distribution. This in turn seems to imply that the exam was a particularly difficult one or the students were mostly a bit daft or that the tutor was useless. Take your pick...



Back to Main Document