We will start by looking at a graphical
method for studying the variation known
as the 'Frequency Histogram'.
To create a frequency histogram, group
the data into ‘bins’, each bin containing
a range of values. The data below show
the test results for 25 students:
|
Results |
|
Bin |
Midpoint
|
Frequency |
|
38
|
10 |
60 |
90 |
88 |
|
>0-20
|
10 |
7 |
|
96 |
1 |
41 |
86 |
14 |
|
>20-40
|
30 |
8 |
|
25 |
5 |
3 |
16 |
22 |
|
>40-60
|
50 |
5 |
|
2 |
29
|
34 |
55 |
36
|
|
>60-80
|
70 |
0 |
|
37 |
36 |
91 |
47 |
43 |
|
>80-100
|
90 |
5 |
I've grouped them into 5 bins of equal
size. The first bin contains the frequency
(or number) of results that are greater
than zero and up to and including 20.
The second bin contains the frequency
of values greater than 20 up to and including
40. I've shaded these values to make it
easier for you to check that there are
eight (38, 25, 37, 29, 36, 34, 22 and
36).
Now I can create a histogram of the results.
The vertical axis represents the frequency
of observations in each range:
There are two conventions
for showing the bin values on the horizontal
axis of the histogram:
1. show the midpoint of the bin range
2. show the upper limit of each bin range,
the 'cutpoint'
 |
The histogram above shows the
midpoint convention. Pass your
cursor over the image to see the
alternative 'cutpoint' convention.
|
The reason for creating a histogram is
to see the 'pattern' of the data. The
number of bins you use will affect how
easy it is to see the pattern. If you
use too many bins you will have too few
values in each and the pattern will be
'ragged'. If you use too few bins you
may miss important details.
There are various ways of calculating
the optimum number of bins. I find that
using the square root of the number of
data values is as satisfactory as the
more complicated methods. The result is
usually on the low side, but you probably
want to adjust it anyway to avoid awkward
sized bins.
In the example there are 25 data values.
The square root rule gives 5 bins. The
smallest data value is 1, the largest
is 96. A scale stretching from 0 through
100 will contain all the values; this
conveniently gives 5 bins of span 20.
If there were 50 values then the calculation
would suggest 7 bins of size 14 each,
but that is an awkward span so I'd probably
use 10 bins of span of 10 which is a nice
round number; but bins of span 15 would
also be satisfactory.
When you allocate the data into the bins
you must decide on how to handle values
that fall on the boundary between two
bins. In the example I've included values
greater than the lower boundary up to
and including the upper boundary; this
is consistent with the convention used
by Excel.