The variance is a measure of the process
variation. The greater the scatter of
the data values, the larger the variance.
The variance is the average distance of
the data values from the mean.
More precisely the variance is the mean
of the squares of the distance of the
data values from the mean:
| |
i |
 |
 |
1 |
-3 |
9 |
2 |
-2 |
4 |
3 |
-1 |
1 |
4 |
+2 |
4 |
5 |
+4 |
16 |
Sum |
0 |
34 |
|
Variance |
8.5* |
|
* I'll explain why we divide by 'n-1'
shortly.
The values are squared because the square
of any value is positive. Notice that
if we used:

some of the values would be negative
and others positive, the sum of all the
values is zero. Squaring makes all the
values positive and is a convenient way
of overcoming this.
To explain why we divide by 'n-1'. The
natural formula for the variance is:

However the parameter m
is not known and so the statistic
is substituted. This would give the value
of s2 a bias, it would be too
small, however dividing by 'n-1' exactly
compensates for the bias:

The bias occurs because
was calculated using the selfsame data
values used to calculate s2.
Recycling the data values in this way
introduces the bias. The mathematics of
this are complicated, but they are given
here
if you want to see them.
The value 'n-1' is also known as the
number of 'degrees of freedom'. This is
the number of independent data values
in the formula. Suppose that the sample
contains 10 values (n = 10), then if you
know any 9 of the 10 values, and the value
of
you can calculate the remaining data value.
There are only 9 'independent values':
 |
The average of the five values
is 7, find the missing value,
X5:
|
| |
Mathematically the value so calculated
is the value that minimizes the value
of s2. The resulting value
of the variance is lower than if an actual
value taken at random from the process
(an 'independent value') was used, unless
the independent value happened to equal
the calculated value.
I've spent some time introducing the
number of 'degrees of freedom' because
it is an important, and somewhat puzzling,
concept that often crops up in statistics.