The variance is a measure of the process
variation. The greater the scatter of
the data values, the larger the variance.
The variance is the average distance of
the data values from the mean.
More precisely the variance is the mean
of the squares of the distance of the
data values from the mean:

i 


1 
3 
9 
2 
2 
4 
3 
1 
1 
4 
+2 
4 
5 
+4 
16 
Sum 
0 
34 

Variance 
8.5* 

The values are squared because the square
of any value is positive. Notice that
if we used:
some of the values would be negative
and others positive, the sum of all the
values is zero. Squaring makes all the
values positive and is a convenient way
of overcoming this.
You will notice that to calculate the
average we divided by 'n1' rather than
'n' as you might expect. If you knew the
process mean you would use the formula:
However you do not normally know the
parameter μ and
so you substitute xbar.
You use the sample values to calculate
xbar and then you reuse them to calculate
the variance. This gives the value of
s^{2} a bias which can be removed
by dividing by 'n1' and the formula becomes:
The statistic xbar is an
unbiased estimate of μ.
It is an estimate based on a small sample
so it is unlikely to exactly equal μ
but it is equally likely to be too big
or too small. You might expect that using
this approximation would make the estimate
of the process variance less accurate,
but you might not expect it to introduce
a bias.
We know that:
A consequence of this (which
I won't prove) is that you could not substitute
any value for xbar that would give a
smaller value to the equation:
Thus any difference between
xbar and the true value of μ
will always reduce the estimate of s^{2}.
The estimate of s^{2} is biased but it turns
out that dividing by 'n1' instead of
'n' exactly compensates for this bias.
The reason for this is that
because we originally calculated xbar
from the same x_{i} values the
data values and the value of xbar are
not independent of each other. If you
know any 'n1' data values then you can
calculate the missing value.
This is a specific example
of an important concept in statistics
known as the number of 'Degrees of Freedom'.
The number of degrees of freedom is the
number of independent data values in the
equation.
The number of
degrees of freedom is the number of data
values you need to know to calculate the
remaining values if you know the values
of any statistics used in the formula.
In more advanced statistics
you will come across examples where you
might find it difficult to work out how
many degrees of freedom there are. It
will help greatly if you bear this rule
in mind.

The average of the five values
is 7, find the missing value,
X_{5}:

