Measures of variability

Only the mea­sure of loca­tion is not enough to describe a pop­u­la­tion or sam­ple. These do not say any­thing about the spread of the val­ues. The dis­tri­b­u­tion of the val­ues can dif­fer sig­nif­i­cant­ly when the mean is the same but the vari­abil­i­ty is dif­fer­ent (see fig­ure below).

 

The spread with­in a pop­u­la­tion can be described by the stan­dard devi­a­tion, vari­ance or coef­fi­cient of vari­a­tion. These can be used when­ev­er you use the mean. If you use the medi­an be sure to describe the spread with quar­tiles or per­centiles.

Standard deviation

The stan­dard devi­a­tion gives you the aver­age dis­tance between the mean and the val­ues with­in the pop­u­la­tion or sam­ple. Also in this case, the stan­dard devi­a­tion of a sam­ple is an esti­ma­tion of the stan­dard devi­a­tion of the pop­u­la­tion. So dif­fer­ent sym­bols are used to show if we are talk­ing about the esti­ma­tion (s) or the real, true val­ue (\sigma) of the stan­dard devi­a­tion. The equa­tions for cal­cu­lat­ing these also slight­ly dif­fer.

The stan­dard devi­a­tion for a pop­u­la­tion is cal­cu­lat­ed as:

standard deviation population

Where \sigma is the true stan­dard devi­a­tion of the pop­u­la­tion, \mu is the true mean of the pop­u­la­tion, x is an obser­va­tion val­ue of the vari­able and N is the total num­ber of units with­in the pop­u­la­tion.

The esti­mat­ed stan­dard devi­a­tion based on a sam­ple is cal­cu­lat­ed as:

standard deviation sample

Where s is the esti­mat­ed stan­dard devi­a­tion based on the sam­ple, \overline{x} is the esti­mat­ed mean, x is an obser­va­tion val­ue with­in the sam­ple and n is the num­ber of units with­in the sam­ple.

The rea­son why you take n-1 in the denom­i­na­tor in this equa­tion is that you pay a penal­ty for mak­ing an esti­ma­tion. This is called the degrees of free­dom. Impor­tant note on the stan­dard devi­a­tion is that it is on the same scale as the vari­able, for exam­ple mm or US dol­lars. The stan­dard devi­a­tion is an impor­tant mea­sure of spread which set the basis for the stan­dard error and con­fi­den­tial inter­vals.

Exam­ple

Cal­cu­late the stan­dard devi­a­tion of the fol­low­ing sam­ple: 7, 8, 4, 0, 4, 5, 12, 3, 3

(1) Cal­cu­late the square of the dis­tance from every obser­va­tion to the mean and sum them:

\sum (\overline{x}-x)^2 = 96.88

(2) divide the sum of the squares with the degrees of free­dom (n-1):

\frac{\sum (\overline{x}-x)^2}{n-1}=\frac{96.88}{8} = 12.1

(3) Take the root of the quo­tient:

\sqrt{\frac{\sum(\overline{x}-x)^2}{n-1}}=\sqrt{\frac{96.88}{8}} = \sqrt{12.1} = 3.48

How to do it in R

data<-c(7, 8, 4, 0, 4, 5, 12, 3, 3)
sd(data)

 

Variance

The vari­ance is sim­ply the square of the stan­dard devi­a­tion and is use­ful in for exam­ple Analy­sis of Vari­ance (ANOVA). The vari­ance is thus not on the same scale as the mean, for exam­ple mm^2 and \$^2 .

The vari­ance for a pop­u­la­tion is cal­cu­lat­ed as:

variance population

where  \sigma^2 is the true vari­ance of the pop­u­la­tion, the oth­er terms are the same as for the stan­dard devi­a­tion ( \sigma)

The vari­ance for a sam­ple is cal­cu­lat­ed as:

variance sample

where  s^2 is the esti­mat­ed vari­ance of the pop­u­la­tion based on the sam­ple, the oth­er terms are the same as for the stan­dard devi­a­tion.

Exam­ple

Cal­cu­late the vari­ance of the fol­low­ing sam­ple: 7, 8, 4, 0, 4, 5, 12, 3, 3

(1) Cal­cu­late the square of the dis­tance from every obser­va­tion to the mean and sum them:

\sum (\overline{x}-x)^2 = 96.88

(2) divide the sum of the squares with the degrees of free­dom (n-1):

\frac{\sum (\overline{x}-x)^2}{n-1}=\frac{96.88}{8} = 12.1

How to do it in R

data<-c(7, 8, 4, 0, 4, 5, 12, 3, 3)
var(data)

 

Coefficient of variation

Some times you want to com­pare the vari­abil­i­ty of two pop­u­la­tions. How­ev­er the stan­dard devi­a­tion and vari­ance does not tell you so much about the spread in rela­tion to the oth­er pop­u­la­tion if they have dif­fer­ent means. So what you do is that you see how large the stan­dard devi­a­tion is in rela­tion to the mean with­in the pop­u­la­tions. The coef­fi­cient of vari­a­tion for a sam­ple is cal­cu­lat­ed as:

coefficient of variation

where CV is the coef­fi­cient of vari­a­tion,s is the esti­mat­ed stan­dard devi­a­tion based on the sam­ple and is the esti­mat­ed mean based on the sam­ple. When mul­ti­ply­ing with 100 you get the CV in per­cent instead of a pro­por­tion.

Exam­ple

Com­pare the vari­abil­i­ty of the weights of mice and ele­phants.

Mean of ele­phants = 2 694 266.6 grams

Stan­dard devi­a­tion of ele­phants = 13 859.7 grams

CV elephants= \frac{13859.7}{2694266.6} \times 100 = 0.5 %

Mean of mice = 27.7 grams

Stan­dard devi­a­tion of mice = 2.3 grams

CV mice= \frac{2.3}{27.7} \times 100 = 8.3 %

Note that the stan­dard devi­a­tion is much larg­er com­pared to the mice pop­u­la­tion. But spite this fact, the vari­abil­i­ty with­in the mice pop­u­la­tion is larg­er com­pared to the ele­phant pop­u­la­tion. This is because the stan­dard devi­a­tion of ele­phants is small­er in rela­tion to its mean (0.5 %) than com­pared to the mice pop­u­la­tion (8.3 %).

How to do it in R

#Function for calculating the CV

	CV<-function(m,s) s/m * 100

#CV of elephants
	mean_elephants<-2694266.6
	sd_elephants<-13859.7

	CV.elephants<-CV(mean_elephants,sd_elephants)
	CV.elephants

#CV of mice
	mean_mice<-27.7
	sd_mice<-2.3

	CV.mice<-CV(mean_mice,sd_mice)
	CV.mice

 

Quartiles and percentiles

When­ev­er you use the medi­an, you should present the spread of your pop­u­la­tion or sam­ple using quar­tiles or per­centiles. As I men­tioned before, the medi­an is the mid­dle val­ue of all obser­va­tions, above and below this val­ue you find 50 % of the rest of the data val­ues, respec­tive­ly. The medi­an is also called the 50th per­centile or sec­ond quar­tile (Q_2) . To show the spread you sim­ply cal­cu­late the 25th and 75th per­centile, which are the first ( Q_1) and third quar­tile ( Q_3). These are the mid­dle val­ues of the first and sec­ond half of the ranked data set, respec­tive­ly. This kind of data are usu­al­ly pre­sent­ed by box­plots. They are called quar­tiles since they divide the data set into four parts; (1) from the low­est val­ue to the first quar­tile, (2) from the first quar­tile to the sec­ond (the medi­an), (3) from the sec­ond to the third and (4)  from the third quar­tile to the high­est val­ue.


Exam­ple

Cal­cu­late the medi­an, the first and third quar­tile of the fol­low­ing dataset: 7, 8, 4, 0, 4, 5, 12, 3, 3

(1) orga­nize the data in ascend­ing order: 0, 3, 3, 4, 4, 5, 7, 8, 12
(2) cal­cu­late the medi­an; the mid­dle val­ue equals to 4: 0, 3, 3, 4, 4, 5, 7, 8, 12
(3) cal­cu­late the upper (Q_3) and low­er (Q_1) quar­tiles.

We have a slight prob­lem here as we have an even num­ber of val­ues below and above the medi­an (four num­bers). How to cal­cu­late the quar­tiles? Sim­ply incor­po­rate the medi­an in the cal­cu­la­tion of both quar­tiles. You don’t have to do this if the low­er and upper 50 % of the data are odd num­ber of val­ues.

To cal­cu­late the first quar­tile, cal­cu­late the medi­an of 0,3,3,4,4. That is 3. To cal­cu­late the third quar­tile, cal­cu­late the medi­an of  4, 5, 7, 8, 12, which is 7.

The medi­an = 4, Q_1=3 and  Q_3=7.  That means that the upper 25 % of the val­ues fall above 7 and 25 % of the val­ues fall below 3, and 50 % of the val­ues fall between 3 and 7.

How to do it in R

data<-c(7, 8, 4, 0, 4, 5, 12, 3, 3)
#The first quartile or 25th percentile quantile
(data)[2]
#The third quartile or 75th percentile quantile
(data)[4]