[porto_container animation_duration=“1000” animation_delay=“0”][/porto_container]

## Measures of variability

Only the mea­sure of loca­tion is not enough to describe a pop­u­la­tion or sam­ple. These do not say any­thing about the spread of the val­ues. The dis­tri­b­u­tion of the val­ues can dif­fer sig­nif­i­cant­ly when the mean is the same but the vari­abil­i­ty is dif­fer­ent (see fig­ure below).

The spread with­in a pop­u­la­tion can be described by the stan­dard devi­a­tion, vari­ance or coef­fi­cient of vari­a­tion. These can be used when­ev­er you use the mean. If you use the medi­an be sure to describe the spread with quar­tiles or percentiles.

### Standard deviation

The stan­dard devi­a­tion gives you the aver­age dis­tance between the mean and the val­ues with­in the pop­u­la­tion or sam­ple. Also in this case, the stan­dard devi­a­tion of a sam­ple is an esti­ma­tion of the stan­dard devi­a­tion of the pop­u­la­tion. So dif­fer­ent sym­bols are used to show if we are talk­ing about the esti­ma­tion ( $s$) or the real, true val­ue ( $\sigma$) of the stan­dard devi­a­tion. The equa­tions for cal­cu­lat­ing these also slight­ly differ.

The stan­dard devi­a­tion for a pop­u­la­tion is cal­cu­lat­ed as: Where $\sigma$ is the true stan­dard devi­a­tion of the pop­u­la­tion, $\mu$ is the true mean of the pop­u­la­tion, $x$ is an obser­va­tion val­ue of the vari­able and $N$ is the total num­ber of units with­in the population.

The esti­mat­ed stan­dard devi­a­tion based on a sam­ple is cal­cu­lat­ed as: Where $s$ is the esti­mat­ed stan­dard devi­a­tion based on the sam­ple, $\overline{x}$ is the esti­mat­ed mean, $x$ is an obser­va­tion val­ue with­in the sam­ple and $n$ is the num­ber of units with­in the sample.

The rea­son why you take $n-1$ in the denom­i­na­tor in this equa­tion is that you pay a penal­ty for mak­ing an esti­ma­tion. This is called the degrees of free­dom. Impor­tant note on the stan­dard devi­a­tion is that it is on the same scale as the vari­able, for exam­ple mm or US dol­lars. The stan­dard devi­a­tion is an impor­tant mea­sure of spread which set the basis for the stan­dard error and con­fi­den­tial intervals.

Exam­ple

Cal­cu­late the stan­dard devi­a­tion of the fol­low­ing sam­ple: 7, 8, 4, 0, 4, 5, 12, 3, 3

(1) Cal­cu­late the square of the dis­tance from every obser­va­tion to the mean and sum them: $\sum (\overline{x}-x)^2 = 96.88$

(2) divide the sum of the squares with the degrees of free­dom ( $n-1$): $\frac{\sum (\overline{x}-x)^2}{n-1}=\frac{96.88}{8} = 12.1$

(3) Take the root of the quotient: $\sqrt{\frac{\sum(\overline{x}-x)^2}{n-1}}=\sqrt{\frac{96.88}{8}} = \sqrt{12.1} = 3.48$

How to do it in R

data<-c(7, 8, 4, 0, 4, 5, 12, 3, 3)
sd(data)


### Variance

The vari­ance is sim­ply the square of the stan­dard devi­a­tion and is use­ful in for exam­ple Analy­sis of Vari­ance (ANOVA). The vari­ance is thus not on the same scale as the mean, for exam­ple $mm^2$ and $\^2$ .

The vari­ance for a pop­u­la­tion is cal­cu­lat­ed as: where $\sigma^2$ is the true vari­ance of the pop­u­la­tion, the oth­er terms are the same as for the stan­dard devi­a­tion ( $\sigma$)

The vari­ance for a sam­ple is cal­cu­lat­ed as: where $s^2$ is the esti­mat­ed vari­ance of the pop­u­la­tion based on the sam­ple, the oth­er terms are the same as for the stan­dard deviation.

Exam­ple

Cal­cu­late the vari­ance of the fol­low­ing sam­ple: 7, 8, 4, 0, 4, 5, 12, 3, 3

(1) Cal­cu­late the square of the dis­tance from every obser­va­tion to the mean and sum them: $\sum (\overline{x}-x)^2 = 96.88$

(2) divide the sum of the squares with the degrees of free­dom ( $n-1$): $\frac{\sum (\overline{x}-x)^2}{n-1}=\frac{96.88}{8} = 12.1$

How to do it in R

data<-c(7, 8, 4, 0, 4, 5, 12, 3, 3)
var(data)


### Coefficient of variation

Some times you want to com­pare the vari­abil­i­ty of two pop­u­la­tions. How­ev­er the stan­dard devi­a­tion and vari­ance does not tell you so much about the spread in rela­tion to the oth­er pop­u­la­tion if they have dif­fer­ent means. So what you do is that you see how large the stan­dard devi­a­tion is in rela­tion to the mean with­in the pop­u­la­tions. The coef­fi­cient of vari­a­tion for a sam­ple is cal­cu­lat­ed as: where $CV$ is the coef­fi­cient of vari­a­tion, $s$ is the esti­mat­ed stan­dard devi­a­tion based on the sam­ple and is the esti­mat­ed mean based on the sam­ple. When mul­ti­ply­ing with 100 you get the $CV$ in per­cent instead of a proportion.

Exam­ple

Com­pare the vari­abil­i­ty of the weights of mice and elephants.

Mean of ele­phants = 2 694 266.6 grams

Stan­dard devi­a­tion of ele­phants = 13 859.7 grams $CV elephants= \frac{13859.7}{2694266.6} \times 100 = 0.5 %$

Mean of mice = 27.7 grams

Stan­dard devi­a­tion of mice = 2.3 grams $CV mice= \frac{2.3}{27.7} \times 100 = 8.3 %$

Note that the stan­dard devi­a­tion is much larg­er com­pared to the mice pop­u­la­tion. But spite this fact, the vari­abil­i­ty with­in the mice pop­u­la­tion is larg­er com­pared to the ele­phant pop­u­la­tion. This is because the stan­dard devi­a­tion of ele­phants is small­er in rela­tion to its mean (0.5 %) than com­pared to the mice pop­u­la­tion (8.3 %).

How to do it in R

#Function for calculating the CV

CV<-function(m,s) s/m * 100

#CV of elephants
mean_elephants<-2694266.6
sd_elephants<-13859.7

CV.elephants<-CV(mean_elephants,sd_elephants)
CV.elephants

#CV of mice
mean_mice<-27.7
sd_mice<-2.3

CV.mice<-CV(mean_mice,sd_mice)
CV.mice


### Quartiles and percentiles

When­ev­er you use the medi­an, you should present the spread of your pop­u­la­tion or sam­ple using quar­tiles or per­centiles. As I men­tioned before, the medi­an is the mid­dle val­ue of all obser­va­tions, above and below this val­ue you find 50 % of the rest of the data val­ues, respec­tive­ly. The medi­an is also called the 50th per­centile or sec­ond quar­tile ( $Q_2$) . To show the spread you sim­ply cal­cu­late the 25th and 75th per­centile, which are the first ( $Q_1$) and third quar­tile ( $Q_3$). These are the mid­dle val­ues of the first and sec­ond half of the ranked data set, respec­tive­ly. This kind of data are usu­al­ly pre­sent­ed by box­plots. They are called quar­tiles since they divide the data set into four parts; (1) from the low­est val­ue to the first quar­tile, (2) from the first quar­tile to the sec­ond (the medi­an), (3) from the sec­ond to the third and (4)  from the third quar­tile to the high­est value.

Example

Cal­cu­late the medi­an, the first and third quar­tile of the fol­low­ing dataset: 7, 8, 4, 0, 4, 5, 12, 3, 3

(1) orga­nize the data in ascend­ing order: 0, 3, 3, 4, 4, 5, 7, 8, 12
(2) cal­cu­late the medi­an; the mid­dle val­ue equals to 4: 0, 3, 3, 4, 4, 5, 7, 8, 12
(3) cal­cu­late the upper ( $Q_3$) and low­er ( $Q_1$) quar­tiles.

We have a slight prob­lem here as we have an even num­ber of val­ues below and above the medi­an (four num­bers). How to cal­cu­late the quar­tiles? Sim­ply incor­po­rate the medi­an in the cal­cu­la­tion of both quar­tiles. You don’t have to do this if the low­er and upper 50 % of the data are odd num­ber of values.

To cal­cu­late the first quar­tile, cal­cu­late the medi­an of 0,3,3,4,4. That is 3. To cal­cu­late the third quar­tile, cal­cu­late the medi­an of  4, 5, 7, 8, 12, which is 7.

The medi­an = 4, $Q_1$=3 and $Q_3$=7.  That means that the upper 25 % of the val­ues fall above 7 and 25 % of the val­ues fall below 3, and 50 % of the val­ues fall between 3 and 7.

How to do it in R

data<-c(7, 8, 4, 0, 4, 5, 12, 3, 3)
#The first quartile or 25th percentile quantile
(data)
#The third quartile or 75th percentile quantile
(data)