Measures of variability

Only the measure of location is not enough to describe a population or sample. These do not say anything about the spread of the values. The distribution of the values can differ significantly when the mean is the same but the variability is different (see figure below).

 

Histogram variability

The spread within a population can be described by the standard deviation, variance or coefficient of variation. These can be used whenever you use the mean. If you use the median be sure to describe the spread with quartiles or percentiles.

Standard deviation

The standard deviation gives you the average distance between the mean and the values within the population or sample. Also in this case, the standard deviation of a sample is an estimation of the standard deviation of the population. So different symbols are used to show if we are talking about the estimation (s) or the real, true value (\sigma) of the standard deviation. The equations for calculating these also slightly differ.

The standard deviation for a population is calculated as:

standard deviation population

Where \sigma is the true standard deviation of the population, \mu is the true mean of the population, x is an observation value of the variable and N is the total number of units within the population.

The estimated standard deviation based on a sample is calculated as:

standard deviation sample

Where s is the estimated standard deviation based on the sample, \overline{x} is the estimated mean, x is an observation value within the sample and n is the number of units within the sample.

The reason why you take n-1 in the denominator in this equation is that you pay a penalty for making an estimation. This is called the degrees of freedom. Important note on the standard deviation is that it is on the same scale as the variable, for example mm or US dollars. The standard deviation is an important measure of spread which set the basis for the standard error and confidential intervals.

Example

Calculate the standard deviation of the following sample: 7, 8, 4, 0, 4, 5, 12, 3, 3

(1) Calculate the square of the distance from every observation to the mean and sum them:

\sum (\overline{x}-x)^2 = 96.88

(2) divide the sum of the squares with the degrees of freedom (n-1):

\frac{\sum (\overline{x}-x)^2}{n-1}=\frac{96.88}{8} = 12.1

(3) Take the root of the quotient:

\sqrt{\frac{\sum(\overline{x}-x)^2}{n-1}}=\sqrt{\frac{96.88}{8}} = \sqrt{12.1} = 3.48

How to do it in R

data<-c(7, 8, 4, 0, 4, 5, 12, 3, 3)
sd(data)

 

Variance

The variance is simply the square of the standard deviation and is useful in for example Analysis of Variance (ANOVA). The variance is thus not on the same scale as the mean, for example mm^2 and \$^2 .

The variance for a population is calculated as:

variance population

where  \sigma^2 is the true variance of the population, the other terms are the same as for the standard deviation ( \sigma)

The variance for a sample is calculated as:

variance sample

where  s^2 is the estimated variance of the population based on the sample, the other terms are the same as for the standard deviation.

Example

Calculate the variance of the following sample: 7, 8, 4, 0, 4, 5, 12, 3, 3

(1) Calculate the square of the distance from every observation to the mean and sum them:

\sum (\overline{x}-x)^2 = 96.88

(2) divide the sum of the squares with the degrees of freedom (n-1):

\frac{\sum (\overline{x}-x)^2}{n-1}=\frac{96.88}{8} = 12.1

How to do it in R

data<-c(7, 8, 4, 0, 4, 5, 12, 3, 3)
var(data)

 

Coefficient of variation

Some times you want to compare the variability of two populations. However the standard deviation and variance does not tell you so much about the spread in relation to the other population if they have different means. So what you do is that you see how large the standard deviation is in relation to the mean within the populations. The coefficient of variation for a sample is calculated as:

coefficient of variation

where CV is the coefficient of variation,s is the estimated standard deviation based on the sample and is the estimated mean based on the sample. When multiplying with 100 you get the CV in percent instead of a proportion.

Example

Compare the variability of the weights of mice and elephants.

Mean of elephants = 2 694 266.6 grams

Standard deviation of elephants = 13 859.7 grams

CV elephants= \frac{13859.7}{2694266.6} \times 100 = 0.5 %

Mean of mice = 27.7 grams

Standard deviation of mice = 2.3 grams

CV mice= \frac{2.3}{27.7} \times 100 = 8.3 %

Note that the standard deviation is much larger compared to the mice population. But spite this fact, the variability within the mice population is larger compared to the elephant population. This is because the standard deviation of elephants is smaller in relation to its mean (0.5 %) than compared to the mice population (8.3 %).

How to do it in R

#Function for calculating the CV

	CV<-function(m,s) s/m * 100

#CV of elephants
	mean_elephants<-2694266.6
	sd_elephants<-13859.7

	CV.elephants<-CV(mean_elephants,sd_elephants)
	CV.elephants

#CV of mice
	mean_mice<-27.7
	sd_mice<-2.3

	CV.mice<-CV(mean_mice,sd_mice)
	CV.mice

 

Quartiles and percentiles

Whenever you use the median, you should present the spread of your population or sample using quartiles or percentiles. As I mentioned before, the median is the middle value of all observations, above and below this value you find 50 % of the rest of the data values, respectively. The median is also called the 50th percentile or second quartile (Q_2) . To show the spread you simply calculate the 25th and 75th percentile, which are the first ( Q_1) and third quartile ( Q_3). These are the middle values of the first and second half of the ranked data set, respectively. This kind of data are usually presented by boxplots. They are called quartiles since they divide the data set into four parts; (1) from the lowest value to the first quartile, (2) from the first quartile to the second (the median), (3) from the second to the third and (4)  from the third quartile to the highest value.


Example

Calculate the median, the first and third quartile of the following dataset: 7, 8, 4, 0, 4, 5, 12, 3, 3

(1) organize the data in ascending order: 0, 3, 3, 4, 4, 5, 7, 8, 12
(2) calculate the median; the middle value equals to 4: 0, 3, 3, 4, 4, 5, 7, 8, 12
(3) calculate the upper (Q_3) and lower (Q_1) quartiles.

We have a slight problem here as we have an even number of values below and above the median (four numbers). How to calculate the quartiles? Simply incorporate the median in the calculation of both quartiles. You don’t have to do this if the lower and upper 50 % of the data are odd number of values.

To calculate the first quartile, calculate the median of 0,3,3,4,4. That is 3. To calculate the third quartile, calculate the median of  4, 5, 7, 8, 12, which is 7.

The median = 4, Q_1=3 and  Q_3=7.  That means that the upper 25 % of the values fall above 7 and 25 % of the values fall below 3, and 50 % of the values fall between 3 and 7.

How to do it in R

data<-c(7, 8, 4, 0, 4, 5, 12, 3, 3)
#The first quartile or 25th percentile quantile
(data)[2]
#The third quartile or 75th percentile quantile
(data)[4]