Frequencies and probabilities

Here I present another way to describe your data, by frequencies and probabilities. This gives a graphic overview of your entire data set, which is helpful in determining the distribution of the values you have observed.

The frequency table and histogram

I think the frequency table and histogram is best explained by an example:

You want to describe the height of men within community. So you randomly collect 100 independent heights: The data looks like…

Data_freq

This array does not give any feeling of how the sample is distributed really. So what you need to do is to make a frequency table. You can do this in two ways:

The data is presented in a table using tallies, which gives a visual presentation of the data.

Talllies

The frequency table using tallies gives you an overview of the data and you get a sense of how it is distributed. However, it takes time to do this kind of table and is therefore not very practical. It actually resembles a barplot. We’ll go ahead doing one of these in a moment. But first we go on to present the frequency table using numbers instead of tallies.

Frequency table

If you go on and sum these values you’ll find that it adds up to 100, so all the observations are accounted for. Now we can go on making a plot describing the sample:

Barplot

Probabilities

The data can be presented in another way; by using probabilities or proportions. If you divide the frequency for each height with the total number of units in your sample or population, you get the proportion for that height. In other words, the space that the height takes up in the data set. This is also the probability of the height; the probability that a man will have this height in your community.

So, we go on to calculate the probability for each height by dividing them by 100; the total number (n) of units in the sample:

Prop_freq

Be sure everything is correct by summing all probabilities; they should add up to 1 exactly. Presenting this in a plot you get:

Barplot_prob

The class interval

It is not always very convenient to display all values on the x-axis in the plot. There might be hundreds of unique values. Also there might only be one observation for a lot of values. To deal with this problem we can arbitrary group the height values. Let’s say you group the heights into groups of five.  Then, for example, the height of 165 is a part of the interval 163-167. This interval is called the class interval and the height 165 is called the class mark. Then go on and sum the frequencies for all heights within that interval and display it for the class interval. This is how it looks like in a table:

Class intervall

Smoother, hey? This table does not take up as much space as the former and present a smoother distribution of the values. How does this looks like in the plot? A barplot made on these class intervals is actually called a histogram:

Histogram frequencies
Histogram probabilities

Important to remember

(1) Frequency tables and histograms are important tools to describe the distribution of a sample or population

(2) The frequencies can easily be transformed to probabilities

(3) A histogram is a graphical presentation of the distribution of your data

How to make the graphs on this page in R

#Getting the data

	#For barplots

		p<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/data_prob.table_.csv",header=T,dec=",",sep=";")
		h<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/hist.csv",header=T,dec=",",sep=";")
            raw.data<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/data_hist.csv",header=T,dec=",",sep=";")
		raw.data<-as.numeric(raw.data[,1])

#Barplot with frequencies

		barplot2(p$Frequency,names=p$Height,col = "#C20000", main = "Frequency barplot",
    			 xlab = "Height (cm)", ylab="Frequency", las=1,cex.lab=1.2)

		abline(h=0)

#Barplot with probabilities 

		x11()

		barplot2(p$Probability,names=f$Height,col = "#C20000", main = "Probability barplot",
    			 xlab = "Height (cm)", ylab="Probability", las=1,cex.lab=1.2)

		abline(h=0)

#Histogram with frequencies

	hist(raw.data, breaks = h$Class.mark,
    	 	freq = TRUE,  col = "#C20000", main = "Histogram with frequencies",
     	      xlab = "Height (cm)", xaxt="n",ylab="Frequency",las=1,cex.lab=1.2)
		axis(side=1,at=h$Class.mark,labels=h$Class.mark,cex.lab=1.2,las=1,pos=0)

     abline(h=0)

#Histogram with probabilities

	hist(raw.data, breaks = h$Class.mark,
    	 	freq = FALSE,  col = "#C20000", main = "Histogram with probabilities",
     	      xlab = "Height (cm)", xaxt="n",ylab="Probability",las=1,cex.lab=1.2)
		axis(side=1,at=h$Class.mark,labels=h$Class.mark,cex.lab=1.2,las=1,pos=0)

     abline(h=0)