Frequencies and probabilities

Here I present anoth­er way to describe your data, by fre­quen­cies and prob­a­bil­i­ties. This gives a graph­ic overview of your entire data set, which is help­ful in deter­min­ing the dis­tri­b­u­tion of the val­ues you have observed.

The fre­quen­cy table and his­togram

I think the fre­quen­cy table and his­togram is best explained by an exam­ple:

You want to describe the height of men with­in com­mu­ni­ty. So you ran­dom­ly col­lect 100 inde­pen­dent heights: The data looks like…

This array does not give any feel­ing of how the sam­ple is dis­trib­uted real­ly. So what you need to do is to make a fre­quen­cy table. You can do this in two ways:

The data is pre­sent­ed in a table using tal­lies, which gives a visu­al pre­sen­ta­tion of the data.

The fre­quen­cy table using tal­lies gives you an overview of the data and you get a sense of how it is dis­trib­uted. How­ev­er, it takes time to do this kind of table and is there­fore not very prac­ti­cal. It actu­al­ly resem­bles a barplot. We’ll go ahead doing one of these in a moment. But first we go on to present the fre­quen­cy table using num­bers instead of tal­lies.

If you go on and sum these val­ues you’ll find that it adds up to 100, so all the obser­va­tions are account­ed for. Now we can go on mak­ing a plot describ­ing the sam­ple:

Prob­a­bil­i­ties

The data can be pre­sent­ed in anoth­er way; by using prob­a­bil­i­ties or pro­por­tions. If you divide the fre­quen­cy for each height with the total num­ber of units in your sam­ple or pop­u­la­tion, you get the pro­por­tion for that height. In oth­er words, the space that the height takes up in the data set. This is also the prob­a­bil­i­ty of the height; the prob­a­bil­i­ty that a man will have this height in your com­mu­ni­ty.

So, we go on to cal­cu­late the prob­a­bil­i­ty for each height by divid­ing them by 100; the total num­ber (n) of units in the sam­ple:

Be sure every­thing is cor­rect by sum­ming all prob­a­bil­i­ties; they should add up to 1 exact­ly. Pre­sent­ing this in a plot you get:

The class inter­val

It is not always very con­ve­nient to dis­play all val­ues on the x-axis in the plot. There might be hun­dreds of unique val­ues. Also there might only be one obser­va­tion for a lot of val­ues. To deal with this prob­lem we can arbi­trary group the height val­ues. Let’s say you group the heights into groups of five.  Then, for exam­ple, the height of 165 is a part of the inter­val 163–167. This inter­val is called the class inter­val and the height 165 is called the class mark. Then go on and sum the fre­quen­cies for all heights with­in that inter­val and dis­play it for the class inter­val. This is how it looks like in a table:

Smoother, hey? This table does not take up as much space as the for­mer and present a smoother dis­tri­b­u­tion of the val­ues. How does this looks like in the plot? A barplot made on these class inter­vals is actu­al­ly called a his­togram:

Impor­tant to remem­ber

(1) Fre­quen­cy tables and his­tograms are impor­tant tools to describe the dis­tri­b­u­tion of a sam­ple or pop­u­la­tion

(2) The fre­quen­cies can eas­i­ly be trans­formed to prob­a­bil­i­ties

(3) A his­togram is a graph­i­cal pre­sen­ta­tion of the dis­tri­b­u­tion of your data

How to make the graphs on this page in R

#Getting the data

	#For barplots

		p<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/data_prob.table_.csv",header=T,dec=",",sep=";")
		h<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/hist.csv",header=T,dec=",",sep=";")
            raw.data<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/data_hist.csv",header=T,dec=",",sep=";")
		raw.data<-as.numeric(raw.data[,1])

#Barplot with frequencies

		barplot2(p$Frequency,names=p$Height,col = "#C20000", main = "Frequency barplot",
    			 xlab = "Height (cm)", ylab="Frequency", las=1,cex.lab=1.2)

		abline(h=0)

#Barplot with probabilities 

		x11()

		barplot2(p$Probability,names=f$Height,col = "#C20000", main = "Probability barplot",
    			 xlab = "Height (cm)", ylab="Probability", las=1,cex.lab=1.2)

		abline(h=0)

#Histogram with frequencies

	hist(raw.data, breaks = h$Class.mark,
    	 	freq = TRUE,  col = "#C20000", main = "Histogram with frequencies",
     	      xlab = "Height (cm)", xaxt="n",ylab="Frequency",las=1,cex.lab=1.2)
		axis(side=1,at=h$Class.mark,labels=h$Class.mark,cex.lab=1.2,las=1,pos=0)

     abline(h=0)

#Histogram with probabilities

	hist(raw.data, breaks = h$Class.mark,
    	 	freq = FALSE,  col = "#C20000", main = "Histogram with probabilities",
     	      xlab = "Height (cm)", xaxt="n",ylab="Probability",las=1,cex.lab=1.2)
		axis(side=1,at=h$Class.mark,labels=h$Class.mark,cex.lab=1.2,las=1,pos=0)

     abline(h=0)