Frequencies and probabilities
Here I present another way to describe your data, by frequencies and probabilities. This gives a graphic overview of your entire data set, which is helpful in determining the distribution of the values you have observed.
The frequency table and histogram
I think the frequency table and histogram is best explained by an example:
You want to describe the height of men within community. So you randomly collect 100 independent heights: The data looks like…
This array does not give any feeling of how the sample is distributed really. So what you need to do is to make a frequency table. You can do this in two ways:
The data is presented in a table using tallies, which gives a visual presentation of the data.
The frequency table using tallies gives you an overview of the data and you get a sense of how it is distributed. However, it takes time to do this kind of table and is therefore not very practical. It actually resembles a barplot. We’ll go ahead doing one of these in a moment. But first we go on to present the frequency table using numbers instead of tallies.
If you go on and sum these values you’ll find that it adds up to 100, so all the observations are accounted for. Now we can go on making a plot describing the sample:
The data can be presented in another way; by using probabilities or proportions. If you divide the frequency for each height with the total number of units in your sample or population, you get the proportion for that height. In other words, the space that the height takes up in the data set. This is also the probability of the height; the probability that a man will have this height in your community.
So, we go on to calculate the probability for each height by dividing them by 100; the total number (n) of units in the sample:
Be sure everything is correct by summing all probabilities; they should add up to 1 exactly. Presenting this in a plot you get:
The class interval
It is not always very convenient to display all values on the x-axis in the plot. There might be hundreds of unique values. Also there might only be one observation for a lot of values. To deal with this problem we can arbitrary group the height values. Let’s say you group the heights into groups of five. Then, for example, the height of 165 is a part of the interval 163-167. This interval is called the class interval and the height 165 is called the class mark. Then go on and sum the frequencies for all heights within that interval and display it for the class interval. This is how it looks like in a table:
Smoother, hey? This table does not take up as much space as the former and present a smoother distribution of the values. How does this looks like in the plot? A barplot made on these class intervals is actually called a histogram:
Important to remember
(1) Frequency tables and histograms are important tools to describe the distribution of a sample or population
(2) The frequencies can easily be transformed to probabilities
(3) A histogram is a graphical presentation of the distribution of your data
How to make the graphs on this page in R
#Getting the data #For barplots p<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/data_prob.table_.csv",header=T,dec=",",sep=";") h<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/hist.csv",header=T,dec=",",sep=";") raw.data<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/data_hist.csv",header=T,dec=",",sep=";") raw.data<-as.numeric(raw.data[,1]) #Barplot with frequencies barplot2(p$Frequency,names=p$Height,col = "#C20000", main = "Frequency barplot", xlab = "Height (cm)", ylab="Frequency", las=1,cex.lab=1.2) abline(h=0) #Barplot with probabilities x11() barplot2(p$Probability,names=f$Height,col = "#C20000", main = "Probability barplot", xlab = "Height (cm)", ylab="Probability", las=1,cex.lab=1.2) abline(h=0) #Histogram with frequencies hist(raw.data, breaks = h$Class.mark, freq = TRUE, col = "#C20000", main = "Histogram with frequencies", xlab = "Height (cm)", xaxt="n",ylab="Frequency",las=1,cex.lab=1.2) axis(side=1,at=h$Class.mark,labels=h$Class.mark,cex.lab=1.2,las=1,pos=0) abline(h=0) #Histogram with probabilities hist(raw.data, breaks = h$Class.mark, freq = FALSE, col = "#C20000", main = "Histogram with probabilities", xlab = "Height (cm)", xaxt="n",ylab="Probability",las=1,cex.lab=1.2) axis(side=1,at=h$Class.mark,labels=h$Class.mark,cex.lab=1.2,las=1,pos=0) abline(h=0)