Population vs sample

Pop­u­la­tion and sam­ple are two fun­da­men­tal con­cepts of sta­tis­ti­cal the­o­ry. In every sta­tis­ti­cal test, you deal with at least one pop­u­la­tion and an asso­ci­at­ed sam­ple.

Before even think­ing of col­lect­ing data you need to define the population(s) involved in the test. A pop­u­la­tion is the group you want to make gen­er­al­iza­tions about. You want to make some sort of state­ment about this group, such as: “men are on aver­age x meters in length”. When per­form­ing a sta­tis­ti­cal test you often want to see if there is a dif­fer­ence between two poten­tial pop­u­la­tions, such as: “men are on aver­age taller com­pared to women”. But, if no dif­fer­ence is detect­ed by the test, there is a high prob­a­bil­i­ty that heights of all men and women belong to the same pop­u­la­tion. You also need to decide whether it is all men in the world or in for exam­ple Swe­den you want to make gen­er­al­iza­tions about. That is, you need to be sure about which group you actu­al­ly make gen­er­al­iza­tions about. How spe­cif­ic you should be is deter­mined by the pur­pose of your study.

It is in prac­tice impos­si­ble to gath­er infor­ma­tion about the heights of all men in the world or even in Swe­den. There­fore you need to col­lect a sub­set of all the lengths in the pop­u­la­tion. This is called a sam­ple. The sam­ple is a ran­dom sub­set that rep­re­sents the pop­u­la­tion. The sam­ple needs to be ran­dom; oth­er­wise it is not real­ly rep­re­sent­ing the pop­u­la­tion. Let’s say you are inter­est­ing in describ­ing the length of all men in the world. Besides sta­tis­tics you are also very inter­est­ed in bas­ket ball, play­ing inter­na­tion­al­ly. To save time you ask the mem­bers of the oth­er team about their lengths, which you use in the study. The prob­lem about this study is that the lengths are not ran­dom­ly drawn from the pop­u­la­tion of all men in the world. The lengths in fact rep­re­sent the ones of pro­fes­sion­al bas­ket ball play­ers.

When you are work­ing with a data set, it is impor­tant that you know if you are deal­ing with a pop­u­la­tion or a sam­ple. In most cas­es the data is from a sam­ple, but some­times it is actu­al­ly pos­si­ble to col­lect data from the entire pop­u­la­tion. The equa­tions used to describe a pop­u­la­tion dif­fer depend­ing on if you have obser­va­tions from all the units in the pop­u­la­tion or from a ran­dom sam­ple.

Most impor­tant from this sec­tion:

Be sure to define the pop­u­la­tion that you want to make gen­er­al­iza­tions about.

The sam­ple of a pop­u­la­tion needs to be rep­re­sen­ta­tive, which means it has to be ran­dom­ly drawn from the pop­u­la­tion.

Be sure that you know whether your data is from the entire pop­u­la­tion or a sam­ple.