## Two tailed Z-test

The Z-test is used when you want to compare the means of two large samples (>30 observations). In a two sided Z test, you test the null-hypothesis that µ_{1 }= µ_{2}. Contrary, in a one sided test Z test, you test the null-hypothesis µ_{1 }≥ µ_{2 }or µ_{1 }≤ µ_{2}.

You need to check the following assumptions before proceeding with the Z-test:

- The observations are independent
- The samples have the same variance
- That the Central Limit Theorem holds true (it does if the sample sizes > 30)

The Z-test relies on the test statistic Z, which is calculated by:

where and are the sample means, and are the sample variances, and and are the sample sizes of sample 1 and 2, respectively.

The null hypothesis is rejected when Z > 1.96 and 2.58 at a significance level of α = 0.05 and 0.01, respectively. That is, you are certain at a degree of 95 and 99 %, respectively, that the null-hypothesis can be rejected, i.e. the means differ.

**Example**

You want to test if the monthly salaries of industrial workers differ between China and India.

- Construct the null-hypothesis

H_{0:} the mean salary in China and India do not differ (µ_{China }= µ_{India})

Take a random sample of at least 30 salaries of industrial workers from China and India.

- Calculate the mean () and variance () for each sample, in this case:

3. Check that the variances are equal

— Perform a F-test, calculate the F statistic

— Calculate the degrees of freedom ()

- Check the critical value for F at α=0.05 where v_{1} = 44 and v_{2 }= 31 in a table of critical F values: F_{α=0.05 }= 1.8–2.01

- Compare the calculated F statistic with F_{α=0.05}

F < F_{α=0.05 }= 1.14 < 1.8

- Reject H_{0 }or H_{1 }

H_{0 }can’t be rejected; the assumption of equal variances holds true.

4. Calculate the Z statistic:

— Look up the critical value for Z at α = 0.05

In the case of critical values for Z we don’t have to check a table it is simply Z_{α=0.05 }= 1.96, independent of the degrees of freedom of the samples as the test relies on the Central Limit Theorem.

5. Compare the calculated Z statistic with Z_{α=0.05}

Z > Z_{α=0.05 }= 12.3 > 1.96

Z is even greater than Z_{α=0.01 }= 2.58

6. Reject H_{0 }or H_{1}

H_{0 }can be rejected; the mean salary in China and India differ (µ_{China }≠ µ_{India})

7. Interpret the result

We are more than 99 % certain that the salaries of industrial workers in China are higher compared to India.

*How to do it in R*

#Check the assumption of equal variances using F-test f.test<-function(var.max,var.min) var.max/var.min f.test(2.4,2.1) #Function to calculate the Z statistic z<-function(x1,x2,s1,s2,n1,n2) abs(x1-x2)/sqrt(s1/n1+s2/n2) } z(7.8,3.5,2.4,2.1,45,32)

### Two tailed Z-test in depth

The theory behind the Z-test is relies on the Central Limit Theorem and is quite straightforward.

The Central Limit Theorem says that estimations of a population parameter from a large number of samples, with size n, conform to a normal distribution. The population parameter can for example be the mean and the standard deviation. Other parameters can also be considered such as **the difference between the means of two populations (d).**

In the Z-test you want to know if there is a real difference between the mean of two populations, µ_{pop1 }and µ_{pop2}.

As the calculated means from samples are estimations and therefore could differ from the true means (µ_{pop1 }and µ_{pop2}) due to sampling error we need to perform a statistical test (a Z-test). We simply need to find out whether the difference between the estimated means are a true difference or due to chance.

If there is no difference between the means of two populations then d = µ_{pop1 }- µ_{pop2 }= 0

Every time you estimate a mean from two populations, you can calculate a difference (d) between the means. Just as in the case with means, a large number of d estimated from samples conforms to a normal distribution according to the Central Limit Theorem.

If there is no real difference between the populations, if d = µ_{pop1 }- µ_{pop2 }= 0, the distribution of d is centered around µ=0 with standard deviation σ. Due to sampling error the estimated d from samples deviates from 0. But, the variability decreases with sample size. Sounds, familiar? Yes, the standard deviation of this distribution is the same as the standard error, the standard error of d (SE_{d}).

Now, this is what you want to find out: does my estimated difference (d) between the means of my two samples belong to a population of d’s where µ_{d}=0? i.e. a population of d’s where there is no real difference between the means, where the difference is due to chance.

How do we do that?

We simply calculate the number of standard errors (SE_{d}) between the estimated d and µ_{d} = 0.

- Calculate the distance between the estimated d and µ
_{d}= 0.

The bars mean that we take the absolute value of the difference, i.e. we ignore any minus sign. This is because we are not interested in which mean that is greater than the other, just that there is a difference.

2. Divide this distance with the standard error of d (SE_{d}):

3. To calculate the number of standard errors between d and µ_{d} = 0 we get:

If the Z value exceeds 1.96, the estimated d is not very probable to have been obtained if there is no difference between the means. In other words, the estimated d is far from 0. The chance is less than 5 % that the estimated d belongs to a population of d’s with mean 0.

### How to produce the graph in R

mnorm_plot<-function(my,sigma){ x<-seq((my-(4*sigma)),(my+(4*sigma)),0.05) mnorm<-function(my, sigma,x){ y<-(1/(sigma*sqrt(2*pi)))*exp(-0.5*((x-my)/sigma)^2) y } p<-matrix(ncol=1,nrow=length(x)) for(i in 1:length(x)) { p[i]<-mnorm(my,sigma,x[i]) } plot(x,p,ylab="",xlab="Difference between means",type="n",las=1,bty="l",pch=19,yaxt="n") lines(x,p,type="l",lwd=1.5) abline(0,0,col="grey") segments(my,0.00,my,max(p),col="grey") mtext("µ",side=3,line=-0.5) K.L<-my-(1.96*sigma) K.U<-my+(1.96*sigma) cord.x<-c(min(x),seq(min(x),K.L,0.01),K.L) cord.y<-c(0,mnorm(my,sigma,seq(min(x),K.L,0.01)),0) cord.x2<-c(K.U,seq(K.U,max(x),0.01),max(x)) cord.y2<-c(0,mnorm(my,sigma,seq(K.U,max(x),0.01)),0) polygon(cord.x,cord.y,col="#C20000",border="black") polygon(cord.x2,cord.y2,col="#C20000",border="black") } mnorm_plot(0,0.35) #The first argument is the difference between means and the second is the standard error