Example · R

Knowing the mean causes Type I error

I needed to explain this to someone the other day, and it may not be as generally appreciated as it should be.

People, when learning about statistical tests often get exposed to the idea that in good experimental design you should decide your analysis before getting the data- on simulated data if need be.

But people often want the crutch of knowing some summary statistics about the data while making their plans. So I am just going to run through a specific case to show how that ruins results at the margins.

Let’s assume we have a sample from a normally distributed population of known population standard deviation- so a really easy hypothetical case to see what is going on.

For .95 significance, at a 95% percent of the time, in the long run, our sample confidence interval will contain the population interval, and we will come to a correct conclusion to keep the null hypothesis. Due to random chance, 5% of the time we will get a type one error and incorrectly reject the null hypothesis thinking we have found something interesting (remember in this example the sample is draw from the population so any finding of rejecting the Null Hypothesis is going to fundamentally be an error, even if conducted correctly- that is random chance for you).

Now, we have both one two-tailed test (the option to reject if the results are extremely high or low) or two one-tailed tests (the option to reject if the results are extremely high, or the option to reject if the results are high). And though they are focused on different parts of the probability curve they all reject 5% of the area. For those using R here is a simulation at this point (though you might be thinking “this is obvious”, it is setup for the next part.

pop_mean <- 1000
sigma <- 10
n <- 100

xbar <- replicate(10000, mean(rnorm(n, mean= pop_mean, sd=sigma)))

z <- (xbar - pop_mean) / (sigma / sqrt(n))

# sum of type 1 error, two sided test:
reject0when2sided <- abs(z) > 1.96
sum(reject0when2sided) /10000

#with a one sided lower test
reject0whenlowerside <- z < -1.6449
sum(reject0whenlowerside) /10000

#with a one sided upper test
reject0whenupperside <- z > 1.6449
sum(reject0whenupperside) / 10000

But now, let us apply a little knowledge (a dangerous thing). Say we calculate summary statistics of our data, which gives us the knowledge of how the mean of the data places relative to a number we are interested in (the population mean) and then decide one of three things:

  • The sample is above the expected value, so I will run a one tailed test to determine if the distance above is significant
  • The sample is below the expected value, so I will run a one tailed test to determine if the distance below is significant
  • The sample is exactly the expected value, so it doesn’t really matter what I do the results will not be significant

Here is the thing though- by choosing the test on the basis of the relationship of the sample mean to the expected value you are effectively apply a test (intended to be applied on all the data) only to the half of the data likely to lead to an interesting result. This has consequences. In particular the area that is judged significant is the same, while the test is not applied to values in the area of insignificance. So of the values you apply the test to, more of them fall in the area where type I errors occur.

# and lets say you run a two sided test, but you look at the results,
# and if the sample mean is greater than expected, swap to a upper test
# and if the sample mean is lower than expected, swap to a lower test

reject0withswap <- reject0when2sided
reject0withswap[xbar < pop_mean] <- reject0whenlowerside[xbar < pop_mean]
reject0withswap[xbar > pop_mean] <- reject0whenupperside[xbar > pop_mean]
sum(reject0withswap) / 10000

In this idealised case, it has doubled the likelihood in the long term, but given not perfectly normal distributions, unknown population means, and individual variation among samples, what you can say in the case of an individual set of data is that it dramatically increases the chance of type one error.

So decide you statistical test on the basis of the question you want to answer, not on the basis of what you data tells you.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s