This is something I have always considered obvious, but as I cannot find a specific reference to I am writing out an explainer. It does feel a little like Gauss’s “I thought it was trivial” though, so if anyone has another explanation of it that predates this one, let me know and I will link to it.

“Common” advice for using Chi-Squared tests of Independence used to be “you can aggregate tiny samples to get to the minimum expected value” now it is “use a test that can handle smaller values” (in large part because the development of computing resources supports more computation intensive test). But the advice has never been “you can aggregate large categories”, though there is not normally a detailed explanation of why this is bad.

The reason is that the Chi Squared test of Independence is a comparison of the variability in your data. Aggregating your samples together potentially reduces the variability in the data, due to the Central Limit Theorem and reversion to the mean. Testing one sample against an aggregation of other samples using a test of variability means that you are comparing the variability of one sample against the reduced variability of the aggregate. This means that the statistical measure of the unusualness of the sample (and so the significance of the result) is overstated.

Working through an example, if I have 5 sample years of data

```
data <- "
case1,case2
88099,85906
93230,92496
85270,83761
89777,88634
86788,85734"
example <- read.csv(text=data)
row.names(example) <- c("A", "B", "C","D","E")
knitr::kable(example)
```

case1 | case2 | |
---|---|---|

A | 88099 | 85906 |

B | 93230 | 92496 |

C | 85270 | 83761 |

D | 89777 | 88634 |

E | 86788 | 85734 |

And if I am asking the question is year A unusual compared to other years. I could make a plot.

```
plot(1:5, example$case1/(example$case1 + example$case2), xaxt="n", frame.plot=FALSE, xlim=c(0,6), xlab="year", ylab="proprtion of case1 in year", ylim=c(0.5,0.51))
axis(1, at=1:5,labels=row.names(example))
points(1, example$case1[1]/(example$case1[1]+ example$case2[1]), pch=19, col="red")
```

It shows that year A is high. If we run a (correctly carried out) Chi-squared test using all of the samples

`chisq.test(example)`

```
##
## Pearson's Chi-squared test
##
## data: example
## X-squared = 7.7596, df = 4, p-value = 0.1008
```

It reports that is much variation in the data has around a 10% chance of occurring by chance, so we can conclude nothing exciting is going on in the variation of the data. The results from year A may be significant against particular other years, but they are not significant given the overall variation in the data.

However, if we pool every year except year A

```
pooled <- rbind(example[1,], colSums(example[2:5,]))
row.names(pooled)[2] <- "all other"
knitr::kable(pooled)
```

case1 | case2 | |
---|---|---|

A | 88099 | 85906 |

all other | 355065 | 350625 |

and preform the same Chi-squared test the result is

`chisq.test(pooled)`

```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: pooled
## X-squared = 5.5479, df = 1, p-value = 0.0185
```

Apparently significant to a p-value of less than 0.02.

So there seems to be something fundamentally wrong here. We can model the development of the wrongness by starting year A off with the expected values given all other years, and shifting values from case2 to case1 to see what happens

```
sigmulti <- numeric()
sigpooled <- numeric()
modeledp <- pooled
modeledi <- example
numberA <- sum(pooled[1,])
modeledp[1,1] <- as.integer(numberA * pooled[2,1]/sum(pooled[2,]))
modeledp[1,2] <- numberA - modeledp[1,1]
modeledi[1,1] <- modeledp[1,1]
modeledi[1,2] <- modeledp[1,2]
difference <- 1:1500
for (step in difference){
modeledp[1,1] <- modeledp[1,1] + 1
modeledp[1,2] <- modeledp[1,2] - 1
modeledi[1,1] <- modeledi[1,1] + 1
modeledi[1,2] <- modeledi[1,2] - 1
sigpooled <- c(sigpooled, chisq.test(modeledp)$p.value)
sigmulti <- c(sigmulti, chisq.test(modeledi)$p.value)
}
plot(difference*2, sigmulti, col="blue", type="l", ylim=c(0,1), main="p.values of testing individual years vs pooled testing", ylab="p.value", xlab="size of difference in year A between case 1 and case 2", frame.plot=F)
lines(difference*2, sigpooled, col="green", type="l")
abline(h=0.05, lty=2)
legend("topright",legend=c("testing individual years", "pooled value", "0.05 level"), col=c("blue","green","black"), lty=c(1,1,2), border=F, bty="n")
```

While the pooled value produces an erroneous values until it converges at effective 0 with testing individual, of particular concern is that using a pooled sample reports a p-value of 0.05 with a difference between case1 and case2 at a difference of 916 for case1 versus case2 compared to testing against all years giving a p-value of 0.19527 at the same level of difference between case1 and case2. Testing against individual years does not find a significant difference in pattern of data until the difference between case1 and case2 in year A reaches 1260. There is danger zone of a raw difference of 344 where things are erroneously being declared significant at the 0.05 level when in fact they are not. It so happens that year A falls into this zone.

Just to be explicit on the cause of the error that pooling the data produces: testing individual years against each other tests the amount of variation in the sample against the actual variation in the samples, testing the pooled data is testing the variation in year A against an expected (hypothetical) variation from a sample level across all years. The actual observed variation in the data is greater than that predicted from a single sample, so the test results in an error.

**Why this is a form of multiple testing error**

It might still not be obvious why this is a multiple testing error, but for those who have already made the intellectual leap, you might want to jump this section where I work through an example.

The fundamental issue is the difference between combining the variance of the sample data, and the variance of the combined sample data. This is easier to visualise if treat the problem as a binomial one of case1 vs. not case1 (case2), and then use the normal approximation to the binomial. I’m just doing this for visualisation purposes.

```
bino <- example
bino$n <- bino$case1 + bino$case2
bino$samplep <- bino$case1 / (bino$case1 + bino$case2)
bino$samplemean <- bino$case1
bino$samplesd <- (bino$n * bino$samplep * (1- bino$samplep)) ^ 0.5
knitr::kable(bino)
```

case1 | case2 | n | samplep | samplemean | samplesd | |
---|---|---|---|---|---|---|

A | 88099 | 85906 | 174005 | 0.5063015 | 88099 | 208.5530 |

B | 93230 | 92496 | 185726 | 0.5019760 | 93230 | 215.4780 |

C | 85270 | 83761 | 169031 | 0.5044637 | 85270 | 205.5587 |

D | 89777 | 88634 | 178411 | 0.5032033 | 89777 | 211.1893 |

E | 86788 | 85734 | 172522 | 0.5030547 | 86788 | 207.6750 |

which we can graph showing two standard deviation confidence intervals confidence intervals

```
cimax <- (2 * bino$samplesd + bino$samplemean) / bino$n
cimin <- (bino$samplemean - 2 * bino$samplesd) / bino$n
plot(1:5, bino$samplemean / bino$n, xaxt="n", frame.plot=FALSE, xlim=c(0,6), xlab="year", ylab="sampleproportion", ylim=c(min(cimin),max(cimax)), main="Sample proportion with\n2 sd cofidence intervals shown")
axis(1, at=1:5,labels=row.names(example))
points(1, bino$samplemean[1] / bino$n[1], col="red")
for (i in 1:5){
lines(c(i,i),y=c(cimax[i],cimin[i]))
}
lines(c(1,1),y=c(cimax[1],cimin[1]), col="red")
```

Though year A has a higher proportion than other years, there is a overlap with some of the other years. But we aren’t doing this as a test, we are doing this to see what happens if you compound the the variances of the normal approximations by calculating the variance of the combined data (the effect of combining all the data then measuring the variability) compared the combined variance of the individual data points (correctly treating the samples as separate).

```
combn <- bino[2:5,]
combn$var <- combn$samplesd ^ 2
combn$varnminus1 <- combn$samplesd ^ 2 *(combn$n - 1)
combn$pooledsd <- (cumsum(combn$varnminus1) / cumsum(combn$n - 1)) ^ 0.5
combn$asonesd <- (cumsum(combn$samplemean) * (1 - cumsum(combn$samplemean)/cumsum(combn$n)))^0.5
cicmax <- (2 * combn$asonesd + cumsum(combn$samplemean)) / cumsum(combn$n)
cicmin <- (cumsum(combn$samplemean) - 2 * combn$asonesd) / cumsum(combn$n)
plot(2:5, cumsum(combn$samplemean) / cumsum(combn$n), xaxt="n", frame.plot=FALSE, xlim=c(0,6), xlab="year", ylab="sampleproportion", ylim=c(.50,.51), main="Combined proportion with\n2 sd cofidence intervals shown")
axis(1, at=1:5,labels=c("A","B","BC","BCD","BCDE"), cex.axis=0.5)
points(1, bino$samplemean[1] / bino$n[1], col="red")
for (i in 1:4){
lines(c(i+1,i+1),y=c(cicmax[i],cicmin[i]))
}
lines(c(1,1),y=c(cimax[1],cimin[1]), col="red")
```

**Note**: I would like to make it clear that I know this is a completely wrong way to combine the sample variances, I am just trying to visually show the affects of aggregating all the other samples to form one sample before doing the comparison. In the previous graph, having the samples separate means taking the differences between the groups into account and when doing multiple testing running a correction on the data to reduce the likelihood of “having tested a lot of things you eventually get something that is significant” (c.f. Bonferroni). In this graph, by the time you the combination of B + C + D + E the rich variability in the data has been lost, and you seem to be comparing 2 things (A and BCDE) for which there seems to be no similarity what so ever.

What makes it the same class of error as multiple testing errors is the way you are underestimating the variation present when comparing one sample with several other samples. The Bonferroni correction in multiple testing is performing a similar role to the degrees of freedom in the Chi Squared test- it is acknowledging that when you have a bunch of samples that vary, that variation between samples is important. The manifestation of this fundamental issue is slightly different in the two cases- in a traditional textbook multiple testing error you are ignoring that the variation between samples and eventually striking one that is significant. In this version, In this version, the information of between sample variation is similarly ignored, and in that blindness the difference of the first sample (Year A in this example) is overemphasised by the degree to which the first sample is influenced by the same inter-sample variation that is affecting the other samples.

Blindness to what multiple samples are telling you is, in my opinion, a multiple sampling error.