I was reading an article that focused on 25 data points spread over a 72 point scale

The article made various assertions based on the non-random distribution of the data. However, it did not in any way give any uncertainty about the results or the likelihood of events occurring by chance. So I wrote my own simulation based on the idea that with a 72 point scale and 25 values, the maximum lack of clustering is when the data is evenly spread every 2-3 steps (and nothing is 1 or 0 steps away from another entry), at the other end of the spectrum data with maximum clustering is all on the same value (0 steps away from the next entry). This means we can declare a scale of how clustered something is by how many entries are 0 or 1 away, and then see how common different degrees of clustering are when the data is distributed at random. For this to be non-random, it needs to be unlikely to occur due to random chance.

# for figure from article bins <- 72 occurances <- 25 bins / occurances # maximum_spacing = 25 at 3 or 2 distance, and 0 at 1 or 0 # replication test basd on the idea that the gaps between clustered data observations # are either going to be very small amounts (within sub-clusters) or very large # (between sub-clusters), this lets us make a simulation test on the basis of # independence to see how unlikely this distribution is by taking the idea of # putting entries in order and counting how many are within 2 of the next one, # clustered entries should, by definition, have more entries close together than non-clustered # maximum spread, no clustering, has 0 entries at 0 or 1 spacing. # at maximum clustering, all the entries are on the same point, therefore we can # measure the degree of sim <- function(){ x <- sample(1:72, size=25, replace=TRUE) # order the data sortx <- sort(x) #calculate gaps gaps <- diff(sortx) gap_count <- sum(gaps < 2) } million_reps <- replicate(1000000, sim()) hist(million_reps) mean(million_reps) median(million_reps) #observed_data (copied from graph) with top boundary point added actual <- c(7,9,9,11,11,26,30,33,40,42,44,44,48,56,57,59,62,62,63,63,65,65,67,70,70) actual_diff <- diff(actual) actual_count <- sum(actual_diff < 2) #likelihood of observing a cluster of this degree or greater with random data sum(million_reps >= actual_count) / length(million_reps) # = around .75 or 75%

So the observed data is less clustered than random data on average (the mean, or for that matter the median) is. With 75% of random cases showing that degree of closeness to other entries (clustering) or greater. So I don’t feel there is a lot of evidence for an effect that is non-random.