R

Testing clustering with simulation

I was reading an article that focused on 25 data points spread over a 72 point scale

exert

The article made various assertions based on the non-random distribution of the data. However, it did not in any way give any uncertainty about the results or the likelihood of events occurring by chance. So I wrote my own simulation based on the idea that with a 72 point scale and 25 values, the maximum lack of clustering is when the data is evenly spread every 2-3 steps (and nothing is 1 or 0 steps away from another entry), at the other end of the spectrum data with maximum clustering is all on the same value (0 steps away from the next entry). This means we can declare a scale of how clustered something is by how many entries are 0 or 1 away, and then see how common different degrees of clustering are when the data is distributed at random. For this to be non-random, it needs to be unlikely to occur due to random chance.

# for figure from article
bins <- 72
occurances <- 25
bins / occurances
# maximum_spacing = 25 at 3 or 2 distance, and 0 at 1 or 0

# replication test basd on the idea that the gaps between clustered data observations
# are either going to be very small amounts (within sub-clusters) or very large 
# (between sub-clusters), this lets us make a simulation test on the basis of
# independence to see how unlikely this distribution is by taking the idea of 
# putting entries in order and counting how many are within 2 of the next one,
# clustered entries should, by definition, have more entries close together than non-clustered
# maximum spread, no clustering, has 0 entries at 0 or 1 spacing.
# at maximum clustering, all the entries are on the same point, therefore we can
# measure the degree of


sim <- function(){
 x <- sample(1:72, size=25, replace=TRUE)
 # order the data
 sortx <- sort(x)
 #calculate gaps
 gaps <- diff(sortx)
 gap_count <- sum(gaps < 2)
}

million_reps <- replicate(1000000, sim())
hist(million_reps)
mean(million_reps)
median(million_reps)
#observed_data (copied from graph) with top boundary point added

actual <- c(7,9,9,11,11,26,30,33,40,42,44,44,48,56,57,59,62,62,63,63,65,65,67,70,70)
actual_diff <- diff(actual)
actual_count <- sum(actual_diff < 2)

#likelihood of observing a cluster of this degree or greater with random data
sum(million_reps >= actual_count) / length(million_reps)
# = around .75 or 75%


So the observed data is less clustered than random data on average (the mean, or for that matter the median) is. With 75% of random cases showing that degree of closeness to other entries (clustering) or greater. So I don’t feel there is a lot of evidence for an effect that is non-random.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s