New Zealand · R

Simulating Cadbury Easter Eggs

With discussion in the media over the frequency of favourites chocolates in Easter Eggs, Thomas Lumley over in StatsChat wrote a simulation to estimate the probabilities.

I thought I would add another way of thinking about simulating chocolate varieties in Easter Eggs- rather than thinking about it as packs of chocolates which you can test and summarise the results for, you can also think about it as (lots of) individual chocolates that belong to a pack and are of a variety. You can then use tools like dplyr to apply mass aggregation and summarise functions to get the frequency of events

library(dplyr)
#How many times are we doing this:
trials <- 100000
#How big is the pack size
packsize <- 8
#How many varieties are there
bartypes <- 5
# if we treat this as a whole bunch of bars each of which belong to a trial group, then there will be {packsize} entries for each trial group
trialnumber <- rep(1:trials, packsize)
# and if we draw enough bars to account for each bar in each trial group  
bar % group_by(trialnumber) %>% summarise(cherryRipes=sum(bar==1)) %>% group_by(cherryRipes) %>% summarise(frequency=n())
#maximum number of bars of the same type in a pack:
data.frame(trialnumber, bar) %>% group_by(trialnumber,bar) %>% summarise(occurance=n()) %>% group_by(trialnumber) %>% summarise(commonest=max(occurance)) %>% group_by(commonest) %>% summarise(frequency=n())
#how often are all five flavours present (TRUE)
data.frame(trialnumber, bar) %>% group_by(trialnumber) %>% summarise(allthere=all(1:bartypes %in% bar)) %>% group_by(allthere) %>% summarise(frequency=n())

However, even this is unnecessarily complex. All the actions taken on the data are “counting how many” actions, and using tables (or sections of tables) offers a very fast easy way to summarise things when you are comparing one or two variables on the basis of counting up occurrences. You still need to generate a list of what trial each chocolate belongs to, and what variety the chocolate is, but then you can do the rest from tables or parts of tables

#How many times are we doing this:
trials <- 100000
#How big is the pack size
packsize <- 8
#How many varieties are there
bartypes <- 5
# if we treat this as a whole bunch of bars each of which belong to a trial group, then there will be {packsize} entries for each trial group
trialnumber <- rep(1:trials, packsize)
# and if we draw enough bars to account for each bar in each trial group  
bar <- sample(1:bartypes, packsize*trials, replace=TRUE)
# then we have a list of trial identifiers and a list of bar identifiers we can summarise how many of each in a table
packs <- table(trialnumber, bar)
#this effectively makes each row in the packs table a pack
#with a count up for each type of bar in each column
#Now we can pick out summary information about that table to answer questions
#number of cherry ripes in a pack, assuming Cherry Ripe is bar outcome (column) 1:
table(packs[,1])
#maximum number of bars of the same type in a pack:
table(apply(packs,1,max))
#all five flavours present, since not being present must mean a zero minimum in at least one flavour,
#the zero results are missing at least one flavour
table(apply(packs,1,min))

Something I find very powerful, when addressing questions in R, is the think about approaches that do everything at one. So, treating it as one big problem (in this case a giant set of bars we know the pack number of) rather than a bunch of small problems (a big set of packs we know the bars of). Working with our data at the most atomic level of makes for swift, minimalistic code.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s