With discussion in the media over the frequency of favourites chocolates in Easter Eggs, Thomas Lumley over in StatsChat wrote a simulation to estimate the probabilities.

I thought I would add another way of thinking about simulating chocolate varieties in Easter Eggs- rather than thinking about it as packs of chocolates which you can test and summarise the results for, you can also think about it as (lots of) individual chocolates that belong to a pack and are of a variety. You can then use tools like dplyr to apply mass aggregation and summarise functions to get the frequency of events

library(dplyr) #How many times are we doing this: trials <- 100000 #How big is the pack size packsize <- 8 #How many varieties are there bartypes <- 5 # if we treat this as a whole bunch of bars each of which belong to a trial group, then there will be {packsize} entries for each trial group trialnumber <- rep(1:trials, packsize) # and if we draw enough bars to account for each bar in each trial group bar % group_by(trialnumber) %>% summarise(cherryRipes=sum(bar==1)) %>% group_by(cherryRipes) %>% summarise(frequency=n()) #maximum number of bars of the same type in a pack: data.frame(trialnumber, bar) %>% group_by(trialnumber,bar) %>% summarise(occurance=n()) %>% group_by(trialnumber) %>% summarise(commonest=max(occurance)) %>% group_by(commonest) %>% summarise(frequency=n()) #how often are all five flavours present (TRUE) data.frame(trialnumber, bar) %>% group_by(trialnumber) %>% summarise(allthere=all(1:bartypes %in% bar)) %>% group_by(allthere) %>% summarise(frequency=n())

However, even this is unnecessarily complex. All the actions taken on the data are “counting how many” actions, and using tables (or sections of tables) offers a very fast easy way to summarise things when you are comparing one or two variables on the basis of counting up occurrences. You still need to generate a list of what trial each chocolate belongs to, and what variety the chocolate is, but then you can do the rest from tables or parts of tables

#How many times are we doing this: trials <- 100000 #How big is the pack size packsize <- 8 #How many varieties are there bartypes <- 5 # if we treat this as a whole bunch of bars each of which belong to a trial group, then there will be {packsize} entries for each trial group trialnumber <- rep(1:trials, packsize) # and if we draw enough bars to account for each bar in each trial group bar <- sample(1:bartypes, packsize*trials, replace=TRUE) # then we have a list of trial identifiers and a list of bar identifiers we can summarise how many of each in a table packs <- table(trialnumber, bar) #this effectively makes each row in the packs table a pack #with a count up for each type of bar in each column #Now we can pick out summary information about that table to answer questions #number of cherry ripes in a pack, assuming Cherry Ripe is bar outcome (column) 1: table(packs[,1]) #maximum number of bars of the same type in a pack: table(apply(packs,1,max)) #all five flavours present, since not being present must mean a zero minimum in at least one flavour, #the zero results are missing at least one flavour table(apply(packs,1,min))

Something I find very powerful, when addressing questions in R, is the think about approaches that do everything at one. So, treating it as one big problem (in this case a giant set of bars we know the pack number of) rather than a bunch of small problems (a big set of packs we know the bars of). Working with our data at the most atomic level of makes for swift, minimalistic code.