Which is to say use sets.

There is a class of problem I have seen people missing potentially useful approaches a few times, because they didn’t recognise it as a set problem. Basically, if the question involves terms like “combined” or “excluding” or “common to both” it may be a question that can benefit from thinking in set terms.

To some extent whenever you are subsetting data you are using sets (even more so for multiple criteria), but I was thinking of the more formal operations like union and intersect.

The other day I encountered the problem of “Based on the log information of when people are active on a site, when would be the best time to hold a live event”, which is a basic aggregation question. But this question lead to the more complex “what combination of two times would capture the most people”. Note that this question involves a combination, in this case the union of the two times of people. So here is an example:

Create a log of a fictions weeks worth of data

#example data

people <- 1:200

times <- seq.POSIXt(from=as.POSIXct("2016-02-08 00:00:00"), to=as.POSIXct("2016-02-15 00:00:00"), by=60)

set.seed(365)

who <- sample(people,1000,replace=TRUE)

when <- times[sample(1:length(times),1000,replace=TRUE)]

exampleLog <- data.frame(who,when)

And now, analysing the log to determine matches

#round to hour

library(lubridate)

t1 <- trunc(exampleLog$when, units="hours")

exampleLog$roundedHour <- trunc(exampleLog$when + hours(1), units="hours")

exampleLog$roundedHour[minute(exampleLog$when) < 30] <- t1[minute(exampleLog$when) < 30]

#get weekdays and hours of the day

exampleLog$hour <- hour(exampleLog$roundedHour)

exampleLog$dayHuman <- wday(exampleLog$roundedHour, label=TRUE)

exampleLog$dayComputer <- wday(exampleLog$roundedHour)

#stick them together for ease of processing

exampleLog$dayhourHuman <- paste(exampleLog$dayHuman, exampleLog$hour)

exampleLog$dayhourComputer <- exampleLog$dayComputer + (exampleLog$hour/100)

#in reality dayhourComputer is only useful for knowing the natural order of the entries, for example if making graphs

#The list of people for each time for which there are people

collected <- aggregate(who ~ dayhourHuman + dayhourComputer, data=exampleLog, FUN=c)

#the pairs of each potential time (for two, this is quite an efficient way) by their index numbers

time1 <- rep(1:length(collected$who), each=length(collected$who))

time2 <- rep(1:length(collected$who), times=length(collected$who))

#but we don't want to double up the same times

t3 <- time1 - time2

time1 <- time1[t3 != 0]

time2 <- time2[t3 != 0]

timeCombos <- data.frame(time1, time2)

Now, we get the size of the set in the union of the groups in time1 and time2

timeCombos$totalPeople <- apply(timeCombos, 1, function(x){length(union(collected$who[[x[1]]],collected$who[[x[2]]]))})

Then it is just a matter of finding the best, and acting on that knowledge

best <- timeCombos[timeCombos$totalPeople == max(timeCombos$totalPeople),]

best$when1 <- collected$dayhourHuman[best$time1]

best$when2 <- collected$dayhourHuman[best$time2]

print(paste(best$when1,"together with",best$when2))