Which is to say use sets.

There is a class of problem I have seen people missing potentially useful approaches a few times, because they didn’t recognise it as a set problem. Basically, if the question involves terms like “combined” or “excluding” or “common to both” it may be a question that can benefit from thinking in set terms.

To some extent whenever you are subsetting data you are using sets (even more so for multiple criteria), but I was thinking of the more formal operations like union and intersect.

The other day I encountered the problem of “Based on the log information of when people are active on a site, when would be the best time to hold a live event”, which is a basic aggregation question. But this question lead to the more complex “what combination of two times would capture the most people”. Note that this question involves a combination, in this case the union of the two times of people. So here is an example:

Create a log of a fictions weeks worth of data

#example data
people <- 1:200
times <- seq.POSIXt(from=as.POSIXct("2016-02-08 00:00:00"), to=as.POSIXct("2016-02-15 00:00:00"), by=60)
who <- sample(people,1000,replace=TRUE)
when <- times[sample(1:length(times),1000,replace=TRUE)]
exampleLog <- data.frame(who,when)

And now, analysing the log to determine matches

#round to hour
t1 <- trunc(exampleLog$when, units="hours")
exampleLog$roundedHour <- trunc(exampleLog$when + hours(1), units="hours")
exampleLog$roundedHour[minute(exampleLog$when) < 30] <- t1[minute(exampleLog$when) < 30]
#get weekdays and hours of the day
exampleLog$hour <- hour(exampleLog$roundedHour)
exampleLog$dayHuman <- wday(exampleLog$roundedHour, label=TRUE)
exampleLog$dayComputer <- wday(exampleLog$roundedHour)
#stick them together for ease of processing
exampleLog$dayhourHuman <- paste(exampleLog$dayHuman, exampleLog$hour)
exampleLog$dayhourComputer <- exampleLog$dayComputer + (exampleLog$hour/100)
#in reality dayhourComputer is only useful for knowing the natural order of the entries, for example if making graphs
#The list of people for each time for which there are people
collected <- aggregate(who ~ dayhourHuman + dayhourComputer, data=exampleLog, FUN=c)
#the pairs of each potential time (for two, this is quite an efficient way) by their index numbers
time1 <- rep(1:length(collected$who), each=length(collected$who))
time2 <- rep(1:length(collected$who), times=length(collected$who))
#but we don't want to double up the same times
t3 <- time1 - time2
time1 <- time1[t3 != 0]
time2 <- time2[t3 != 0]
timeCombos <- data.frame(time1, time2)

Now, we get the size of the set in the union of the groups in time1 and time2

timeCombos$totalPeople <- apply(timeCombos, 1, function(x){length(union(collected$who[[x[1]]],collected$who[[x[2]]]))})

Then it is just a matter of finding the best, and acting on that knowledge

best <- timeCombos[timeCombos$totalPeople == max(timeCombos$totalPeople),]
best$when1 <- collected$dayhourHuman[best$time1]
best$when2 <- collected$dayhourHuman[best$time2]
print(paste(best$when1,"together with",best$when2))


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s