New Zealand · R

Friday the 13th, Part 3

So, venturing into the “Does Friday the thirteenth affect births in New Zealand for a final time, after parts one and two.

But this part is about really showing what a little insight into the data can do to help an analysis – telling the story of the way a lever can unlock all kinds of insight (to make a very, very mixed metaphor)

Recap

We have an aggregate set of births in New Zealand by days of the month for the years 1980 to 2014, and are wondering if fear of 13ths of the months leads to avoiding scheduled c-sections and so influences birth figures. Part 1 basically found 13ths had a slightly smaller total of births, but not unusually so.

In part 2, I realised that 1980 to 2014 does not contain an even number of days of the week for each day- with the exception of February the 29th, each date contains 6 of one day of the week, 4 of another day of the week, and 5 instances of the other day (see code block 1). By comparing the 6, 5, and 4 values for particular days we can see how the number of births change as the number of instances of those days change. From that we can say that Wednesdays in particular, but also Thursdays are the days that scheduled c-sections are carried out within the aggregate period, but also that they are not conducted on Sundays in particular, but also not Saturday. With Fridays joining Mondays and Tuesdays as unremarkable days.

#code block 1, finding how many dates break the one 6, five 5s, one 4 pattern
dts <- seq.Date(from=as.Date("1980-01-01"), to=as.Date("2014-12-31"), by=1)
dys <- weekdays(dts)
library(lubridate)
mth <- month(dts, label=TRUE, abbr=FALSE)
mdt <- mday(dts)
alldates <- data.frame(dts,dys,mth, mdt)
library(dplyr)
library(tidyr)
alldates %>% group_by(mth,mdt) %>% summarise(Mondays = sum(dys == "Monday"),
                                             Tuesdays = sum(dys == "Tuesday"),
                                             Wednesdays = sum(dys == "Wednesday"),
                                             Thursdays = sum(dys == "Thursday"),
                                             Fridays = sum(dys == "Friday"),
                                             Saturdays = sum(dys == "Saturday"),
                                             Sundays = sum(dys == "Sunday")) %>%
  filter(!(mth == "February" & mdt == 29)) %>%
  gather(key=day, value=count, Mondays:Sundays) %>% group_by(mth, mdt) %>%
  summarise(fours = sum(count==4), fives= sum(count==5), sixes= sum(count==6)) %>%
  filter(fours != 1 | sixes != 1)
## Source: local data frame [0 x 5]
## Groups: mth [0]
## 
## Variables not shown: mth (fctr), mdt (int), fours (int), fives (int),
##   sixes (int)
#none break the pattern

But the story of the data is still not is clear as it can be, while each non-leap day has the same 6/ 5s/ 4 pattern, the distribution of the 6/5s/4 changes. In the case of the Friday the 13ths (see codeblock 2)

#codeblock 2 shows the 4 and 6 days for the 13ths. It needs the alldates variable created in codeblock 1
alldates %>% group_by(mth,mdt) %>% summarise(Mondays = sum(dys == "Monday"),
                                             Tuesdays = sum(dys == "Tuesday"),
                                             Wednesdays = sum(dys == "Wednesday"),
                                             Thursdays = sum(dys == "Thursday"),
                                             Fridays = sum(dys == "Friday"),
                                             Saturdays = sum(dys == "Saturday"),
                                             Sundays = sum(dys == "Sunday")) %>%
  gather(key=day, value=count, Mondays:Sundays) %>% 
  filter(count !=5, mdt == 13) %>%
  select(mth, mdt, count, day) %>%
  spread (key=count, value=day) -> thirteens
names(thirteens) <- c("Month", "Date", "has 4", "has 6")
knitr::kable(thirteens)
Month Date has 4 has 6
January 13 Saturdays Sundays
February 13 Tuesdays Wednesdays
March 13 Mondays Thursdays
April 13 Thursdays Sundays
May 13 Saturdays Tuesdays
June 13 Tuesdays Fridays
July 13 Thursdays Sundays
August 13 Sundays Wednesdays
September 13 Wednesdays Saturdays
October 13 Fridays Mondays
November 13 Mondays Thursdays
December 13 Wednesdays Saturdays

So, for example January the 13th has 4 Saturdays (fewer low days) but 6 Sundays (more low days), while May the 13th has 4 Saturdays (fewer low days) but 6 Tuesdays (more average days). There is still variation within the data.

However, each day has the same pattern as a week before and after (note: this was internet commentator Megan Pledger on Stats Chat), except for where the pattern is being disrupted by Leap Days- the week before and after the possible 29th (that is me again, figuring out the implementation). So we can assess the significance of a date by comparing it to the average of the date a week before and after (excluding aforementioned leap zone dates) giving an amount above or below the expected births for each date (see codeblock 3)

library(readxl)
library(lubridate)
birthData <- read_excel("b.xlsx", sheet = 2, skip=2)
names(birthData)[1] <- "day"
#take a handy leap year
thedate <- seq.Date(from = as.Date("2015-12-22"), to=as.Date("2017-01-07"), by="day")
wbefore <- thedate - days(7)
wafter <- thedate + days(7)
differences <- data.frame(thedate, wbefore, wafter)
actual <- apply(differences,1, function(x){birthData[[as.numeric(mday(x[["thedate"]])), as.character(month(x[["thedate"]], label=TRUE, abbr=FALSE))]]})
weekbf <- apply(differences,1, function(x){birthData[[as.numeric(mday(x[["wbefore"]])), as.character(month(x[["wbefore"]], label=TRUE, abbr=FALSE))]]})
weekaf <- apply(differences,1, function(x){birthData[[as.numeric(mday(x[["wafter"]])), as.character(month(x[["wafter"]], label=TRUE, abbr=FALSE))]]})
expected <- (weekbf + weekaf) / 2
propDiff <- (actual - expected) / expected
results <- data.frame(thedate, actual, expected, propDiff)
results <- results[year(results$thedate)==2016,]

We can plot the entire years variation from the expected (codeblock 4)

plot(results$thedate, results$propDiff, type="l", frame.plot = FALSE)

codeblock_04-1

The distortion at the end of the year is Christmas, the apparent distortion at the start of February many times the Christmas effect is the Leap Day affect, where the presence or absence of Feb29th is causing days in this zone to be matched against days with a different pattern of 6/5s/4. This also shows the strength of the differences of individual days (the Feb March variation) to the strength of an effect caused by a significant date (Christmas). So we blank out the period a week either side of February the 29th (codeblock 5)

early <- results$thedate < as.Date("2016-02-22")
late <- results$thedate > as.Date("2016-03-07")
cleaned <- early | late
errorRange <- !(cleaned)
fixedHolidays= as.Date(c("2016-01-01", "2016-01-02", "2016-02-06", "2016-04-25", "2016-12-25","2016-12-26"))
holidays <- results$thedate %in% fixedHolidays

And revisit the plot as the years variation excluding leap affected days (codeblock 6)

plot(results$thedate[cleaned], results$propDiff[cleaned], type="n", frame.plot = FALSE)
lines(results$thedate[early], results$propDiff[early])
lines(results$thedate[late], results$propDiff[late])
lines(results$thedate[errorRange], rep(0, sum(errorRange)), lty=3, lwd=0.5)
points(results$thedate[holidays], results$propDiff[holidays], pch=19,cex=0.5, col="red")

codeblock_06-1

I have added red dots for fixed holidays in New Zealand – New Years Day, Day after New Years Day, Waitangi Day, ANZAC day, Christmas Day, and Boxing day. I want to make a couple of points about holidays:

  • Holidays inflate the number above expectations of the dates a week either side- because Holidays are so low, not only are holidays themselves low, but as half of the dates that then entry a week before or after is being compared against they deflate the expected value of that date. This has the effect of comparing the actual value to a lower than true expected value, leading to artificially high entries one week either side of a holiday. I could correct for this to a degree, but it is more of the same as what I have done, and is beyond what I want to do for a recreational blog post, so I will leave untangling that one as an exercise for the reader. Though Christmas/New Years are particularly interesting in this respect, as the are exactly a week apart.
  • I have not included mobile holidays, like Easter (based on the moon), Queen’s Birthday (1st Monday in June), or Labour Day (4th Monday in October). These are wandering around in the data set have similar effects as the fixed days, but spread over the specific dates they have occurred on. For that matter, even fixed holidays do a form of wandering- in New Zealand Mondayisation influences when holidays actually equates to time off work. For employees who would not otherwise work on that Saturday or Sunday, the Public holiday must be treated as falling on the following Monday or Tuesday. For employees who would otherwise work on that Saturday or Sunday, the public holiday must be treated as falling on that day. It is again possible to address the wandering days off work effects, by figuring out how often the holidays occur on particular dates and how often Mondayisation takes place, but again this is a recreational post and I leave that as an exercise for the reader.

I think the holidays effect goes some way to explaining the shape of the histogram of the under/over expectations of each date (codeblock 7)

hist(results$propDiff[cleaned])

codeblock_07-1

There are more extreme values on the negative side, and more entries close to zero on the positive side. This is arguably in part because a holiday has one big negative effect on one date, and a mild positive effect on two other dates.

So where does Friday the thirteenth fit in all this

Holidays aside, there is not a lot of difference in the data, but if we aggregate all the clean dates we see (codeblock 8) that the 13th has the lowest median number of births compared to the expected amount based on dates a week either side, and the 5th lowest mean number of births (compared to as before).

library(dplyr)
results[cleaned,] %>% mutate(mthDate = mday(thedate)) %>% group_by(mthDate) %>% summarise(mean=mean(propDiff), median=median(propDiff)) %>% arrange(median) %>% mutate(medrank = 1:n()) %>% arrange(mean) %>% mutate(meanrank= 1:n()) %>% filter(mthDate == 13) %>% select(medrank, meanrank)
## Source: local data frame [1 x 2]
## 
##   medrank meanrank
##     (int)    (int)
## 1       1        5

This would not seem to be a very likely outcome if days were independent and the 13th is not being selected against. However, there are no holidays regularly on the 6th or the 20th of the Month, which means that the 13th is not being raised artificially (like 12 dates one week from fixed holidays, plus however many are affected by moving holidays). But I think we can clarify matters with what I am calling the “you have to come out sometime” theorem.

You have to come out sometime

If you look at the Christmas, New Years, and Waitangi Day, there are clear surges in days immediately adjacent to the holiday dates. This is because the broad delivery date was decided almost nine months before. Not being born on a particular date does not mean a non-event, it means shifting the event to another date. With genuinely independent events the occurrence of one of the events does not influence other occurrences, but when planning a scheduled delivery moving it from on date means putting it on another date.

This is, incidentally, why I’ve been keeping away from most formal statistical tests on this- the dependence of the data would make the results of most pretty meaningless. But this same reason is also a lever into judging the likely hood that the low level of births is the result of displacement (Fear of Friday the 13th) rather than natural variation in the number of people due to be born as a result of events almost nine months ago. To be a genuine relationship there has to be evidence of both a fall on the day, and a displacement of the births to neighboring days.

In the case of Friday 13ths, we can be pretty specific about this, we know from part 2 deliveries are not normally scheduled for the weekend and are scheduled on Wednesdays and Thursdays, so we are looking for unexpected increases in births on the 11th and 12th. And at an aggregate level (codeblock 9) there seems to be something to this. Calculating raw numbers over or under expected values, we find over the year the 11th is 377 births greater than expected, the 12th 210.5 births greater than expected (don’t ask :)) and the 13th is 507 births less than expect.

library(dplyr)
library(lubridate)
results$rawDiff <- results$actual - results$expected
results$mthDate = mday(results$thedate)
results %>% filter(mthDate == 11 | mthDate == 12 | mthDate == 13) %>% group_by(mthDate) %>% summarise(totalDifference = sum (rawDiff)) 
## Source: local data frame [3 x 2]
## 
##   mthDate totalDifference
##     (int)           (dbl)
## 1      11           377.0
## 2      12           210.5
## 3      13          -507.0

However, the all dates aggregated misses a critical limitation- A delivery that would otherwise be on May the 13th cannot shift to October 12th, it has to take place near May the 13th. This is the second insight being applied, so we are combining differences between days and looking for displacement of births rather than absence. So we need to compare each 13th with its immediately preceding 11th and 12th (codeblock 10)

library(dplyr)
library(tidyr)
results %>% filter(mthDate == 11 | mthDate == 12 | mthDate == 13) %>%
  mutate(mth = month(thedate, label=TRUE, abbr=FALSE), nmth = month(thedate)) %>%
  select(mth, nmth, mthDate, rawDiff) %>% spread(mthDate,rawDiff) %>%
  mutate(twoBeforeSum = `11` + `12`) %>%
  select(mth,twoBeforeSum,`13`) -> dsp
names(dsp) <- c("month", "twoDaysBefore","Thirteenth")
plot(c(1,13), c(min(dsp[,2:3]),max(dsp[,2:3])), type="n", frame.plot = FALSE)
abline(h=0, lty=3, col="#779999")
l <- lapply(1:12, function(x){lines(c(x,x+0.5),c(dsp[x,2],dsp[x,3]), col="#CCCCCC", lwd=0.5)})
l <- lapply(1:12, function(x){points(x,dsp[x,2], col="#0000AA", pch=15, cex=0.8)})
l <- lapply(1:12, function(x){points(x+0.5,dsp[x,3], col="#00EE00", pch=19, cex=0.8)})
knitr::kable(dsp)

codeblock_10-1

month twoDaysBefore Thirteenth
January 209.0 -162.5
February 236.5 216.0
March 160.5 -81.5
April -28.5 -224.0
May -9.5 -46.5
June 23.5 -143.0
July -163.0 -17.0
August 37.5 132.5
September -94.0 -33.0
October -34.0 -111.0
November 234.5 -83.5
December 15.0 46.5

If Friday the 13th was causing displacement, we would expect the following:

  • Fridays births (in relation to expected) should be lower than the combined Wednesday and Tuesday (in relation to expected).

In a “perfectly fearful” world every Friday is less than the previous two days, in a situation of no fear an equal number (6 each). In this case there are 8 dates, which is much more similar to a no fear than a fear situation.

  • The expected values for the Friday and the preceding days should be on opposite sides of zero.

In a situation of fear that has no daily variability (so effects are only caused by the displacement of births) the below expected days are matched by above expected days (as with fixed day holidays earlier in this post). This is only the case with four months.

More stringently, the offset from zero for the Friday in the negative should be matched by an offset in the previous two days in the positive direction. In this case one out of the twelve happens to fit that pattern.

So there is just as much of a lack evidence for Friday the 13th as in the previous parts

Advertisements

2 thoughts on “Friday the 13th, Part 3

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s