This talks through my thinking about using regression to predict the Rugby World Cup semifinal matches. I don’t actually know much about rugby, but I do know data, so I am currently leading my work’s “who do you think are going to win World Cup matches”. I am writing this up before the semi-final matches to “show my working” and to show others the thinking about data that went into this, and if I crash and burn this weekend, then damnit at least I was numerically justified my expectations (it would actually be quite reasonable for rugby to have a twenty point difference between the same teams on different occasions due to events on the day, so all the predictions need to be caveated with that).

I am starting from what I consider the reasonable plan that a rugby teams rank have some relation to how likely they are to beat teams of different ranks, so I need some ranks. Thank you Wikipedia.

```
u <- "https://en.wikipedia.org/wiki/World_Rugby_Rankings"
if(!(file.exists("rugby.html"))){
download.file(u, destfile="rugby.html", method = "libcurl")
}
library(XML)
tables = readHTMLTable("rugby.html", stringsAsFactors = FALSE)
lapply(tables,dim) # visually checking which the 4 column table is as that is the one I want
```

`ranks <- tables[[1]]`

Now I do need to do a little cleanup of the table I have brought in

```
names(ranks) <- make.names(ranks[2,])
ranks <- ranks[3:27,]
ranks$Points <- as.numeric(ranks$Points)
```

Next step is getting the match results to date for the Rugby World Cup. I don’t have a convenient online source to import them from so enter them by hand.

```
#results to date
matchnum <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44)
matchdate <- as.Date(c("18/09/15" , "19/09/15" , "19/09/15" , "19/09/15" , "19/09/15" , "20/09/15" , "20/09/15" , "20/09/15" , "23/09/15" , "23/09/15" , "23/09/15" , "24/09/15" , "25/09/15" , "26/09/15" , "26/09/15" , "26/09/15" , "27/09/15" , "27/09/15" , "27/09/15" , "29/09/15" , "1/10/15" , "1/10/15" , "2/10/15" , "3/10/15" , "3/10/15" , "3/10/15" , "4/10/15" , "4/10/15" , "6/10/15" , "6/10/15" , "7/10/15" , "7/10/15" , "9/10/15" , "10/10/15" , "10/10/15" , "10/10/15" , "11/10/15" , "11/10/15" , "11/10/15" , "11/10/15" , "17/10/15" , "17/10/15" , "18/10/15" , "18/10/15"), "%d/%m/%y")
score1 <- c(35, 10, 50, 32, 32, 25, 54, 26, 45, 28, 38, 58, 54, 23, 46, 25, 65, 39, 44, 35, 23, 41, 43, 5, 34, 13, 45, 16, 15, 47, 64, 16, 47, 33, 15, 60, 64, 32, 9, 18, 23, 62, 20, 35)
score2 <- c(11, 17, 7, 34, 10, 16, 9, 16, 10, 13, 11, 14, 9, 18, 6, 28, 3, 16, 10, 21, 13, 18, 10, 26, 16, 33, 16, 9, 17, 15, 0, 17, 9, 36, 6, 3, 19, 22, 24, 28, 19, 13, 43, 34)
team1 <- c("England", "Tonga", "Ireland", "South Africa", "France", "Samoa", "Wales", "New Zealand", "Scotland", "Australia", "France", "New Zealand", "Argentina", "Italy", "South Africa", "England", "Australia", "Scotland", "Ireland", "Tonga", "Wales", "France", "New Zealand", "Samoa", "South Africa", "England", "Argentina", "Ireland", "Canada", "Fiji", "South Africa", "Namibia", "New Zealand", "Samoa", "Australia", "England", "Argentina", "Italy", "France", "United States", "South Africa", "New Zealand", "Ireland", "Australia")
team2 <- c("Fiji", "Georgia", "Canada", "Japan", "Italy", "United States", "Uruguay", "Argentina", "Japan", "Fiji", "Romania", "Namibia", "Georgia", "Canada", "Samoa", "Wales", "Uruguay", "United States", "Romania", "Namibia", "Fiji", "Canada", "Georgia", "Japan", "Scotland", "Australia", "Tonga", "Italy", "Romania", "Uruguay", "United States", "Georgia", "Tonga", "Scotland", "Wales", "Uruguay", "Namibia", "Romania", "Ireland", "Japan", "Wales", "France", "Argentina", "Scotland")
wcr <- data.frame(matchnum, matchdate, score1, score2, team1, team2)
```

Because I am actually interest in how each team does against other teams, I am going to do a bit of flipping around the match results and joining it to itself, so rather than be data where each entry is the match result between two teams, there are two entries for each match (1 from each teams perspective). I will be treating team1 as the “viewpoint” team.

```
wcr2 <- data.frame(matchnum, matchdate, score1=score2, score2=score1,team1=team2, team2=team1) #match results viewed from other team
scores <- rbind (wcr, wcr2)
```

The next job on the list is to combine the ranks with the match scores, and work out the differences in rank and the differences in score.

```
scores <- merge(scores, ranks[, c("Team", "Points")], by.x="team1", by.y="Team")
names(scores)[names(scores) == "Points"] <- "Points1"
scores <- merge(scores, ranks[, c("Team", "Points")], by.x="team2", by.y="Team")
names(scores)[names(scores) == "Points"] <- "Points2"
scores$sdiff <- scores$score1 - scores$score2
scores$rdiff <- scores$Points1 - scores$Points2
```

Let’s pause for a moment and look at a visualisation

`plot(scores$rdiff, scores$sdiff, pch=19, cex=0.8, col="#44444466")`

Even allowing for the fact the the bottom left and the top right are mirrors (because one team loses by the amount the other team wins), that looks pretty good as a relationship. We can formalise that pretty good as a linear regression model

```
basic_model <- lm(sdiff ~ rdiff, data=scores)
plot(scores$rdiff, scores$sdiff, pch=19, cex=0.8, col="#44444466")
abline(basic_model)
```

`summary(basic_model)`

```
##
## Call:
## lm(formula = sdiff ~ rdiff, data = scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.26 -6.91 0.00 6.91 33.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.222e-15 1.372e+00 0.00 1
## rdiff 1.956e+00 1.017e-01 19.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.87 on 86 degrees of freedom
## Multiple R-squared: 0.8113, Adjusted R-squared: 0.8091
## F-statistic: 369.8 on 1 and 86 DF, p-value: < 2.2e-16
```

Interpreting the model numbers, there is effectively no chance that the relationship is imaginary, and the relationship explains 80% of the pattern in the data. But what about the part of the results not explained by the overall relationship. Let’s first work out the score difference for each team in each match- the difference between what would be expected from them on the basis of the rank difference and what was actually scored.

```
scores$expected <- basic_model$coefficients[1] + scores$rdiff * basic_model$coefficients[2]
scores$diffdiff <- scores$sdiff - scores$expected
```

Now, for the four finalists, lets visualise how they have been under or over preforming through the tournament. We can do this either as visualising how they have been doing through time as the tournament progresses, or we can do so as how they have been doing on the basis of the rank of the other team. Tackling the through time first:

```
par(mfrow=c(2,2))
expected <- 2
actual <- 1
t1 <- "New Zealand"
col1 <- "black"
mtitle <- paste(t1,": expected (dashed)\nvs. actual (solid) points difference")
scores <- scores[order(scores$matchdate),]
plot(scores$matchdate, scores$sdiff, main=mtitle, cex=0.8, pch=3, col="#44444466", xlab="match date", ylab="points difference")
lines(scores$matchdate[scores$team1 == t1], scores$sdiff[scores$team1 == t1], col=col1, lty=actual)
lines(scores$matchdate[scores$team1 == t1], scores$expected[scores$team1 == t1], col=col1, lty=expected)
t1 <- "Australia"
col1 <- "gold1"
mtitle <- paste(t1,": expected (dashed)\nvs. actual (solid) points difference")
scores <- scores[order(scores$matchdate),]
plot(scores$matchdate, scores$sdiff, main=mtitle, cex=0.8, pch=3, col="#44444466", xlab="match date", ylab="points difference")
lines(scores$matchdate[scores$team1 == t1], scores$sdiff[scores$team1 == t1], col=col1, lty=actual)
lines(scores$matchdate[scores$team1 == t1], scores$expected[scores$team1 == t1], col=col1, lty=expected)
t1 <- "South Africa"
col1 <- "green"
mtitle <- paste(t1,": expected (dashed)\nvs. actual (solid) points difference")
scores <- scores[order(scores$matchdate),]
plot(scores$matchdate, scores$sdiff, main=mtitle, cex=0.8, pch=3, col="#44444466", xlab="match date", ylab="points difference")
lines(scores$matchdate[scores$team1 == t1], scores$sdiff[scores$team1 == t1], col=col1, lty=actual)
lines(scores$matchdate[scores$team1 == t1], scores$expected[scores$team1 == t1], col=col1, lty=expected)
t1 <- "Argentina"
col1 <- "blue"
mtitle <- paste(t1,": expected (dashed)\nvs. actual (solid) points difference")
scores <- scores[order(scores$matchdate),]
plot(scores$matchdate, scores$sdiff, main=mtitle, cex=0.8, pch=3, col="#44444466", xlab="match date", ylab="points difference")
lines(scores$matchdate[scores$team1 == t1], scores$sdiff[scores$team1 == t1], col=col1, lty=actual)
lines(scores$matchdate[scores$team1 == t1], scores$expected[scores$team1 == t1], col=col1, lty=expected)
```

`par(mfrow=c(1,1))`

So that’s an interesting set of graphs, but the interpretation is a bit speculative because there are only five points for each team. Maybe New Zealand underpreformed in the earlier stage, then has been improving, but can it keep overpreforming relative to its expected results. Maybe Australia started as well as expected but performance has been slipping, but will it continue to slip relative to what is expected from a team at its rank. South Africa seems to have had an uneven tournament, though playing about as well as expected overall. And Argentina never seems to have had a match where they played worse than expected.

Shall we try seeing how the same differences can be visualised against the rank of the opposing team? Putting them on one graph with the same colour set, we see:

```
par(mfrow=c(2,2))
expected <- 2
actual <- 1
t1 <- "New Zealand"
col1 <- "black"
mtitle <- paste(t1,": expected (dashed)\nvs. actual (solid) points difference")
scores <- scores[order(scores$rdiff),]
plot(scores$rdiff[scores$team1 == t1], scores$sdiff[scores$team1 == t1], main=mtitle, type="l", col=col1, lty=actual, ylim=c(-20,80), xlim=c(-10,40), xlab="rank difference compared to other team", ylab="score difference")
points(scores$rdiff[scores$team1 == t1], scores$sdiff[scores$team1 == t1], pch=19, cex=0.3, col=col1)
lines(scores$rdiff, scores$expected, col="#44444466", lty=expected)
points(scores$rdiff[scores$team1 == t1], scores$expected[scores$team1 == t1], pch=19, cex=0.3, col="#44444466")
t1 <- "Australia"
col1 <- "gold1"
mtitle <- paste(t1,": expected (dashed)\nvs. actual (solid) points difference")
scores <- scores[order(scores$rdiff),]
plot(scores$rdiff[scores$team1 == t1], scores$sdiff[scores$team1 == t1], main=mtitle, type="l", col=col1, lty=actual, ylim=c(-20,80), xlim=c(-10,40), xlab="rank difference compared to other team", ylab="score difference")
points(scores$rdiff[scores$team1 == t1], scores$sdiff[scores$team1 == t1], pch=19, cex=0.3, col=col1)
lines(scores$rdiff, scores$expected, col="#44444466", lty=expected)
points(scores$rdiff[scores$team1 == t1], scores$expected[scores$team1 == t1], pch=19, cex=0.3, col="#44444466")
t1 <- "South Africa"
col1 <- "green"
mtitle <- paste(t1,": expected (dashed)\nvs. actual (solid) points difference")
scores <- scores[order(scores$rdiff),]
plot(scores$rdiff[scores$team1 == t1], scores$sdiff[scores$team1 == t1], main=mtitle, type="l", col=col1, lty=actual, ylim=c(-20,80), xlim=c(-10,40), xlab="rank difference compared to other team", ylab="score difference")
points(scores$rdiff[scores$team1 == t1], scores$sdiff[scores$team1 == t1], pch=19, cex=0.3, col=col1)
lines(scores$rdiff, scores$expected, col="#44444466", lty=expected)
points(scores$rdiff[scores$team1 == t1], scores$expected[scores$team1 == t1], pch=19, cex=0.3, col="#44444466")
t1 <- "Argentina"
col1 <- "blue"
mtitle <- paste(t1,": expected (dashed)\nvs. actual (solid) points difference")
scores <- scores[order(scores$rdiff),]
plot(scores$rdiff[scores$team1 == t1], scores$sdiff[scores$team1 == t1], main=mtitle, type="l", col=col1, lty=actual, ylim=c(-20,80), xlim=c(-10,40), xlab="rank difference compared to other team", ylab="score difference")
points(scores$rdiff[scores$team1 == t1], scores$sdiff[scores$team1 == t1], pch=19, cex=0.3, col=col1)
lines(scores$rdiff, scores$expected, col="#44444466", lty=expected)
points(scores$rdiff[scores$team1 == t1], scores$expected[scores$team1 == t1], pch=19, cex=0.3, col="#44444466")
```

`par(mfrow=c(1,1))`

There are a host of stories that might be the case here- New Zealand underpreforming against week teams (which is to say still winning, but not by as much as a pure linear relationship might suggest), Australia seem to be struggling against better sides. South Africa who knows (but you clearly don’t want to play South Africa as a weak team, you want to play New Zealand), Argentina performing well against opponents of all levels. But with only five data points it is very much the case of “might be the case”.

And with those five data points, what you predict will happen in the semi-finals hinges on how you interpret those graphs, in particular the arguable question of “Is Australia fading?”. It may be underpreforming against better teams and may be underpreforming more as the tournament goes one. Or they might just be playing a bit below expectations with some variation.

I am going to calculate to forms of correction to the data, one form is for if the patterns are true, so building a linear model for the patterns in the data of the differences to the expected results, the other form assumes that is is just variation and takes the median difference to expectations.

```
#calculate the degree outperforming by rank for each country
intercep_rank <- sapply(unique(scores$team1), function(x){lm(diffdiff ~ rdiff, data= scores, subset=(team1==x) )$coefficients[1]})
slope_rank <- sapply(unique(scores$team1), function(x){lm(diffdiff ~ rdiff, data= scores, subset=(team1==x) )$coefficients[2]})
corrections <- data.frame(Team=unique(scores$team1), intercept=intercep_rank, slope=slope_rank)
overpreform <- aggregate(diffdiff ~ team1, data=scores, median)
```

For deploying those in predictions, I am going to take the halfway point between the two models as being good enough. So New Zealand vs. South Africa:

```
t1 <- "New Zealand"
t2 <- "South Africa"
r1 <- ranks$Points[ranks$Team == t1]
r2 <- ranks$Points[ranks$Team == t2]
rvar <- r1 - r2
m1 <- basic_model$coefficients[1] + rvar * basic_model$coefficients[2] + corrections$intercept[corrections$Team == t1] + rvar * corrections$slope[corrections$Team == t1] - corrections$intercept[corrections$Team == t2] - rvar * corrections$slope[corrections$Team == t2]
m2 <- basic_model$coefficients[1] + rvar * basic_model$coefficients[2] + overpreform$diffdiff[overpreform$team1 == t1] - overpreform$diffdiff[overpreform$team1 == t2]
m1
```

```
## (Intercept)
## 29.87665
```

`m2`

```
## (Intercept)
## 5.943232
```

`(m1 + m2) /2`

```
## (Intercept)
## 17.90994
```

For a not very contentious comfortable win to New Zealand. Then Argentina vs. Australia:

```
t1 <- "Argentina"
t2 <- "Australia"
r1 <- ranks$Points[ranks$Team == t1]
r2 <- ranks$Points[ranks$Team == t2]
rvar <- r1 - r2
m1 <- basic_model$coefficients[1] + rvar * basic_model$coefficients[2] + corrections$intercept[corrections$Team == t1] + rvar * corrections$slope[corrections$Team == t1] - corrections$intercept[corrections$Team == t2] - rvar * corrections$slope[corrections$Team == t2]
m2 <- basic_model$coefficients[1] + rvar * basic_model$coefficients[2] + overpreform$diffdiff[overpreform$team1 == t1] - overpreform$diffdiff[overpreform$team1 == t2]
m1
```

```
## (Intercept)
## 21.18233
```

`m2`

```
## (Intercept)
## -0.6176184
```

`(m1 + m2) /2`

```
## (Intercept)
## 10.28235
```

The prediction is a somewhat more contentious Argentina by ten points.

Just as a final note on dragging more information out of the data. If I did want to take this a step further, I would normalise the under/overpreformance of team1 against the aggregate under/overpreformance of team2. For example, England arguably underpreformed at the World Cup, so for teams that played England part of their overperformance is caused by the difference when measured against England. But that adds a lot of work, so not today.