People are social- data matching individuals by associations

This is just a brief write up explaining how if you have an individual identifier, and a group identifier, you can match individuals between data sources on the basis of the other members of the group. In this particular hypothetical example I am using birth-date and household, but the principle holds for any individual identifier where a group identifier is also present (I was working in the early 2000s with name as the individual identifier).

Say I have a population of 100000 people with an age range of 65 years

DoB <- sample(seq(from=as.Date("1950-01-01"), to=as.Date("2015-12-31"), by=1),10000, replace=TRUE)

This is a large enough group that some collisions occur

DoBfreq <- tapply(DoB, DoB, length)
## DoBfreq
##    1    2    3    4    5    6 
## 6544 1354  218   18    2    2

Of the 10000, 6544 of the inhabitants have unique birth dates and 2 birth dates have 6 people born on them

In theory the 3466 individuals which do not have unique birth-dates are difficult to match between hypothetical source A and hypothetical source B.

However, if we have some information associating individuals with other individuals, such associations tend to persist over time. So we can apply these relationships to the individual as an identifying feature. Effectively using their social network graph as a matching criteria.

So lets say the 10000 people belong to a potential 3000 households

hhd <- sample(1:3000,10000, replace=TRUE)
sourceA <- data.frame(DoB, hhd)

This household identification might have a different identifier between sources we are matching

sourceB <- sourceA
sourceB$hhd <- 3001 - sourceA$hhd

So it, initially, seems to contain no useful data. However, if we think of the household as an aggregation of the individuals that make up the household.

sourceAextendedKey <- aggregate(DoB ~ hhd, sourceA, function(x){paste(sort(as.character(x)), collapse= " ")})
sourceBextendedKey <- aggregate(DoB ~ hhd, sourceB, function(x){paste(sort(as.character(x)), collapse= " ")})

sourceA <- merge(sourceA, sourceAextendedKey, by="hhd")
names(sourceA) <- c("hhd", "DoB", "aggKey")
sourceB <- merge(sourceB, sourceBextendedKey, by="hhd")
names(sourceB) <- c("hhd", "DoB", "aggKey")

print ("a number of 10000 means no ambiguity in match people in households")
## [1] "a number of 10000 means no ambiguity in match people in households"
nrow(merge(sourceA, sourceB, by=c("DoB","aggKey")))
## [1] 10010

So rather than matching on just the date of birth, we match on the date of birth as a larger collection of dates of birth and get an big decrease in ambiguity .

DobplusH <- paste(sourceA$DoB,sourceA$aggKey)
DoBwithH <- tapply(DobplusH, DobplusH, length)
## DoBwithH
##    1    2 
## 9990    5

In this example there are 10 individuals that have an identity collision with Date of birth plus a household described by DOB, rather than the 3466 problem entries initially (so 34.7% problem entries to 0.1%).

In real world data there is the added issue that while relationships are persistent, they are not permanent. So rather than a direct match against the aggregate, you need you matching techniques to be sensitive to the break up or extension of the aggregate units. But that can be addressed by more analysis and code