Data Linkage

What is data linkage?

Data linkage is the action of bringing together data that relate to the same individual, family, place or event. The resulting datasets reveal rich information that can be used for research, service planning, delivery and evaluation. For example, if data related to each person’s driving licences was linked to their public hospital admissions data, we could see whether people who are caught speeding are more likely to be admitted to hospital with heart problems over their lifetime. Data linkage development and centres that perform data linkage are rooted in health research, but are spreading across other service areas. Australia is one of the world leaders in data linkage, along with Canada and the UK, and has set an exceptionally high standard for the process and the use of its outputs.

Data linkage depends on the collection of data and produces rich datasets for the analysis of data, but neither collection nor analysis of data are within the definition of data linkage.

The data linkage process is designed to minimise the risk of anyone having access to identifying information (e.g. names and addresses) and linked data at the same time. The process usually seeks to avoid a situation where a massive dataset exists with everyone’s identifying information as well as information from more than one source.

Why do we need data linkage?

Many people assume that their state and federal governments have a massive database with information about everyone on it. This just isn’t the case. Every government department has a separate data system or even several. This means that when you enrol your child at school, they can’t just check your name and date of birth from when you applied for your driving licence. It can become repetitive giving each government agency the same information over and over again, but departments are not allowed to simply share people’s information with each other.

But if we do link data we can answer really difficult questions and even save lives. For example, if we link birth records to prescription records we can see whether certain drugs may be harmful if taken during pregnancy or childhood. This is important because many drugs haven’t been evaluated for pregnant women and children.[1] Transport NSW linked its crash data to a number of other agencies to better understand the factors related to serious injuries and therefore improve road safety.[2]

The Bureau of Crime Statistics and Research in NSW provides information to the public and policy makers all over the world. It is an exception to most government agencies as it continuously links Police data with data from the courts and corrections systems.[3]

The primary purpose for developing linked datasets might be for research, service planning, delivery or evaluation. There are, however, secondary benefits to these datasets. For example, data linkage to evaluate the provision of health services to juvenile offenders will also provide valuable information that allows public servants to understand the health needs of juvenile offenders, so that they can plan to better address these needs and track changes over time. Other benefits include increased collaboration between agencies, improved analytical skills in the public service and new datasets that may be de-identified and provided for public access.

Is Community Insight Australia data linkage?

Community Insight Australia is technically data linkage as it links data by location, but our data is not linked by individual person, which is what most people mean when they refer to data linkage. So you can’t see how many divorced single parents born in Greece are on a disability pension. You can only define an area and see the proportion of divorcees, proportion of single parents, proportion of people born in Greece and proportion of people on a disability pension (see below).

Excerpt from Community Insight Australia dashboard. Figures are percentages of the population.

Divorced One parent family Born in Greece Disability Support Pension
Adelaide City 7.7 12 0.4 5.9
Gold Coast combined 10 17.3 0.1 4.7
Gold Coast North 10.4 18.6 0.1 4.9
Greater Hobart 8.1 17.6 0.2 6.4
Sutherland Shire 6.2 14.9 0 1.5

 Data sources: ABS census 2011, DSS payments data December 2015.

Why are people worried about data linkage?

One of the biggest challenges in making data available is ensuring that no person or organisation is likely to be identified in the data. Recently people were worried about their census data being linked to other datasets held by the government. Data linkage has risks:

  1. Due to the increased richness of information relating to an individual during data linkage, it is more likely that a person can be identified from the new dataset than from the original datasets. For this reason, researchers working on linked data usually do so within secure environments (e.g. the SAX Institute’s Secure Unified Research Environment), which means that they cannot print or download the data, and their movements are tracked by secure systems.
  2. Most data linkage processes ensure that personal information and linked data are never held by the same person (see example from CHeReL below). This means that even if data is hacked or leaked, the possibility of people and their information being identified is very low.
  3. Current data linkage processes in Australia are very risk-averse and contain multiple safeguards. However unlikely, it is possible that there may be future changes to ABS legislation and other laws and processes to allow the creation and storage of linked datasets that include identifying information.

An example of a data linkage process – from CHeReL

The Centre for Health Record Linkage (CHeReL) is helps researchers, planners and policy makers access linked health data about people in the NSW and ACT.[4] We describe the process they use to link data below. You can see how the process is constructed so that no one person or agency has access to both datasets as well as identifying information.

  1. Approval

The researcher applies to CHeReL for specific data from specific custodians to be linked. Use of linked data through CHeReL requires three approvals, from:

  • CHeReL
  • the custodians of all data collections used in a linkage study
  • a human research ethics committee

Before releasing any data, the NSW Ministry of Health requests a signed confidentiality agreement from the researchers.  Confidentiality agreement templates are available on the Ministry’s website at:

  1. Linking

Data custodians split each dataset into the information that identifies a person and the information about their condition or history. They provide only the identifiers for each to CHeReL. Identifiers are information fields that can be matched, like name, birthday, address as well as government-generated information like medicare number.

Data linkage 2.png

  1. CHeReL links the identifiers (using a master linkage key) and creates a Project Person Number (PPN) for each person in the datasets. They send the PPNs for each person’s identifiers back to the data custodians.

data linkage 3.png

  1. The data custodians attach the PPN to each person’s information in the original dataset (e.g. medical history, educational qualifications, traffic offences). The PPN replaces the identifiers. The data custodians provide the datasets + PPN to the researcher.

data linkage 4.png

  1. The researcher uses the PPN to link the datasets and create one big linked dataset. They receive no identifiers. The linked dataset is not available to CHeReL or the data custodians, only the researcher.

data linkage 5.png





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s