What is Record Linkage? - Deakin University

sunflowerplateAI and Robotics

Nov 21, 2013 (3 years and 10 months ago)

69 views

GRHANITE Technology

A technology for ethical data acquisition in
Australia


Dr Douglas Boyle PhD, FACHI

Head, GRHANITE™ Health Informatics Unit

University of Melbourne


Next 50 minutes…

Data Linkage and the GRHANITE Technology

Background


Data linked research in Australia


The law and policy


Technological challenges

The GRHANITE Tool


Privacy
-
protecting record linkage


Generic interfacing


Addressing privacy


Patient re
-
identification


Security


Large
-
scale deployment and management

The GRHANITE Health Informatics Unit

What is Record Linkage?

Record linkage . . .


“[is] a solution to the problem of recognizing those

records in two files which represent identical persons,

objects, or events (said to be matched).”



Fellegi

IP &
Sunter

AB (1969) A theory for record linkage. Journal of the American Statistical
Association 64, 1183
-
1210

Western Australia linkage enabled by legislation

Australian Institute of Health and Welfare

Centre for Health Record Linkage, NSW

Ref:
http://www.cherel.org.au/


SA
-
NT
DataLink



Data Linkage in SA and NT

Ref:
https://www.santdatalink.org.au/application_process


Curtin University Centre for Data Linkage

Ref: James Boyd, Anna
Ferrante

and Sean Randall, Curtin University, IDLC Conference Proceedings 2012

Cross
-
Jurisdictional Linkage in Australia

Ref: Katrina
Spilsbury



Curtin University and the Public Health Research Network, IDLC Conference Proceedings 2012

Victorian Data Linkages Unit

Ref:
http://www.health.vic.gov.au/vdl/index.htm


BioGrid

Australia Victoria

www.biogrid.org.au


A common theme…


Research in Australia
where record
linkage is performed
has almost
exclusively involved linking secondary
care or administrative datasets:


Hospital Admissions and Outcomes


Clinic Outpatients, Death


Clinic Inpatients


MBS / PBS

The Law and Heath Policy

Data Protection

Privacy Legislation

Best Practice in
Anonymisation


Legal Aspects of Data Ownership and Privacy in
Australia


D’Arcy Holman

Ref: D’Arcy Holman


University of Western Australia, IDLC Conference Proceedings 2012

Legal Aspects of Data Ownership and Privacy in
Australia


D’Arcy Holman

Ref: D’Arcy Holman


University of Western Australia, IDLC Conference Proceedings 2012

Legal Aspects of Data Ownership and Privacy in
Australia


D’Arcy Holman

Ref: D’Arcy Holman


University of Western Australia, IDLC Conference Proceedings 2012

Legal Aspects of Data Ownership and Privacy in
Australia


D’Arcy Holman

Ref: D’Arcy Holman


University of Western Australia, IDLC Conference Proceedings 2012

Challenges and Gaps


A need for

data for research


Representative


Longitudinal


Robust


Challenges


Privacy & Confidentiality


Patient Consent


Security


Provider perceptions of risk


Scale of the task if primary care and other community to be
involved


No effective standards


Need to interface to many technologies


Need to deal with constant change




Some of the components that need addressed…


Ethics and consent prior to obtaining data


Manage the differing data platforms and means of obtaining data from
them


Cleaning and packaging data that has been approved before a
researcher receives it


Dealing with on
-
going changes to the underlying clinical databases


Dealing with the occasional replacement of servers, PC’s and their
software


Securing the data in transit and at the data destination


Provide the means to record
-
link the data within the appropriate
legislation


Ensuring the data is available to only those
authorised

to have it


Provide the on
-
going backup and security services to ensure on
-
going
success

The GRHANITE Tool

The GRHANITE Tool

GRHANITE


-

Primary Care and other Large
-
Scale Linkage

CCare

Genie

Zedmed

Lab

BP

Hospital

GRHANITE™ Web Service

GRHANITE™ Databank

cDOpy28aFAKyaqDdq5x
o+OhmxGlOMGYNTyJ1q
f+TSHZhC974lkxaixZSdT
NGp5ne8UZPKF2mz0Xg
w3QuSWaadwvKlYkKQ7
bmFOPnpjnSHkM=

Client

Tier

Middle

Tier

Data

Tier

MD2/3

The GRHANITE Tool

Privacy
-
Protecting Record Linkage

Trusted Data
Linkages Unit

Database e
.
g
.
DARTS


Database 1

Database e
.
g
.
DARTS


Database 2

Person
Identifiers

De
-
identified
clinical data

Database e
.
g
.
DARTS


Research Data

(linked)

Traditional Record Linkage

Rec

1: John Smith
DoB

27/03/1956

Rec

2: John Smyth
DoB

27/03/1956

Rec3:...

Database e
.
g
.
DARTS


Database 1

Database e
.
g
.
DARTS


Database 2

De
-
identified clinical
data including
linkage keys

Database e
.
g
.
DARTS


Research Data

(linked)

Privacy
-
Protecting Record Linkage

Rec

1: John Smith
DoB

27/03/1956

Rec

2: John Smyth
DoB

27/03/1956

823y8734dhjck348hcj3847h898jx8
weudj8eeamazaks^&%11==

823y8734dhjck348hcj3847h898jx8
weudj8eeamazaks^&%11==

SHA Hash Functions

The SHA hash functions are a set of
cryptographic hash
functions

designed by the
National Security Agency

(NSA) and
published by the
NIST

as a U.S.
Federal Information
Processing Standard
. SHA stands for Secure Hash Algorithm.


GRHANITE identifier Hashing Utilises the SHA
-
256 Algorithm

No known weakness identified to
-
date

References:

http://en.wikipedia.org/wiki/SHA1#cite_note
-
4

Gilbert H, Handschuh H. Security analysis of SHA
-
256 and sisters,
Selected Areas in Cryptography 3006
: 175
-
193 2004

GRHANITE Identifier Hashing

Surname + Forename + Date of Birth

“Boyle” + “Douglas” + “19690118”

“BoyleDouglas19690118”

SHA
-
256 Hash generation and additional AES encryption using secured
encryption passphrase

“u62KHSJyZQUt2QJkOUSXyFie4B5g2yVv/8kzvj4FUrsLV0EvZjsig
3keoCIh3TcMcDR5/m5SOEgsl8Z/diucZQAzVaX+iBKz/mzfFfiDdC
A=”

Name Cleaning and Standardisation



"DOT", "DOROTHY”


"EVON", "YVONNE”


"GRAZIA", "GRACE”


"GRIETJE", "MARGARET”


"GWEN", "GWENDOLINE”


"GWYNETH", "GWENDOLINE”


"JENNY", "JENNIFER”


"JO", "JOANNE”


"JOE", "JOANNE”


"JUDY", "JUDITH”


"KATE", "CATHERINE”


"KATHY", "CATHERINE”


"KATHRYN", "CATHERINE”


"KYRIAKI", "KOULA”


"LILIJANA", "LILLIAN”


"LILY", "LILLIAN”


"LIBBY", "ELIZABETH”


"LIZ", "ELIZABETH”


"LUCILLE", "LUCY”


“NICK", "NICHOLAS”


“NIK", "NICHOLAS”


“NIC", "NICHOLAS”


“NICKOLAS", "NICHOLAS”


“PANAGIOTIS", "PETER”


“PANTALEONE", "LEO”


“RAY", "RAYMOND”


“ROB", "ROBERT”


“ROD", "RODNEY”


“ROGER", "RODGER”


“RON", "RONALD”


“RONNIE", "RONALD”


“PIOTR", "PETER”


“RIK", "RICHARD”


“RICK", "RICHARD”


“SPIRIDON", "SPIROS”


“SPYRIDON", "SPIROS”


“STAN", "STANLEY”


“TADEUS", "TED”


“TADEUSZ", "TED”

“RIK101”
-
> “RICHARD”

Phonetic
encoding can ensure a significant percentage of
spelling mistakes are captured (Modified
DoubleMetaphone

Algorithm)

Spelling mistakes in names:


Douglas
Dooglas

DuGlass

Doouglas

-
> TKLS


Boyle Boil
Boyel

-
> PL

Similar techniques can address date errors such as an
inaccurate year of birth

Error in a year of birth:


1969
-
01
-
12
-
>
19690112


1969
-
12
-
01
-
>
19690112

Phonetic Encoding and Date Manipulation

Analysis of the characteristics of RMH patients

(What identifiers are important in record linkage)


871,061 patient demographics records analysed


Standard data
-
cleaning steps based on typical data errors
formulated
-
> standard, cleaned data


136,772 distinct surnames, 8.6% occur only once


After phonetic encoding, 12,385 distinct surnames remain


Frequency distribution of names recorded

Forenames

n

Surnames

n

John

16,389

Smith

6,281

Peter

9,593

Brown

3,288

Michael

8,458

Jones

2,927

Robert

8,200

Wilson

2,909

David

7,947

Nguyen

2,569

Maria

7,702

Taylor

2,555

William

6,802

Anderson

2,001

George

6,141

Johnson

1,812

Mary

6,015

White

1,809


Date of Birth Distribution


Date spikes and other anomalies allow frequency
assessment of the
likelyhood

that a date is correct and allows
cut
-
offs to be set






GRHANITE Hashing for Record Linkage

What is the probability of any single hash correctly identifying a patient?


Frequency distributions of each identifier were
calculated


An Agreement Frequency Ratio (AFR) is
generated for each identifier






The AFR of each component of the hash is
added to give a combined AFR for the whole
hash

m = probability of a match being correct (varies by name / date)

u = probability of a match being incorrect

[ln(m/u)/ln(2)]

Reference:

Blakely T, Salmond C “Probabilistic record linkage and a method to calculate the positive predictive value”
International
Journal of Epidemiology

2002;31:1246
-
1252

GRHANITE Hashing for Record Linkage

GRHANITE Hashing for Record Linkage

GRHANITE™

Demonstration

CCare

Genie

Zedmed

MT32

BP

Practix

GRHANITE™ Web Service

GRHANITE™ Databank

cDOpy28aFAKyaqDdq5x
o+OhmxGlOMGYNTyJ1q
f+TSHZhC974lkxaixZSdT
NGp5ne8UZPKF2mz0Xg
w3QuSWaadwvKlYkKQ7
bmFOPnpjnSHkM=

Client

Tier

Middle

Tier

Data

Tier

MD2/3

Language for querying


Best
Practice Example

Participant

Consent Management

Supporting:


Opt
-
in


Opt
-
out


W
aiver

of consent

Data Export Preview

Data as it leaves the
practice or hospital

Typical GRHANITE Data

Patients

Consultation dates

Chlamydia test results

Typical GRHANITE Data

Data used to link patients

(NO names, dates of birth, Medicare ID or other usual identifier)

Participant

Re
-
identification

Site monitoring via web or
Smartphone

The GRHANITE Health Informatics Unit

www.grhanite.com


The GRHANITE Health Informatics Unit

GRHANITE Installations
-

~200 Nationally

Current HIU Project Status


GRHANITE
Research Data
repositories implemented
at


University of Melbourne (
Shepparton
)


Burnet Institute (Melbourne)


UNSW
x
2 (Sydney)


Deakin

Uni

(Melbourne)


National Prescribing Services x 3 (Sydney)



NPS
MedicineInsight



POC phase currently underway


rollout to 500 GP surgeries planned for 2013
-
2015


ACCESS
Chlamydia Surveillance project successful (13
laboratories, 27 GP clinics, 8 ACCHS, 7 Family Planning
Clinics


Every State and Territory


ACCEPt

RCT


~145 sites installed, 79 sites is six
months

Current HIU Project Status


ATSHIP Project


5 Aboriginal projects, UNSW


ePBRN

Project


11 GP surgeries, UNSW


BioGrid

Australia


GRHANITE Linkage implemented


Project completed, AIHW Death Data integrated to
BioGrid

research projects


Lab
data: Chlamydia test data for 750,000 Chlamydia tests
in 2008
-
2010 representing 550,000 individuals after privacy
-
protecting record linkage


MAGDA Gestational Diabetes Project moving to an
implementation phase


This project is to perform data
linkage NEVER seen in Australia
before


Linkage of NDSS data to Pathology Data


Linkage of this data to
DoH

Perinatal

Birth Records

Benefits


Organisations can choose to use GRHANITE to implement
their own research or audit networks


The principal benefit is when data needs record linked
-

other tools work well for simple audit


Because the linkage keys cannot be reversed, you do not
need a separate record linkage unit


risk of identity
exposure greatly reduced (SOP’s simplified)


Small Excel or Access databases can contribute data as
well as larger, standard databases


no reliance
compliance to standards (pragmatic)


GRHANITE does not need to be re
-
written to add or
modify data extract definitions


Once installed, different organisations can benefit from
GRHANITE (with the practices permission of course)


Install once, use many.


Dr Douglas Boyle

dboyle@unimelb.edu.au

grhanite.unimelb.edu.au

www.grhanite.com


MAGDA Lab Export

Lab Database

Securely
transmitted

.
csv

export
of data from
lab system

Lab PC running
GRHANITE

cDOpy28aFAKyaqD
dq5xo+OhmxGlOM
GYNTyJ1qf+TSHZ
hC974lkxaixZSdTN
Gp5ne8UZPKF

GRHANITE looks for the .
csv

files in an
agreed folder, reads the files, de
-
identifies
the data and extracts data permitted for
the study

1

2

3

De
-
identified
study data