Association Rule Mining in Type-

levelsordData Management

Nov 20, 2013 (3 years and 9 months ago)

81 views

Association Rule Mining in Type
-
2 Diabetes
Risk Prediction

Gyorgy J. Simon

Dept. of Health Sciences Research

Mayo Clinic

SHARPn

Summit 2012

Outline


Introduction


Modeling Diabetes Risk


Association Rule Mining


Results


Diabetes Disease Network Reconstruction


Diabetes Risk Prediction


Applicability to SHARP

Diabetes


In the US, 25.8 million people (8% of the population) suffer from
Diabetes Mellitus


Type 2 Diabetes Mellitus (DM)


DM leads to significant medical complications


Effective preventive treatments exist


Identifying subpopulations at risk is important


Pre
-
Diabetes (
P
reDM
) is a condition that precedes DM


fasting glucose 100
-
125


Identify sets of risk factors that significantly increase the risk of
developing diabetes in a
pre
-
diabetic

population


Risk factors:


Co
-
morbid diseases: obesity, cardiac
-
, vascular conditions


Vitals, lab test results, medications, co
-
morbid conditions


85k Mayo Patients 1999
-
2004 with research consent

Design

1/1/1999

12/31/2004

Normal

84,708

DM

424

PreDM

23,828

Normal

44,156

DM

19,013

Normal

43,809

PreDM
21,826

2,002

347

16,664

7/2010

Study Period

Follow
-
Up

Data


Follow
-
up Time (FUT): Time since
PreDM

Dx


Co
-
morbidities: before elevated glucose measurement


hypertension,
hyperlipidemia
, obesity, various cardiac and vascular diseases



Age and Follow
-
up time (FUT) are predictive of DM


They are not modifiable, we need to compensate for them



Goal is different from high
-
throughput
phenotyping


None of the patients have the disease


Predict the risk that patients progress to DM


PID

Co
-
morbidities

Glucose

Age

FUT

DM

OB

HTN



001

Y

Y

110

55

1.8

Y

002

115

19

2.5

N







Outline


Introduction


Modeling Diabetes Risk


Association Rule Mining


Results


Diabetes Disease Network Reconstruction


Diabetes Risk Prediction


Applicability to SHARP

Computational Model

Age

Sex

Unknown

Disease
Mechanism

bmi

Tobacco

hdl

HTN

glucose

DM
Dx

s
tatin











Level 1

Unmodifiable

“nuisance”

factors

Level 2

Clinical

f
actors of

interest

Level 3

Glucose

“definition”

o
f DM

We have to adjust for level 1 factors

b
efore we can assess the effect of

l
evel 2 factors !

Goal

Find sets of clinical factors (level 2)
that are associated with elevated
risk of DM

Modeling Approaches

1.
Logistic regression / Survival Analysis


No ability to discover interactions

2.
Decision Trees/
RandomForest
/Gradient
-
boosted Trees


Greedy approach to discover interaction


No ability to compensate for age and follow
-
up time (FUT)

3.
Association Rule Mining (ARM)


Specifically designed to discover interactions


No ability to compensate for age and FUT


Regression Analysis + Association Rule Mining

Remove the effect of age

g
ender and FUT

Find association between the risk

f
actors and the DM risk not

e
xplained by age and FUT

Simon et al. AMIA 2011

PID

DM

Age

FUT

001

Y

55

1.8

002

N

19

2.5





R
1

Co
-
morbidities

Obese

HTN



Y

Y

E
1

Expected Number


of DM incidents


based on age and


sex only

O

Observed Number


of DM

incidents

R
1

= O


E
1



1
st

Phase Residual

1
st

Phase

2
nd

Phase

R
2

Glucose

103

112



E
2

Expected Number


of DM incidents based


on co
-
morbidities only


(after adjusting for age


and sex)

3
rd

Phase

R
2

= O

(E
1
+E
2
) = R
1
-
E
2



2
nd

Phase Residual

E
3

Expected


Number of


DM incidents


based on


glucose (after


adjusting for


everything else)

E = E
1

+ E
2

+ E
3



Final Prediction

Overview

Regression modeling


Survival model or


Logistic regression

Association Rule Mining

Association Rule Mining


Origins from sales data


Items

(columns): co
-
morbid conditions


Transactions

(rows): patients


Itemsets
: sets of co
-
morbid conditions


Goal
: find
all

itemsets

(sets of
conditions) that
frequently

co
-
occur in
patients.


One of those conditions should be DM.



Support
: # of transactions the
itemset

I

appeared in


Support({OB
, HTN, IHD})=3


Frequent
: an
itemset

I

is frequent, if
support(
I
)>
minsup


Patient

OB

HTN

IHD



DM

001

Y

Y

Y

Y

002

Y

Y

Y

Y

003

Y

Y

004

Y

005

Y

Y

Y

X
: infrequent

Distributional Association Rule Mining

Distributional Association
R
ules

associate an
itemset

with a continuous outcome.

PID

A

B

C

D



R

01

Y

Y

Y

Y

.40

02

Y

Y

Y

.38

03

Y

Y

Y

Y

.39

04

Y

Y

Y

.41

05

Y

Y

.00

06

Y

Y

.01

07

Y

.02

08

Y

.00

0
5
10
15
0
0.15
0.3
0.45
0
2
4
6
0
0.15
0.3
0.45
Application to Diabetes

Find all sets
I

of co
-
morbid conditions, such that the distribution of risk
R

is

significantly different between the patient population having
I

and without
I

Simon et al, KDD 2011a

Frequency

Frequency

R

R

Why Association Rule Mining?

Challenge

Solution

Interactions

Designed to discover associations

Missing data

Asymmetry

in items


Absence of item does not mean that
the risk factor was not present

Clinical

question

Directly extracts

sets of risk factors

Allows for differences in modeling
for prediction and for disease
mechanism

discovery

Computational Efficiency

Efficient algorithms

exist

Outline


Introduction


Modeling Diabetes Risk


Association Rule Mining


Results


Diabetes Disease
N
etwork Reconstruction


4.5
-
yr DM Risk
P
rediction


Applicability to SHARP

Diabetes Disease Network
Reconstruction


Metabolic Syndrome: DM + cardiac/vascular
diseases


Use Association Rule Mining to map out the
relationships between DM and other metabolic
syndrome diseases


Also measure their effect on DM progression risk



Predictors: Age, sex, FUT; co
-
morbid disease
Dx


1
st

Phase model is survival model


2
nd

Phase ARM

Results

Sup

Cases

P
-
value

RR

Itemset

7116

819

2.0e
-
7

1.32

HTN

4729

560

1.7e
-
8

1.45

OB

8612

964

2.6e
-
8

1.31

HL

1980

291

1.9e
-
9

1.78

HTN,OB

4171

534

1.5e
-
8

1.47

HTN,HL

553

85

8.3e
-
4

1.86

OB,IHD

2434

335

4.3e
-
9

1.68

OB,HL

382

66

7.7e
-
4

2.08

HTN,OB,IHD

1271

204

2.8e
-
8

1.93

HTN,OB,HL

470

76

7.2e
-
4

1.93

OB,IHD,HL

339

61

6.1e
-
4

2.15

HTN,OB,IHD,HL


Interpretation: Patients with
HTN,OB,IHD and HL have age and
FUT adjusted
2.15 RR of DM.


Effect of age
-

and FUT adjustment


The entire
PreDM

population has
8.04% chance of DM.


Without age and FUT
adjustment, the above
population has 61/339=17.9%


With age and FUT adjustment, 1
-
(1
-
.084)
2.15
=17.2%

Legend

OB

Obesity

HTN

Hypertension

IHD

Ischemic Heart
Disease

HL

Hyperlipidemia


37 Distributional Association Rules were
discovered


11 are significant.

(Poisson test;
Bonferroni

adjusted 5%)

Results

Legend

OB

Obesity

HTN

Hypertension

IHD

Ischemic Heart
Disease

HL

Hyperlipidemia

Condition(s
)


Subpop
. ( Relative

Size Risk )

IHD

2366 (1.16)

[
p
-
value .11]

HTN, OB, IHD

382 (2.08)

HTN, IHD, HL

1210 (1.36)

[
p
-
value .015]

Outline


Introduction


Modeling Diabetes Risk


Association Rule Mining


Results


Diabetes disease network re
-
construction


4.5
-
yr DM risk prediction


Applicability to SHARP

DM Progression Risk Prediction


Predicting the probability of progression to DM
within 4.5 years



Predictors: age, sex, co
-
morbid
Dx
, laboratory
results and medication orders


1
st

Phase:
spline

logistic regression to adjust for
age and sex


2
nd

Phase: ARM


3
rd

Phase: linear regression using glucose


Machine Learned Indices


Comparison to machine
learning methods


Gradient Boosted Trees
(GBM)


10,000 trees


Linear Model (LM)


Random Forest (RF)


275
-
325 trees


Association Rule Mining
(ARM)


100 rules



10
-
fold CV repeated 50 times


Same predictive
performance but more
interpretable model

C
-
statistic

Traditional Indices


Performance similar to San Antonio (Refit)


ARM readily provides a justification as to why the risk
is high


Proposed method places the patient on a path in the
diabetes network

Clinical Validation


Work in progress…



Apply the rules to
both
normo
-
glycemic

and Pre
-
DM
patients


Each point is a rule


Patterns similar for
lower
-
risk
subpopulations


For high
-
RR rules,
risk of DM is higher
for Pre
-
DM patients


Outline


Introduction


Modeling Diabetes Risk


Association Rule Mining


Results


Interpretability


Predictive Performance


Applicability to SHARP

High
-
T
hroughput
Phenotyping

(HTP)


We can use the Association Rules as a HTP
algorithm


Discover the rules with ARM


Validate the rules with an expert clinician

High
-
throughput

Phenotyping

DM Risk Assessment

Does the patient
currently

have
DM?

Will the patient progress to DM
in
4.5 yrs
?


-

Interventions are possible

Binary decision (DM

or not)

Probability

of diabetes


-

Prob. can be dichotomized
into DM/no DM

Acknowledgment

Peter

W. Li, PhD

Health Sciences Research, Mayo Clinic, MN


Pedro J.
Caraballo
,

MD

Internal Medicine,

Mayo Clinic, MN


M. Regina Castro, MD

Division of Endocrinology and Metabolism,

Mayo Clinic, MN


Terry M.
Therneau
,

PhD

Health Sciences Research, Mayo Clinic, MN


Vipin

Kumar, PhD

Department of Computer Science,

University of Minnesota

References

Vemuri

P, Simon G,
Kantarci

K, Whitwell J,
Senjem

M,
Przybelski

S, Gunter J,
Josephs K,
Knopman

D,
Boeve

B,
Ferman

T, Dickson D,
Parisi

J, Petersen R and Jack
C.
Antemortem

differential diagnosis of dementia pathology using structural MRI:
Differential
-
STAND.
NeuroImage
, 2010.


Caraballo

P, Li P, Simon G. Use of Association Rule
-
mining to Assess Diabetes Risk
in Patients with Impaired Fasting Glucose, AMIA, 2011.



Simon G, Kumar V, Li P. A Simple statistical model and association rule filtering. In
Proc. ACM International Conference on Data Mining and Knowledge Discovery
(KDD), 2011.



Simon G. Li P, Jack C,
Vemuri

P. Understanding Atrophy Trajectories in Alzheimer’s
Disease Using Association Rules on MRI images. In Proc. ACM International
Conference on Data Mining and Knowledge Discovery (KDD), 2011.