Mining Clinical data

sentencehuddleΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

83 εμφανίσεις

Challenges and Techniques for
Mining Clinical data

Wesley W. Chu

Laura Yu Chen


Outline


Introduction of SmartRule association
rule mining


Case I: mining pregnancy data to
discover drug exposure side effects


Case II: mining urology clinical data for
operation decision making

SmartRule Features


Generate MFIs directly from tabular data


Reduce the search space and the support counting
time by taking advantage of column structures


User select MFIs for rule generation


User can select a subset of MFIs to including
certain attributes as targets in rule generation


Derive rules from targeted MFIs


Efficient support
-
counting by building inverted
indices for the collection of itemsets


Hierarchically organize rules into trees and
use spreadsheet to present the rule trees

System overview of SmartRule

TMaxMiner:

Compute
MFI from
tabular data
.



MFI


Data



Rules



Config

Domain
experts

InvertCount:


-

MFIs

FIs


-

Count sup

RuleTree:


-

Generate


-

Organize

FI

Supports

Excel Book

1

2

3

4

5

6

Computation Complexity


Efficient MFI mining:


Does not require superset checking


gather past tail information to determine
the next node to explore during the mining
process


Efficient rule generation:


Reduce the computation for support
-
counting by building inverted indices

Scalability


Limitation: Microsoft Excel spreadsheet
size is 65,536 rows in one spreadsheet


When the dataset exceeds the
spreadsheet size limit:


Partition the dataset into multiple groups of
the maximum spreadsheet size to derive
MFIs for each spreadsheet


Then join these MFIs for generating
association rules

Case I:

Mining Pregnancy Data


Data set: Danish National Birth Cohort (DNBC)


Dimension: 4455 patients x 20 attributes


Each patient record contain:


Exposure status : drug type, timing, and sequence
of different drugs


Possible confounders: vitamin intake, smoking,
alcohol consumption, socio
-
economic status and
psycho
-
social stress


Endpoint: preterm birth, malformations and
prenatal complications

Sample Pregnancy Data

Challenges


Problem: discover side effects of drug
exposure during pregnancy


E.g.: study how the antidepressants and
confounders influence the preterm birth of the
new
-
born


Difficulties in finding side effects:


Small number of patients suffer side effect


Sensitive to the drug exposure time


Exposure to sequence of multiple drugs

Derive Drug Side Effects via SmartRule(1):
low
-
support low
-
confidence rules


Low support or low confidence rules could
still be significant because of their contrast
to normal pregnant woman


For example:



If patients exposed to
cita

in the 3rd trimester, then
have preterm birth with support=0.0011,
confidence=0.1786


If patients not exposured to
cita
, then have preterm
birth with support=0.0433,
confidence=0.0444

Derive Drug Side Effects via SmartRule(2):
temporal sensitive rules


Divide the pregnancy period into time slots (e.g.
trimester) and combine drug exposure by time:


If patients exposed to
cita

in the
1st trimester

and drink
alcohol, then have preterm birth with support=0.0011
and confidence=0.132


If patients exposed to cita in the
2nd trimester

and drink
alcohol, then have preterm birth with support=0.0011 and
confidence=0.417


If patients exposed to cita in the
3rd trimester

and drink
alcohol, then have preterm birth with support=0.0009 and
confidence=0.364


Flexible in time slot division, domain user can
control granularity

Rule Presentation


Hierarchically organize
rules into trees


View general rules and
then extend to specific
rules


Use spreadsheet to
present the rule trees


Easy to sort, filter or
extend the rule trees to
search for the interesting
rules

2) If
exposed to
cita

in the 1
st

trimester, then
preterm birth (sup=0.0016, conf=0.0761)

6) If
exposed to
cita

in the 1
st

trimester and
drink alcohol, then preterm birth (sup=0.0011,
conf=0.132)

7) If
exposed to
cita

in the 2
nd

trimester and
drink alcohol, then preterm birth (sup=0.0011,
conf=0.417)


3) If
exposed to
cita

in the 2
nd

trimester, then
preterm birth (sup=0.0013, conf=0.1714)

4) If exposed to
cita

in the 3
rd

trimester, then
preterm birth (sup=0.0011, conf=0.1786)

A part of the rule hierarchy for the exposure to the
antidepressant
citalopram
and alcohol at different time
period of pregnancy with preterm birth

8) If
exposed to
cita

in the 3
rd

trimester and
drink alcohol, then preterm birth (sup=0.0009,
conf=0.364)

1) In general, patients have

preterm birth (sup=0.0454,
conf=0.0454)

5) If no
exposure to
cita
, then preterm birth
(sup=0.0433, conf=0.0444)

Knowledge Discovery from
Data Mining Results


Challenges:


Examining the vast number of rules
manually is too labor
-
intensive


Exploring knowledge (rules) without
specific goal

Existing approach:

Top
-
down in
Rule Hierarchy


Association rules are represented in
general rules,
summaries and exception rules

(GSE patterns). The
GSE pattern presents the discovered rules in a
hierarchical fashion. Users can browse the hierarchy
from
top
-
down

to find interesting exception rules.


Due to the low occurance of drug side effects,
interesting rules are exception rules and reside at
the lower level of the hierarchy. Without user
guidance, it requires exploration of the entire GSE
hierarchy to locate the interesting exception rules.


Reference:


B. Liu, M. Hu, and W. Hsu, "Multi
-
level organization and summarization of the discovered rules,"
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
,
Aug, 2000, Boston, USA.


B. Liu, M. Hu, and W. Hsu, "Intuitive representation of decision trees using general rules and
exceptions.“
Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI
-
2000)
,
July 30
-

Aug 3, 2000, Austin, Texas, USA.



New effective bottom up technique
to find exception rules


Derive a set of
seed attributes

from
high
-
confidence rules


For example, given high
-
conf rule:


If exposed to Anxio in the pre, in and post time
and use tobacco and have symptoms of
depression, then have preterm birth with
confidence = 0.6


List of seed attributes: Anxio_pre, Anxio_in,
Anxio_post, tobacco and symptoms of
depression

Using seed attributes to explore
exception rules via rule hierarchy


Explore more rules based on these
seed
attributes

in the rule hierarchies


First look for rules that represent effect of each
single seed attribute on preterm birth


Then further explore the combination of multiple
seed attributes

High
-
confidence

rule

Seed attributes

Rule

hierarchy

Rule

hierarchy

New Findings from Data
Mining



Finding: combined exposure to
citalopram

and alcohol in
pregnancy is associated with
an increased risk of preterm
birth


Not initially discovered by
epidemiology study due to the
large number of combinations
among all the attributes and
their values

2) If
exposed to
cita

in the 1
st

trimester, then
preterm birth (sup=0.0016, conf=0.0761)

6) If
exposed to
cita

in the 1
st

trimester and
drink alcohol, then preterm birth (sup=0.0011,
conf=0.132
)

7) If
exposed to
cita

in the 2
nd

trimester and
drink alcohol, then preterm birth (sup=0.0011,
conf=0.417
)


3) If
exposed to
cita

in the 2
nd

trimester, then
preterm birth (sup=0.0013, conf=0.1714)

4) If exposed to
cita

in the 3
rd

trimester, then
preterm birth (sup=0.0011, conf=0.1786)

8) If
exposed to
cita

in the 3
rd

trimester and
drink alcohol, then preterm birth (sup=0.0009,
conf=0.364
)

1) In general, patients have

preterm birth (sup=0.0454,
conf=0.0454)

5) If no
exposure to
cita
, then preterm birth
(sup=0.0433, conf=0.0444)

Statistical Analysis VS. Data Mining

Statistical analysis


Infeasible to test all
potential hypotheses
for large number of
attributes


Testing hypotheses
with small sample
size has limited
statistical power

Data mining


No hypothesis, mine
association in large
dataset with multiple
temporal attributes


Can generate association
rules independent of the
sample size


Derive rules with temporal
information of drug
exposure

Case II:

Mining Urology Clinical Data


Data set: urology surgeries operated
during 1995 to 2002 at the UCLA
Pediatric Urology Clinic


Dimension: 130 patients x 28 attributes

Bladder Body & Bladder Neck

Training Data Attributes


Each patient record contain:


Pre
-
operative conditions:


Demography data: age, gender, etc.


patient ambulatory status (A)


catheterizing skills (CS)


amount of creatinine in the blood (SerumCrPre)


leak point pressure (LPP)


urodynamics, such as the minimum volume of saline infused into a
bladder when its pressure reached 20 cm of water (20%min)


Type of surgery performed:


Op
-
1 Bladder Neck Reconstruction with Augmentation


Op
-
2 Bladder Neck Reconstruction without Augmentation


Op
-
3 Bladder Neck Closure without Augmentation


Op
-
4 Bladder Neck Closure with Augmentation


Post
-
op complications: infection, complication, etc.


Final outcome of the surgery: urine continence


wet or dry


Sample of Urology Clinical Data

Goals and Challenges


Goal:


Derive a set of rules from the clinical data set
(training set) that summarize the outcome based
on patients’ pre
-
op data


Predict operation outcome based on a given
patient’s pre
-
op data (test set), and recommend
the best operation to perform


Challenge:


Small sample size, large number of attributes


Continuous
-
value attributes such as uro
-
dynamics
measurements

Data Mining Steps


1. Separate the patients into four groups based on
their type of surgery performed


2. In each group, partition the continuous value
attributes into discrete intervals or cells. Since the
sample size is very small, we use a

hybrid
technique
to determine the optimal number of cells and cell
sizes.


3. Generate association rules for each patient group
based on the partitioned continues value attributes


4. For a given patient with a specific set of pre
-
op
conditions, the generated rules from the training set
can be used to predict success or failure rate for a
specific operation

Partitioning Continuous Value
Attributes


Current approach to partition continuous attribute:


Using domain expert guidance can be biased and
inconsistent


Statistical clustering technique fails when the training set
size is small and the number of attributes is large


New hybrid approach:


Using data mining technique to select a small set of key
attributes


Using statistical classification technique to perform the
optimal partition (determine the cell sizes and the number of
cells) from the small set of key attributes

Hybrid Clustering Technique


Select a small key attribute set (via data mining):


Use domain expert partition to perform mining on the
training set


Select a set of key attributes that contribute to high
confidence and support rules


Optimal partition (via statistical classification)


Use statistical classification techniques (e.g. CART) to
determine the optimal number of cells and their
corresponding cell sizes for the attributes


Mining optimally partitioned attribute data yields better quality
rules


Partition of continuous
variables for operations


Partition of continuous variables into optimal number of
discrete intervals (cells) and cell sizes for four types of
operations.

Cell#

LPP

SerumCrPre

1

[0, 19]

[0, 0.75]

2

(19, 33.5]

[0.75, 2.2]

3

(33.5,40]

n/a

4

normal

n/a

Operation Type 1

Operation Type 4

Cell#

LPP

20% mean

1

[0, 19]

[0, 33.37]

2

(19, 69]

(33.37, 37.5]

3

normal

(37.5, 52]

4

n/a

(52, 110]

Cell#

20%min

20%mean

30%min

30%mean

LPP

SerumCrPr
e

1

[80, 118]

[50, 77]

[100, 170]

[51, 51]

[12, 20]

[0, 0.5]

2

[145, 178]

[88, 104]

[206, 241]

[94, 113]

[24, 36]

[0.7, 1.4]

3

[221, 264]

[135, 135]

n/a

[135, 135]

normal

n/a

Operation Type 2

Cell#

20%min

20%mean

30%min

30%mean

LPP

SerumCrPre

1

[103,130]

[57, 75]

[129, 157]

[86, 93]

[6, 29]

[0.3, 0.7]

2

[156,225]

[92, 105]

[188, 223]

[100,121]

[30,40]

[1.0, 1.5]

Operation Type 3

Recommending operation based on
rules derived from training set


Transform the patient’s pre
-
op data of the continues
value attributes using the optimal partitions for each
operation


Find a set of rules (from the training set) that matches
the patients’ pre
-
op data


Compare the matched rules from each operation,
recommend the type of sugary that provides the best
match


Example: Prediction for Matt

Ambulatory

Status (A)

Cath

Skills (CS)

Serum

CrPre

20%

min

20%

mean(M)

30%

min

30%

mean

LPP

UPP

4

1

0.5

31

20

50

33

27

unkown

Patient Matt’s pre
-
operative conditions

Ambulatory

Status (A)

Cath

Skills (CS)

Serum

CrPre

20%

min

20%

mean(M)

30%

min

30%

mean

LPP

Op
-
1

4

1

1

n/a

n/a

n/a

n/a

2

Op
-
2

4

1

1

<1

<1

<1

<1

2

Op
-
3

4

1

1

<1

<1

<1

<1

1

Op
-
4

4

1

n/a

n/a

1

n/a

n/a

2

Discretized pre
-
operative conditions of patient Matt’s pre
-
op conditions.
The attributes not used in rule generation are denoted as n/a

Rule trees selected from the knowledge base
that match patient Matt’s pre
-
op profile

Surgery

Conditions

Outcome

Support

Support(%)

Confidence

Op
-
1

CS=1

Success

10

41.67

0.77

CS=1 and LPP=2

Success

3

12.5

0.75

Op
-
2

CS=1 and LPP=2

Fail

2

16.67

0.67

20%min=1 and LPP=2

Fail

2

16.67

0.67

Op
-
3

CS=1 and SerumCrPre=1

Success

5

50

0.83

CS=1, SerumCrPre=1 and LPP=1

Success

2

20

1

Op
-
4

A=4

Success

14

32.55

0.78

A=4 and CS=1

Success

11

25.58

0.79

A=4, CS=1 and LPP=2

Success

8

18.6

0.8

A=4, CS=1 and M=1

Success

6

13.95

1

A=4, CS=1, M=1 and LPP=2

Success

6

13.95

1

Based on the rule tree, we note that Operations 3 and 4 both match patient
Matt’s pre
-
op conditions. However, Operation 4 matches more attributes in
Matt’s pre
-
op conditions than Operation 3. Thus, Operation 4 is more
desirable for patient Matt.

Representing rules in a
hierarchical structure

A
4
CS
1
Lpp
2

Success


A
4
CS
1
M
1
Lpp
2

Success


sup=32.55%,conf=0.78

sup=13.95%,conf=1

sup=18.6%,conf=0.8

sup=13.95%,conf=1

sup=25.58%,conf=0.79

A
4

Success

A
4
CS
1
M
1

Success


A
4
CS
1

Success

Represent rule trees for Op
-
4 by spreadsheet

Rule tree for Op
-
4


Favorable user feedback in using the spreadsheet
interface because of its ease in rule searching and
sorting

Lesson learn from mining data
with small sample size


For small sample size, hybrid clustering yield
better than conventional unsupervised
clustering techniques


Hybrid clustering enables us to generate
useful rules for small sample sizes, which
could not be done using data mining or
statistical classifying methods alone



Conclusion


Mining pregnancy data:


Discover drug exposure side effects (association)


Advantage over traditional statistical approaches:


Independent of hypotheses


Independent of the sample size


Derive rules with temporal information


Using seed attribute approach to effectively discover exception
rules via rule hierarchy


Mining urology clinical data:


Deriving association rules based on patient’s pre
-
op conditions and
their operation outcomes according to different type of operations


Hybrid clustering technique to derive optimal partition for
continuous value attributes . This technique is critical for deriving
high quality rules for small sample size with large number of
attributes



Reference


Qinghua Zou, Yu Chen, Wesley W. Chu and Xinchun Lu. Mining association rules
from tabular data guided by maximal frequent itemset. Book Chapter in

Foundations and Advances in Data Mining
”, edited by Wesley W. Chu and T.Y.
Lin, Springer, 2005.


Yu Chen, Lars Henning Pedersen, Wesley W. Chu and Jorn Olsen. "Drug
Exposure Side Effects from Mining Pregnancy Data" In
SIGKDD Explorations
(Volume 9, Issue 1), June 2007, Special Issue on Data Mining for Health
Informatics, Guest Editors: Raymond Ng and Jian Pei

.


Q. Zou, W.W. Chu, and B. Lu. SmartMiner: A depth
-
first search algorithm guided
by tail information for mining maximal frequent itemsets. In Proc. of the IEEE
Intl. Conf. on Data Mining, 2002.


R. Agrawal and R. Srikant: Fast algorithms for mining association rules. In
Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994.


D. Burdick, M. Calimlim, and J. Gehrke: MAFIA: a maximal frequent itemset
algorithm for transactional databases. In Intl. Conf. on Data Engineering, Apr.
2001.


K. Gouda and M.J. Zaki: Efficiently Mining Maximal Frequent Itemsets. Proc. of
the IEEE Int. Conference on Data Mining, San Jose, 2001.

Reference


B. Liu, M. Hu, and W. Hsu, "Multi
-
level organization and summarization of the
discovered rules,"
Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining
, Aug, 2000, Boston, USA.


B. Liu, M. Hu, and W. Hsu, "Intuitive representation of decision trees using
general rules and exceptions.“
Proceedings of Seventeeth National Conference
on Artificial Intellgience (AAAI
-
2000)
, July 30
-

Aug 3, 2000, Austin, Texas, USA.


Frequent Itemset Mining Implementations Repository,
http://fimi.cs.helsinki.fi/


http://www.ics.uci.edu/~mlearn/MLRepository.html