1
CSE
300
Data mining and its application and
usage in medicine
By Radhika
2
CSE
300
Data Mining and Medicine
History
Past 20 years with relational databases
More dimensions to database queries
earliest and most successful area of data mining
Mid 1800s in London hit by infectious disease
Two theories
–
Miasma theory
Bad air propagated disease
–
Germ theory
Water

borne
Advantages
–
Discover trends even when we don’t understand reasons
–
Discover irrelevant patterns that confuse than enlighten
–
Protection against unaided human inference of patterns provide
quantifiable measures and aid human judgment
Data Mining
Patterns persistent and meaningful
Knowledge Discovery of Data
3
CSE
300
The future of data mining
10 biggest killers in the US
Data mining = Process of discovery of interesting,
meaningful and actionable patterns hidden in large
amounts of data
4
CSE
300
Major Issues in Medical Data Mining
Heterogeneity of medical data
Volume and complexity
Physician’s interpretation
Poor mathematical categorization
Canonical Form
Solution: Standard vocabularies, interfaces
between different sources of data integrations,
design of electronic patient records
Ethical, Legal and Social Issues
Data Ownership
Lawsuits
Privacy and Security of Human Data
Expected benefits
Administrative Issues
5
CSE
300
Why Data Preprocessing?
Patient records consist of clinical, lab parameters,
results of particular investigations, specific to tasks
Incomplete
: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
Noisy
: containing errors or outliers
Inconsistent
: containing discrepancies in codes or
names
Temporal
chronic diseases parameters
No quality data, no quality mining results!
Data warehouse needs consistent integration of
quality data
Medical Domain, to handle incomplete,
inconsistent or noisy data, need people with
domain knowledge
6
CSE
300
What is Data Mining? The KDD Process
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task

relevant
Data
Selection
Data Mining
Pattern Evaluation
7
CSE
300
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on a
multidimensional
data
model that views data in the form of a
data cube
A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
Dimension tables
, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
Fact table
contains measures (such as dollars_sold)
and keys to each of related dimension tables
W. H. Inmon:
“
A data warehouse is a
subject

oriented
,
integrated
,
time

variant
, and
nonvolatile
collection of
data in support of management
’
s decision

making
process.
”
8
CSE
300
Data Warehouse vs. Heterogeneous DBMS
Data warehouse: update

driven, high performance
Information from heterogeneous sources is
integrated in advance and stored in warehouses for
direct query and analysis
Do not contain most current information
Query processing does not interfere with
processing at local sources
Store and integrate historical information
Support complex multidimensional queries
9
CSE
300
Data Warehouse vs. Operational DBMS
OLTP (on

line transaction processing)
Major task of traditional relational DBMS
Day

to

day operations: purchasing, inventory,
banking, manufacturing, payroll, registration,
accounting, etc.
OLAP (on

line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical,
consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read

only but complex
queries
10
CSE
300
11
CSE
300
Why Separate Data Warehouse?
High performance for both systems
DBMS tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
Missing data: Decision support requires historical
data which operational DBs do not typically maintain
Data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
Data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
12
CSE
300
13
CSE
300
14
CSE
300
Typical OLAP Operations
Roll up (drill

up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll

up
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back

end relational tables (using SQL)
15
CSE
300
16
CSE
300
17
CSE
300
Multi

Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources
Front

End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
18
CSE
300
Steps of a KDD Process
Learning the application domain:
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
19
CSE
300
Common Techniques in Data Mining
Predictive Data Mining
Most important
Classification: Relate one set of variables in data to
response variables
Regression: estimate some continuous value
Descriptive Data Mining
Clustering: Discovering groups of similar instances
Association rule extraction
Variables/Observations
Summarization of group descriptions
20
CSE
300
Leukemia
Different types of cells look very similar
Given a number of samples (patients)
can we diagnose the disease accurately?
Predict the outcome of treatment?
Recommend best treatment based of previous
treatments?
Solution: Data mining on micro

array data
38 training patients, 34 testing patients ~ 7000 patient
attributes
2 classes: Acute Lymphoblastic Leukemia(ALL) vs
Acute Myeloid Leukemia (AML)
21
CSE
300
Clustering/Instance Based Learning
Uses specific instances to perform classification than general
IF THEN rules
Nearest Neighbor classifier
Most studied algorithms for medical purposes
Clustering
–
Partitioning a data set into several groups
(clusters) such that
Homogeneity:
Objects belonging to the same cluster are
similar to each other
Separation:
Objects belonging to different clusters are
dissimilar to each other.
Three elements
The set of
objects
The set of
attributes
Distance measure
22
CSE
300
Measure the Dissimilarity of Objects
Find best matching instance
Distance function
Measure the dissimilarity between a pair of
data objects
Things to consider
Usually very different for
interval

scaled
,
boolean
,
nominal
,
ordinal
and
ratio

scaled
variables
Weights should be associated with different
variables based on applications and data
semantic
Quality of a clustering result depends on both the
distance measure
adopted and its implementation
23
CSE
300
Minkowski Distance
Minkowski distance: a generalization
If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance
)
0
(


...




)
,
(
2
2
1
1
q
q
j
x
i
x
j
x
i
x
j
x
i
x
j
i
d
q
p
p
q
q
x
i
x
j
q=2
q=1
6
6
12
8.48
X
i
(1,7)
X
j
(7,1)
24
CSE
300
Binary Variables
A contingency table for binary data
Simple matching coefficient
d
c
b
a
c
b
j
i
d
)
,
(
p
d
b
c
a
sum
d
c
d
c
b
a
b
a
sum
0
1
0
1
Object
i
Object
j
25
CSE
300
Dissimilarity between Binary Variables
Example
A1
A2
A3
A4
A5
A6
A7
Object 1
1
0
1
1
1
0
0
Object 2
1
1
1
0
0
0
1
Object
1
Object
2
1
0
sum
1
2
2
4
0
2
1
3
sum
4
3
7
7
4
1
2
2
2
2
2
)
2
,
1
(
O
O
d
26
CSE
300
K

nearest neighbors algorithm
Initialization
Arbitrarily choose k objects as the initial cluster
centers (centroids)
Iteration until no change
For each object
O
i
Calculate the distances between
O
i
and the k centroids
(Re)assign
O
i
to the cluster whose centroid is the
closest to
O
i
Update the cluster centroids based on current
assignment
27
CSE
300
k

Means Clustering Method
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
cluster
mean
current
clusters
new
clusters
objects
relocated
28
CSE
300
Dataset
Data set from UCI repository
http://kdd.ics.uci.edu/
768 female Pima Indians evaluated for diabetes
After data cleaning 392 data entries
29
CSE
300
Hierarchical Clustering
Groups observations based on dissimilarity
Compacts database into “labels” that represent the
observations
Measure of similarity/Dissimilarity
Euclidean Distance
Manhattan Distance
Types of Clustering
Single Link
Average Link
Complete Link
30
CSE
300
Hierarchical Clustering: Comparison
Average

link
Centroid distance
1
2
3
4
5
6
1
2
5
3
4
Single

link
Complete

link
1
2
3
4
5
6
1
2
5
3
4
1
2
3
4
5
6
1
2
5
3
4
1
2
3
4
5
6
1
2
3
4
5
31
CSE
300
Compare Dendrograms
1 2 5 3 6 4
1 2 5 3 6 4
1 2 5 3 6 4
2 5 3 6 4 1
Average

link
Centroid distance
Single

link
Complete

link
32
CSE
300
Which Distance Measure is Better?
Each method has both advantages and disadvantages;
application

dependent
Single

link
Can find irregular

shaped clusters
Sensitive to outliers
Complete

link, Average

link, and Centroid distance
Robust to outliers
Tend to break large clusters
Prefer spherical clusters
33
CSE
300
Dendrogram from dataset
Minimum spanning tree through the observations
Single observation that is last to join the cluster is patient whose
blood pressure is at bottom quartile, skin thickness is at bottom
quartile and BMI is in bottom half
Insulin was however largest and she is 59

year old diabetic
34
CSE
300
Dendrogram from dataset
Maximum dissimilarity between observations in one
cluster when compared to another
35
CSE
300
Dendrogram from dataset
Average dissimilarity between observations in one
cluster when compared to another
36
CSE
300
Supervised versus Unsupervised Learning
Supervised learning (classification)
Supervision: Training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on training set
Unsupervised learning (clustering)
Class labels of training data are unknown
Given a set of measurements, observations, etc.,
need to establish existence of classes or clusters in
data
37
CSE
300
Derive models that can use patient specific
information, aid clinical decision making
Apriori decision on predictors and variables to predict
No method to find predictors that are not present in the
data
Numeric Response
Least Squares Regression
Categorical Response
Classification trees
Neural Networks
Support Vector Machine
Decision models
Prognosis, Diagnosis and treatment planning
Embed in clinical information systems
Classification and Prediction
38
CSE
300
Least Squares Regression
Find a linear function of predictor variables that
minimize the sum of square difference with response
Supervised learning technique
Predict insulin in our dataset :glucose and BMI
39
CSE
300
Decision Trees
Decision tree
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification
ID3 algorithm
Based on training objects with known class labels to
classify testing objects
Rank attributes with information gain measure
Minimal height
least number of tests to classify an object
Used in commercial tools eg: Clementine
ASSISTANT
Deal with medical datasets
Incomplete data
Discretize continuous variables
Prune unreliable parts of tree
Classify data
40
CSE
300
Decision Trees
41
CSE
300
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Attributes are categorical (if continuous

valued,
they are discretized in advance)
Tree is constructed in a top

down recursive
divide

and

conquer manner
At start, all training examples are at the root
Test attributes are selected on basis of a heuristic
or statistical measure (e.g., information gain)
Examples are partitioned recursively based on
selected attributes
42
CSE
300
Training Dataset
Age
BMI
Hereditary
Vision
Risk of
Condition X
P1
<=30
high
no
fair
no
P2
<=30
high
no
excellent
no
P3
>40
high
no
fair
yes
P4
31
…
40
medium
no
fair
yes
P5
31
…
40
low
yes
fair
yes
P6
31
…
40
low
yes
excellent
no
P7
>40
low
yes
excellent
yes
P8
<=30
medium
no
fair
no
P9
<=30
low
yes
fair
yes
P10
31
…
40
medium
yes
fair
yes
P11
<=30
medium
yes
excellent
yes
P12
>40
medium
no
excellent
yes
P13
>40
high
yes
fair
yes
P14
31
…
40
medium
no
excellent
no
43
CSE
300
Construction of A Decision Tree for
“
Condition X
”
Age?
>40
30…40
<=30
[P1,…P14]
Yes: 9, No:5
[P1,P2,P8,P9,P11]
Yes: 2, No:3
[P3,P7,P12,P13]
Yes: 4, No:0
[P4,P5,P6,P10,P14]
Yes: 3, No:2
History
no
yes
YES
[P1,P2,P8]
Yes: 0,
No:3
[P9,P11]
Yes: 2,
No:0
Vision
fair
excellent
NO
YES
NO
YES
[P6,P14]
Yes: 0,
No:2
[P4,P5,P10]
Yes: 3,
No:0
44
CSE
300
Entropy and Information Gain
S
contains
s
i
tuples of class
C
i
for
i
= {1, ...,
m
}
Information measures info required to classify any
arbitrary tuple
Entropy of attribute A with values {a
1
,a
2
,
…
,a
v
}
Information gained by branching on attribute A
s
s
s
s
,...,s
,s
s
i
m
i
i
m
2
1
2
1
log
)
I(
)
,...,
(
...
E(A)
1
1
1
mj
j
mj
j
s
s
I
s
s
s
v
j
)
E(
)
,...,
,
I(
)
Gain(
2
1
A
s
s
s
A
m
45
CSE
300
Entropy and Information Gain
Select attribute with the
highest
information gain (or
greatest entropy reduction)
Such attribute minimizes information needed to
classify samples
46
CSE
300
Rule Induction
IF conditions THEN Conclusion
Eg: CN2
Concept description:
Characterization
: provides a concise and succinct summarization of
given collection of data
Comparison
: provides descriptions comparing two or more
collections of data
Training set, testing set
Imprecise
Predictive Accuracy
P/P+N
47
CSE
300
Example used in a Clinic
Hip arthoplasty trauma surgeon predict patient’s long

term clinical status after surgery
Outcome evaluated during follow

ups for 2 years
2 modeling techniques
Naïve Bayesian classifier
Decision trees
Bayesian classifier
P(outcome=good) = 0.55 (11/20 good)
Probability gets updated as more attributes are
considered
P(timing=goodoutcome=good) = 9/11 (0.846)
P(outcome = bad) = 9/20
P(timing=goodoutcome=bad) = 5/9
48
CSE
300
Nomogram
49
CSE
300
Bayesian Classification
Bayesian classifier vs. decision tree
Decision tree: predict the class label
Bayesian classifier:
statistical
classifier;
predict
class membership probabilities
Based on
Bayes theorem
; estimate
posterior
probability
Na
ï
ve Bayesian classifier:
Simple classifier that assumes
attribute
independence
High speed when applied to large databases
Comparable in performance to decision trees
50
CSE
300
Bayes Theorem
Let
X
be a data sample whose class label is unknown
Let
H
i
be the hypothesis that
X
belongs to a particular
class
C
i
P(
H
i
) is
class prior
probability that
X
belongs to a
particular class
C
i
Can be estimated by
n
i
/
n
from training data
samples
n
is the total number of training data samples
n
i
is the number of training data samples of class
C
i
)
(
)
(
)

(
)

(
X
P
i
H
P
i
H
X
P
X
i
H
P
Formula of Bayes Theorem
51
CSE
300
More classification Techniques
Neural Networks
Similar to pattern recognition properties of biological
systems
Most frequently used
Multi

layer perceptrons
–
Input with bias, connected by weights to hidden, output
Backpropagation neural networks
Support Vector Machines
Separate database to mutually exclusive regions
Transform to another problem space
Kernel functions (dot product)
Output of new points predicted by position
Comparison with classification trees
Not possible to know which features or combination of
features most influence a prediction
52
CSE
300
Multilayer Perceptrons
Non

linear transfer functions to weighted sums of
inputs
Werbos algorithm
Random weights
Training set, Testing set
53
CSE
300
Support Vector Machines
3 steps
Support Vector creation
Maximal distance between points found
Perpendicular decision boundary
Allows some points to be misclassified
Pima Indian data with X1(glucose) X2(BMI)
54
CSE
300
What is Association Rule Mining?
Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories
Example of Association Rules
{
High LDL, Low HDL
}
{
Heart Failure
}
PatientID
Conditions
1
High LDL Low HDL,
High BMI,
Heart Failure
2
High LDL Low HDL
,
Heart Failure,
Diabetes
3
Diabetes
4
High LDL Low HDL
,
Heart Failure
5
High BMI
,
High LDL
Low HDL
,
Heart Failure
People who have high LDL
(“bad” cholesterol), low HDL
(“good cholesterol”) are at
higher risk of heart failure.
55
CSE
300
Association Rule Mining
Market Basket Analysis
Same groups of items bought placed together
Healthcare
Understanding among association among patients with
demands for similar treatments and services
Goal : find items for which joint probability of
occurrence is high
Basket of binary valued variables
Results form association rules, augmented with
support and confidence
56
CSE
300
Association Rule Mining
D
in
trans
Y
X
containing
trans
Y
X
P
#
)
(
#
)
(
Association Rule
An implication
expression of the form
X
Y, where X and Y
are itemsets and
X
Y=
Rule Evaluation
Metrics
Support
(s): Fraction of
transactions that
contain both X and Y
Confidence
(c):
Measures how often
items in Y appear in
transactions that
contain X
X
containing
trans
Y
X
containing
trans
Y
X
P
#
)
(
#
)

(
Trans
containing Y
Trans containing
both X and Y
Trans
containing X
D
57
CSE
300
The Apriori Algorithm
Starts with most frequent 1

itemset
Include only those “items” that pass threshold
Use 1

itemset to generate 2

itemsets
Stop when threshold not satisfied by any itemset
L
1
= {frequent items};
for (k = 1;
L
k
!=
; k++) do
Candidate Generation: C
k+1
= candidates
generated from
L
k
;
Candidate Counting:
for each transaction
t
in
database do increment the count of all candidates
in
C
k+1
that are contained in
t
L
k+1
= candidates in
C
k+
1
with min_sup
return
k
L
k
;
58
CSE
300
Apriori

based Mining
b, e
40
a, b, c, e
30
b, c, e
20
a, c, d
10
Items
TID
Min_sup=0.5
1
d
3
e
3
c
3
b
2
a
Sup
Itemset
Data base D
1

candidates
Scan D
3
e
3
c
3
b
2
a
Sup
Itemset
Freq 1

itemsets
bc
ae
ac
ce
be
ab
Itemset
2

candidates
ce
be
bc
ae
ac
ab
Itemset
2
1
2
2
3
1
Sup
Counting
Scan D
ce
be
bc
ac
Itemset
2
2
2
3
Sup
Freq 2

itemsets
bce
Itemset
3

candidates
bce
Itemset
2
Sup
Freq 3

itemsets
Scan D
59
CSE
300
Principle Component Analysis
Principle Components
In cases of large number of variables, highly possible that
some subsets of the variables are very correlated with each
other. Reduce variables but retain variability in dataset
Linear combinations of variables in the database
Variance of each PC maximized
–
Display as much spread of the original data
PC orthogonal with each other
–
Minimize the overlap in the variables
Each component normalized sum of square is unity
–
Easier for mathematical analysis
Number of PC < Number of variables
Associations found
Small number of PC explain large amount of variance
Example 768 female Pima Indians evaluated for diabetes
Number of times pregnant, two

hour oral glucose tolerance test
(OGTT) plasma glucose, Diastolic blood pressure, Triceps skin
fold thickness, Two

hour serum insulin, BMI, Diabetes pedigree
function, Age, Diabetes onset within last 5 years
60
CSE
300
PCA Example
61
CSE
300
National Cancer Institute
CancerNet
http://www.nci.nih.gov
CancerNet for Patients and the Public
CancerNet for Health Professionals
CancerNet for Basic Reasearchers
CancerLit
62
CSE
300
Conclusion
About ¾ billion of people’s medical records are
electronically available
Data mining in medicine distinct from other fields due
to nature of data: heterogeneous, with ethical, legal
and social constraints
Most commonly used technique is classification and
prediction with different techniques applied for
different cases
Associative rules describe the data in the database
Medical data mining can be the most rewarding
despite the difficulty
63
CSE
300
Thank you !!!
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο