Data Mining

recorddescentΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

315 εμφανίσεις

Discriminant

Model of the Rupture Status
of Aneurysm:

Analysis by the Data Mining Method

Chang
-
Woo Ryu
1
, Yong
Seong

Park
1
,
Hee

Dong Jung
1
, Jun
-
Seok

Koh
2
,
Eui

Jong

Kim
3
, Woo
Suk

Choi
3

1
Department of Radiology and
2
Neurosurgery.
Gangdong

Kyung
Hee

University Hospital

3
Department of Radiology. Kyung
Hee

University Medical Center

Unruptured

Cerebral
Aneurysms


Surgical management of unruptured cerebral
aneurysms is controversial


Prediction "will rupture" is a indispensable
challenge, because the surgery of unruptured
aneurysm is the lesser of the two evils.


Data mining!

How can we deal with numerous variables?

Data mining is…


Exploring and modeling large amounts of data


In clinical field, derive model that can use patient
specific information to
predict the outcome
of
interest and to thereby support
clinical decision
-
making



Data mining is a process of data analysis rather
than the specific method for data analysis


Decision tree model


Decision tree model is

one of the main techniqu
es used in

Data Mining


It is a supervised data mining model used for


hierarchical classification or prediction

&

Classification

Prediction

Play

9

Don`t play 5

Play

4

Don`t play 0

Play

3

Don`t play 2

Play

2

Don`t play 3

Play

2

Don`t play 0

Play

0

Don`t play 3

Play

0

Don`t play 2

Play

3

Don`t play 0

sunny

rain

overcast

≤ 70

> 70

true

false

Outlook?

Humidity?

windy

Dependent variables : PLAY

It is organized as a rooted tree with 2 types of nodes called
decision nodes
and
class nodes

Objective

We tried to make the prediction model of
aneurysm rupture by using of data mining
method.

Methods

Flow chart for decision making
model

[
Define Problem
]


Discriminate the rupture status of aneurysm

[
Data Collection
]


553 aneurysms in consecutive 448 patients (me
an
57.44
±
11.21
years;
157 males and 291 femal
es)


Consecutive

patients 2006.5
-
2009.08


Aneurysm diagnosed on 3D
rotational angiograp
hy



Aneurysm rupture


SAH defined on CT or lumbar puncture


Multiple aneurysm: confirm

by surgical field


327 unruptured aneurysms

226 ruptured aneurysms


clinical variables

1. Age(year)

2. Sex

3
. Hypertension(HTN)

4
. Diabetes mellitus(DM)

5
. Smoke

6
. Drinking

7
. BMI(Body Mass Index)(kg/m
2
)

Please refer to appendix !

[
Data Preprocessing
]

[
Selection Classification Technique
]

Algorithm


Logistic regression


Decision Tree: Chi square


Gini


Entropy


80% of subjects
those are randomly selected is
used to derive the model


Is selected model good?


[
Build the Classifier Model
]

Yes

No

[
Analyze Results
]

Return

to
[
Data
Preprocessing
]

20% of subjects
is used to valid the model


ROC curve


Validation of model



[
Analyze Results
]

Results

Abbreviation


BMI: body mass index


HWR: height
-
width ratio


BNF: bottleneck factor


NoofAn
: number of aneurysm


AR: aspect ratio


PAM: parent artery diameter


MAX: maximum diameter of aneurysm


DT: decision tree


Cp
: planar anisotropy


Logistic regression (stepwise)

Variables

Odd ratio

Height
-
Width

Ratio

3.570

Aspect Ratio

2.179

Size Ratio

1.404

BMI

0.908

Age

0.967

No. of An (2
vs

1)

0.241

Side (2
vs

0)

0.514

Side (1
vs

0)

0.448

Decision Tree: Chi
-
square

Decision Tree: Chi
-
square

Leaves

Training

Validation

1

0.4264

0.3784

2

0.3359

0.4144

3

0.3307

0.3423

4

0.3307

0.3423

5

0.3204

0.3153

6

0.3152

0.3243

AR: 1.35

Max: 5.56

PAM: 2.39

HTN

BMI

BNF

AR

AR

Decision Tree:
Gini

& Entropy

Leaves

Training

Validation

1

0.4264

0.3784

2

0.2972

0.2432

3

0.2920

0.2162

4

0.2713

0.1982

5

0.2661

0.1802

6

0.2661

0.1802

7

0.2661

0.1802

8

0.2610

0.1712

9

0.2610

0.1712

10

0.2610

0.1712

11

0.2610

0.1712

12

0.2610

0.1712

13

0.2610

0.1712

14

0.2610

0.1712

15

0.2610

0.1712

16

0.2222

0.1622

17

0.2222

0.1622

18

0.2222

0.1622

19

0.2222

0.1622

20

0.2222

0.1622

21

0.2222

0.1622

22

0.2222

0.1622

23

0.2222

0.1622

24

0.2222

0.1622

25

0.2222

0.1622

26

0.2196

0.1622

27

0.2196

0.1622

28

0.2119

0.1622

29

0.2093

0.1622

30

0.1990

0.1712

31

0.2093

0.1802

32

0.2016

0.1802

33

0.1990

0.1802

34

0.1886

0.1892

35

0.1835

0.1982

36

0.1757

0.2072

37

0.1628

0.2162

38

0.1628

0.2252

39

0.1576

0.2432

Decision Tree:
Gini

& Entropy

DT:Chi

DT:
Gini&Entropy

Logistic

Description

Target

Misclassification

Rate

Valid:Misclassification

Rate

Test:Misclassification
Rate

Logistic Regression

RUPTURE

0.206

0.279

0.236

Tree

(Chi)

RUPTURE

0.320

0.315

0.309

Tree (Entropy
)

RUPTURE

0.222

0.162

0.236

Tree (
Gini
)

RUPTURE

0.222

0.162

0.236


As the result of the stepwise logistic regression, independent
variables those were correlated to aneurysm rupture were loc
ation2, 3, shape, No. of daughter sac, diameter, bottleneck fac
tor, size ratio, BMI, HTN, DM, and age.


In the entropy, chi
-
square, and
Gini

algorithms of decision tree
models, the numbers of leaves that misclassification rate of th
e test model was matched valid model where 5, 36, and 35, re
spectively.


In chi
-
square decision tree, independent variables correlated t
o aneurysm rupture were HWR, BNF, Cp1 and No. of aneurys
ms.


The analysis by decision tree suggests an accuracy of 68% fo
r the ruptured aneurysms.

[
In summary…
]

Cont’d


Decision

tree (Chi): Easy to understand and interpret,
but lower probability


Decision tree (
Gini
,
entrophy
) high probability ,but too
many pruning


Logistic regression: high probability of predictive model


Discussion

Advantages of
Decision Tree


easy
to understand and interpret


works
for categorical and quantitative data


attributes
can be chosen in any desired order


pruning
a
decision tree
is very easy


works
for categorical and quantitative data


can
be used to identify outliers


can
be used even when domain experts are
absent




[
In summary…
]

Data

mining


Conventional statistics

Volume

of data

The

better, the larger

Can use small data

Purpose

To find unsuspected

relationships

To valid the hypothesis

that
is made by data owner

Comparison between data mining and
conventional statistics

Conclusion


Based on Cases, data mining techniques can be
used to establish the expert system that automat
ically decide the management of intracranial ane
urysm.



Its clinical value needs to be further evaluated.


Appendix: Geometric variable
s


Maximum diameter: the longest among height, depth and width


Aspect ratio: Height/neck


Bottleneck ratio: maximum diameter/neck diameter


Size ratio: maximum diameter/parent artery diameter


Height/width ratio


Volume neck ratio: volume/neck area (ml/m
2
)


height

width

neck

depth

Parent artery