Bachelor of Software Engineering

stagetofuAI and Robotics

Oct 29, 2013 (3 years and 7 months ago)

66 views


University of South Australia


School of Computer and Information Science

Bachelor of Software Engineering


Research Proposal


Discover Patterns in Adverse

Drug Reaction


Name: Ernst J Joham

ID Number: 10005126


SUPERVISOR:
DR JIUYONG LI





: DR JAN STANEK


ii

ABSTRACT

This research will use medical data to investigate and find patterns through data mining for
adverse drug reaction.
Wilson, Thabane and Holbrock (2003) define data mining as the
importance of extracting valid, unknown and ac
tionable information from databases.
According to Furey (2005)

‘each

year 2.2 million Americans suffer serious adverse
reactions

to drugs

which are referred t
o as Adverse Drug Reaction (ADR)’
.
The
World
Health Organization (2002)

overview of
adverse events

clearly highlights this importance
and describes these adverse events as
fatal
,

life
-
threatening and permanently/significantly
disabling
,
requires or prolongs hospitalization.

By using data mining to discover patterns
involving factors such as age, height
, and weight with certain conditions or taking different
drugs together it can lead to outcomes that cause adverse events. The purpose of the
research is to
try to discover patterns through data mining on a far ideal dataset data set
that contains noise an
d missing values. Two

core questions are explored:

(1)

is it possible
to discover patterns in spar
e
s datasets?
, and (2)

what patterns can be identified through
data mining for ADR?

This research project will seek answers to the
se

questions using pre
-
reco
rded data. The data being used will provide real
-
world
evidence

for d
etecting adverse
drug reaction. An interpretative quantitative methodology will be used. The
research

will
involve

data

s
orting through
approximately twelve thousand existing records
and
the
selection of

relevant information
.
R statistical package
will be use
to find patterns

and
interpret communalities. R (R Project for Statistical Computing) software is an open source
package with functional language capabilities allowing graphical displ
ay and statistical
exploration from datasets. Once the results are obtained an in
-
depth analysis and
interpretation of the data will take place. Our conclusion to the research will determine if a
far from ideal data set can be mined with certain techniques

that are more suitable for
medical datasets.








iii


DECLARATION


I declare the following to be my own work, unless otherwise referenced, as
defined by the University’s policy on plagiarism.




Ernst J Joham



iv

TABLE OF CONTENTS


1.

INTRODUCTION

................................
................................
........

1

1.1

B
ACKGROUND

................................
................................
................................
...

1
-
2

1.2

M
OTIVATI
ON

................................
................................
................................
..........

2

1.3

R
ESEARCH
O
BJECTIVE AND
S
TUDY
Q
UESTIONS
................................
.................

2

1.4

T
HESIS
S
TRUCTURE
................................
................................
..............................

3

2.

LITERATURE REVIEW

................................
..........................
4
-
7

3.

METHODOLOGY

................................
................................
.......

8


3.1.

D
ATASET

................................
................................
................................
...............

8

3.2.

R
ESEARCH

P
ROCESS

................................
................................
.....................

9
-
10

3.3.

D
ATA

M
INING

T
OOL

................................
................................
......................

11
-
12

3.4.

A
LGORITHMS
................................
................................
................................
.......
12

4

SCHEDULE

................................
................................
..............

13

REFERENCE

................................
................................
.....................

14







1

1.

INTRODUCTION

1.1

Background

Discovering patterns in medical datasets is still very difficult
and challenging
but very rewarding (Roddick & Graco

2003). Compared to
other fields

if you
can data mine medical datasets it will also work for any dataset. There are a lot
more constraints and issues that limit the way the data mining is undertaken for
medical datasets. Some of these issues facing medica
l data is the why the data
is collected; accuracy of the data, ethical, legal and social issues that comes
with patients records (Cios & Moore 2002).



The World Healt
h Organization (2002) reports that

some countries the
admission due to ADRs is more than

10%. The growing problem of
these
medical morbidity and mortality has a high financial burden on hospitals
. This
growing problem needs to

be addressed by

monitoring system and other
alternatives.


Data mining can be one of these alternatives in helping de
tect ADRS by
following a data mining process and using certain techniques in
extract
ing
patterns in medical datasets to identify the cause of adverse events that are
life
-

threatening, and prolong
hospitalization.


Data mining techniques have improved from

when data mining began and with
the introduction of databases, but the database does not

benefit the health
professional(s)

until the information is turned into useful information. By using
effective data mining tools and algorithms and a step by step dat
a mining
pro
cess it is possible to produce u
seful and new information from the dataset
(Wilson, Thabane & Holbrook 2003).



This thesis attempts to explore using data mining techniques in discovering
patterns in medical data. There are many issues that m
ake it difficult for mining
medical data and a need to overcome this complexity is important. By using



2

medical datasets, data mining techniques and technologies are pushed to their
limits (Roddick & Graco 2003). This aspect will test the effectiveness of
v
arious algorithms used in evaluating these results.

1.2

Motivation

The motivation for this project is my personal interest in data mining and the
challenges that is
involved with today’s knowledge discovery in databases. With
the project I hope to discover pat
terns of interest by using low quality medical
data. There is a clear need for more research into data mining of medical
applications as little research so far has been published. Data quality and issues
with medical datasets does impact the end result of
patterns discovered.

A lot of techniques these days already have mechanisms built in to help with
noise and missing values. In this research a number of algorithms will be tested
to see if they can handle a data set that is far from ideal to data mine. For

the
project R statistical tool will be used for the data mining process. Reason for use
of R is that it is an open source tool and also has the benefit of a programming
language. It is also a widely used tool by many data mining professionals. The
pattern
s discovered are interpreted and a conclusion will be made on the
soundness of the algorithm
s
.

1.3

Research Objective and Study Questions

The aim of this research is to use

data mining methods in an attempt to produce
relevant results from real world data. The interpretation of the results from this
research will determine if data sets that are faced with issues and constraints like
noisy, incompleteness and limitation on at
tributes can still produce patterns of
interest.

The following research questions for this thesis will be addressed:



(1) Is it possible to discover patterns in spar
e
s datasets?

(2) What patterns can be identified through data mining for ADR?




3

1.4

Thesis
Structure

The layout for the thesis is as follows:


Section 2 is an overview of the literature. It will review current studies
conducted in the area of data mining when it come to noisy, incomplete and
data that is generally hard to extract patterns becaus
e of issues with the data.
Also best techniques used for this kind of data will be reviewed.


Section 3 describes the methodology used for this research. Includes an
overview of the data used for the project. Data mining tools for the analyzing of
the dat
aset and the techniques used in producing the models, and results.


Section 4 provides an overview of data mining and the process involved for a
data mining project. A look into some of the likely techniques used for data
mining is also looked at.


Sectio
n 5 answers the research questions. Interpreting the models is attempted
and discussions about the results are made.


Section 6 this chapter is a summary of the entire study conducted, limitations
that also affected the study and suggestions for future res
earch.













4

2.

LITERATURE REVIEW

With the growth of data mining and finding informative information in datasets it is not
surprising that more research is needed in data quality and e
ffective data mining
algorithms to be able to detect interesting
relationships within the dataset
.
There are still
relatively few publications and research done for data mining especially for medical
datasets with noise and missing values. Several studies have focused on the problems
encountered with datasets and best t
echniques to be used when data mining medical
applications. For example Cios & Moore (2002) addresses the difficulty and constraints
of collecting medical data to mine and the technical and social reasons behind missing
values in the data
set. Study by Brown & Kros (2003
) focuses further on the impact of
missing data and how existing methods can help with the problems of missing data.

They categories methods

for d
ealing with missing data into
:



Use complete data only



Delete selected case or
variables



Data imputation



Model
-
based approaches


Before any of these methods can be applied to the data set the analyst must understand
each type of missing values only then can a discussion be made in how to address them
(Brown & Cros 2003).Types of miss
ing values can be of type data missing at random,

Data missing completely at random, non
-
ignorable missing data, and outliers treated as
missing data (Brown & Cros 2003).

Anothe
r alternative approach to handling

missing values is by conceptual reconstructi
on
where only conceptual aspects of the data are mined from the incomplete data se
t
(Aggarwal & Parthasarathy 2001
).
They further argue that some o
f

the methods like
data imputation are prone to errors. Aggarwal & Parthasarathy (2001) gives an example
wher
e in table1 it shows how entries that are missing 20% to 40% in the data set. When
using the conceptual reconstruction method the first three were 92% accurate as the
original data sets.





5

Dataset

Cao

CAM(20%)

CAM(40%)

BUPA

62.4

0.963

0.927

Musk (1)

76.2

0.943

0.92

Musk (2)

95.0

0.96

0.945

Letter Recognition



84.9


0.825


0.62


Table 1 Conceptual reconstructed data sets (Aggarwal & Parthasarathy 2001
)

Other Studies
have gone further with impact of missing values and
explore the impact
of noise and how this can i
nfluence the output of models.
Zhu & Wu
(
2004) puts these
into class noise and attributes noise. The
ir research concentrated

on attribute noise as
class noise is much cleaner them first thought (Zhu & Wu 2004). Attribute noise
is

more difficult to handle and include:

(
1)
I
ncorrect attribute
values

(2) Missing or don’
t know attribute values

(3) Incomplete attributes or don’
t care values


Some researchers have focused on data cleansing tools to help eliminate noise but this
can only achieve a reasonable result (Zhu & Wu 2004). Noise handling methods can
help to eliminate noise in data sets.
Hulse et al (2007) introduces the Pair wise Noise
Attribute Detection Algorithm (PANDA) that can detect attribute noise within datasets
allowing the removal of noisy data only if required. The other algorithm introduced is
the (DM)
distance
-
based outlier detection technique

which is similar but not as goo
d as
PANDA in detecting attribute noise. When the noise is detected then we can remove it
or if not removed it may cause a low quality set of hypotheses. Table 2 displays the
result of a dataset using PANDA and Dm. PANDA identifies more noise instances.

In
stance category


1

10


11

20

21

30

1

30

PANDA DM

PANDA DM

PANDA DM

PANDA DM

Noise



6


6

7


4

8


8

21


18

Outliers


2


4

2


6

1


2

5


12

Exceptions


2



0

1



0

1


0

4



0

Typical


0


0

0


0

0


0

0


0


Table 2 10% of a dataset of 30 most suspicious instances (Hulse et al 2007)




6

Several researches have focused on the techniques that have built in mechanism to
handle noise and missing
values and which are more appropriate to use for medical
applications. Lavera
č

(1999) reviews a number of techniques that have been applied and
are more suited to medical data sets.

These include decision tree, logic programs, K
-
nearest neighbour, and Baye
sian classifiers. Lavera
č

(1999) describes these as
‘intelligent data analysis techniques in the extraction of knowledge, regularities, trend
and representation cases from patients data stored in medical records’. Lee et al (2000)
believes that techniques
that users can easily extract specific knowledge are the key for
making medical decisions and studies have concluded that Bayesian networks and
decision trees are the primary techniques applied in medical information systems.
Fayyad et al (Lee et al 1999,
p.85) indicates that the diverse fields for knowledge
discovery draw upon the main components and methods shown in figure 1.


Figure 1 Main components of KDD and DM and there relationship (Lee et al 1999)


A study on drug discovery Obenshain

(2004) showed that neural networks performed
better then logistic regression, but the decision tree did better in identify active
compounds most likely to have biological activity.

Other
researchers

into data mining for medical datasets have focused on
data mining
process which includes dealing with missing values, noise and choosing the techniques
for knowledge discovery
. Cios & Moore (2002) acknowledges that it is important for



7

medical data mining to follow a procedure for success in knowledge discover
y. These
can follow a few steps like a nine
-
step process or the DMKD process which adds
several steps to the CRISP
-
DM model and has been applied to several medical problem
domains.

Figure 2 shows how the process model works which can be semi
-
automated
for
medical applications (Cios & Moore 2002).


Figure 2 DMKD process model (Cios & Moore
2002)


Wang (2008) argues that most process models focus on the results but not in gaining
new knowledge. Medical data mining applications is expected to discover new
knowledge and should follow a five stage data mining development cycle:
planning
tasks, developi
ng data mining

hypotheses, preparing data, selecting data mining tools,

and evaluating data mining results.


Current literature has focused on ways to improve data sets
by applying methods for
missing values and noise. Not many methods have been applied on

medical data sets.
The same with techniques where tests have been done, but still there is room for further
research into techniques that when using real
-
world medical data sets f
or data min
ing.
This study will further
investigate ways for a successful ou
tcome of discovering
patterns in a medical data set. The CRISP
-
DM data mining process will be used and R
statistical package tool for handling noise and missing values
.
Zhu & Wu
(
2004)

indicate that powerful tools can
greatly assist in the data

cleansing p
rocess

which are
cost effective

are necessary and ma
y help to achieve data
qual
ity level for data mining
.

A number of

algorithms
will also be tested on the medical data set to see how well they
can perform on the data set that contains noise.




8

3.

METHODOLOGY

3.1.

Dataset

The dataset for the project is a pre
-
record dataset provide by external clients who are
kept anodynes. Also because of the confidentiality, ethical and legal issues in the
dataset there was a necessity to remove sensitive information before we were

able to
view and use the data. There are a total of 1286 records of patients with ADR that
will be used for the data mining project.

The information in the dataset included characteristic of patient and drug
s

for
adverse drug reactions. The information t
hat was made available in the dataset
includes:



Date when the patient was admitted for ADR.



Age record in days



Brand is the generic drug for the main drug



Drug that was given to the patient



Route of administration



Probability of the drug being the cause of

ADR



Severity of the ADR



Recovered or not



UR number which includes patients details



ATC Anatomical Therapeutic Chemical is a classification system for drugs


It is worth nothing

that, due to the limited attributes, incomplete and missing
information only
a few attributes were chosen for use.








9

3.2.

Research Process

The project uses the data mining method of CRISP_DM where the consortium uses
a six step data mining process as shown in figure 1.


Figure 1: CRISP
-
DM


six

step process model (CRISP
-
DM, 2000)

Understand the business
this is where the project was reviewed by the client,
sup
ervisor and team
member as which direction we were going to take and what
was the goal of the project.
The main aim of this research is to test techniques to see
if patterns a
re formed using a sparse dataset.

Understand the dataset
for this stage the dataset was reviewed by using Rattle too
l
to give

a summary of the attributes as a whole and query each attributes separately
to visualise the data in various format to aid in the

decision

which attribute to keep
for further analysis.

Since the attributes for this dataset was limited a few attributes
stood out more and were considered for the next phase.

Data preparation
this is where the data went through two extra processes,
Data

Cleaning

and
Data Transformation

all done in the R tool because of the ease of
use of scripts to carry out the data cleansing and transformation. The objective for
this phase was to decide on the structure of the data for the next phase. Five
attributes w
ere chosen they included Date, Age in Days, Route, Recovered, and
ATC code for the drug.

These attributes were chosen in consideration of giving a
better result for modelling. Table 1 shows attributes abbreviation name and given
values.





10


Variable

Abbrevia
tion

Date when the patient was admitted
to hospital for ADRs (October
-
March =1, April
-
September = 0)


How old the patient is categorised
i
nto equal number of records.

(0
-
2
years old = 1, 2
-
5 years old = 2, 5
-
11 years old = 3, 11
-
16 years old =
4, and
above 16 years of age = 5
)

The administration of the
medication that caused the ADR is
either oral or intravenous.(Oral = 1,
Intravenous = 0)

Recovered from ADRs or
not.(Recovered = 0, Not recovered
= 1)

The drugs given to the patient either
are classified

antibiotics or
not.(Antibiotics =1, Not Antibiotics
=0)

ADRDATE



AGE



ROUTE


RECOV


ATC


Table 1 shows the binary values that the attributes were given.

Modelling
phase for the process included the decision o
f selecting the most
appropriate algorithms
for the research

which for this study included logistic
regression, decision tree, and risk pattern.

Evaluation
phase was the last phase for the project where the models were
interpreted and the results determined if the proj
ect objectives were met
. Due to

time
constraint the results of the three techniques were used to answer the
project

objectives and the first three phases were only completed once.




11

3.3.

Data Mining

Tool

The data mining tool

chosen for the project is R package for statistical computing
and gr
aphics with programming capabilities, and Rattle a user interface that can be
combined with R package. These tools can be run on a variety of platforms
including UNIX, Windows, and MacOS and R also allows binding with other
languages such as Python, XML, S
oap, and Perl. Both of these packages are under
the free software environment and provide a sophisticated way of performing data
mining. A screenshot of the R and Rattle tools is shown in figure 2.



Figure 2:
R and Rattle tool for data mining screenshot.

Rattle is used by many governments and private organisations around the world
including the Australian Taxation office and is being adopted by a number of
colleges and university in teaching data mining.

The R and Rattle combined provides a good set of da
ta mining algorithms for
modelling selection. They include cluster, association rules, liner models, tress, and
neutral models. Besides the models there is the variety of ways for visualizing the
data like histograms, plots. Also data form of almost any so
urce can be loaded and
used.




12


Most of the data preparation was done in R by using Scripting language and the
decision tree and logistic regression was modelled using Rattle. The only other
algorithm used for the project was Mining Risk Patterns. The softwa
re
for this
algorithm was run

on Linux 9.0 platform.

3.4.

A
lgorithms

The data mining techniques adopted for the project included logistic regression,
decision tree, and risk pattern mining algorithm. Each of these techniques provides
their own unique way of ana
lysing the medical dataset that was provided.

Decision tree and logistic regression have been applied and used across a wide range
of applications including medical applications. Ji

et al (2009, p. 2) in reporting
Andrews study, emphasizes the benefits of logistic regression and decision tree
method for ‘identifying commonalities and differences in medical databases
variables. The risk pattern algorithm has also been applied to medic
al data for
patients on ACE inhibitors who have an allergic event (Li et al, 2005). As this
project explores the use of medical dataset to detect adverse drug reaction it was
important to use techniques that are reliant and have proven to work in similar
s
tudies.

The difference between the techniques is that logistic regression is appropriate when
variables are of two possibilities (0, 1) and variables with multiple categories. This
makes the logistic regression method useful for this study in determining w
hether
patient’s medical details given have any association of the patient not recovering
from adverse drug reactions. Where else the decision tree is also well suited to
binary values but can also be modelled with more than two values and can easily be
u
nderstood by people because of the tree like structure and leaf nodes that can easily
be analysed to determine the patterns given. The last algorithm ‘makes use of anti
-
monotone property to efficiently prune searching space’ (Li et al, 2005). The optical
r
isk pattern mining returns the highest relative risk pattern among the patterns
discovered. This model is easily interpreted and shows the odds ratio, risk ratio and
the fields associated with the pattern.




13

4.

SCHEDULE

Activities


Date

Description

Project P
lan

August 25, 2008

Ongoing until end of first semester

SRS Document

August 25, 2008

Ongoing until end of first semester

Test Plan

August 25, 2008

Ongoing until end of first semester

Data preparation

August


October
2008

Clean and preparation of dataset

Thesis proposal

November 07,2008

Research presentation

Modelling

November

December 2008

Modelling of dataset

Proof of concept

March 30,

2009

Produce framework and process
description

User documentation

May

12,

2009

User guide for the project

Test Results

May 15,2009

Tests results for the techniques
used on the dataset

Final Technical
Report


May 30, 2009

Deployment (final report for the
data mining project).

Research proposal

June 16
,2009

Final written proposal

Research paper

August 7, 2009

Final

Research presentation

September 4,2009

Final




14

REFERENCE

Aggarwal CC & Srinivasan, P 2001,
Mining massively incomplete data sets by
conceptual reconstruction
, ACM, San Francisco, California.


Brown, ML & Kros, JF 2003,

'Data mining and the impact of missing data',

Industrial
Management & Data Systems,
vol
.
103, pp. 611
-
621.


Cios, K 2002, 'Uniqueness of medical data mining',
Artificial intelligence in medicine,
vol
.
26, no. 1
-
2, pp. 1
-
24.

CRISP_DM 2000,
Cross Industry
Standard Process for Data Mining, viewed 27 August
2008, <
http://www.crisp
-
dm.org/Partners/index.htm
>.

Li, J, Fe, AW
-
c, He, H, Chen, J, Jin, H, McAullay, D, Williams, G, Sparks, R &
Kelman, C 2005
,
Mining risk patterns in medical data
, ACM, Chicago, Illinois, USA.

Lavrač, N 1999, 'Selected techniques for data mining in medicine',
Artificial intelligence
in medicine,
vol
.
16, no. 1, pp. 3
-
23.

Lee, I
-
N, Liao, S
-
C & Embrechts
, M 2000, 'Data mining techniques applied to medical
information',
Medical Informatics & the Internet in Medicine
,

vol
.
25, no. 2, pp. 81
-
102.

Obenshain, MK 2004, ‘Application of Data Mining Techniques to Healthcare Data’,

I
nfection

C
ontrol and

H
ospital

E
pidemiology
, vol.25, no 8, pp. 690
-
695.


Roddick, JF, Fule, P & Graco, WJ 2003, 'Exploratory medical knowledge discovery:
experiences and issues',
SIGKDD Explor. Newsl.,
vol
.
5, no. 1, pp. 94
-
99.

Safety of Medicines 2002, A Guide to Detecting and Reporting

Adverse Drug

Reaction Why Health Professionals Need to Take Action, WHO publications, viewed

15 April 2008, <
http://whqlibdoc.who.int/hq/2002/WHO_EDM_QSM_2002.2.pdf
>.

Wang, H & Wang, S 2008, 'Medical knowledge acquisition through data mining', paper
pres
ented at the IT in Medicine and Education, 2008. ITME 2008. IEEE International
Symposium on, Xiamen

Wilson, AM, Thabane, L & Holbrook A 2003, 'Application of data mining techniques in
pharmacovigilance',
British Journal of Clinical Pharmacology,
vol
.
57,
no. 2, pp. 127
-
134.

Zhu, X, Khoshgoftaar, T, Davidson, I & Zhang, S 2007, 'Editorial: Special issue on
mining low
-
quality data',
Knowledge and Information Systems,
vol
.
11, no. 2, pp. 131
-
136



15