Analysing Microarray Data Using Bayesian Network Learning

ocelotgiantΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 5 μήνες)

73 εμφανίσεις

Analysing Microarray Data Using
Bayesian Network Learning

Name: Phirun Son

Supervisor: Dr. Lin Liu

Contents


Aims


Microarrays


Bayesian Networks


Classification


Methodology


Results

Aims and Goals


Investigate suitability of Bayesian
Networks for analysis of Microarray data


Apply Bayesian learning on Microarray
data for classification


Comparison with other classification
techniques

Microarrays


Array of microscopic
dots representing gene
expression levels


Gene expression is the
process of DNA genes
being transcribed into
RNA


Short sections of genes
attached to a surface
such as glass or silicon


Treated with dyes to
obtain expression level

Challenges of Microarray Data


Very large number of
variables, low number
of samples


Data is noisy and
incomplete


Standardisation of data
format


MGED


MIAME,
MAGE
-
ML, MAGE
-
TAB


ArrayExpress
, GEO,
CIBEX

Bayesian Networks


Represents conditional independencies of
random variables


Two components:


Directed Acyclic Graph (DAG)


Probability Table

Methodology


Create a program to test accuracy of
classification


Written in MATLAB using
Bayes

Net Toolbox (Murphy, 2001),
and Structure Learning Package (
Leray
, 2004)


Uses Naive network structure, K2 structure learning, and pre
-
determined structure


Test program on synthetic data


Test program using real data


Comparison of
Bayes

Net and Decision
Tree

Synthetic Data


Data created from well
-
known Bayesian
Network examples


Asia network, car network, and alarm
network


Samples generated from each network


Tested with naive, pre
-
known structure,
and with structure learning

Synthetic Data
-

Results

Asia Network

Lauritzen

and
Spiegelhalter
, ‘Local Computations with
Probabilities on Graphical Structures and Their Application to
Expert Systems’, 1988, pg 164

Correct

Naive

81.0%

K2 Learning

83.4%

Known

Graph

85.0%

50 Samples, 10 Folds, 100
Iterations

Class Node: Dyspnoea

Correct

Naive

83.1%

K2 Learning

84.3%

Known

Graph

85.1%

100 Samples, 10 Folds, 50
Iterations

Class Node: Dyspnoea

Synthetic Data
-

Results

Correct

Naive

53.5%

K2 Learning

58.3%

Known

Graph

62.4%

Car Network

Heckerman, et al, ‘Troubleshooting under Uncertainty’, 1994
pg 13

50 Samples, 10 Folds, 100
Iterations

Class Node: Engine Starts

Correct

Naive

56.5%

K2 Learning

58.7%

Known

Graph

61.2%

100 Samples, 10 Folds, 50
Iterations

Class Node: Engine Starts

Synthetic Data
-

Results

ALARM Network


37 Nodes, 46 Connections


Beinlich

et al, ‘The ALARM monitoring system: A case study
with two probabilistic inference techniques for belief
networks’, 1989

Correct

Naive

72.4%

K2 Learning

78.7%

Known

Graph

89.6%

50 Samples, 10 Folds, 10 Iterations

Class Node:
InsufAnesth

Correct

Naive

69.0%

K2 Learning

77.8%

Known

Graph

93.6%

50 Samples, 10 Folds, 10 Iterations

Class Node:
Hypovolemia

Lung Cancer Data Set


Publically available data sets:


Harvard:
Bhattacharjee

et al, ‘Classification of Human Lung Carcinomas
by mRNA Expression Profiling Reveals Distinct
Adenocarcinoma

Subclasses’, 2001


11,657 attributes, 156 instances,
Affymetrix



Michigan: Beer et al, ‘Gene
-
Expression Profiles Predict Survival of
Patients with Lung
Adenocarcinoma
’, 2002


6,357 attributes, 96 instances,
Affymetrix



Stanford: Garber et al, ‘Diversity of Gene Expression in
Adenocarcinoma

of the Lung’, 2001


11,985 attributes, 46 instances,
cDNA


Contains missing values



Feature Selection


Li (2009) provides a feature
-
selected set
of 90 attributes


Using WEKA feature selection


Also allows comparison with Decision Tree
based classification



Discretised

data in 3 forms


Undetermined values left unknown


Undetermined values put into either category


two category


Undetermined values put into another category


three category


WEKA: Ian H. Witten and
Eibe

Frank, ‘Data Mining: Practical machine learning
tools and techniques’, 2005.

Harvard Set


Harvard Training on Michigan







Harvard Training on Stanford


MATLAB

WEKA

D
T

2
-
Cat

-
> 2
-
Cat NF

95 (99.0%)

95 (99.0%)

95 (99.0%)

2
-
Cat
-
> 2
-
Cat F

94 (97.9%)

93 (96.9%)

92 (95.8%)

3
-
Cat
-
>

3
-
Cat NF

94 (97.9%)

95 (99.0%)

94 (97.9%)

3
-
Cat
-
>

3
-
Cat F

88 (91.7%)

95 (99.0%)

94 (97.9%)

MATLAB

WEKA

DT

2
-
Cat

-
> 2
-
Cat NF

41 (89.1%)

46 (100%)

43 (93.5%)

2
-
Cat
-
> 2
-
Cat F

41 (89.1%)

45 (97.8%)

36 (78.3%)

3
-
Cat
-
>

3
-
Cat NF

41 (89.1%)

46 (100%)

42 (91.3%)

3
-
Cat
-
>

3
-
Cat F

41 (89.1%)

46 (100%)

42 (91.3%)

Michigan Set


Michigan Training on Harvard







Michigan Training on Stanford


MATLAB

WEKA

D
T

2
-
Cat

-
> 2
-
Cat NF

150 (96.2%)

154 (98.7%)

153 (98.1%)

2
-
Cat
-
> 2
-
Cat F

144 (92.3%)

153 (98.1%)

150 (96.2%)

3
-
Cat
-
>

3
-
Cat NF

145 (92.9%)

153 (98.1%)

153 (98.1%)

3
-
Cat
-
>

3
-
Cat F

140 (89.7%)

152 (97.4%)

153 (98.1%)

MATLAB

WEKA

DT

2
-
Cat

-
> 2
-
Cat NF

41 (89.1%)

46 (100%)

41 (89.1%)

2
-
Cat
-
> 2
-
Cat F

41 (89.1%)

46 (100%)

40 (87.0%)

3
-
Cat
-
>

3
-
Cat NF

41 (89.1%)

45 (97.8%)

39 (84.8%)

3
-
Cat
-
>

3
-
Cat F

41 (89.1%)

46 (100%)

39 (84.8%)

Stanford Set


Stanford Training on Harvard







Stanford Training on Michigan


MATLAB

WEKA

D
T

2
-
Cat

-
> 2
-
Cat NF

139 (89.1%)

153 (98.1%)

139 (89.1%)

2
-
Cat
-
> 2
-
Cat F

139 (89.1%)

150 (96.2%)

124 (79.5%)

3
-
Cat
-
>

3
-
Cat NF

139 (89.1%)

150 (96.2%)

154 (98.7%)

3
-
Cat
-
>

3
-
Cat F

139 (89.1%)

150 (96.2%)

152 (97.4%)

MATLAB

WEKA

DT

2
-
Cat

-
> 2
-
Cat NF

86 (89.6%)

95 (99.0%)

86 (89.6%)

2
-
Cat
-
> 2
-
Cat F

86 (89.6%)

92 (95.8%)

72 (75.0%)

3
-
Cat
-
>

3
-
Cat NF

86 (89.6%)

95 (99.0%)

94 (97.9%)

3
-
Cat
-
>

3
-
Cat F

86 (89.6%)

95 (99.0%)

91 (94.8%)

Future Work


Use structure learning for Bayesian
Classifiers


Increase of homogeneous data


Other methods
of classification