Applying Subgroup Discovery for the Analysis of String Quartet ...

bahmotherΗλεκτρονική - Συσκευές

7 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

97 εμφανίσεις

Unlocking the potential of public
available gene expression data
for large
-
scale analysis



Jonatan

Taminau

PhD defense, November 2012


2

2

Introduction


In this thesis:


Focus on data to information step.


Focus on microarrays technology.

Data

Knowledge

Information

3

3

Introduction

Data

Information

Data Repositories:


+ Massive amounts


+ Examples: GEO,
ArrayExpress


+ Publicly available!

Analysis Software:


+ Commercial: CLC Bio, Spotfire, etc.


+ Free: Bioconductor, Genepattern, Galaxy, etc.


+ A lot of existing research

4

4

Introduction


Although hundreds of thousands of samples are publicly available, and several powerful


analysis software solutions exist, the research community is facing a
chasm

between these


two resources.
” (
Coletta

et
al,
2012)



One of the challenges for the future is
how to integrate

all the DNA microarray data that


have been generated and deposited in public databases.”
(Larsson et
al,
2006)

?

5

5

Introduction


We identified two
hurdles

for
large
-
scale

microarray
analysis:



Consistent retrieval of individual datasets.


Integrative analysis of
multiple data
sets.


6

6

Outline

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8


Chapter 9

7

7

Outline

Retrieval

of data

Integrative

Analysis

Problem Statement

inSilico DB

Problem Statement

Meta
-
Analysis

Merging

Application

8

8

Outline

Retrieval

of data

Problem Statement

inSilico DB

Problem Statement

Meta
-
Analysis

Merging

Application

Integrative

Analysis

9

9

Retrieval of genomic data


Data is online, freely available


But:

difficult to
consistently

retrieve the data
(Example:
Baggerly

&
Combes
, 2011
)


What does it mean?


Data retrieval is reproducible and tractable


No manual intervention needed


All data is preprocessed the same

10

10

Retrieval of genomic data


Typical microarray workflow:

Image

CEL

file

Scanner

Prepro
-

cessing

DNA

microarray

Image

Analysis

numerical

(‘raw’) data

Gene

expression

matrix


11

11

Retrieval of genomic data

CEL

file

Prepro
-

cessing

numerical

(‘raw’) data

Gene

expression

matrix


Complex


+ normalization/background correction


+ probe
-
to
-
gene mapping


+

versioning issues


+ etc.

Not
Documented
!


only 48% of all data in GEO and

ArrayExpress was submitted with

raw data


(
Larsson et al. 2006)

12

12

Retrieval of genomic data

+
Features

+ Genes or
probes

+ range:

20k
-
30k

+
Instances

+ Patients, tissues,
etc.

+ range:

10
-
100

Gene Expression Value:


+ Expression of gene
i

in


sample
j


+ range between 2
-
14


+ log2 scaled

x
ij


13

13

Retrieval of genomic data


What about
phenotypical

data

or meta
-
data
?



Extra information about the samples (age, gender,
disease, etc.)


No
standard

way of formatting this
information


MIAME /
Ontologies

/ Free text / etc.


Also still an open problem

14

14

Retrieval of genomic data


Why

is consistent retrieval from public
repositories so
important
?


Reproducibility of results


Comparison of new results with existing studies


Combining different studies

15

15

Outline

Retrieval

of data

Problem Statement

inSilico DB

Problem Statement

Meta
-
Analysis

Merging

Application

Integrative

Analysis

16

16

The
inSilico

Database


Result of
InSilico

project


Innoviris

(2007
-
2012)


8 persons from VUB & ULB


Provides consistently preprocessed and
expert
-
curated

genomic data


Being commercialized

17

17

The
inSilico

Database


What makes the
inSilico

Database so
valuable

?


Not the fact that all data is
precomputed


But
how

it is
precomputed




What is the underlying engine ?


Genomic Pipelines


Backbone


18

18

The
inSilico

DB |
Genomic Pipelines


For every data type there is a different
pipeline


Microarray pipeline:



Jobs


Dependencies


Backbone

19

19

The
inSilico

DB |
Backbone


Automatic Workflow System


Barely manual intervention needed


Control of intermediate results


Pre
-
computation saves time (for the user)


Streamlined Error management


Automatic Monitoring




20

20

The
inSilico

DB |
Backbone


How does it works?


Java daemon (recently replaced by application server)


Configuration Files




21

21

inSilicoDb

package


One thing missing for large
-
scale analysis...


Programmatic access via scripting



Contains the basic functionality of
InSilico

DB


Makes automatic retrieval of data possible!


Seamlessly integrates with other
bioconductor

analysis tools


Published in
Bioinformatics
, download > 2000
times


22

22

Outline

Retrieval

of data

Problem Statement

inSilico DB

Problem Statement

Meta
-
Analysis

Merging

Application

Integrative

Analysis

23

23

Integrative Analysis



Combining

the information of multiple,
independent but related
studies

in order to
extract more
general

and more
reliable
results



Problem:



How to do it ?


Two approaches:


Meta
-
Analysis


Merging

24

24

Integrative Analysis

Merging

Meta
-
Analysis

25

25

Outline

Retrieval

of data

Problem Statement

inSilico DB

Problem Statement

Meta
-
Analysis

Merging

Application

Integrative

Analysis

26

26

Meta
-
Analysis


+ Combining
p
-
values


+

Combining effect sizes


+ Combining Ranks


+ Vote Counting


+ etc.



+ Depends on goal


+ Much focus on finding


DEGs


+ Defines what the


results look like


+ Consistent Retrieval


is essential !


+
inSilicoDb

package

27

27

Meta
-
Analysis |
Stable Genes


365

studies were screened for stable genes


Motivation:


Interested in
reference genes


Currently used genes (housekeeping genes) are not
ideal


Need a
compact

and
diverse

list of genes that are
stable under most conditions


In collaboration with Dr Bram de
Craene

(VIB
-
UGent
)


28

28

Meta
-
Analysis |
Stable Genes


(1)
Retrieve Data


+
inSilicoDb

package


+ All 365 datasets downloaded in


less than 100 min


(2)
Calculate Stability Scores


+ For each gene:


+ Coefficient of Variation (CV)


sd

/ mean


+ avoid lowly expressed genes


(3)
Combine Stability Scores


+ For each gene take median of CVs


+ Rank and take top 100


(4)
Semantic Similarity Filtering


+ Exclude genes that are related


+ Uses gene annotation from GO


+ Innovative Step!


+ From 100 to 10 genes

29

29

Meta
-
Analysis |
Stable Genes


Status:








August 2012

| waiting for results…


September 2012

| first positive results!


November 2012

| second test case, positive feedback
from NAR, manuscript in preparation…




30

30

Outline

Retrieval

of data

Problem Statement

inSilico DB

Problem Statement

Meta
-
Analysis

Merging

Application

Integrative

Analysis

31

31

Merging


+ Consistent
Retrieval


is essential !


+
inSilicoDb

package


+
Batch effects


+ Methods to remove


-

Location
-
scale


-

Matrix Factorization


-

Discretization

+ Makes data compatible

+ Preprocessing
not



sufficient


+ Same as with
single


studies


+ Increased sample


size !

32

32

Merging |
Batch Effects


Illustrative Example what batch effects can
cause:


We merged
4

different studies with
thyroid

samples


All studies contained
normal

and
tumor

samples


In collaboration with Wilma Van
Staveren


(IRIBHM, ULB)



Samples are plotted in MDS space


We expect
two clusters

33

33

Merging |
Batch Effects

Merging
without
batch effect removal

Merging
with
batch effect removal

Legend:


+ symbol for study


+ color for normal/tumor

34

34

inSilicoMerging

package


R/
Bioconductor

package combining:


6 different merging methods


5 visual inspection tools


6 quantitative measures


Only resource

so far combining all this
functionality !


Seamlessly integrates with
inSilicoDb

package


35

35

Outline

Retrieval

of data

Problem Statement

inSilico DB

Problem Statement

Meta
-
Analysis

Merging

Application

Integrative

Analysis

36

36

Identification of
DEGs

in Lung Cancer


Idea:

compare meta
-
analysis and merging
approaches for integrative analysis


We used lung cancer as case based on the
content of
inSilico

DB.


Ignore subtypes:
DEGs

can be seen as
playing a role in the basic mechanisms of lung
cancer

37

37

Identification of
DEGs

in Lung Cancer


What is our
hypothesis

?



Due to the
small sample sizes

of individual studies
there are a lot or
False Negatives

when using meta
-
analysis


Can we avoid this by using merging as an alternative
approach?

38

38

Identification of
DEGs

in Lung Cancer

Merging

Meta
-
Analysis

Constraints:


+
fRMA

preprocessed


+ > 30 samples


+ both normal and tumor


+ GPL96 or GPL570

Methodology:


+ apply
limma


-

p
-
value < 0.05


-

FC > 2

+ robustness test


-

100 iterations with


90% of data


-

resampling


+
inSilicoMerging


package

+ take intersection

39

39

Identification of
DEGs

in Lung Cancer


Meta
-
Analysis:

40

40

Identification of
DEGs

in Lung Cancer


Merging:

41

41

Identification of
DEGs

in Lung Cancer


Findings:


Resampling

helps to remove false positives


Relatively

low impact of batch effect
removal methods


More
DEGs

identified through merging
(102) than via meta
-
analysis (25)



Deriving separate statistics and then averaging is often less
powerful than directly computing statistics from aggregated data.

(
Xu

et al, 2008)

no False Positives?


+ checked literature


+ initial pathway
analysis

42

42

Outline

Retrieval

of data

Problem Statement

inSilico DB

Problem Statement

Meta
-
Analysis

Merging

Application

Integrative

Analysis

+ Contributions

+ Conclusions

43

43

Contributions


Genomic pipelines / backbone (
Ch 4
)


Release of 2 publicly available R/
Bioconductor

packages (
Ch 4 & 7
)


Survey of batch effect removal methods (
Ch 7
)


Two applications


Identification of stable genes via meta
-
analysis (
Ch 6
)


Screening of potential biomarkers via integrative analysis
(
Ch 8
)

44

44

Conclusions


We identified two
hurdles

for
large
-
scale

microarray
analysis:



Consistent retrieval of individual datasets.


Integration of multiple data sets for
integrative analysis.


45

45

Conclusions


Consistent retrieval of individual datasets
.





inSilicoDb

package


Integration of multiple data sets for
integrative analysis
.




inSilicoMerging

package




Paving the road towards unlocking the
potential of public available gene expression
studies


46

46

Thanks!

+
InSilico

Team!

+ Jury!

+ Audience!

+
Yann
-
Michaël
!