Big Data in Drug Discovery

voltaireblingData Management

Nov 20, 2013 (3 years and 6 months ago)

150 views

Big Data in Drug Discovery

David J.
Wild

Assistant Professor & Director, Cheminformatics Program

Indiana University School of Informatics and Computing

djwild@indiana.edu
-

http://djwild.
info




Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Epochs in drug discovery

Empirical


up until 1960’s


754 First pharmacy opened in Baghdad


Late 1800’s


major pharmaceutical companies, mass production


1900
-
1960


major discoveries (insulin, penicillin, the pill …)


Rational


1960’s to 1990’s


Designing molecules to target protein active sites


“lock and key”


Computational Drug Discovery


Biggest success HIV (RT, protease inhibitors)


Big Experiment


1990’s to 2000’s


High throughput screening


Microarray Assays


Gene Sequencing and Human Genome Project


Big Data


2010’s onwards


Informatics
-
driven drug discovery


Accepting the body is amazingly complex and we don’t understand it well


Everything is connected



Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

The metabolic pathways of a single cell

David Wild,
December 2009.
Page
3

http://djwild.info

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

The inner life of the cell

David Wild,
December 2009.
Page
4

http://djwild.info

http://video.google.com/videoplay?docid=
-
2351549868099343381&hl=en#

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Big Data in the public domain


There is now an incredibly
rich resource of public information

relating compounds,
targets, genes, pathways, and diseases. Just for starters there is in the public domain
information on:


69 million compounds
and
449,392 bioassays
(
PubChem
)


4,763 drugs
(DrugBank)


9 million protein sequences
(
SwissProt
) and 58,000 3D structures (PDB)


14 million human nucleotide sequences
(EMBL)


19 million life science publications
-

800,000 new each year (
PubMed
)


Multitude of other sets (drugs, toxicogenomics, chemogenomics, SAR, …)



Even more important are the
relationships between these entities.
For example a
chemical compound can be linked to a gene or a protein target in a multitude of ways:


Biological assay with percent inhibition, IC50, etc


Crystal structure of ligand/protein complex


Co
-
occurrence in a paper abstract


Computational experiment (docking, predictive model)


Statistical relationship


System association (e.g. involved in same pathways cellular processes)

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

PubChem growth since 2005

David Wild,
December 2009.
Page
6

http://djwild.info

2,824,265

35,379,748

56,774,950

69,088,100

0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
2005-01
2005-03
2005-05
2005-07
2005-09
2005-11
2006-01
2006-03
2006-05
2006-07
2006-09
2006-11
2007-01
2007-03
2007-05
2007-07
2007-09
2007-11
2008-01
2008-03
2008-05
2008-07
2008-09
2008-11
2009-01
2009-03
2009-05
2009-07
2009-09
2009-11
2010-01
2010-03
2010-05
2010-07
PubChem Substance Size 2005
-
2010

Addition of
ChemSpider

434635

1
10
100
1000
10000
100000
1000000
2005-01
2005-04
2005-07
2005-10
2006-01
2006-04
2006-07
2006-10
2007-01
2007-04
2007-07
2007-10
2008-01
2008-04
2008-07
2008-10
2009-01
2009-04
2009-07
2009-10
2010-01
2010-04
2010-07
PubChem Bioassays 2005
-
2010

Addition of
ChEMBL

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Large amount of data and links for each compound

David Wild,
December 2009.
Page
7

http://djwild.info

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Proteins & Genes

David Wild,
December 2009.
Page
8

http://djwild.info

http://www.genome.jp/en/db_growth.html

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF: The FaceBook of Drug Discovery

David Wild,
December 2009.
Page
9

http://djwild.info

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

You are a big pile of data too!

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Large
-
scale predictive modeling adds even more data

Chen, B. and Wild, D.J. PubChem BioAssays as a data source for predictive models,
Journal of
Molecular Graphics and Modeling
. 2010; 28, 420
-
426.

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Informatics
-
based drug discovery

Predicting new molecular targets for known drugs. Nature 462, 175
-
181(12 November 2009)

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

“Systems chemical biology” and chemogenomics

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Recent enabling technologies for SCB / Chemogenomics

Cloud computing
allows processing
and data mining on a
vast scale

Integrative
cheminformatics &
bioinformatics
connects
compounds, targets
genes, pathways
,
diseases and side
effects

Health informatics
(
PHRs

and
EHRs
)
allows integration of
the molecular and
patient models (QP)

Semantic
technologies and
complex systems
tools allow seamless
integration and
human
-
scale data
mining

Analysis

Visualization, projection,

data mining, hypothesis
generation, network
tools

Integration

RDF, XML, Triple Stores

Ontologies, SPARQL,

Graph algorithms

Access

Web Services, RPC

Information extraction

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

ChemBioGrid.org: Web service infrastructure for cheminformatics

Dong, X., Gilbert, K.E., Guha, R., Heiland, R., Kim, J., Pierce, M.E. Pierce, Fox, G.C. and Wild, D.J. Web service
infrastructure for chemoinformatics,
J. Chem. Inf. Model.
, 2007; 47(4) pp 1303
-
1307.

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

The Semantic Web


meaning & relationships

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF


RDF integration & SPARQL querying

Chen, B., Dong. X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J.

Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems
chemical biology data. BMC Bioinformatics 2010, 11, 255

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF context

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF Relationships

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Linked Open Data Cloud (linkeddata.org)

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Converting data into RDF

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Finding multi
-
target inhibitors of MAPK pathway with a SPARQL query

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Finding compounds with similar polypharmacology using SPARQL

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Projecting queries into chemical space

GTM / MDS
projection and
embedding of
all PubChem
using clouds

Plotting and
embedding
unknown
compounds
with SCB
property labels

Dynamic
querying and
projection into
chemical space

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Projecting queries into chemical space

Choi, J.Y. , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data
with Dimension Reduction.
Emerging Computational Methods for the Life Sciences Workshop
, ACM
Symposium for High Performance Distributed Computing Jun 21
-
25, 2010, Chicago IL

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

“Doppler Radar Plot”


Kinase Specificity

Choi, J.Y. , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data
with Dimension Reduction.
Emerging Computational Methods for the Life Sciences Workshop
, ACM
Symposium for High Performance Distributed Computing Jun 21
-
25, 2010, Chicago IL

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

“Doppler Radar Plot”


Kinase Specificity

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF Dashboard: finding paths

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Pathfinder

Dexamethasone

Triamcinalone

NFKB1

Glucocorticoid Receptor

http://ella.slis.indiana.edu/~yuysun/flex/pathfinder.html

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Dynamic exploration with clouds and Cytoscape

Virtuoso runs
Chem2Bio2RDF
queries on the
cloud

Cytoscape plugins
give access to
Chem2Bio2RDF,
LPG and chemical
structure
visualization

Dynamic
exploration in
Cytoscape

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Hydrocortisone


Dexamethasone links


Fig.

Use Case 1.Network diagram of the paths obtained between Hydrocortisone and Dexamethasone using
ChemBioScape.Drugbank interaction contains information about
every

drug’s target. In this case, DB00741 and DB01234
share common targets through several different Drugbank interaction ID’s.

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Tolcapone and Entacapone links


Fig.
Use case 2.Tolcapone and
Entacapone

are connected to each other through
drugbank

interaction
2348 and 1962.Also, the two drugs appear in
PubMed

articles 8119326 and 8223912 via their CID
(Compound ID)


Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Isoniazid and Ethionamide


replicate paper results


Banerjee
, A.,
Dubnau
, E.,
Quemard
, A.,
Balasubramanian
, V., Um, K., Wilson, T., et al.:
inhA
, a gene encoding a target for
isoniazid

and
ethionamide

in Mycobacterium tuberculosis. Science, 263(5144), 227
-
230 (1994).


Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

WENDI


exploring compound knowledge space

Doxorubicin (anthracyclin antibiotic)


Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

WENDI v1.0
-

insights from the literature

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

WENDI v2.0
-

Automated reasoning with RDF


Simple OWL ontology for relationships


Large RDF network expands out from Query


RDF inference engines applied & results filtered / prioritized


QUERY

CID

86427

CID

8642

AID

328

PubMed

12856

Breast

Cancer

Breast

Cancer

HER2

Breast

Cancer

similar_to

similar_to

active_against

contains_term

contained_in

contains_term

contains_term

predicted_
inactive_again
st

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

WENDI OWL/RDF Network & Inferred Associations

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Semantic text mining of journal articles

Jiao, D. and Wild, D.J. Extraction of CYP Chemical Interactions from Biomedical Literature Using Natural Language
Processing Methods,
Journal of Chemical Information and Modeling
, 49(2); pp263
-
269

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Chemical & Biological Literature Extraction

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Validating topics by experimental relationships

Topic 26: cell, expression, cancer,
tumor,…

Related Disease: DNA Damage,
Melanoma,
Glioblastoma
, …

Un
-
proved link

proved link by c2b2r_chemogenomics

Target

Drug

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Bio
-
LDA III


Entropy


In information theory, entropy is a measure of the uncertainty associated with a
random variable.


Here we can compute the bio
-
term entropies over topics


Kullback
-
Leibler

divergence (KL divergence)


a non
-
symmetric measure of the difference between two probability distributions.


Here we used the KL divergence as the non
-
symmetric distance measure for two
bio
-
terms over topics

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Combining path finding and Bio
-
LDA


Detect semantic association


Path finding algorithm


millions of RDF triples from Chem2bio2rdf


Assess semantic association


Bio
-
LDA model


Entropy and KL divergence


Additional knowledge base: 50, 100 and 200 topics using the recent 336,899
MEDLINE abstracts, which contains 13,338 identical bio
-
terms


Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Summary


Drug discovery is enetering a new era that is arguably centered on informatics
analysis of the vast amount of biological and chemical data now being produced,
and which looks at the effect of drugs on biological systems as a whole. This new
approach underlies the new fields of
systems chemical biology

and
chemogenomics


Analyzing this data and particularly the relationships beween compounds, drugs,
proteins, genes, diseases, pathways and people promises to provide important
understanding of the nature of disease and treatment


The Semantic Web provides an effective framework for logically managing the data,
and Cloud Computing provides a physical framework for computation and searching


Early
-
stage methods developed at Indiana allow integrated access to this data, path
finding between any two points, visualization in chemical space and network tools,
and advanced handling of the scholarly literature


Critical next steps include ranking and intelligent filtering of paths and relationships
to provide aggregate evidence
-
based approaches, and integration of NGS and
patient data

Big Data in Drug Discovery
David Wild, July 2010. http://djwild.info.

Cheminformatics group at Indiana University


http://djwild.info