General December 2000 - IMA

wickedshortpumpΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

104 εμφανίσεις

Novel Bioinformatics Techniques
in Functional Genomics
Children’s Hospital Informatics Program
www.chip.org
Children’s Hospital Boston
Harvard Medical School
Massachusetts Institute of Technology
Atul Butte, MD
atul_butte@harvard.edu
Bioinformatics at the
Children’s Hospital Informatics Program
• Funded by 12 NIH grants across 7 institutes
• NHGRI PhD Training Program “Bioinformatics and Integrative Genomics”
• NLM funded “Biomedical applications of the Next Generation Internet” N01 LM093536
• NCI funded “Improved diagnosis in ALL” R21 CA95618, PI Dietrich Stephan: bioinformatics support
• NCI funded “Cancer and Leukemia Group B Statistical Center” U10 CA033601: bioinformatics support
(pending)
• NHLBI Program in Genomic Applications “Genomics of Cardiovascular Development, Adaptation, and
Remodeling” U01 HL066582: bioinformatics core
• NHLBI Program in Genomic Applications “Innate Immunity In Heart,Lung, And Blood Disease”
U01HL066800: bioinformatics core
• NHLBI funded “AT2 receptor-mediated gene programming of smooth muscle” R01 HL58516:
bioinformatics support
• NIDDK “Diabetes Genome Initiative Project” DK60837: bioinformatics core
• NIDDK “Biotechnology Center” processing 50 microarrays/week U24 DK058739: bioinformatics core
• NIDDK “Surrogate Markers for Early Stage Diabetic Retinopathy”: bioinformatics support
• NINDS program project grant “Gene expression in normal and diseased muscle during development”
P01 NS040828: bioinformatics core
• NINDS funded “Functional Genomic Analysis of the Developing Cerebellum” R21 NS041764:
bioinformatics core
• NIAID funded “Novel Approaches to Achieve Allograft Tolerance” R01 AI050987: bioinformatics core
Relevance Networks
• How can we find gene
regulatory networks or
physiological regulatory
networks with little or no a
priori knowledge (unsupervised
learning)
• How can we link microarray
measurements to clinical
measurements?
• Relevance Networks are an
approach to analyze these data
sets
Butte AJ, Kohane IS, Unsupervised Knowledge Discovery in Medical Databases Using
Relevance Networks, Symposia AMIA, 1999.
Butte and Kohane, Children’s Hospital, Patent Pending.
Atul Butte
• Patients and cell lines are analyzed as cases
• Clinical parameters, laboratory tests, RNA expression, and
susceptibility to anti-cancer agents are all example features of
those cases
• Features include functional genomics as well as phenotypic
Construction of Relevance Networks 1
Lab
Test 1
134
Lab
Test 2
3.7
4.5
Clinical
Param 1
105
99
RNA Expr
J02923
Susceptibility
to Anti-cancer
Agent 169517
8.1
Patient, Cell Line,
Time, etc.
138
0.7
7.4
2.1
3.3132
5.3 102
Cases
Features
Lab
Test 2
3.7
4.5
Susceptibility
to Anti-cancer
Agent 169517
8.1
2.1
3.35.3
• For all pairs of features, we take overlapping values over the
cases and make a scatter plot of values
Construction of Relevance Networks 2
Lab
Test 1
134
Clinical
Param 1
105
99
RNA Expr
J02923
Patient, Cell Line,
Time, etc.
138
0.7
7.4132
Lab
Test 2
3.7
4.5
Susceptibility
to Anti-cancer
Agent 169517
8.1
2.1
3.3
5.3 102
Susceptibility
to Anti-cancer
Agent 169517
Lab Test 2
Construction of Relevance Networks 3
• Perform a pairwise comparison between all features
• For each scatter plot, we compute how similar the two features
are (dissimilarity measure)
– Fit a linear model and store correlation coefficient r
• Every feature is completely connected to every other feature
by a dissimilarity measure (e.g. a linear model)
Lab Test 2 169517
r = 0.65
Susceptibility
to Anti-cancer
Agent 169517
Lab Test 2
• Choose a threshold
r to split the
network
• Drop links with r
under threshold
• Breaks the completely
connected network into
islands where connections
are stronger than threshold
• Islands are what we call “relevance networks”
• Display graphically, with thick lines representing
strongest links
Construction of Relevance Networks 4
Lab Test 1
Lab Test 2
r
Clinical
Param 1
Expression of
J02923
Clinical
Param 2
Susceptibility to
169517
r
r
r
r
Software available at www.chip.org/relnet
Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS. Proceeding of the National
Academy of Science, 97:12182, 2000.
Causality
• So far, our potential genetic networks are undirected
• Temporality is first step to determining causality
• Multi-frequency analysis, assuming nothing about genomic
frequency
Biological
System
Gene B
Receiver
Gene A
Transmitter
Statistical Signal Processing
• Statistical signal processing techniques allow us to analyze
how much signal / energy is transferred in a potential system
Gain
Phase
Coherence
Determining the Black Boxes
• Assume a system
for all pairs of
genes
• Determine if one
is an echo of
another by
measuring phase
shift at each
frequency
• Measure gain at
those frequencies:
louder or softer
• Strength
proportional to
coherence at
those frequencies
Signal Processing Finds Gene Regulation
• Given enough time (and enough measurements), this can be
successfully applied to all possible pairs of genes
Butte AJ, et al. Comparing the Similarity of Time-Series Gene Expression
Using Signal Processing Metrics. Journal of Biomedical Informatics
2002; 34: 396.
Has my gene shown up before?
Cross-PGA results integration
• NHLBI Programs in Genomics Applications (PGA)
• 11 funded nationwide
• PGAgene: Integrated
gene-centric view
• Available today at
pgagene.chip.org
• Data from over 1,200
publicly available
microarrays
• Contains and references
5+ million pieces PGA data
• No additional work required from the external sites
Cross-PGA results integration
• Uses LocusLink as common denominator
• Links microarrays, SNPs and mutations
• Out of 156,386 genes in LocusLink
– 29,320 have been measured on a PGA
microarray
– 485 have been sequenced for SNPs
– 4 have been sequenced for mutations
• One solution to the problem of linking cDNA and
Affymetrix microarrays
Lee K, Kohane IS, Butte AJ. Bioinformatics 2003, 19:778.
Example of Integrating Data
• Work done by
Kyungjoon Lee
• Combining 893
genes measured in
rat, mouse and
human, across 997
arrays
How many samples do we need?
• To prove (p < 0.05) an 8% difference in transplant
rejection rate between two drugs, is it easier to study 10
patients or 100 patients?
• To make a list of genes that differentiate patients with
early rejection from long-term non-rejection, is it easier
to use 1 sample of each, or 100 samples of each?
Modified from Yeoh, et al. Cancer Cell 2002, 1: 133.
Reject
Non-reject
Genes
…and
much
more
about
modeling
the
variation
of the
condition
With
microarray
diagnostics,
sample size
is less about
power…
Reject
Non-reject
Genes
How do we avoid overfitting?
• In other words, with too few samples, it is too easy to
overfit the measurements, especially when measuring 20
to 30 thousand genes
• We have techniques like support vector machines that
even further expand the number of features
• And even the ones we get wrong, we later find they’re
been misclassified, or define a new subgroup…
Yeoh, et al. Cancer Cell 2002, 1: 133.
Cross-validation
• Random permutation and cross-validation are
commonly used in evaluating strategies for
picking diagnostic genes
• These can help reduce the danger of overfitting
• But only additional samples will allow algorithms
to learn the variation in disease
• This reduces false positives
Pan W. Genome Biology 2002. http://genomebiology.com/2002/3/5/research/0022
Wolfinger RD. Journal of Computational Biology 2001, 8:625.
Unchip translation tables
• Affymetrix now has a web-site (www.netaffx.com) to find
information about each probe set
• Sometimes, however, you want direct access to a database
• We developed new database for links between Affymetrix
microarrays and national databases solely using the data
files (public information)
Accession
Number
(214,301)
Links
(5,884,490)
Original explanation (155,281)
GenBank accession number
(199,335)
UniGene symbol (183,685)
LocusLink accession (186,086)
OMIM disease (44,272)
Chromosomal position (98,501)
Functional domains (212,274)
Golden path exon locations
(11,103,407)
Genbank
TIGR
UniGene
Custom
www.unchip.org
• “Before and after” example
– 33640_at:Cluster incl. Y14768: Homo sapiens DNA, cosmid clone
– Unchip: Allograft inflammatory factor 1
• Search for “95654_at”
– valyl-tRNA synthetase 2, heat shock protein cognate 70, etc.
• Search for “M35416_at”
– Affymetrix: RALB V-ral simian leukemia viral oncogene homolog B
– Edition 2: v-ral simian leukemia viral oncogene homolog B (ras
related; GTP binding protein)
– Edition 3: stathmin 1/oncoprotein 18
• Can find official names, symbols, and synonyms for accessions
• Can search for expression by functional domain or meaning
• Available as database and web-site, at www.unchip.org
• Updated monthly
More examples
What probes are available testing for genes on human
chromosome 4?
Select distinct a.link_text, b.link_int, c.link_text, d.accession, d.array_type
From chip_accession_link a, chip_accession_link b, chip_accession_link c,
chip_accession d
Where
a.link_text = ‘Homo sapiens’ and a.link_type = 35 and
b.link_type in (15,27,23) and
c.link_text = ‘4’ and c.link_type = 28 and
a.parent_accession = b.parent_accession and
a.parent_accession = c.parent_accession and
a.parent_accession = d.id;
+--------------+----------+-----------+------------------+---------------+
| link_text | link_int | link_text | accession | array_type |
+--------------+----------+-----------+------------------+---------------+
| Homo sapiens | 85462 | 4 | 44096_at | u95_b |
| Homo sapiens | 85438 | 4 | 76249_at | u95_e |
| Homo sapiens | 85013 | 4 | 65166_at | u95_c |
| Homo sapiens | 84992 | 4 | 45734_at | u95_b |
| Homo sapiens | 84869 | 4 | 48697_at | u95_b |
| Homo sapiens | 84803 | 4 | 52883_at | u95_b |
| Homo sapiens | 84740 | 4 | 66234_at | u95_c |
More examples
Which probes are for genes thought to be involved in oncogenesis
and contain a zinc finger (possibly binding DNA)?
Select distinct c.link_text, b.link_text, d.accession, d.array_type
From chip_accession_link a, chip_accession_link b, chip_accession_link c, chip_accession d
Where
a.link_text = 'oncogenesis' and a.link_type = 40 and
b.link_text like 'Zinc finger%' and b.link_type = 22 and
c.link_type in (17,37) and
a.parent_accession = c.parent_accession and
a.parent_accession = b.parent_accession and
a.parent_accession = d.id and
a.version = 2 and b.version = 3 and c.version = 3;
+--------------------------------------+------------------------------------+-------------+---------------+
| link_text | link_text | accession | array_type |
+--------------------------------------+------------------------------------+-------------+---------------+
| zinc finger protein 217 | Zinc finger|pfam00096 | 32034_at | u95v2_a |
| zinc finger protein 217 | Zinc finger|pfam00096 | 32034_at | u95_a |
| zinc finger protein 151 (pHZ-67) | Zinc finger|pfam00096 | 41532_at | u95_a |

Take Home Points
• Not all pathways will be reverse engineered
by microarrays
• With microarrays, sample size plays a larger
role in accuracy rather than power
• Due to rapidly changing information, one
is never truly finished analyzing a
microarray data set
Perou CM. Nature Genetics 2001, 29:373.
Bioinformatics and
Integrative Genomics
big.chip.org
NIH Funded
New PhD training
program in
bioinformatics for
quantitative
individuals
Includes training in wet-
and dry-biology,
clinical medicine
First class Fall 2002
Microarrays for an Integrative Genomics
• The first text-book on microarray analysis and
experimental design
• Barnes and Noble, Borders, Amazon: US$32-40
Use and Analysis of Microarray Data
• Butte AJ. Nature Reviews Drug Discovery 2002, 1:651.
Collaborators and Support
• Collaborations
– Scott Weiss / Channing Laboratory
NHLBI Program of Genomics
Applications
Nurses Health Study
Physicians Health Study
Normative Aging Study
– Seigo Izumo / Beth Israel
NHLBI Program of Genomic
Applications
Framingham Heart Study
– David Rowitch / Dana Farber
NINDS Innovative Technologies
– Dietrich Stephan / Children’s
National Medical Center
Leukemia Diagnostics
– Towia Libermann / Beth Israel
NIDDK Biotechnology Center
– Victor Dzau / Brigham and Women’s
Angiotensin signaling
– Terry Strom / Beth Israel
NIAID Immune Tolerance Network
– Louis Kunkel / Children’s Hospital
Muscular Dystrophy
– C. Ron Kahn and M. E. Patti /
Joslin Diabetes Center
Diabetes Genomic Anatomy Project
• Support
– NIH: NLM, NINDS, NHLBI, NIDDK,
NIAID, NHGRI, NCI, NIGMS
– Lawson Wilkins NovoNordisk Award
– Merck / MIT Fellowship
– Genentech Foundation Fellowship
– Endocrine Fellow Foundation
Bioinformatics at the
Children’s Hospital Informatics Program
www.chip.org
Staff
• Isaac Kohane,
Director
• Atul Butte
• Steven Greenberg
• Alvin Kho
• Peter Park
• Marco Ramoni
• Alberto Riva
• Yao Sun
• Zoltan Szallagi
Fellows
• Ashish Nimgaonkar
• Sunil Saluja
• Dominic Alloco
Post-doctoral fellows
• Zhaohui Cai
• Sangeeta English
• Voichita Marinescu
• Eric Tsung
• Asher Schachter
• Alex Turchin
Students
• Kyungjoon Lee
Alumni
• Ling Bao
• Jinyun Chen
• Aaron Homer
• Janet Karlix
• Ju Han Kim
• Winston Kuo
• Mark Whipple
• Maneesh Yadav
Atul Butte, MD
atul_butte@harvard.edu