Automation of Biological Research:

healthyapricotΜηχανική

5 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

93 εμφανίσεις

Automation of Biological Research:
Robotics and Machine Learning

Session 1
-

Introduction

Robert F. Murphy

Lane Center for Computational Biology

Carnegie Mellon University

http://
lane.compbio.cmu.edu

murphy@cmu.edu


Copyright © 2010 R. F. Murphy. These lecture notes are provided under a Creative Commons
Attribution
-
NonCommercial
-
ShareAlike

3.0
Unported

license. See
http://creativecommons.org/licenses/by
-
nc
-
sa/3.0/

for complete terms of the license.

Automation of Biological Research:
Robotics and Machine Learning


Instructors: Robert F. Murphy, Jaime
Carbonell
, Jeff Schneider


Fall 2010, 02
-
450: 9 units, 02
-
750: 12 units


MW 1:30 pm to 2:50 pm, GHC 4303


Biology has been revolutionized by automated methods for generating large amounts of data on diverse
biological processes. This, in addition to the finding that many more components are involved in each
process than had earlier been thought, has led to a transition from a “reductionist” paradigm of
biological research involving detailed study of single molecules or events to a “systems biology”
paradigm involving comprehensive, systematic studies combined with computational data
analysis.

Integration of data from many types of experiments will be required to construct detailed,
predictive models of cell, tissue or organism behaviors, and the complexity of the systems suggests that
need for these models to be constructed automatically. This will require iterative cycles of acquisition,
analysis, modeling, and experimental design, since it is not feasible to do all possible biological
experiments. This course will cover a range of automated biological research methods (especially high
-
throughput, robotic methods for protein structure determination, gene sequencing, cell
-
based drug
screening, and
nanoassays
), and a range of relevant computational methods (especially active learning,
proactive learning, compressed sensing and model structure learning).

It assumes a basic knowledge of
machine learning. Class sessions will consist of a combination of lectures and discussions of important
research papers. Grading will be based on class participation,
homeworks
, and a final project.


Prerequisites:


-
㐵〺 ㄵ
-
㌸ㄠ潲 楮str畣瑯爠灥p浩m獩潮⸠


-
㜵〺 ㄰
-
㘰ㄠ潲 ㄰
-
㜰ㄠ潲 楮str畣瑯u
灥p浩m獩潮o


Course web page: http://
lane.compbio.cmu.edu/courses/automationbioresearch

02
-
750 Coursework and grading


Class participation (30%)


Two
homeworks

using active learning on
existing datasets (30%)


Final project/poster presentation (40%)

02
-
450 Coursework and grading


Class participation (50%)


One of two
homeworks

using active learning
on existing datasets (30%)


Final project/poster presentation (20%)

Final project


Pick a biological process


Describe current state of knowledge


Describe current efforts at quantitative modeling


Describe types of experiments that can provide
information about process


Describe current state of instrumentation for their
automation


Describe structured data sources to be used (optional)


Describe proposed new
instrumentation/experimentation (optional)


Describe proposed modeling strategy


Describe active learning strategy


Submit paper (max 10 pages) and present poster

Interface between biology and
computer science


The National Academy of Sciences report
(2005)

Executive Summary


Twenty
-
first century biology will integrate a number of diverse intellectual notions.


One integration is that of the reductionist and systems approaches

a focus on
components of biological systems combined with a focus on interactions among
these components.


A second integration is that of many distinct strands of biological research:
taxonomic studies of many species, the enormous progress in molecular genetics,
steps toward understanding the molecular mechanisms of life, and a consideration
of biological entities in relationship to their larger environment.


A third integration is that computing will become highly relevant to both
hypothesis testing and hypothesis generation in empirical work in biology.


Finally, 21
st

century biology will also encompass what is often called discovery
science

the enumeration and identification of the components of a biological
system independently of any specific hypothesis about how that system functions
(a canonical example being the genomic sequencing of various organisms).
Twenty
-
first century biology will embrace the study of an inclusive set of biological
entities, their constituent components, the interactions among components, and
the consequences of those interactions, from molecules, genes, cells, and
organisms to populations and even ecosystems.


NAS report page 2

Roles for computing in biology


Computational tools
are artifacts

usually
implemented as software but sometimes hardware


that enable biologists to solve very specific and
precisely defined problems. Such biologically oriented
tools acquire, store, manage, query, and analyze
biological data in a myriad of forms and in enormous
volume for its complexity. These tools allow biologists
to move from the study of individual phenomena to
the study of phenomena in a biological context; to
move across vast scales of time, space, and
organizational complexity; and to utilize properties
such as evolutionary conservation to ascertain
functional details.

NAS report pages 2
-
3

Roles for computing in biology


Computational models
are abstractions of biological
phenomena implemented as artifacts that can be used to
test insights, to make quantitative predictions, and to help
interpret experimental data. These models enable
biological scientists to understand many types of biological
data in context, even in very large volume, and to make
model
-
based predictions that can then be tested
empirically. Such models allow biological scientists to tackle
difficult problems that could not readily be posed without
visualization, rich databases, and new methods for making
quantitative predictions. Biological modeling itself has
become possible because data are available in
unprecedented richness and because computing itself has
matured enough to support the analysis of such complexity.

NAS report pages 2
-
3

Roles for computing in biology


A
computational perspective on or metaphor
for biology
applies the intellectual constructs
of such abstractions may well provide an
alternative and more appropriate language
and set of abstractions for representing
biological interactions, describing biological
phenomena, or conceptualizing some
characteristics of biological systems.

NAS report pages 2
-
3

Roles for computing in biology


Cyberinfrastructure

high
-
end general
-
purpose computing centers
that provide supercomputing capabilities to the community at
large; well
-
curated

data repositories that store and make available
to all researchers large volumes and many types of biological data;
digital libraries that contain the intellectual legacy of biological
researchers and provide mechanisms for sharing, annotating,
reviewing, and disseminating knowledge in a collaborative context;
and high
-
speed networks that connect geographically distributed
computing resources

will become an enabling mechanism for
large
-
scale, data
-
intensive biological research that is distributed
over multiple laboratories and investigators around the world. New
data acquisition technologies such as genome sequencers will
enable researchers to obtain larger amounts of data of different
types and at different scales, and advances in information
technology and computing will play key roles in the development of
these technologies.

NAS report pages 2
-
3

Biological entities and their
relationships



Entities


metabolites
, macromolecules, complexes,
organelles, cells, tissues, organism


Relationships


binds
to, regulates, is regulated by, inhibits,
enhances


Levels of biological organization

http://changchangchangchang.blogspot.com/2009/09/reading
-
journal
-
ch01.html

Length and time scales



http://www.soton.ac.uk/~chemphys/jessex/group/research/research_dan.html

Sensors and readouts


amount


spatial distribution


interactions


activity


environment


Cellular processes


Metabolism
-

chemical reactions sustaining life


Transcription
-

copying genetic instructions from DNA to RNA


Translation
-

creating proteins from RNA


Reproduction
-

replication and division of cells to create new cells


DNA repair
-

correction of errors in genetic instructions


Membrane Transport (osmosis/passive transport/active transport)
-

import/export small molecules


Membrane Traffic (
endocytosis/phagocytosis/organelle

transport/secretion)
-

import/export macromolecules and larger entities


Programmed cell death
-

death caused by recognition of a signal


Cell senescence
-

death caused by age


Cell signaling
-

recognition and propagation of external conditions


Adhesion
-

binding of a

cell

to a

surface,

extracellular matrix

or another
cell


Motility/cell migration
-

controlled cell movement


Automation


Many automated instruments developed over
past 20 years that permit high
-
throughput
experimentation


Many sophisticated computational techniques
developed for analyzing and modeling
biological data

Ontologies


The Open Biological and Biomedical
Ontologies

-

http://www.obofoundry.org/


Notable
ontologies


GO (Genome Ontology: process, component,
function)
http://www.geneontology.org/


CHEBI (Chemical entities of biological interest)
-

http://www.ebi.ac.uk/chebi/


PRO (Protein ontology)


http://pir.georgetown.edu/pro/


Workflows


Experimental workflows


Computational workflows


Modeling biological processes as workflows

Laws and Theories


‘Laws are often times mathematically defined (…
a description of how nature behaves) whereas
theories are often non
-
mathematical. Looking at
things this was [sic] helps to explain, in part, why
physics and chemistry have lots of "laws"
whereas biology has few laws (and more
theories). In biology, it is very difficult to describe
all the complexities of life with "simple"
(relatively speaking!) mathematical terms.’


R. Matson: http://science.kennesaw.edu/~rmatson/3380theory.html

What’s the hard part of (natural)
science?


Physics


developing theories?


Chemistry


creativity?


Biology


getting enough data!


understanding it


Molecular Complexity


Each cell/tissue type in a given organism
expresses tens of thousands of proteins
that combine to produce behaviors


Need to understand how they change from
cell type to cell type or under various
conditions

Molecular Complexity


Reductionist paradigm: assume that a
change in one molecule affects few other
molecules and effects are largely
independent


Systems biology paradigm: all can
potentially influence each other in complex
ways

Complex System


A system composed of interconnected parts
that as a whole exhibit one or more properties
(behavior among the possible properties) not
obvious from the properties of the individual
parts (Wikipedia)

Human metabolic pathway diagram

http://
www.genome.jp/kegg
/

Human protein interaction map

http://www.mdc
-
berlin.de/de/highlights/archive/2005/hig
hlight11/index.html

Generalizations and Explanations


Most (all?) biological results are stated as
attempted generalizations


ABC regulates expression of DEF


XYZ degrades TUV


Often seek to explain phenomena using
previously learned generalizations


This is rarely quantitative, yet many systems
have been demonstrated to be sensitive to
small (less than two
-
fold) changes

Paradigms for biological research


h
ypothesis testing


d
iscovery approaches


vary
conditions and observe response


screen
for effectors (mutations/inhibitors) giving a
desired
response


active and proactive learning


AI for Scientific discovery


Goal of “traditional” AI for scientific discovery
is to build a system that learns a new law (or
theory)


Experience suggests this is hard for computers and
easier for people


What if the goal is detailed, iterative, empirical
predictive modeling of a complex system?


Computers should be better at this than people

Active Learning


The task of building empirical, predictive
models requires “probing” the systems for
many, many combinations of variables


A learner given control over the data it can
learn from is said to be “active”


Traditional modeling: build model, pick
experiment to test it


Active learning: build model, pick experiment
with best chance of improving model

Established active learning
applications


Learning of motor skills by robots


Natural language processing


Google ads/web page layout


Netflix movie ranking


A biological active learning scenario


Have many molecules/cells/tissues/patients for
which we would like to measure something (set
A)


Have many conditions/genes/compounds whose
effects we would like to learn (set B)


Build computational (generative) model to
capture correlations among items in set A with
respect to effects of set B and correlations among
items in set B with respect to effects on set A

The need for active learning for
biology
-

Motivating examples



Drug development


Learning regulatory motifs

Drug side effects and toxicity


During drug development, candidate drugs
are screened for their specific desired effect
but not against the very large number of
potential other effects


Many drugs advance well along the
development pipeline (and even into
widespread use) before their side effects are
discovered

Drug combinations


Complexity of many diseases makes existence
of single drug cures unlikely


Combinations of drugs increasingly used


Treatment with complex series of drugs can
be contemplated if sufficiently detailed
predictive models available


Automated microscopy
for drug screening


Automated microscopy often used to
determine which of many compounds (drug
candidates) have a desired effect on a
particular target molecule


It is a major effort just to analyze tens of
thousands of drugs for a particular target in a
particular cell line

Automated microscopy for drug
screening


But what about doing this for all potential
targets?


10
4

proteins x 10
4

drugs = 10
8


And how about for pairs of drugs?


10
4

proteins x (10
4

drugs)
2

= 10
12


Can’t do this by brute force

Learning drug effects


Use
active learning

and robotics
to

iteratively
select and

execute
batches of

experiments chosen

as most likely to

improve the model


Experiments no longer carried out in regular
patterns, so need “random access” robotics



Learning regulatory motifs


Genes contain many subsequences (motifs) that are
involved in their regulation


Lengths and positions unknown


Deleting or mutating motifs can be done to check their
effect


Such experiments are time
-
consuming and expensive


Number of potential motifs of a given size is size of
genome (~3 billion)


Number of genes products that could regulate each
potential motif is ~10 thousand

Learning regulatory motifs


Have full genome sequence and examples of
known regulatory motifs


Build model from known motifs, use it to
identify candidates (both motifs and genes
that regulate them), and prioritize them for
testing

Approaches for modeling biology


Theoretical/Mathematical models


Biochemical models (reaction mechanisms, rates)


Formal methods (see
Sadot

et al 2008)


IEEE/ACM Transactions on Computational Biology and
Bioinformatics 5(2):223
-
234, 2008


Probabilistic formal methods (see
Kwiatkowska

et
al 2008)


ACM SIGMETRICS Performance Evaluation Review
35(4):14
-
21, 2008


Probabilistic/Empirical

Model verification vs. active learning


Language of formal methods and current
practice of systems biology emphasizes the
role of experimentation in
testing/verifying
models


Active learning emphasizes role of
experimentation in
improving

models

Reading: For this class


NAS report:


Catalyzing Inquiry at the Interface of Computing
and Biology (2005) John C.
Wooley

and Herbert S.
Lin, editors. National Academies Press.


Read Chapters 1, 2 and 3


Other reference material on website

Reading: For next class


Overview:


Thomas C.
Terwilliger
, David Stuart, and Shigeyuki
Yokoyama (2009) Lessons from Structural Genomics.
Annu
. Rev.
Biophys
. 38:371

83


Detail:


Ian M. Berry, O.
Dym,R
. M.
Esnouf
, K.
Harlos
, R.
Meged
, A.
Perrakis
, J. L.
Sussman
, T. S. Walter, J.
Wilson and Albrecht
Messerschmidt

(2006) SPINE
high
-
throughput crystallization, crystal imaging and
recognition techniques: current state, performance
analysis, new technologies and future aspects
Acta

Cryst
. D62:1137

1149

For next class


Prepare a list of criteria your propose for
evaluating biological automation papers


Evaluate the Berry et al paper according to
those criteria


Submit the above via email to
murphy@cmu.edu

before class and bring copy
to class


We will discuss proposed criteria and create a
consensus set to be used for future papers