A rapid tour of Bioinformatics

tennisdoctorΒιοτεχνολογία

29 Σεπ 2013 (πριν από 4 χρόνια και 15 μέρες)

137 εμφανίσεις

A rapid tour of
Bioinformatics
Saurabh Sinha, Lenny Pitt
Bioinformatics
, or
Computational Biology ?


sometimes used interchangeably


latter sometimes includes former


often, latter means
molecular modeling
to
investigate properties and behaviors of
molecules via computer simulation


often, former refers to application of
databases, algorithms, computational and
statistical techniques to solve problems
arising from the management and analysis of
biological data.
Computational Biology


Example: protein folding


http://www.youtube.com/watch?
v=lijQ3a8yUYQ



http://fold.it/portal/

Molecular Biology 101

Cells


Cells are the
fundamental units of
living organisms


Cells are born, do their
jobs, and die


Study of life =
study of cells
Proteins


Many of the processes (chemical
reactions) inside cells are carried out by
proteins
iwrwww1.fzk.de/biostruct/ Assets/1a00x500.jpg
DNA


DNA carries the information on which
proteins to produce in a cell, and how
SOURCE: http://www.microbe.org/espanol/news/human_genome.asp
Chromosome
DNA


DNA is a string written in
the alphabet {A,C,G,T}


Human DNA is a string with
3 billion characters !
adenine, cytosine, guanine, (DNA and RNA), thymine (DNA)"
"""
u r a c i l ( R N A )"
DNA and Proteins
www.ornl.gov/.../slides/ images/01-0037low.jpg
Genes


Genes are “substrings”
(~1000 bp)
of DNA


A gene is used as a template for
producing a protein


Each protein comes from a different gene


~25,000 genes in the human DNA


The process of making a protein from a
gene can be regulated in the cell:
GENE REGULATION

The initial successes of
bioinformatics

Some problems & successes
1.

Sequence alignment
2.

Comparative genomics
3.

Sequencing the genome
4.

Gene search
5.

Evolutionary biology & phylogenetic
trees
1.

Sequence Alignment

(fundamental question)



Is this string equal to that one?


Does this string contain a copy of that
one?


Is this string “like” that one? How much
alike?


Is this string “like” a portion of that one?
Sequence alignment


Could you have done this task, for two strings of length 1
million characters, by hand ?


Sequence analysis algorithms are the bread and butter of
bioinformaticians.
CS has already studied these!


Is this string equal to that one?


compare two files


Is this string equal to a portion of that
one?


find a word in a document


Is this string “like” a portion of that one?


find suggested spelling corrections
Edit Distance


how much alike are two strings?


CATTGAGCT


CTTAGCCTA



C
A
TT
GAG
CT




C

TT
AGC
CT
A


Is this the best possible?




C
A
TT
G
AGC
–T–



C

TT

AGC
CTA


Charge one for each mismatch, each
insertion, each deletion.


Problem: find the least cost alignment


Extensions: charge different amounts
for A/C mismatch, for insertion, etc.,
reflecting (un)likelihood of certain
genetic mutations.


There are reasonably efficient
algorithms for all of these problems
2. Comparative genomics



Human and mouse share the genetic
“toolkit” for development


Compare the two genomes and find the
conserved features


These are likely to be of functional
importance


How to compare two genomes ?


Sequence alignment
2. Comparative genomics

http://genome.ucsc.edu/cgi-bin/hgGateway

3. Sequencing the Genome
The Human Genome Project



Human genome: a “string” of length
3,000,000,000 characters !


Starting with a human cell, how can we
obtain this sequence ?


The problem of sequencing


2001
Shotgun Sequencing


Lab technology: can sequence snippet of
1000-2000 nucleotides.


Idea: “shotgun” apart multiple copies of whole
genome, sequence all snippets, reconstruct.
http://en.wikipedia.org/wiki/Shotgun_sequencing



3 billion / 1000 = 3 million snippets.


Want multiple copies divided in different
spots, so many snippets overlap


From overlap, we can tell how things go
together.


Need 7-fold replication to guarantee coverage

Need large databases !
This is what you get

How is the genome sequenced ?

http://www.wiley.com/legacy/college/boyer/0470003790/cutting_edge/shotgun_seq/computer.gif
Assembly Methods


Greedy approach


Graph approaches:


Hamiltonian path


Traveling Salesman (TSP) in k-mer graph


Eulerian path in k-mer graph
READ:
http://www.cbcb.umd.edu/research/assembly_primer.shtml

Greedy Approach


Merge two snippets with greatest overlap


Repeat
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Problem: may merge repeated segments

(>50% of human genome are repeats)
Hamiltonian Path


Create graph


vertices = snippets


edges = overlap
http://www.cbcb.umd.edu/research/assembly_primer.shtml
red edges correspond to repeated segments
Find a path that visits
each vertex exactly once
Other graph approaches


Unknown sequence.


Challenge: here are the “3-mers”:


CAG, ATC, GTC, CCA,


CAT, AGT, TCC, TCA


Max TSP approach


3-mers are vertices


Eulerian Path approach


3-mers are edges
Solution



CATCCAGTCA

Max TSP approach
3-mers sequenced:
{ ATC, CCA, CAG, TCC, AGT }

AGT


CCA

ATC

ATCCAGT


TCC


CAG


ATCCAGT
ATC
CCA
TCC
AGT
CAG
2
2
2
2
1
1
1
0
1
1
3-mers extracted from unknown sequence
Find max-weight tour visiting all vertices
Max TSP approach
3-mers sequenced:
{ ATC, CCA, CAG, TCC, AGT }

AGT


CCA

ATC

ATCCAGT


TCC


CAG


ATCCAGT
ATC
CCA
TCC
AGT
CAG
2
2
2
2
1
1
1
0
1
1
3-mers extracted from unknown sequence
Find max-weight tour visiting all vertices
Eulerian paths and k-mers


get sequence of all k-mers (including
multiplicities)


edges
are k-mers


vertices are k-1 bp prefix and suffix.


find Eulerian path (traverses each edge)
3-mers sequenced:
{ ATC, CCA, CAG, TCC, AGT }

AGT


CCA

ATC

ATCCAGT


TCC


CAG


AT
TC
CC
CA
AG
GT
ATCCAGT
Find tour using all edges
Exercise


Length-9 DNA sequence was deconstructed


3-mers = {GTT, TCG, CGT, TTA, ACG, TTC,
TAC}


Draw graph with directed edges labeled by
these 3-mers, and vertices labeled with the
corresponding 2-mers


Find a directed path through this graph that
crosses each edge exactly once, and write
down the possible original length-9 sequence
that can be reconstructed from the path
4. Gene Search



Find out where the genes are located in this long string


Genes cover ~2% of human genome


Finding them using computer algorithms and statistics
http://www.broad.mit.edu/annotation/argo/help/usecase/index_files/image012.jpg
4. Gene Search



Comparative genomics - similar regions
to known genes for other organisms
likely indicate similar function


Similarity to gene-like patterns


Reverse engineering from expressed
proteins


(
http://en.wikipedia.org/wiki/Gene_prediction
)
5. Evolutionary Biology and
Phylogenetic Trees



See
presentation
by
Jana Sperschneider

21st century biology:
bioinformatics drives the revolution

Special issue of journal
Science,
July 1, 2005.

>What Is the Universe Made Of?>What is the Biological Basis of
Consciousness?
>Why Do Humans Have So Few Genes?>To
What Extent Are Genetic Variation and Personal Health Linked?
>Can the Laws of Physics Be Unified?>How Much Can Human
Life Span Be Extended?
>What Controls Organ Regeneration?
>How Can a Skin Cell Become a Nerve Cell?>How Does a Single
Somatic Cell Become a Whole Plant?
>How Does Earth's Interior
Work?>Are We Alone in the Universe?>How and Where Did Life
on Earth Arise?
>What Determines Species Diversity?>What
Genetic Changes Made Us Uniquely Human
?>How Are Memories
Stored and Retrieved?
>How Did Cooperative Behavior Evolve?
>How Will Big Pictures Emerge from a Sea of Biological Data
?
>How Far Can We Push Chemical Self-Assembly?>What Are the
Limits of Conventional Computing?
>Can We Selectively Shut Off
Immune Responses?
>Do Deeper Principles Underlie Quantum
Uncertainty and Nonlocality?>Is an Effective HIV Vaccine
Feasible?>How Hot Will the Greenhouse World Be
?>What Can
Replace Cheap Oil -- and When?
>Will Malthus Continue to Be
Wrong?

>What Is the Universe Made Of?>What is the Biological Basis of
Consciousness?
>
Why Do Humans Have So Few Genes
?>To
What Extent Are Genetic Variation and Personal Health Linked?
>Can the Laws of Physics Be Unified?>How Much Can Human
Life Span Be Extended?
>What Controls Organ Regeneration?
>How Can a Skin Cell Become a Nerve Cell?>How Does a Single
Somatic Cell Become a Whole Plant?
>How Does Earth's Interior
Work?>Are We Alone in the Universe?>How and Where Did Life
on Earth Arise?
>What Determines Species Diversity?>What
Genetic Changes Made Us Uniquely Human
?>How Are Memories
Stored and Retrieved?
>How Did Cooperative Behavior Evolve?
>How Will Big Pictures Emerge from a Sea of Biological Data
?
>How Far Can We Push Chemical Self-Assembly?>What Are the
Limits of Conventional Computing?
>Can We Selectively Shut Off
Immune Responses?
>Do Deeper Principles Underlie Quantum
Uncertainty and Nonlocality?>Is an Effective HIV Vaccine
Feasible?>How Hot Will the Greenhouse World Be
?>What Can
Replace Cheap Oil -- and When?
>Will Malthus Continue to Be
Wrong?
A simple organism
GENE

Raw materials
Environmental signal
Response (protein)
A simple organism
GENE1

GENE2

GENE3

Environmental signal
Raw materials
A simple organism
GENE1

GENE2

GENE3

GENE4

GENE5

GENE6

GENE7

GENE8

GENE9

GENE10

A complex organism
GENE1

GENE2

GENE3

GENE4

GENE5

GENE6

GENE7

GENE8

GENE9

GENE10

Complex
circuit of
interactions
Do not need more genes; additional complexity
comes from more interconnections among genes
Regulatory networks


Genes are
switches
, transcription factors are
input
signals
, proteins are
outputs



Proteins (
outputs
) are the
signals
for other
genes (
switches
)


This may be the reason why humans have so
few genes (the circuit, not the number of
switches, carries the complexity)


Bioinformatics can unravel such networks,
given the genome (DNA sequence) and gene
activity information
Decoding the regulatory network



Find patterns (“binding sites”) in DNA sequence


Analyze high throughput measurements of gene
activity levels (“microarrays”)


Analyze measurements of protein-DNA interaction
(“ChIP-on-chip”)


Integration of heterogeneous sources of data
REGULATORY
NETWORK
DISCOVERY

http://www.chiponchip.org/Images/scheme_800x600_crop.jpg
Microarrays
ChIP-on-chip
Patterns in DNA sequence
“How does a single somatic
cell become a whole plant ?”

Developmental biology


The timeline from a single cell (with
genetic material from mother and father)
to a multicellular embryo, and to an
adult


A paradox : All cells in the adult body
have the same DNA, then how come
different cells are different ?
How does a single cell lead to this ? …
… and to this ?
Drosophila
(fruitfly)
Answer: Regulatory networks
(Again !)


Bioinformatics used to scan entire genome for
regions that participate in “segmenting” the
embryo


Hidden Markov models, a popular technique in
signal processing, used to detect such regions


Multiple species comparison aids discovery
“How did cooperative behavior
evolve?”

Cooperative social behavior


What is the genetic (molecular) basis of social behavior ?


Social behavior in honey bees


Young worker bees are nurses in the hive; older ones go
out to forage


This behavioral pattern is determined by needs of colony


How do the bees know ?
Bioinformatics of social
behavior


UIUC team scanned the honeybee
genome to understand this


Regulatory network of social behavior


Statistical tools, machine learning,
sequence analysis used for this project
“How will big pictures emerge
from a sea of biological data?”

The sea


Genomes: 3 x 10
9
bp of human genome


Similar numbers for other genomes: mouse, rat,
dog, chicken, chimp etc.


Microarray: snapshots of 1000s of genes’
activities at one time and condition. Thousands
of microarrays.


ChIP-on-chip data: measurements of a
transcription factor’s binding affinity for 1000s of
genes (promoters).
Segal et al. Nature Genetics 2005.
Big pictures
A compendium of cancer
genes and their regulation
The sea of biological data


Biological literature, capturing decades of
painstaking experimental work on genetics
and molecular biology


Can we glean useful information from this
vast body of knowledge ?


Biological literature mining.


Natural language processing


Text Information Retrieval (statistical approaches)
Some other challenges


Protein structure prediction


Can we predict the 3-D structure of a protein
from its sequence ?


Why ?


One good reason: structure gives clues about
function. If we can tell the structure, we can
perhaps tell the function


We can design amino acid sequences that will fold
into proteins that do what we want them to do.
Drug design !!


Neural networks, a popular technique in
computer science, applied to this problem
Some other challenges


“Metagenomics”


Most studies to date are on genomes of one
species


A sample from the soil contains hundreds of
bacteria, thousands of viruses. Can we study
all of these ?


Bioinformatics is indispensable !!


New type of data, new types of algorithms
Many more challenges


New types of data come due to
technological breakthroughs in biology


High throughput data carries
unprecedented amount of information


Too much noise


Bioinformatics removes the noise and
reveals the truth
Bioinformatics


Is not about one problem (e.g., designing
better computer chips, better compilers,
better graphics, better networks, better
operating systems, etc.)


Is about a family of very different problems,
all related to biology, all related to each other


How can computers help solve any of this
family of problems ?
Bioinformatics and You


You can learn the tools of bioinformatics


These tools owe their origin to computer
science, information theory, probability
theory, statistics, etc.


You can learn the language of biology,
enough to understand what the problems are


You can apply the tools to these problems
and contribute to science