Whirlwind tour of bioinformatics

wickedshortpumpBiotechnology

Oct 1, 2013 (3 years and 10 months ago)

70 views

Bioinformatics: whirlwind intro (DRAFT)
James A. Foster
11 September 2000

Bioinformatics: whirlwind intro.......................................................................................1
James A. Foster...........................................................................................................1
11 September 2000......................................................................................................1
1 What is Bioinformatics?..........................................................................................2
2 Why is Bioinformatics hot?.....................................................................................2
3 What questions does Bioinformatics answer?...........................................................2
3.1 How to convert information to data..................................................................2
3.2 Questions about sequences...............................................................................3
3.2.1 Comparing two sequences........................................................................3
3.2.2 Finding similar sequences in huge databases............................................5
3.2.3 Finding similarities in collections of sequences........................................5
3.2.4 Organizing all that data............................................................................6
3.2.5 Phylogenetic relationships........................................................................6
3.2.6 Predicting 2D structure and function from sequences................................7
3.2.7 Classifying sequences...............................................................................7
3.2.8 Micro Array Data.....................................................................................7
3.2.9 Annotation...............................................................................................8
3.3 Questions about molecules...............................................................................8
3.3.1 Predicting 3D structure from sequences....................................................8
3.3.2 Predicting function...................................................................................8
3.3.3 Predicting behavior in larger systems.......................................................8
3.4 Questions about systems..................................................................................9
3.4.1 Regulatory systems..................................................................................9
3.4.2 Development (ontogeny)..........................................................................9

1 What is Bioinformatics?
Bioinformatics (bio+informatics=life+computer science): Development,
implementation, and analysis of tools and techniques (computational and mathematical)
for analyzing biological data.

A broader suggestion, less bio-centric: Development, implementation, and analysis of
tools and techniques (computational and mathematical) for analyzing biological data.

systems similar to biological systems.
2 Why is Bioinformatics hot?
1. Scientists gather biological data with techniques T, analyze them with tools A, which
produces knowledge K.
a. K is very valuable: helps us understand our world and ourselves,
improves health, leads to advances in biotechnology, etc. So: lives are on
the line, and so are mountains of money.
b. T has moved from slow, expensive techniques only doable by highly
trained experts to very fast, easy, automated techniques that even idiots
and computer scientists can perform. So: there are tons of data, and
more coming.
c. A is very sparse. There are very few tools, and most of them require a
great deal of expertise to use correctly. Most tools are only accurate if pigs
can fly. So: there are only a couple ounces of tools.
2. To overcome bottleneck A, one must have solid grounding in mathematics, computer
science, and biologywhich very few people have. Even real collaborations are
difficult, since the three disciplines speak different languages, and practitioners think
differently. So: the analytical bottleneck requires very rare skills to address.
3 What questions does Bioinformatics answer?
We can roughly address this by considering different types of biological data.
3.1 How to convert information to data
To sequence DNA, one breaks the DNA in to pieces about 500bp--2kbp, breaks these
into smaller sequences, sequences the pieces, sorts the sequences by size, reads the last
nucleotide on the sequence, then reassembles the information from some analog signal
into the familiar sequence of nucleotides.
Automated sequencers take as input all initial fragments of a piece (called what?). For
example, if the piece is 1kbp the input will be 1000 fragments of length from 1 to 1000
nucleotides, each of which is the same as the first part of the original piece. These
fragments have a fluorescent (right?) nucleotide at the end, which radiates a different
color for each nucleotide when stimulated. The fragments assort by size on either a gel or
a capillaryso that the longer and heavier ones go to the bottom of the sequencer faster
than the smaller and lighter ones. The sequencer shines a laser on each fragment as it
drops past the sensor, which causes the end nucleotide to glow with a characteristic color.
Light sensors read this emission and record the intensity of each of the four colors as raw
data. The output of an automated sequencer is a file containing four sine waves, one for
each nucleotide in the original sequence, with the amplitude corresponding to the
intensity of light in the four colors representing the possible nucleotides as the gel or
capillary contents moved past the sensor.
Bioinformatics algorithms break this analog signal into a sequence of real values
corresponding (hopefully) to nucleotides. They then determine which signal actually
came from the nucleotide at that positionwhich is difficult when different sensors
detect approximately the same signal strength from the glowing DNA. The scientist, with
more algorithmic tools, then edits these sequences to remove obvious problem areas
(usually at the beginning and end of a sequence). Next the pieces must be reassembled
into the original sequence. This is difficult since the process of splitting the original
fragment into pieces does not preserve information about where in the original sequence
they came from.
3.2 Questions about sequences
DNA sequences and protein sequences are fundamental biological data. Some
biologically inspired technologies, such as evolutionary computation, also create or
analyze inherently sequential datasuch as bitstrings, machine code, and most programs.
Bioinformatics answers questions about (at least) the following issues
3.2.1 Comparing two sequences
String matching algorithms are inadequate. One compares biological sequences in order
to find homologies, rather than similarities. Homologous sequences share a common
evolutionary history, and are related to each other by descent from a common ancestral
sequence (not necessarily a common ancestral organism). Different characters in
biological sequences also have physical properties that matter and must be considered.
For example, in DNA an A is more similar to a G than to a T (both are purines), and
GAAAAAC is more similar to GAC than to GCCCCCC (slippage can happen in DNA
replication).
Global alignments compare strings and find the best way to match up all of their
characters. Local alignments find the best way to match up parts of sequences. Global
alignments are important when two sequences are related in their entirety. Local
alignments are important for finding shared elements in different sequences, or for
looking up one small sequence in another very long one (such as a whole genome!).
Substitution matrices
To compare sequences, one needs a substitution matrixwhich assigns relative costs to
different types of mis-matches. Building these matrices and selecting an appropriate one
are important Bioinformatics topics. Often, the cost of a mismatch is the log of the
probability of seeing that particular mismatch in the system from which the substitution
matrix was derived.
PAM (point acceptable mutations) matrices include costs for mismatches at different
levels of divergence between two sequences. More divergence means a mismatch costs
less, because one expects more mismatches for more distantly related sequences. A
PAM-n matrix has costs for mismatches in sequences for which n% of the elements in the
sequences have changed. PAM-1 was determined empirically by analyzing statistics over
closely related DNA sequences aligned by hand, and higher n are extrapolations.
(PAM200 and PAM250 are common).
BLOSUM matrices were derived by statistical analysis from collections of multiple
distantly related protein sequences (blocks). The BLOSUM-n matrix assumes that
sequences with n% similarity are essentially the same, so that substitution costs are
mostly due to sequences in the blocks more than n% different from the others.
(BLOSUM60 is common).
Gap penalty models account for regions in which sequences do not matchwhere one
must introduce a gap in one sequence that is not aligned with anything in the other.
Uniform gap penalties assess a single cost for any character which is not matched. Affine
gap penalties assess one cost for beginning a gap, and another penalty for extending it.
Other models are possible. Gap penalty models are usually chosen to make analysis
easier, or to make programming easierrather than for biological reasons. There are no
good models for gap penalties.
Statistical significance of comparisons
One needs to know how reasonable a comparison ishow likely the comparison
observed is to have happened by chance. This requires some model of change, or
evolution, which relates the sequences, or at least some model of the distribution of
changes. Statistical things that I dont understand (yet) come into play here, like extreme
value distributions and Gibbs sampling.
The degree of similarity between two sequences is not a statistically or biologically
interesting measure of significance. One must account for the underlying model of
change, or the distribution of changes in some population of similar sequences when
quantifying significance.
Dynamic programming
Dynamic programming algorithms compare two sequences and find the optimal
alignment between them. The optimal alignment minimizes the cost, given a specific
substitution matrix. DP algorithms are efficient in computer science terms (usually
O(n
2
)), but far too slow in practice for long sequences or large numbers of comparisons.
The Needleman-Wunsch (NW) algorithm was the first global DP alignment algorithm for
two strings. It compares strings from left to right, at each step finding the best alignment
that extends the previous alignment.
Picture should go here.
The Smith-Waterman algorithm adapts NW to be a local DP alignment algorithm, by
finding the best comparisons from the right to the left, at each step finding the best string
from which the suffix alignment could have been derived.
Picture should go here.
Heuristics
Dot-plot algorithms place sequences on a matrix and compare subsequences in a sliding
window along the two edges of the matrix. Each point in the matrix is the score of the
alignment of two subsequences in those windows. Similar regions show up as strong
diagonal lines.
Picture should go here.
FASTA is a heuristic (CS jargon for rule of thumb) local alignment algorithm, usually
used for database search. It matches subsequences using a scoring matrix, identifies likely
matching regions, then does SW on those regions. FASTA produces very good statistics
on the significance of a match.
BLAST is another heuristic. Blast uses partial matches between subsequences, but only
uses the scoring matrix if the similarity is high enough. Early versions of BLAST did not
consider gaps, but newer versions do.
3.2.2 Finding similar sequences in huge databases
Usually, one has a sequence and performs local comparisons with everything in the
database, using this sequence. Each comparison produces a scored alignment, and the
search returns sufficiently high-scoring sequences. Database lookup is sequence
comparison. One must understand the significance of the scores returned by a database
query.
3.2.3 Finding similarities in collections of sequences
Multiple sequence alignment algorithms line up several sequences so that, hopefully,
homologous regions are in the same columns. There is no efficient algorithm for
optimally aligning more than two sequences, for a given scoring matrix. This is a
mathematical limitation, not a biological one. Most multiple sequence alignment
algorithms are not optimal, and the likelihood that their alignments reflect biological
reality is very hard, if not impossible, to assess. Bioinformaticians do these assessments
and design these algorithms.
Once one has an alignment, one can use the conserved regions to determine which
sequences diverged from a common ancestor more recently than others: the more s imilar
two sequences are, the less time since they diverged.
n-dimensional DP adapts the optimal DP alignment algorithms to n sequences.
Unfortunately, they are not computationally feasible even in a theoretical sense (they are
O(k
n
) for n sequences of k characters).
Progressive alignments begin with a seed alignment and align it with whole new
sequences in some pre-defined order, often determined by a guide tree. For example,
CLUSTAL W creates a phylogeny (putative evolutionary history) from all pairwise
distances between sequences distances using the neighbor-joining algorithm of Saitou
and Nei. It then combines alignments from the leaves of the tree to the root with DP. The
progressive alignment technique of Feng and Doolittle, used by the PILEUP program,
begins with pairwise distances and builds its guide tree with the clustering algorithm of
Fitch and Margoliash.
Iterative alignment algorithms begin with a complete alignment, and make improvements
to it. For example, Barton and Sternberg first align all the sequences ranked by their
similarity to the profile of sequences aligned so far. They then remove each sequence in
turn, build a profile for the remaining sequences, and re-align the removed sequence to
the new profile. MultiAlin first aligns pairs of sequences, then clusters these pairs into a
guide tree and recursively combines alignments, recomputes scores, and adjusts the guide
tree.
Hidden markov models (HMMs) build a statistical model of alignments, accounting for
matching states, insertions, and deletions. One then uses the HMM to guide alignment of
new sequences, by finding the most probable way to fit the new sequences to the
modelusing dynamic programming. This is a very powerful technique.
3.2.4 Organizing all that data
I havent a clue how this is done. The big databases look like immense flat files to me.
But surely this cant be true.
There is no checking for accuracy when sequences are submitted. Only the submitter can
modify sequences in Genbank. So, there is a serious data integrity problem.
Some databases store structures, rather than data. For example, some have HMMs to
represent models of conserved regions.
3.2.5 Phylogenetic relationships
All organisms, and therefore all their DNA and proteins, are related by common ancestry,
some more closely than others. The phylogeny of a collection of taxa (either organisms or
sequences or molecules or whatever) is a putative evolutionary relationship. A
phylogenetic tree is a tree structure in which branches coalesce at nodes representing
putative ancestors.
One forms these trees from sequence data by identifying highly conserved regions in the
sequences. These conserved regions indicate possible evolutionary relationships. For
such inferences to be sound, multiple sequence alignment algorithms must accurately
identify homologous regions. Alignment and phylogeny go together inextricablyand
we dont know how to do either.
There are two basic categories of algorithms for inferring phylogenies from sequences.
Parsimony techniques posit phylogenies (such alliteration!) that minimize the number of
changes necessary to explain the sequences. The maximum parsimony tree is the one in
which the smallest number of changes possible occur along the branches of the treeor
the changes with the smallest cost according to some substitution matrix. These
algorithms are generally fast, but the assumption of parsimony is unjustified by
evolutionary models and the statistical significance of the resulting tree is hard to assess.
Maximum likelihood algorithms attempt to recover the tree that is most probably, given
an underlying model of evolution together with the observed data. These are generally
very hard to compute, in that one must simulate a great number of alternative phylogenies
in order to assess the confidence of the one that the algorithm produces. They are also
sensitive to the underlying model of evolution. However, reliable statistical techniques
exist and are constantly improving. Most people use maximum likelihood when they
have few enough taxa for it to be computationally feasible.
3.2.6 Predicting 2D structure and function from sequences
Some sequences represent strings of molecules that fold into a 2D structure.
DP algorithms exist to predict the least costly series of 2D manipulations for a given
sequence. One must model manipulationssuch as hairpins, bulges, loops, and
dangling endsand their costs. This cost modeling is a biological problem, but the
structure prediction problem is algorithmic.
3.2.7 Classifying sequences
Some sequences form molecules with particular properties, or are closely associated with
certain classes of organisms (or sets of genes). One frequently needs to infer the class or
function of a molecule from its sequence. For example, one might want to determine the
type of protein formed by an amino acid sequence that one has just extracted from an
organism, in order to determine its function.
Some regions of sequences also have particular structure or function. Some things that a
Bioinformatics algorithm might identify are: regulatory regions, introns, genes, alpha
sheets and beta pleats (right?), and transmembrane subsequences.
In artificial intelligence, such problems are known as classification problems, and there
are several different approaches, including: artificial neural nets, genetic algorithms,
genetic programming, suffix trees and parsing. One can also use statistical approaches
such as hidden Markov modeling and expectation maximization, provided one has an
underlying model for the target object.
Gene discovery classifies raw sequence data as either being geneicmeaning that it is
expressed as a proteinor non-geneic. Big pharmas run gene discovery algorithms in
order to find possible targets for pharmaceuticals. They then patent the genes literally by
the thousands, on the off chance that they might be lucrative some day.
3.2.8 Micro Array Data
Micro arrays are small glass plates, about the size of ones thumbnail. Short sequence,
known as probes, are anchored to the arrays in a regular pattern. Some arrays have the
probes laid down very carefully and densely, and are very expensive. Some attach the
probes to beads, which are then attached to the glass with a little less precision and a lot
less cost. One can match a biological sample, usually DNA, to the probes by building the
sample from radioactive or fluorescing molecules (right?), then washing the sample over
the micro array. The strands complementary to the sample will bind to the sample. One
can then recognize the presence of the probe in the sample by finding where the micro
array glows.
Data from micro arrays is a 2D matrix of intensity levels, which indicates which probes
bound to the sample and at what strength. Probe data contains a great deal of noise: some
high and low intensity signals are actually manufacturing flaws, placement is not
uniform, and there are other errors. I have heard that micro array data is not reliable
because one can re-run the same sample over the same array and get very different data.
But, classifying the content of the sample from the micro array data is a hot
Bioinformatics area. So is designing probes for micro arrays.
3.2.9 Annotation
An annotated sequence identifies the structural and functional components of the
sequence. For example, an annotated protein sequence identified the sheets and pleats.
Annotated gene sequences (and protein sequences) identify promoters, introns, and more.
Many sequences are poorly or inadequately annotated when added to the public
databases. Moreover, as new data becomes available, more accurate annotations can be
made. Automated annotation systems are badly needed, since hand annotations cannot
keep pace with new data. But annotation requires classification, and we do not know
enough either algorithmically or biologically to do this well.
3.3 Questions about molecules
3.3.1 Predicting 3D structure from sequences
Sequences wad up in 3D. DNA coils and supercoils. Proteins bend and fold. They tend
to assume structures very reliablymeaning that the same sequence almost always
assumes the same structure. But it is very hard to make this prediction. We dont know
enough chemisty or computer science.
Many of the best algorithms use threading, which uses knowledge about the structure of
previously classified proteins to restrict the possible conformations of a new protein. This
is a starting point for predicting the new structure. Very hard to do.
3.3.2 Predicting function
Even harder. But conserved structure hints at conserved function, so one can often predict
the function of a protein by finding other proteins of similar structure.
But, similarly shaped proteins can have very different functions, and very differently
shaped proteins can have very similar functions.
3.3.3 Predicting behavior in larger systems
How does one predict how several proteins will interact, given their sequences?
Very hard to say.
3.4 Questions about systems
3.4.1 Regulatory systems
Genes switch on and off at different times, and produce proteins at different rates,
depending on their chemical environment. The order in which genes are expressed, and
the strength of that expression, are critical biological phenomena. How does one model or
predict these control reactions?
Metabolic pathways are interesting, too, whatever they are.
3.4.2 Development (ontogeny)
The regulation of some genes determines how a single cell, the fertilized egg, develops
into very complicated collections of highly differentiated tissues. How does one extract
this process from the biological data?