Bioinformatics

underlingbuddhaΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

74 εμφανίσεις

Computational Questions

Bioinformatics

Where CS and Biology Meet


Bioinformatics: Applications of CS to
the life sciences


What are the computational issues?


Storage and retrieval of genetic data, data
mining, tools


Analysis of genetic data: similarities,
differences, structure


Processing experimental data

Problem Solving in

Computer Science


Program: Sequence of instructions that
perform a particular
task


Task (problem) expressed as: Given
data (input), produce results (output)


From problems to programs


Formulate the problem


Develop and verify an
algorithm


Write and test the program

Algorithm Analysis


Algorithm: Conceptual/theoretical form
of a program


What is analyzed?


Correctness: does it solve the problem?


Complexity: how much resources (time
and memory) does it consume?


Tradeoffs: sometimes, we need to sacrifice
correctness for efficiency

Example 1: Searching for an
Element in a List


Problem formulation:


Input: sorted list L of n elements (e.g.,
names) and a target element x


Output: the position of the target element
if it exists in the list


Possible algorithms


Linear search


Binary search

Linear Search


Algorithm:


For each element in L (from the first to the
last element), compare it with x and return
the position if equal


Time complexity:


Up to n comparisons performed


On the average, n/2 comparisons


Runs in linear time (proportional to the list
size n)

Binary Search


Algorithm:


Compare middle element of the list with x,
return the position if equal; if not, reduce
the list to either the lower half or the upper
half of original list; repeat the process


Time complexity


Up to log
2
n comparisons performed


Runs in logarithmic time

Linear vs Logarithmic Time

n

n/2

log n

10

5

4

100

50

7

1000

500

10

1,000,000

500,000

20

2,000,000

1,000,000

21

Comparing Running Times


Exercise: tabulate values of the following
run
-
time functions for different values of n


Functions:


log n (logarithmic)


n (linear)


n
2

(quadratic)


n
3
(cubic)


2
n

(exponential)


n!

Example 2: Substring Search


Problem formulation:


Input: Strings s and t of characters


Output: If s is a substring of t, its position
in t


Example:


Input: s = “ctct”, t = “agtctcttctaac”,


Output: 4


Algorithm? Time Complexity?

Example 3: Traveling
Salesman


Problem Formulation:


Input: n cities, distances between cities


Output: shortest tour of all cities


Algorithm:


Consider all permutations of the cities,
compute total distances for each
permutation, select the minimum among
all total distances

Exponential Algorithms and
Intractable Problems


The Traveling Salesman problem is an
example of an intractable (“NP
-
complete”)
problem


Characterized by:


The existence of a correct exponential algorithm


No known polynomial algorithm


Exponential algorithm is impractical. Now
what?

Heuristics


There are polynomial algorithms for
intractable problems that do not always yield
the correct answer


Example: Start with any city, go to the
nearest unvisited city, repeat process


Not always correct. Counterexample?


Selection of nearest city is called a
heuristic


Compromise: Can prove some statements on
the (incorrect) algorithm and that may be
enough in practice

Back to Bioinformatics:

Some Objectives


Formulate problems relevant to biology


Devise/understand algorithms for these
problems


Computer scientists and biologists need to
talk more


Computer scientists have a tendency to make
(often unreasonable) assumptions


Biologists may place too much faith on results
returned by automated systems

Overview: Selected Problems
in Bioinformatics


Sequence alignment


Phylogeny


Dealing with experimental results

BLAST Search

Blast Results

DNA Sequence Databases


Data representation, integrity, accuracy


Search and scoring methods


Meaning and reliability of results


e.g., how does BLAST

(Basic Local Alignment Search Tool)
respond to random data?


Sequence Alignment Problem


Given two nucleotide sequence, obtain
an optimal alignment between the
sequences



Example:

AT
-
C
-
TGAT




-
TGCAT
-
A
-

Dynamic Programming

T

G

C

A

T

A

A

T

C

T

G

A

T

Phylogeny


Construction of phylogenetic trees
based on genomic distance


Problems to be solved:


Determining genomic distance


Tree construction from the distances

Determining Genomic Distance


Given two genomes, determine the
number of mutations necessary to
obtain one from the other


Common distance model (least number
of mutations)


Mutation on the genome level:
rearrangement (sorting!) operations on
permutations

Sorting Permutations and a
Graph Theoretic Model

0 3 5 6 7 2 1 4 8 9

0 1 2 7 6 5 3 4 8 9

Phylogenetic Tree
Reconstruction


Given a set of species and genomic
distances between the species,
construct a phylogenetic tree that is
(most) consistent with the distances



Problem shown to be NP
-
complete


This means we should try some heuristics

Phylogenetic Tree

Mouse

Human

Monkey

Experimental Results


Image or data directly drawn from a
device


e.g., microarray, scanner


Need to make objective, discrete
conclusions


e.g., pixel intensity vs. gene expression


Need to handle errors and
imperfections

Microarray Image

Image Analysis to Aid
Microarray Experiments


Automatically locating the grid of spots


Use Fourier transforms to compute periods
and offsets


Extracting intensity


Refine spot sample to collect significant,
normalized data


Make conclusions on genetic function

Summary


Bioinformatics: a perfect opportunity for
interdisciplinary research within the
sciences


Academics from the different
backgrounds need to study, discuss,
debate with each other