# Bioinformatics in Computer Sciences at NJIT - Department of ...

Biotechnology

Oct 2, 2013 (4 years and 7 months ago)

143 views

Lecture 1

BNFO 601

Usman Roshan

Course overview

Perl progamming language (and some Unix basics)

Unix basics

Intro Perl exercises

Dynamic programming and Viterbi algorithm in Perl

Sequence analysis

Algorithms for exact and heuristic pairwise alignment

Hidden Markov models

BLAST

Short read and genome alignments

Genome
-
wide association studies (if time permits)

Overview (contd)

Grade: Two 25% exams, 35% final exam, and
two 15% projects

Exams will cover Perl and bioinformatics
algorithms

Recommended Texts:

Introduction to Bioinformatics Algorithms by Pavel
Pevzner and Neil Jones

Biological sequence analysis by Durbin et. al.

Introduction to Bioinformatics by Arthur Lesk

Beginning Perl for Bioinformatics by James Tisdall

Introduction to Machine Learning by Ethem Alpaydin

Nothing in biology makes sense,
except in the light of evolution

AAGACTT

-
3 mil yrs

-
2 mil yrs

-
1 mil yrs

today

AAGACTT

T_GACTT

AAGGCTT

_GGGCTT

TAGACCTT

A_CACTT

ACCTT

(Cat)

ACACTTC

(Lion)

TAGCCCTTA

(Monkey)

TAGGCCTT

(Human)

GGCTT

(Mouse)

T
_
GACTT

AAG
G
CTT

AAGACTT

_
G
GGCTT

T
AG
A
C
CTT

A
_
C
ACTT

AAGGCTT

T_GACTT

AAGACTT

TAG
G
CCTT

(Human)

TAG
C
CCTT
A

(Monkey)

A_C
_
CTT

(Cat)

A_CACTT
C

(Lion)

_G
_
GCTT

(Mouse)

_GGGCTT

TAGACCTT

A_CACTT

AAGGCTT

T_GACTT

AAGACTT

Representing DNA in a format
manipulatable by computers

DNA is a double
-
up of four nucleotides:

Cytosine (C)

Thymine (T)

Guanine (G)

Since A (adenosine) always pairs
with T (thymine) and C (cytosine)
always pairs with G (guanine)
knowing only one side of the ladder is
enough

We represent DNA as a sequence of
letters where each letter could be
A,C,G, or T.

For example, for the helix shown here
we would represent this as CAGT.

Transcription and translation

Amino acids

Proteins are chains of

amino acids. There are

twenty different amino

acids that chain in

different ways to form

different proteins.

For example,

FLLVALCCRFGH

(this is how we could store

it in a file)

This sequence of amino

acids folds to form a 3
-
D

structure

Protein folding

Protein folding

The protein folding

problem is to determine

the 3
-
D protein structure

from the sequence.

Experimental techniques

are very expensive.

Computational are cheap

but difficult to solve.

By comparing sequences

we can deduce the

evolutionary conserved

portions which are also

functional (most of the time).

Protein

structure

Primary structure: sequence of

amino acids.

Secondary structure: parts of the

chain organizes itself into alpha
helices, beta sheets, and coils. Helices
and sheets are usually evolutionarily
conserved and can aid sequence
alignment.

Tertiary structure: 3
-
D structure of
entire chain

Quaternary structure: Complex of
several chains

Key points

DNA can be represented as strings
consisting of four letters: A, C, G, and T.
They could be very long, e.g. thousands
and even millions of letters

Proteins are also represented as strings of
20 letters (each letter is an amino acid).
Their 3
-
D structure determines the
function to a large extent.

Comparison of sequences

Fundamental in bioinformatics

Many applications

Evolutionary conserved regions

Functional sites in proteins

Phylogeny reconstruction

Protein structural contact prediction

Gene splicing

Genome assembly

Database search

Pairwise sequence alignment

Comparison of two sequences

We want to determine the substitutions,
insertions, and deletions that occurred
between the two sequences

Pairwise alignment

How to align two sequences?

We use dynamic programming

Treat DNA sequences as strings over the
alphabet {A, C, G, T}

Pairwise alignment

Dynamic programming

Define
V(i,j)

to be the optimal pairwise alignment
score between
S
1..i

and
T
1..j
(|S|=m, |T|=n)

Dynamic programming

Time and space complexity is
O(mn)

Define
V(i,j)

to be the optimal pairwise alignment
score between
S
1..i

and
T
1..j
(|S|=m, |T|=n)

How do we understand this dynamic
programming algorithm?

Let’s first look at some example
alignments

Let’s look at gaps. How do we know where
to insert gaps

Let’s look at the structure of an optimal
alignment of two sequences
x

and
y

and
how it relates optimal alignments of
subsequences of
x

and
y

How do we pick gap
parameters?

Structural alignments

Recall that proteins have 3
-
D structure.

Structural alignment
-

example
1

Alignment of thioredoxins from

human and fly taken from the

Wikipedia website. This protein

is found in nearly all organisms

and is essential for mammals.

PDB ids are 3TRX and 1XWC.

Structural alignment
-

example
2

Computer generated

aligned proteins

Unaligned proteins.

2bbm and 1top are

proteins from fly and

chicken respectively.

Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

Structural alignments

We can produce high quality manual
alignments by hand if the structure is
available.

These alignments can then serve as a
benchmark to train gap parameters so that
the alignment program produces correct
alignments.

Benchmark alignments

Protein alignment benchmarks

BAliBASE, SABMARK, PREFAB,
HOMSTRAD are frequently used in studies for
protein alignment.

Proteins benchmarks are generally large and
have been in the research community for
sometime now.

BAliBASE 3.0

Comparison of two alignments

How can we quantify the “correctness” of a
test alignment with respect to a reference?

We measure the number of pairs in the
test that are also aligned in the reference
alignment.

This is also called the sum
-
of
-
pairs
accuracy.

Biologically realistic scoring matrices

PAM and BLOSUM are most popular

PAM was developed by Margaret Dayhoff
and co
-
workers in 1978 by examining
1572 mutations between 71 families of
closely related proteins

BLOSUM is more recent and computed
from blocks of sequences with sufficient
similarity

Computing scoring matrices

Start with a set of reference alignments

Suppose we want to compute the score of A
aligning to C

Count the number of times A aligns to C

Count the number of A’s and C’s

Compute
p
AC

the probability of A aligning to C
and
p
A

and
p
C

the background probabilities of A
and C

Compute the log likelihood ratio

Next week

Basics of Unix

Perl programming

Basics

Exercises

Dynamic programming solution to sequence
alignment in Perl