CS 5263 Bioinformatics - Department of Computer Science

underlingbuddhaBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

77 views

CS5263 Bioinformatics

Lecture 1: Introduction

Outline


Administravia


What is bioinformatics


Why bioinformatics


Topics in bioinformatics


What you will & will not learn


Introduction to molecular biology

Student info


Your name


Email


Enrollment status


Academic background


Interests


Course Info


Instructor: Jianhua Ruan


Office: S.B. 4.01.48


Phone: 458
-
6819


Email: jruan@cs.utsa.edu


Office hours: Tues 6:30
-
7:30, Wed 3
-
4pm


Web:
http://www.cs.utsa.edu/~jruan/teaching/cs
5263_fall_2007/

Course description


A survey of algorithms and methods in
bioinformatics, approached from a
computational viewpoint.


Discussions balanced between algorithmic
analyses and biological applications


Prerequisite:


Knowledge in algorithms and data structure


Programming experience


Basic understanding of statistics and probability


Appetite to learn some biology

Textbooks


Required:


An Introduction to Bioinformatics Algorithms


by Jones and Pevzner



Recommended:


Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids


by Durbin, Eddy, Krogh and Mitchison



Additional resources


See course website

Grading


Attendance: 10%


At most 2 classes missed without affecting
grade


Homeworks: 50%


No late submission accepted


Read the collaboration policy!


Final project and presentation: 40%

What is bioinformatics


National Institutes of Health (NIH):


Research, development, or application of
computational tools and approaches

for
expanding the use of biological, medical,
behavioral or health data, including those to
acquire, store, organize, archive, analyze, or
visualize

such data.

What is bioinformatics


National Center for Biotechnology
Information (NCBI):


the field of science in which
biology, computer
science, and information technology

merge to
form a single discipline
. The ultimate goal of
the field is
to enable the discovery of new
biological insights
as well as to create a global
perspective from which unifying principles in
biology can be discerned.

What is bioinformatics


Wikipedia


Bioinformatics refers to the
creation and
advancement of algorithms, computational
and statistical techniques, and theory

to solve
formal and practical problems posed by or
inspired from the management and analysis
of biological data.

Why bioinformatics


Modern biology generates huge amount of data


Human genome sequence has 3 billion bases


Complex relationships among different types of data


Challenges to integrate and analyze data


Algorithmic challenges


Biologists trained to programming are probably not sufficient


Tremendous needs in both academic and industry


Job opportunities


You get the chance to learn something different


Some examples of central
role of CS in bioinformatics

1. Genome sequencing

AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT

3x10
9

nucleotides

~500 nucleotides

AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT

3x10
9

nucleotides

Computational Fragment Assembly


Introduced ~1980


1995: assemble up to 1,000,000 long DNA pieces


2000: assemble whole human genome

A big puzzle

~60 million pieces

1. Genome sequencing

Where are the genes?

2. Gene Finding

In humans:


~22,000 genes

~1.5% of human DNA

Start codon

ATG

5’

3’

Exon 1

Exon 2

Exon 3

Intron 1

Intron 2

Stop codon

TAG/TGA/TAA

Splice sites

2. Gene Finding

Hidden Markov Models

(Well studied for many years
in speech recognition)

3. Protein Folding


The amino
-
acid sequence of a protein determines the 3D
fold


The 3D fold of a protein determines its function


Can we predict 3D fold of a protein given its amino
-
acid
sequence?


Holy grail of compbio

40 years old problem


Molecular dynamics, computational geometry, machine learning,
robotics

4. Sequence Comparison

Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC


TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-
AG
G
CTATCAC
CT
GACC
T
C
CA
GG
C
CGA
--
TGCCC
---


| | | | | | | | | | | | |
x

| | | | | | | | | | |

T
AG
-
CTATCAC
--
GACC
G
C
--
GG
T
CGA
TT
TGCCC
GAC

Sequence Alignment


Introduced ~1970


BLAST: 1990, most cited paper in history


Still very active area of research

query

DB

BLAST

Efficient string matching algorithms

Fast database index techniques

Sequence comparison is key to



Finding genes



Determining function



Uncovering the evolutionary processes

Sequence conservation implies function

5. Evolution

More than 200 complete
genomes have been
sequenced

5. Evolution

6. Microarray analysis

Clinical prediction of Leukemia type


2 types


Acute lymphoid (ALL)


Acute myeloid (AML)


Different treatment & outcomes


Predict type before treatment?


Bone marrow samples: ALL vs AML

Measure amount of each gene

Some goals of biology for the next 50 years


List all molecular
parts

that build an organism


Genes, proteins, other functional parts


Understand the
function

of each part


Understand how parts
interact


Study how function has
evolved

across all species


Find
genetic defects

that cause diseases


Design drugs

rationally


Sequence the genome of every human, use it for
personalized medicine



Bioinformatics is an essential component for
all the goals above

Major conferences


ISMB (Summer every year)


RECOMB (and its satellites) (Spring every year)


PSB (Jan every year, Hawaii)


ECCB (Europe)


CSB (July every year, Stanford)


Conferences in computer science


ICDM (conference on data mining)


ICML (conference on machine learning)


AAAI (conference on AI)

Major journals


Bioinformatics


Journal of Computational Biology


PLoS Computational Biology


BMC Bioinformatics


Genome Biology


Genome Research


Nucleic Acids Research


IEEE Trans on Computational Biology


Science, Nature, PNAS, Cell, Nature Genetics,
Nature Biotech, …

Major Bioinfo research topics

Covered topics


Sequence analysis


Alignment


Motif finding


Pattern matching


Phylogenetic tree


Sequence
-
based predictions


Gene components


RNA structure


Functional Genomics


Microarray analysis


Biological networks

What you will learn?


Basic concepts in molecular biology and
genetics


Selected topics in bioinformatics and
challenges


Algorithms:


DP, graph, string algorithms


Statistical learning algorithms: HMM, EM,
Gibbs sampling


Data mining: clustering / classification

What you will not learn?


Existing tools / databases


Design / perform biological experiments


Protein structure prediction (commonly
avoided by most bioinfo researchers…)


Building bioinformatics software tools (GUI,
database, Perl / Python, …)

Goals


Basis of sequence analysis and other
computational biology algorithms


Overall picture about the field


Read / criticize research articles


Think about the sub
-
field that best suits
your background to explore


Communicate and exchange ideas with
(computational) biologists

Computer Scientists vs
Biologists


(courtesy Serafim Batzoglou, Stanford)

Biologists vs computer scientists


(almost) Everything is true or false in
computer science



(almost) Nothing is ever true or false in
Biology


Biologists vs computer scientists


Biologists seek to understand the
complicated, messy natural world



Computer scientists strive to build their
own
clean and organized virtual world


Biologists vs computer scientists


Computer scientists are obsessed with
being the first to
invent or prove

something



Biologists are obsessed with being the first
to
discover

something


Biologists vs computer scientists


Biologists are comfortable with the idea
that all data have errors, and every rule
has exceptions



Computer scientists are not

Biologists vs computer scientists


Computer scientists get high
-
paid jobs
after graduation



Biologists typically have to complete one
or more 5
-
year post
-
docs...

Molecular biology 101


Cell


DNA, RNA, Protein


Genome, chromosome, gene


Central dogma

Life


Categories


Prokaryotes (e.g. bacteria)


Unicellular


No nucleus


Eukaryotes (e.g. fungi, plant, animal)


Unicellular or multicellular


Has nucleus


The most important distinction among
groups of organism

Prokaryote vs Eukaryote


Eukaryote has many membrane
-
bounded
compartment inside the cell


Different biological processes occur at different
cellular location

Chemical contents of cell


Small molecules


Sugar


Ions (Na
+
, Ka
+
, Ca
2+
, Cl
-

,…)





Macromolecules (polymers):


DNA


RNA


Protein





Polymers: “strings” made by linking
monomers

from a specified set (alphabet)

Polymer

Monomer

DNA

Deoxyribonucleotides

RNA

Ribonucleotides

Protein

Amino Acid

DNA


DNA: forms the genetic material of all living
organisms


Can be replicated and passed to descendents


Contains information to produce proteins


To computer scientists, DNA is a string made
from alphabet {A, C, G, T}


e.g. ACAGAACGTAGTGCCGTGAGCG


Each letter is called a base


A deoxyribonucleotides


Length varies. From hundreds to billions

RNA


Historically thought to be information carrier only


DNA => RNA => Protein


New roles have been found for them


To computer scientists, RNA is a string made
from alphabet {A, C, G, U}


e.g. ACAGAACGUAGUGCCGUGAGCG


Each letter is called a base


A ribonucleotides


Length varies. From tens to thousands

Protein


Protein: the actual “worker” for almost all processes in
the cell


Enzymes: speed up reactions


Signaling: information transduction


Structural support


Production of other macromolecules


Transport


To computer scientists, protein is a string built from 20
letters


E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP


Each letter is called an amino acid


Lengths: from tens to thousands

Central dogma of molecular biology

DNA/RNA zoom
-
in


Commonly referred to as
Nucleic Acid


DNA:
D
eoxyribo
n
ucleic
a
cid


RNA:
R
ibo
n
ucleic
a
cid


Found mainly in the
nucleus

of a cell
(hence “nucleic”)


Contain phosphoric
acid

as a component
(hence “acid”)


They are made up of
nucleotides

Nucleotides


A nucleotide has 3 components


Sugar (ribose in RNA, deoxyribose in DNA)


Phosphoric acid


Nitrogen base


Adenine (A)


Guanine (G)


Cytosine (C)


Thymine (T) or Uracil (U)

Monomers of RNA


A ribonucleotide has 3 components


Sugar
-

Ribose


Phosphate group


Nitrogen base


Adenine (A)


Guanine (G)


Cytosine (C)


Uracil (U)

Monomers of DNA


A deoxyribonucleotide has 3 components


Sugar
-

Deoxyribose


Phosphoric acid


Nitrogen base


Adenine (A)


Guanine (G)


Cytosine (C)


Thymine (T)

Polymerization: Nucleotides => nucleic acids

Phosphate

Sugar

Nitrogen Base

Phosphate

Sugar

Nitrogen Base

Phosphate

Sugar

Nitrogen Base

G

A

G

T

C

A

G

C

5’
-
AGCGACTG
-
3’

AGCGACTG

Phosphate

Sugar

Base

1

2

3

4

5

Many biological processes go from 5’ to 3’

e.g. DNA replication, transcription, etc.

5’

3’

DNA

G

A

G

U

C

A

G

U

5’
-
AGUGACUG
-
3’

AGUGACUG

Phosphate

Sugar

Base

1

2

3

4

5

Many biological processes go from 5’ to 3’

e.g. transcription.

5’

3’

RNA

G

A

G

T

C

A

G

C

Base
-
pair:

A
=

T

G
=

C

5’

5’

3’

3’

5’
-
AGCGACTG
-
3’

3’
-
TCGCTGAC
-
5’

AGCGACTG

TCGCTGAC

AGCGACTG

Forward (+)
strand

Backward (
-
)
strand

One strand is said to be
reverse
-

complementary

to the other

Reverse
-
complementary
sequences


5’
-
ACGTTACAGTA
-
3’


The reverse complement is:


3’
-
TGCAATGTCAT
-
5’

=>


5’
-
TACTGTAACGT
-
3’


Or simply written as

TACTGTAACGT

DNA double helix

Orientation of the double helix


Double helix is anti
-
parallel


5’ end of each strand at 3’ end of the other


5’ to 3’ motion in one strand is 3’ to 5’ in the other


Double helix has no orientation


Biology has no “forward” and “reverse” strand


Relative to any single strand, there is a “reverse
complement” or “reverse strand”


Information can be encoded by either strand or both
strands


5’TTTTACAGGACCATG 3’

3’AAAATGTCCTGGTAC 5’

RNA Secondary structures


RNAs are normally single
-
stranded


Can form complex structure by self
-
base
-
pairing


A=U, C=G