Powerpoint slides - George Washington University

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 11 months ago)

90 views

CS 177 Introduction to Bioinformatics

Fall 2004


Instructor: Anna Panchenko
(
panch@ncbi.nlm.nih.gov
)


Instructor: Tom Madej
(
madej@ncbi.nlm.nih.gov
)


Co
-
Instructor: Rahul Simha
(
simha@gwu.edu
)


Lecture 1: Introduction




Instructors



Course goals



Grading policy



Motivating problem



Course overview



Molecular basis of cellular processes



Historical timeline


Course Goals


The student will be introduced to the fundamental
problems and methods of bioinformatics.



The student will become thoroughly familiar with on
-
line
public bioinformatics databases and their available
software tools.



The student will acquire a background knowledge of
biological systems so as to be able to interpret the
results of database searches, etc.



The student will also acquire a general understanding of
how important bioinformatics algorithms/software tools
work, and how the databases are organized.

Grading Policy


Homework: 50%, weekly assignments



Final exam: 50%

“All examinations, papers, and other graded
work products and assignments are to be
completed in conformance with:
The George

Washington University Code of Academic

Integrity
”.

P.E. Bourne and H. Weissig

(2003), Structural Bioinformatics,

Wiley & Sons.

Optional Texts

What is Bioinformatics?


A merger of biology, computer science, and information
technology.



Enables the discovery of new biological insights and
unifying principles.



Born from necessity, because of the massive amount of
information required to describe biological organisms
and processes.

Severe Acute Respiratory Syndrome
(SARS)


SARS is a respiratory illness caused by a previously
unrecognized coronavirus; first appeared in Southern
China in Nov. 2002.



Between Nov. 2002 and July 2003, there were 8,098
cases worldwide and 774 fatalities (WHO).



The global outbreak was over by late July 2003. A few
new cases have arisen sporadically since then in China.



There is currently no vaccine or cure available.

Fig. 2 from Rota et al.

Phylogenetic analysis of coronavirus proteins

Fig. 2 from Rota et al.

Conserved motifs in coronavirus S proteins.

Fig. 2 from Rota et al.

Exercise!


Look up the SARS genome on the NCBI
website:
www.ncbi.nlm.nih.gov


The

(ever expanding)

Entrez System

Entrez

PopSet

Structure

PubMed

Books

3D Domains

Taxonomy

GEO/GDS

UniGene

Nucleotide

Protein

Genome

OMIM

CDD/CDART

Journals

SNP

UniSTS

PubMed Central

Course Overview

Lecture 1: Introduction




Instructors



Grading policy



Motivating problem



Course overview



Molecular basis of cellular processes



Historical timeline


Lecture 2: General principles of DNA/RNA
structure and stability




Physico
-
chemical properties of nucleic acids



RNA folding and structure prediction



Gene identification



Genome analysis

Lecture 3: General principles of protein structure
and stability




Physico
-
chemical properties of proteins



Prediction of protein secondary structure



Protein domains and prediction of domain boundaries



Protein structure
-
function relationships

Lecture 4: Sequence alignment algorithms




The alignment problem



Pairwise sequence alignment algorithms



Multiple sequence alignment algorithms



Sequence profiles and profile alignment methods



Alignment statistics

Lecture 5: Computational aspects of protein structure,
part I




Protein folding problem



Problem of protein structure prediction



Homology modeling



Protein design



Prediction of functionally important sites

Lecture 6: Computational aspects of protein structure,
part II




Structure
-
structure alignment algorithms



Significance of structure
-
structure similarity



Protein structure classification

Lecture 7: Bioinformatics databases




Sequence and sequence alignment formats, data exchange



Public sequence databases



Sequence retrieval and examples



Public protein structure databases



Lab exercises

Lecture 8: Bioinformatics database search tools




Sequence database search tools



Structure database search tools



Assessment of results, ROC analysis



Lab exercises

Lecture 9: Phylogenetic analysis, part I



Molecular basis of evolution


Taxonomy and phylogenetics


Phylogenetic trees and phylogenetic inference


Software tools for phylogenetic analysis

Lecture 10: Phylogenetic analysis, part II




Accuracies and statistical tests of phylogenetic trees



Genome comparisons



Protein structure evolution

Lecture 11: Experimental techniques for
macromolecular analysis




Sequencing, PCR



Protein crystallography



Mass spectroscopy



Microarrays



RNA interference

Lecture 12: Systems biology




Genomic circuits



Modeling complex integrated circuits



Protein
-
protein interaction



Metabolic networks

Lectures 13, 14: To be decided…

Molecular Biology Background


Cells



general structure/organization



Molecules



that make up cells



Cellular processes



what makes the cell
alive

Two Cell Organizations


Prokaryotes



lack nucleus, simpler internal structure,
generally quite smaller



Eukaryotes



with nucleus (containing DNA) and various
organelles

Selected organelles…


Nucleus



contains chromosomes/DNA



Mitochondria



generate energy for the cell, contains
mitochrondrial DNA



Ribosomes



where translation from mRNA to proteins
take place (protein synthesis machinery)



Lysosomes



where protein degradation takes place

Cells can become specialized…


Three domains of life


Prokarya



Bacteria



Archaea



Eukarya



Eukaryotes


Universal phylogenetic tree.

Fig. 1 from:

N.R. Pace,
Science

276

(1997) 734
-
740.

Molecules in the cell


Proteins



catalyze reactions, form structures, control
membrane permeability, cell signaling, recognize/bind
other molecules, control gene function



Nucleic acids



DNA and RNA; encode information
about proteins



Lipids



make up biomembranes



Carbohydrates



energy sources, energy storage,
constituents of nucleic acids and surface membranes



Other small molecules


e.g. ATP, water, ions, etc.

The Central Dogma of Molecular Biology

Exercise!



Retrieve a protein structure from the SARS
coronavirus from the NCBI website; you can use:
www.ncbi.nlm.nih.gov/Structure/



Look at the structure for the SARS protease using
Cn3D.

Timeline

1859

Darwin publishes
On the Origin of Species…


1865

Mendel’s experiments with peas show that hereditary
traits are passed on to offspring in discrete units.


1869

Meischer isolates DNA.


1895

R
őntgen discovers X
-
rays.


1902

Sutton proposes the chromosome theory of heredity.

Timeline

(cont.)

1911

Morgan and co
-
workers establish the chromosome
theory of heredity, working with fruit flies.


1943

Astbury observes the first X
-
ray pattern of DNA.


1944

Avery, MacLeod, and McCarty show that DNA
transmits heritable traits (not proteins!).


1951

Pauling and Corey predict the structure of the alpha
-
helix and beta
-
sheet.

Timeline

(cont.)

1953

Watson and Crick propose the double helix model for
DNA based on X
-
ray data from Franklin and Wilkins.


1955

Sanger announces the sequence of the first protein to
be analyzed, bovine insulin.


1955

Kornberg and co
-
workers isolate the enzyme DNA
polymerase (used for copying DNA, e.g. in PCR).


1958

The first integrated circuit is constructed by Kilby at
Texas Instruments.

Timeline

(cont.)

1960

Perutz and Kendrew obtain the first X
-
ray structures
of proteins (hemoglobin and myoglobin).


1961
Brenner, Jacob, and Meselson discover that mRNA
transmits the information from the DNA in the nucleus to
the cytoplasm.


1965

Dayhoff starts the Atlas of Protein Sequence and
Structure.


1966

Nirenberg, Khorana, Ochoa and colleagues crack the
genetic code!


1970

The Needleman
-
Wunsch algorithm for sequence
comparison is published.

Timeline

(cont.)

1972
Dayhoff develops the Protein Sequence Database
(PSD).


1972

Berg and colleagues create the first recombinant
DNA molecule.


1973

Cohen invents DNA cloning.


1975

Sanger and others (Maxam, Gilbert) invent rapid DNA
sequencing methods.

Timeline

(cont.)

1980

The first complete gene sequence for an organism
(Bacteriophage FX174) is published. The genome
consists of 5,386 bases coding 9 proteins.


1981

The Smith
-
Waterman algorithm for sequence
alignment is published.


1981

IBM introduces its Personal Computer to the market.


1982

The GenBank sequence database is created at Los
Alamos National Laboratory.

Timeline

(cont.)

1983

Mullis and co
-
workers describe the PCR reaction.


1985

The FASTP algorithm is published by Lipman and
Pearson.


1986

The SWISS
-
PROT database is created.


1986
The Human Genome Initiative is announced by DOE.


1988

The National Center for Biotechnology Information
(NCBI) is established at the National Library of Medicine
in Bethesda.

Timeline

(cont.)

1992

Human Genome Systems, in Gaithersburg, MD, is
founded by Haseltine.


1992

The Institute for Genomic Research (TIGR) is
established by Venter in Rockville, MD.


1995

The
Haemophilus influenzea

genome is sequenced
(1.8 Mb).


1996

Affymetrix produces the first commercial DNA chips.

Timeline

(cont.)

1988

The FASTA algorithm for sequence comparison is
published by Pearson and Lipman.


1990

Official launch of the Human Genome Project.


1990

The BLAST program by Altschul et al., is published.


1991

The CERN research institute in Geneva announces
the creation of the protocols which make up the World
Wide Web.

Timeline

(cont.)

1996

The yeast genome is sequenced; the first complete
eukaryotic genome.


1996

Human DNA sequencing begins.


1997

The
E. coli

genome is sequenced (4.6 Mb, approx. 4k
genes).


1998

The
C. elegans

genome is sequenced (97 Mb,
approx. 20k genes); the first genome of a multicellular
organism.

Timeline

(cont.)

1998

Venter founds Celera in Rockville, MD.


1998
The Swiss Institute of Bioinformatics is established in
Geneva.


1999

The HGP completes the first human chromosome
(no. 22).


2000

The
Drosophila
genome is completed.

Timeline

(cont.)

2000

Human chromosome no. 21 is completed.


2001

A draft of the entire human genome (3,000 Mb) is
published.


2003

The Human Genome is “completed”! Approx. 30,000
genes (estimated).

DNA, RNA, protein overview

Questions about the genome in an organism:


How much DNA, how many nucleotides?


How many genes are there?


What types of proteins appear to be coded by these genes?

Questions about the proteome
:


What proteins are present?


Where are they?


When are they present
-

under what
conditions?


DNA



RNA



Mutations


Amino acids,
protein structure



DNA overview

DNA

d
eoxyribo
n
ucleic
a
cid

4 bases

A =

T =

C =

G =

A
denine

T
hymine

C
ytosine

G
uanine

Nucleoside

base

+ sugar (deoxyribose)

Pyrimidine (C
4
N
2
H
4
)

Purine (C
5
N
4
H
4
)

Nucleotide

base

+ sugar

+ phosphate

Pyrimidine (C
4
N
2
H
4
)

Thymine

Cytosine

Numbering of carbons?

DNA



RNA



Mutations


Amino acids,
protein structure



Linking nucleotides

What next?

Hydrogen bonds

N
-
H
------
N

N
-
H
------
O

Adenine

Guanine

Thymine

Cytosine

Linking nucleotides:

The 3’
-
OH of one nucleotide is
linked to the 5’
-
phosphate of
the next nucleotide

DNA



RNA



Mutations


Amino acids,
protein structure



Base pairing

G

C

T

A

T

A

G

C

T

A

Base pairing (Watson
-
Crick):

A/T (2 hydrogen bonds)

G/C (3 hydrogen bonds)

Always pairing a purine and a
pyrimidine yields a constant width

DNA base composition:

A + G = T + C (Chargaff’s rule)

DNA



RNA



Mutations


Amino acids,
protein structure



DNA conventions

1. DNA is a right
-
handed helix

DNA



RNA



Mutations


Amino acids,
protein structure



5’ 3’

DNA conventions

1. DNA is a right
-
handed helix

2. The 5’ end is to the left by convention

-
ATCGCAATCAGCTAGGTT
-

sense (forward)

antisense (reverse)

-
TAGCGTTAGTCGATCCAA
-

3’ 5’

5’

-
ATCGCAATCAGCTAGGTT
-

3’

3’

-
TAGCGTTAGTCGATCCAA
-

5’

DNA



RNA



Mutations


Amino acids,
protein structure



DNA structure

Some more facts:

1. Forces stabilizing DNA structure:
Watson
-
Crick
-
H
-
bonding

and
base stacking


(planar aromatic bases overlap geometrically and electronically


energy gain)


2. Genomic DNAs are large molecules:


Eschericia coli
: 4.7 x 10
6

bp; ~ 1 mm contour length


Human: 3.2 x 10
9

bp; ~ 1 m contour length

3. Some DNA molecules (plasmids) are circular and have no free ends:


mtDNA


bacterial DNA (only one circular chromosome)

4. Average gene of 1000 bp can code for average protein of about 330 amino acids

5. Percentage of non
-
coding DNA varies greatly among organisms

Organism


# Base pairs

# Genes


Non
-
coding DNA


small virus


4 x 10
3



3


very little


‘typical’ virus


3 x 10
5



200


very little


bacterium


5 x 10
6



3000


10
-

20%


yeast


1 x 10
7



6000


> 50%


human


3.2 x 10
9



30,000? 99%


amphibians


< 80 x 10
9



?


?


plants


< 900 x 10
9

23,000
-

>50,000 > 99%


DNA



RNA



Mutations


Amino acids,
protein structure



r
ibo
n
ucleic
a
cid

4 bases

A =

U =

C =

G =

A
denine

U
racil

C
ytosine

G
uanine

Pyrimidine (C
4
N
2
H
4
)

Purine (C
5
N
4
H
4
)

Nucleoside

Nucleotide


base

+ sugar (ribose)

base

+ sugar

+ phosphate

RNA

RNA structure

3 major types of RNA

messenger RNA (mRNA); template for protein synthesis

transfer RNA (tRNA); adaptor molecules that decode the genetic code

ribosomal RNA (rRNA); catalyzing the synthesis of proteins

Thymine (DNA)

Uracil (RNA)

DNA



RNA



Mutations


Amino acids,
protein structure



Base interactions in RNA

Base pairing:

U/A/(T) (2 hydrogen bonds)

G/C (3 hydrogen bonds)

RNA base composition:

A + G = U + C

/ Chargaff’s rule does not apply (RNA usually prevails as single strand)

RNA structure:

-

usually single stranded

-

many self
-
complementary regions


RNA commonly exhibits an intricate secondary structure


(relatively short, double helical segments alternated with single stranded regions)

-

complex tertiary interactions fold the RNA in its final three dimensional form

-

the folded RNA molecule is stabilized by interactions (e.g. hydrogen bonds and base stacking)

DNA



RNA



Mutations


Amino acids,
protein structure



RNA structure

Primary structure

Secondary structure

A

A) single stranded regions

C) hairpin

C

D

D) internal loop

E

E) bulge loop

F

F) junction

B

B) duplex

G

G) pseudoknot

formed by unpaired nucleotides

double helical RNA (A
-
form with 11 bp per turn)

duplex bridged by a loop of unpaired nucleotides

nucleotides not forming Watson
-
Crick base pairs

unpaired nucleotides in one strand,

other strand has contiguous base pairing

three or more duplexes separated by single

stranded regions

tertiary interaction between bases of hairpin loop

and outside bases

DNA



RNA



Mutations


Amino acids,
protein structure



RNA structure

Primary structure

Secondary structure

A

C

D

E

F

B

G

Tertiary structure

DNA



RNA



Mutations


Amino acids,
protein structure



RNA structure

How to predict RNA secondary/tertiary structure?

Probing RNA structure experimentally:

-

physical methods (single crystal X
-
ray diffraction, electron microscopy)

-

chemical and enzymatic methods

-

mutational analysis (introduction of specific mutations to test change in some


function or protein
-
RNA interaction)

Thermodynamic prediction of RNA structure:

-

RNA molecules comply to the laws of thermodynamics, therefore it should be


possible to deduce RNA structure from its sequence by finding the conformation


with the lowest free energy

-

Pros: only one sequence required; no difficult experiments; does not rely on


alignments

-

Cons: thermodynamic data experimentally determined, but not always accurate;


possible interactions of RNA with solvent, ions, and proteins

Comparative determination of RNA structure:

-

basic assumption: secondary structure of a functional RNA will be conserved in the


evolution of the molecule (at least more conserved than the primary structure);


when a set of homologous sequences has a certain structure in common, this structure can


be deduced by comparing the structures possible from their sequences

-

Pros: very powerful in finding secondary structure, relatively easy to use, only sequences


required, not affected by interactions of the RNA and other molecules

-

Cons: large number of sequences to study preferred, structure constrains in fully conserved


regions cannot be inferred, extremely variable regions cause problems with alignment

DNA



RNA



Mutations


Amino acids,
protein structure



Amino acids/proteins

The “central dogma” of modern biology: DNA


RNA


灲潴敩n

Getting from DNA to protein:

Two parts: 1.
Transcription

in which a short portion of chromosomal DNA is used to



make a RNA molecule small enough to leave the nucleus.


2.
Translation

in which the RNA code is used to assemble the protein at the


ribosome

-

Bases are read in groups of 3 (= a
codon
)

The genetic code

-

The code consists of codons

64 (4
3

= 64)

-

All codons are used in protein synthesis:


-

20 amino acids


-

3 stop codons

-

The code problem: 4 nucleotides in RNA, but 20 amino acids in proteins

-

AUG (methionine) is the start codon (also used internally)

-

The code is non
-
overlapping and punctuation
-
free

-

The code is degenerate (but
NOT

ambiguous): each amino acid is specified by at


least one codon

-

The code is universal (virtually all organisms use the same code)

DNA



RNA



Mutations


Amino acids,
protein structure



The genetic code

methionine and tryptophan

five: proline, threonine,

valine, alanine, glycine

AUG

In
-
class exercise

2. How many amino acids


are specified by the first


two nucleotides only?



3. What is the RNA code for


the start codon?

1. Which amino acids are


specified by single codons?

DNA



RNA



Mutations


Amino acids,
protein structure