here - Wishart Research Group

whipmellificiumBiotechnology

Feb 20, 2013 (4 years and 8 months ago)

146 views

Dr. Wishart


Office hours


Generally available right after class


Best time to catch me is after 4:00 pm in
my office in Athabasca Hall


Usually around til 6:00 pm


Arranging an appointment is always
best (email)


Email responses take 1
-
2 days

Where to get your notes


http://redpoll.pharmacy.ualberta.ca/


Look under “Courses”

General Course Outline


Bioinformatics introduction


Bionformatics Databases


Sequence Alignment


Protein Feature ID


Computational microbiology


Peptide/Protein Analysis


Protein Structure (Xray)

General Course Outline


Protein Structure (NMR)


Mass Spectrometry 1


Mass Spectrometry 2


Proteomics


Systems Biology


Enzymology/Systems Biology


Protein Structure (Xray)

Introduction to
Bioinformatics

Microbiology 343

David Wishart Rm. 3
-
41 Ath

david.wishart@ualberta.ca

Objectives & Outline


Definitions and roles of bioinformatics


DNA sequencing (foundation to
bioinformatics)


Genomes and Genomics


Gene finding in prokaryotes


From genes to proteins



Bioinformatics

Definition

-

A field of information


technology which endeavours


to improve the storage,


management and analysis


of biological, medical and


pharmaceutical data.
A blend




information technology and




biotechnology

Bioinformatics
-

Converting
Data to Knowledge

Data

Knowledge




Bioinformatics Software

What’s the Appeal?


No spills or smells


No need for a lab coat


No need to get hands or clothes messy


Provides a faster or alternative route to
create hypotheses, perform difficult
experiments, avoid unnecessary
experiments, compare, visualize and analyze
data, make predictions, see what is
“unseeable” and handle a growing tidal wave
of both data and knowledge

Bioinformatics

1990 1995 2000 2005 2010 2015 2020

Genomics

Proteomics

Bioinformatics

High Throughput DNA
Sequencing

Shotgun Sequencing

Isolate

Chromosome

ShearDNA

into Fragments

Clone into

Seq. Vectors

Sequence

Principles of DNA
Sequencing

Primer

PBR322

Amp

Tet

Ori

DNA fragment

Denature with

heat to produce

ssDNA

Klenow + ddNTP

+ dNTP + primers

The Secret to Sanger
Sequencing

Principles of DNA Sequencing

5’

5’ Primer

3’ Template

G C A T G C

dATP

dCTP

dGTP

dTTP

ddATP

dATP

dCTP

dGTP

dTTP

ddCTP

dATP

dCTP

dGTP

dTTP

ddTTP

dATP

dCTP

dGTP

dTTP

ddGTP

G
ddC

GCATG
ddC


GC
ddA


GCA
ddT


ddG

GCAT
ddG


Principles of DNA Sequencing

G

C

T

A

+

_

+

_

G

C

A

T

G

C

short

long

Capillary Electrophoresis

Separation by Electro
-
osmotic Flow

Multiplexed CE with
Fluorescent detection

ABI 3700

96x700 bases

Shotgun Sequencing

Sequence

Chromatogram

Send to Computer

Assembled

Sequence

Shotgun Sequencing


Very efficient process for small
-
scale
(~10 kb) sequencing (preferred method)


First applied to whole genome
sequencing in 1995 (
H. influenzae
)


Now standard for all prokaryotic
genome sequencing projects


Successfully applied to
D. melanogaster


Moderately successful for
H. sapiens

The Finished Product

GATTACAGATTACAGATTACAGATTACAGATTACAG

ATTACAGATTACAGATTACAGATTACAGATTACAGA

TTACAGATTACAGATTACAGATTACAGATTACAGAT

TACAGATTAGAGATTACAGATTACAGATTACAGATT

ACAGATTACAGATTACAGATTACAGATTACAGATTA

CAGATTACAGATTACAGATTACAGATTACAGATTAC

AGATTACAGATTACAGATTACAGATTACAGATTACA

GATTACAGATTACAGATTACAGATTACAGATTACAG

ATTACAGATTACAGATTACAGATTACAGATTACAGA

TTACAGATTACAGATTACAGATTACAGATTACAGAT

Sequenced Genomes

http://www.genomenewsnetwork.org
/

Genomes to Date


8 vertebrates
(
human, mouse, rat, fugu, dog, chimp
)


3 plants
(arabadopsis, rice, poplar)


2 insects
(fruit fly, mosquito)


2 nematodes
(C. elegans, C. briggsae)


1 sea squirt


4 parasites
(plasmodium, guillardia)


4 fungi
(S. cerevisae, S. pombe)


200+ bacteria and archebacteria


2000+ viruses

Prokaryotes


Simple gene structure


Small genomes (0.5 to 10 million bp)


No introns (uninterrupted)


Genes are called Open Reading
Frames of “ORFs” (include start &
stop codon)


High coding density (>90%)


Some genes overlap (nested)


Some genes are quite short (<60 bp)


Prokaryotic Gene Structure

ORF (open reading frame)

Start codon

Stop codon

TATA box

ATGACAGATTACAGATTACAGATTACAGGATAG

Frame 1

Frame 2

Frame 3

Simple Gene Finding


Scan forward strand until a start codon is found


Staying in same frame scan in groups of three
until a stop codon is found


If # of codons between start and end is greater
than 50, identify as gene and go to last start
codon and proceed with step 1


If # codons between start and end is less than
50, go back to last start codon and go to step 1


At end of chromosome, repeat process for
reverse complement

Advanced Gene Finding


Identify all ORFs (open reading frames)
> 200 bases on both strands using
normal and alternate start/stop codons


Find high scoring
-
10,
-
35 and RBS sites
at 5’ ends of putative ORFs


Find high scoring rho terminators at 3’
ends of putative ORFs


Exclude ORFs without identified signals
at 5’ or 3’ ends


Key Prokaryotic Gene
Signals


Alternate start codons


RNA polymerase promoter site (
-
10,
-
35 site or TATA box)


Shine
-
Dalgarno sequence (Ribosome
binding site
-
RBS)


Stem
-
loop (rho
-
independent)
terminators


High GC content (CpG islands)

Alternate Start Codons (E. coli)

Class I




Class IIa

ATG
Met

GTG Val

TTG Leu

CTG Met

ATT Val

ATA Leu

ACG Thr

-
10,
-
35 Site (RNA pol Promoter)

-
36
-
35
-
34
-
33
-
32 ….
-
13
-
12
-
11
-
10
-
9
-
8


T T G A C T A t A A T

RBS (Shine Dalgarno Seq)

-
13
-
12
-
11
-
10
-
9
-
8 ..
-
1 0 1 2 3 4


G G G G G G n
A T G

n C

Terminator Stem
-
loops

More Sophisticated Methods

RBS site

promoter site

HMM

Really Sophisticated Methods


GLIMMER


http://www.tigr.org/software/glimmer/


Uses interpolated markov models (IMM)


Requires training of sample genes


Takes about 1 minute/genome


GeneMark.hmm


http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi


Available as a web server


Uses hidden markov models (HMM)

Glimmer Performance

What Next?


Raw DNA sequence


䝥湥 獥煵q湣敳


Gene seqs


偲潴敩P 獥煵敮捥s


Gene + Protein seqs


䑡瑡扡獥s


Gene + Protein info


䑡瑡扡獥s


Most protein and DNA sequence data is
entered into GenBank through XXX


Next Lecture: Databases

Sample Exam Question


Describe an algorithm or sketch a
flowchart for gene finding in
prokaryotes


What are the key features of a
prokaryotic ORF?


Following is a gene sequence


identify and label all major features