Information - WV IDeA Network of Biomedical Research Excellence

fleagoldfishBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

109 views



Bioinformatics

Introduction

Acknowledgments


These slides and exercises were prepared as
Bioinformatics Teaching Modules developed
by Elizabeth Murray, Ph.D. and Andrew
Rieser, Integrated Science and Technology,
Marshall University.


Development of the slide shows and exercises
was funded in part by the National Center for
Research Resources (NCRR) of the NIH Grant
#P20 RR16477.


Acknowledgments


These slides have been inspired by many
sources available on the Internet. We have
tried to acknowledge the contributions of
others and the source of the images in the
notes.


If we have overlooked an acknowledgment,
please let us know and we will correct this.



We are basing some of the examples on the
excellent new text “Bioinformatics and
Functional Genomics” by Jonathan Pevsner
Wiley Liss 2004.


What is Computational Biology?
The development and
application of data
-
analytical and theoretical methods,
mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and
social systems.


What is Bioinformatics?
Research, development, or
application of computational tools and approaches for
expanding the use of biological, medical, behavioral or
health data, including those to acquire, store, organize,
archive, analyze, or visualize such data.

-
NIH Biomedical Information and Science Technology Initiative Consortium

Computational Biology vs.
Bioinformatics

What is Informatics?


The term informatics is widely used in both
health care and computer science.


Computer specialists use the term informatics
for computer hardware, software, and
information theory


Medical informatics includes all data
management in a hospital from patient
records, billing, images, to medical literature
etc.



A Good Definition



Bioinformatics is the use of computers
for the acquisition, management, and
analysis of biological information.


It incorporates elements of molecular
biology, computational biology, database
computing, and the Internet


The key element of the definition is
information management


What Kinds of Information?


Bioinformatics deals with any type of data that is of
interest to biologists


DNA and protein sequences


Gene expression (microarray)


Articles from the literature and databases of
citations


Images of microarrays or 2
-
D protein gels


Raw data collected from any type of field or
laboratory experiment


Software


The analysis of DNA sequence data dominates the
field of bioinformatics, but the term can be used to
describe any type of biological data that can be
recorded as numbers or images and handled by
computers.

Who Works in Bioinformatics?


Bioinformatics is clearly a multi
-
disciplinary field including:



computer systems management


networking, database design


computer programming


molecular biology



How to Get a Job in
Bioinformatics


Few scientists describe themselves as
specialists in bioinformatics.


It is difficult to train people to specialize
in this field since different skills are
required to use computer tools to analyze
data vs. the design of those tools.


Other specialists create the mathematical
algorithms used to build the tools.


Strong knowledge of molecular biology is
also needed to frame meaningful questions
and problems for software development
and analysis.

Every Molecular Biologist must be

“Bioinformatics Literate”



Most biologists are “users” not “developers”
of software and algorithms.


This series of presentations and exercises
are intended to help you be a knowledgeable
user of software packages and be able to
frame interesting questions and interpret
your results.


A Good Day Using Bioinformatics



A scientist studying a
model organism,
Arabidopsis, finds a
TDNA insertional
mutation in a gene they
are studying.


They use the TDNA DNA
sequence as a probe to
hybridize to a genomic
DNA library and identify
a clone of the genomic
DNA.

Now they can determine the sequence of the gene they
mutated with TDNA.

A Good Day Using Bioinformatics


The scientist enters the sequence into a
search tool (BLAST, FASTA) and
compares their DNA sequence with all
the DNA sequences in all the databases.


The scientist finds a group of related
sequences to the gene with the tDNA.


BUT the scientist doesn’t know anything
about those related genes.

A Good Day Using Bioinformatics


The scientist can:


Search publications on the related gene to
determine the gene function in other organisms (or
even their own organism).


Look at the structure of the domains in related
genes to analyze function.


Compare sequences with those of other organisms
to develop “trees” of sequence relationships.


Analyze the promoter sequence of the gene to see
what transcription factor binding sites are there.


Analyze expression data if the gene was included
on microarrays.


All without lifting a pipette or thawing a tube!

A Good Day Using Bioinformatics


Now the scientist can use bioinformatics to
guide their next experiments in the lab.


Design PCR primers to amplify the DNA


Search the sequence for restriction enzyme cut
sites for cloning the DNA for additional
experiments.


Test hypotheses about structure and function of
the protein suggested by the sequence similarities.
If the protein looks like it has a kinase domain,
clone, express and purify it to see if it actually is a
kinase!

Introduction to Molecular
Genetics


Using these slides requires some familiarity
with the principals of molecular biology and
genetics.


If you are from a mathematics or computer
science background, the information in these
slides may be too jargon
-
filled and detailed
for you.


There are many excellent resources on the
Internet to help you learn some of the basic
terminology of molecular biology.


Excellent Introductory
Resources


The US Department of Energy has created a
useful Primer on Molecular Genetics.


http://www.ornl.gov/sci/techresources/Human_Ge
nome/publicat/primer/toc.html


On Line Biology Textbook


http://www.emc.maricopa.edu/faculty/farabee/BI
OBK/BioBookTOC.html


NCBI’s Science Primer


http://www.ncbi.nlm.nih.gov/About/primer/index.h
tml

The Challenges of Molecular
Biology Computing



The big dataset problem


DNA sequencing


Pairwise and Multiple Alignments


Similarity searching the databanks


Structure
-
function relationships; Can
sequence patterns predict function?


Phylogenetic analysis: Sequence
conservation across evolution


Genomics

The Big Dataset Problem


Biologists have been very successful in
finding the sequences of DNA and
protein molecules


Automated DNA sequencers


The Human Genome Project


High throughput sequencing of cDNAs
(ESTs)


Information scientists have to develop
tools to keep up with the data


The Big Dataset Problem


Information is being collected, organized, and
made available in databases:


GenBank is the central sequence information
database in the United States


Data is shared between GenBank and European
Molecular Biology Laboratory (EMBL) and the DNA
Database of Japan (DDBJ)


All sequence data submitted to any of these
databases is automatically integrated into the
others.


Sequence data is also incorporated from the
Genome Sequence Data Base (GSDB) and from
patent applications.

The Big Dataset Problem


These presentations will familiarize students
with these databases and their organization.


Students will learn to enter data into the
databases and search for and download data
from the databases.


Students will learn to use some of the
additional bioinformatics tools used to
organize the databases (LocusLink, COGs,
OMIM, SNP, UniGene and others).

DNA Sequencing


One technician with an automated DNA
sequencer can produce over 20 KB of raw
sequence data per day.


The real challenge of DNA sequencing is in
the analysis of the data


DNA sequences reads of ~500 base pairs
must be assembled into complete genes and
chromosomes


These 500 bp reads have errors of both
incorrect bases and insertion/deletions.



DNA Sequencing


These presentations allow students to become
familiar with different strategies for genome
sequencing projects.


Students will learn to analyze electronic DNA
sequence files and to use the Accelrys
Wisconsin Package Software to assemble
DNA sequences from such projects.


Pairwise and Multiple Alignments


Pairwise alignment is the basis of similarity
searching


Pairwise alignment has been "solved" as a
computational problem through dynamic
programming


However, the "optimal" alignment calculated
by the computer may not be the best
representation of the true biological
alignment.

Pairwise and Multiple Alignments


Multiple Alignment is the basis for the
analysis of protein families and functional
domains.


When pairwise alignment is expanded to
compare multiple sequences, it becomes a
computationally huge problem.


To reduce the nearly infinite permutations, a
simplified heuristic (approximate) algorithm is
used known as progressive pairwise alignment


Since this problem is so complex, it is not
possible to mathematically define a truly
optimal alignment of multiple sequences.


Pairwise and Multiple Alignments


These presentations will explain the dynamic
programming algorithm and its application.


These presentations will allow the student to
distinguish between global and local alignment
algorithms and apply them appropriately.


Students will learn the significance of the
Needleman/Wunsch and Smith/ Waterman
algorithms and their application.


These slides will allow the student to
understand the role of scoring matrices (PAM
and BLOSUM) and gaps in sequence alignment.

Pairwise and Multiple Alignments


These exercises will allow the student to use,
display and interpret data generated by the
Pairwise and Multiple alignment programs
included in the GCG Wisconsin Package.
These programs include:


Pairwise Comparison


Gap; FrameAlign; Compare; DotPlot; GapShow;
ProfileGap


Multiple Comparison


PileUp; HmmerAlign; SeqLab®; PlotSimilarity;
Pretty; PrettyBox, MEME; HmmerCalibrate;
ProfileMake; ProfileGap; Overlap; NoOverlap;
OldDistances.



Similarity Searching the
Databanks


"Are there any sequences in the databanks
similar to my sequence?"


Directly searching the databanks by
comparing sequences is too computer time
-
consuming.


The scientist uses timesaving heuristic tools:
FASTA and BLAST


Meaningful interpretation relies on the
informed judgment of the Biologist and
interpretation of the statistics.


Similarity Searching the
Databanks


Students will master the popular search tools
BLAST and FASTA (in their many versions)
used to search the databanks and learn to
interpret the significance of the statistics
and output from these programs.


Students will learn additional sequence
searching and retrieval programs within the
GCG Wisconsin Package (FrameSearch;
MotifSearch; ProfileMake; ProfileSegments;
FindPatterns; Motifs; WordSearch;
Segments; Fetch and NetFetch).


Structure
-
function relationships:



Sequence patterns that predict function.


The prediction of the function of protein
molecules from their sequence is one of the most
challenging areas of computational molecular
biology.


Sequence determines 3
-
D structure,
structure determines function


Currently, we can’t predict a 3
-
D protein structure
from amino acid sequence alone. The best current
approach is based on comparing sequence similarity
to proteins of known structure = "threading"

Structure
-
function relationships:



Can predict some aspects of 3
-
D structure
from sequence:


A
-
helix vs. B
-
sheet


membrane spanning region


helix
-
turn
-
helix


signal peptide


Identifying conserved regions (domains or
motifs).


Functions of these conserved domains are
defined by laboratory research.


Domain databases can be used to scan any
unknown protein sequence for the presence of
over a thousand known domains.

Structure
-
function relationships:


Databases of important conserved elements
within DNA sequences have been developed:


transcription factor binding sites


restriction enzyme recognition sites


Some 3D RNA structures can also be
predicted based strictly on sequence


by sequence comparison with other known
sequences (such as tRNA)


by simple detection of stem
-
loop structures as
inverse repeats

Structure
-
function relationships:


Students will learn to use PubMed and other
literature databases to obtain on
-
line journal
articles, abstracts and texts.


Students will analyze proteins using software
which identifies sequence motifs, predicts
peptide properties, looks at secondary
structure, hydrophobicity, and antigenicity,
and identifies repeats and regions of low
complexity.


Structure
-
function relationships:


Students will analyze sequences to
predict RNA or DNA structure. GCG
Wisconsin Package programs include
MFold, PlotFold; StemLoop.


Students will use Gene Prediction
software packages available on the
internet, including Genefinder, Genscan
and GrailII.

Structure
-
function relationships:


Students will learn to view 3
-
D protein
structures using Chime, Cn3d, Mage,
Rasmol and Swiss 3D viewer, Spdbv.


Students will learn to view 3
-
D protein
structures using Chime, Cn3d, Mage,
Rasmol and Swiss 3D viewer, Spdbv.


Students will design primer pairs using
Oligo3, Prime, PrimePair and TempMelt


Phylogenetic Analysis


There are evolutionary assumptions
underlying the science of molecular
sequence analysis.


evolution = mutation of DNA sequences


two species that have genes that are
similar in sequence are more closely related
than are two species that have less
sequence similarity.



It is possible to collect sequence data
from several different organisms, add
up the differences, and estimate their
relationships.



A Phylogenetic Tree

Phylogenetic Analysis


There are a many controversies and
objections to such simplistic analyses.


Not all DNA sequences mutate at the same
rate: protein coding regions mutate more
slowly than non
-
coding regions.


Some positions in protein coding DNA
sequences are more free to mutate than
others


Parsimony vs. maximum likelihood methods
of measuring distance.

Phylogenetic Analysis


Students will investigate the relationships
within an aligned set of sequences through
computation of the pairwise distance between
sequences, construction of phylogenetic
trees, or calculation the degree of divergence
of two protein coding regions.


The student will be able to collect a set of
related DNA sequences and calculate
phylogenetic distances and create a tree
using software programs in GCG Wisconsin
package (PAUPSearch; PAUPDisplay;
GrowTree; Diverge ).

Genomics


What is genomics?




An operational definition: The application
of high throughput automated technologies
to molecular biology.




A philosophical definition:



A holistic or systems approach to the study
of information flow within a cell

Genomics


Genomics Technologies include:



Automated DNA sequencing and annotation
of sequences


Gene Finding and Pattern Recognition


DNA microarrays


gene expression (measuring RNA levels)


single nucleotide polymorphisms (SNPs)


Protein chips


Protein
-
protein interactions


Genomics


The student will learn to use microarray
software available from NCBI and Marshall
University’s microarray facility to analyze
gene expression data.


The student will learn to use GCG Wisconsin
package programs designed for genome
analysis, including TestCode, Codon
Preferences, Frame, Repeat, FindPatterns,
Composition, CodonFrequency,Window,
StatPlot, Consensus, FitConsensus, Xnu and
Seg.