Information - WV IDeA Network of Biomedical Research Excellence


Oct 2, 2013 (4 years and 9 months ago)





These slides and exercises were prepared as
Bioinformatics Teaching Modules developed
by Elizabeth Murray, Ph.D. and Andrew
Rieser, Integrated Science and Technology,
Marshall University.

Development of the slide shows and exercises
was funded in part by the National Center for
Research Resources (NCRR) of the NIH Grant
#P20 RR16477.


These slides have been inspired by many
sources available on the Internet. We have
tried to acknowledge the contributions of
others and the source of the images in the

If we have overlooked an acknowledgment,
please let us know and we will correct this.

We are basing some of the examples on the
excellent new text “Bioinformatics and
Functional Genomics” by Jonathan Pevsner
Wiley Liss 2004.

What is Computational Biology?
The development and
application of data
analytical and theoretical methods,
mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and
social systems.

What is Bioinformatics?
Research, development, or
application of computational tools and approaches for
expanding the use of biological, medical, behavioral or
health data, including those to acquire, store, organize,
archive, analyze, or visualize such data.

NIH Biomedical Information and Science Technology Initiative Consortium

Computational Biology vs.

What is Informatics?

The term informatics is widely used in both
health care and computer science.

Computer specialists use the term informatics
for computer hardware, software, and
information theory

Medical informatics includes all data
management in a hospital from patient
records, billing, images, to medical literature

A Good Definition

Bioinformatics is the use of computers
for the acquisition, management, and
analysis of biological information.

It incorporates elements of molecular
biology, computational biology, database
computing, and the Internet

The key element of the definition is
information management

What Kinds of Information?

Bioinformatics deals with any type of data that is of
interest to biologists

DNA and protein sequences

Gene expression (microarray)

Articles from the literature and databases of

Images of microarrays or 2
D protein gels

Raw data collected from any type of field or
laboratory experiment


The analysis of DNA sequence data dominates the
field of bioinformatics, but the term can be used to
describe any type of biological data that can be
recorded as numbers or images and handled by

Who Works in Bioinformatics?

Bioinformatics is clearly a multi
disciplinary field including:

computer systems management

networking, database design

computer programming

molecular biology

How to Get a Job in

Few scientists describe themselves as
specialists in bioinformatics.

It is difficult to train people to specialize
in this field since different skills are
required to use computer tools to analyze
data vs. the design of those tools.

Other specialists create the mathematical
algorithms used to build the tools.

Strong knowledge of molecular biology is
also needed to frame meaningful questions
and problems for software development
and analysis.

Every Molecular Biologist must be

“Bioinformatics Literate”

Most biologists are “users” not “developers”
of software and algorithms.

This series of presentations and exercises
are intended to help you be a knowledgeable
user of software packages and be able to
frame interesting questions and interpret
your results.

A Good Day Using Bioinformatics

A scientist studying a
model organism,
Arabidopsis, finds a
TDNA insertional
mutation in a gene they
are studying.

They use the TDNA DNA
sequence as a probe to
hybridize to a genomic
DNA library and identify
a clone of the genomic

Now they can determine the sequence of the gene they
mutated with TDNA.

A Good Day Using Bioinformatics

The scientist enters the sequence into a
search tool (BLAST, FASTA) and
compares their DNA sequence with all
the DNA sequences in all the databases.

The scientist finds a group of related
sequences to the gene with the tDNA.

BUT the scientist doesn’t know anything
about those related genes.

A Good Day Using Bioinformatics

The scientist can:

Search publications on the related gene to
determine the gene function in other organisms (or
even their own organism).

Look at the structure of the domains in related
genes to analyze function.

Compare sequences with those of other organisms
to develop “trees” of sequence relationships.

Analyze the promoter sequence of the gene to see
what transcription factor binding sites are there.

Analyze expression data if the gene was included
on microarrays.

All without lifting a pipette or thawing a tube!

A Good Day Using Bioinformatics

Now the scientist can use bioinformatics to
guide their next experiments in the lab.

Design PCR primers to amplify the DNA

Search the sequence for restriction enzyme cut
sites for cloning the DNA for additional

Test hypotheses about structure and function of
the protein suggested by the sequence similarities.
If the protein looks like it has a kinase domain,
clone, express and purify it to see if it actually is a

Introduction to Molecular

Using these slides requires some familiarity
with the principals of molecular biology and

If you are from a mathematics or computer
science background, the information in these
slides may be too jargon
filled and detailed
for you.

There are many excellent resources on the
Internet to help you learn some of the basic
terminology of molecular biology.

Excellent Introductory

The US Department of Energy has created a
useful Primer on Molecular Genetics.

On Line Biology Textbook

NCBI’s Science Primer

The Challenges of Molecular
Biology Computing

The big dataset problem

DNA sequencing

Pairwise and Multiple Alignments

Similarity searching the databanks

function relationships; Can
sequence patterns predict function?

Phylogenetic analysis: Sequence
conservation across evolution


The Big Dataset Problem

Biologists have been very successful in
finding the sequences of DNA and
protein molecules

Automated DNA sequencers

The Human Genome Project

High throughput sequencing of cDNAs

Information scientists have to develop
tools to keep up with the data

The Big Dataset Problem

Information is being collected, organized, and
made available in databases:

GenBank is the central sequence information
database in the United States

Data is shared between GenBank and European
Molecular Biology Laboratory (EMBL) and the DNA
Database of Japan (DDBJ)

All sequence data submitted to any of these
databases is automatically integrated into the

Sequence data is also incorporated from the
Genome Sequence Data Base (GSDB) and from
patent applications.

The Big Dataset Problem

These presentations will familiarize students
with these databases and their organization.

Students will learn to enter data into the
databases and search for and download data
from the databases.

Students will learn to use some of the
additional bioinformatics tools used to
organize the databases (LocusLink, COGs,
OMIM, SNP, UniGene and others).

DNA Sequencing

One technician with an automated DNA
sequencer can produce over 20 KB of raw
sequence data per day.

The real challenge of DNA sequencing is in
the analysis of the data

DNA sequences reads of ~500 base pairs
must be assembled into complete genes and

These 500 bp reads have errors of both
incorrect bases and insertion/deletions.

DNA Sequencing

These presentations allow students to become
familiar with different strategies for genome
sequencing projects.

Students will learn to analyze electronic DNA
sequence files and to use the Accelrys
Wisconsin Package Software to assemble
DNA sequences from such projects.

Pairwise and Multiple Alignments

Pairwise alignment is the basis of similarity

Pairwise alignment has been "solved" as a
computational problem through dynamic

However, the "optimal" alignment calculated
by the computer may not be the best
representation of the true biological

Pairwise and Multiple Alignments

Multiple Alignment is the basis for the
analysis of protein families and functional

When pairwise alignment is expanded to
compare multiple sequences, it becomes a
computationally huge problem.

To reduce the nearly infinite permutations, a
simplified heuristic (approximate) algorithm is
used known as progressive pairwise alignment

Since this problem is so complex, it is not
possible to mathematically define a truly
optimal alignment of multiple sequences.

Pairwise and Multiple Alignments

These presentations will explain the dynamic
programming algorithm and its application.

These presentations will allow the student to
distinguish between global and local alignment
algorithms and apply them appropriately.

Students will learn the significance of the
Needleman/Wunsch and Smith/ Waterman
algorithms and their application.

These slides will allow the student to
understand the role of scoring matrices (PAM
and BLOSUM) and gaps in sequence alignment.

Pairwise and Multiple Alignments

These exercises will allow the student to use,
display and interpret data generated by the
Pairwise and Multiple alignment programs
included in the GCG Wisconsin Package.
These programs include:

Pairwise Comparison

Gap; FrameAlign; Compare; DotPlot; GapShow;

Multiple Comparison

PileUp; HmmerAlign; SeqLab®; PlotSimilarity;
Pretty; PrettyBox, MEME; HmmerCalibrate;
ProfileMake; ProfileGap; Overlap; NoOverlap;

Similarity Searching the

"Are there any sequences in the databanks
similar to my sequence?"

Directly searching the databanks by
comparing sequences is too computer time

The scientist uses timesaving heuristic tools:

Meaningful interpretation relies on the
informed judgment of the Biologist and
interpretation of the statistics.

Similarity Searching the

Students will master the popular search tools
BLAST and FASTA (in their many versions)
used to search the databanks and learn to
interpret the significance of the statistics
and output from these programs.

Students will learn additional sequence
searching and retrieval programs within the
GCG Wisconsin Package (FrameSearch;
MotifSearch; ProfileMake; ProfileSegments;
FindPatterns; Motifs; WordSearch;
Segments; Fetch and NetFetch).

function relationships:

Sequence patterns that predict function.

The prediction of the function of protein
molecules from their sequence is one of the most
challenging areas of computational molecular

Sequence determines 3
D structure,
structure determines function

Currently, we can’t predict a 3
D protein structure
from amino acid sequence alone. The best current
approach is based on comparing sequence similarity
to proteins of known structure = "threading"

function relationships:

Can predict some aspects of 3
D structure
from sequence:

helix vs. B

membrane spanning region


signal peptide

Identifying conserved regions (domains or

Functions of these conserved domains are
defined by laboratory research.

Domain databases can be used to scan any
unknown protein sequence for the presence of
over a thousand known domains.

function relationships:

Databases of important conserved elements
within DNA sequences have been developed:

transcription factor binding sites

restriction enzyme recognition sites

Some 3D RNA structures can also be
predicted based strictly on sequence

by sequence comparison with other known
sequences (such as tRNA)

by simple detection of stem
loop structures as
inverse repeats

function relationships:

Students will learn to use PubMed and other
literature databases to obtain on
line journal
articles, abstracts and texts.

Students will analyze proteins using software
which identifies sequence motifs, predicts
peptide properties, looks at secondary
structure, hydrophobicity, and antigenicity,
and identifies repeats and regions of low

function relationships:

Students will analyze sequences to
predict RNA or DNA structure. GCG
Wisconsin Package programs include
MFold, PlotFold; StemLoop.

Students will use Gene Prediction
software packages available on the
internet, including Genefinder, Genscan
and GrailII.

function relationships:

Students will learn to view 3
D protein
structures using Chime, Cn3d, Mage,
Rasmol and Swiss 3D viewer, Spdbv.

Students will learn to view 3
D protein
structures using Chime, Cn3d, Mage,
Rasmol and Swiss 3D viewer, Spdbv.

Students will design primer pairs using
Oligo3, Prime, PrimePair and TempMelt

Phylogenetic Analysis

There are evolutionary assumptions
underlying the science of molecular
sequence analysis.

evolution = mutation of DNA sequences

two species that have genes that are
similar in sequence are more closely related
than are two species that have less
sequence similarity.

It is possible to collect sequence data
from several different organisms, add
up the differences, and estimate their

A Phylogenetic Tree

Phylogenetic Analysis

There are a many controversies and
objections to such simplistic analyses.

Not all DNA sequences mutate at the same
rate: protein coding regions mutate more
slowly than non
coding regions.

Some positions in protein coding DNA
sequences are more free to mutate than

Parsimony vs. maximum likelihood methods
of measuring distance.

Phylogenetic Analysis

Students will investigate the relationships
within an aligned set of sequences through
computation of the pairwise distance between
sequences, construction of phylogenetic
trees, or calculation the degree of divergence
of two protein coding regions.

The student will be able to collect a set of
related DNA sequences and calculate
phylogenetic distances and create a tree
using software programs in GCG Wisconsin
package (PAUPSearch; PAUPDisplay;
GrowTree; Diverge ).


What is genomics?

An operational definition: The application
of high throughput automated technologies
to molecular biology.

A philosophical definition:

A holistic or systems approach to the study
of information flow within a cell


Genomics Technologies include:

Automated DNA sequencing and annotation
of sequences

Gene Finding and Pattern Recognition

DNA microarrays

gene expression (measuring RNA levels)

single nucleotide polymorphisms (SNPs)

Protein chips

protein interactions


The student will learn to use microarray
software available from NCBI and Marshall
University’s microarray facility to analyze
gene expression data.

The student will learn to use GCG Wisconsin
package programs designed for genome
analysis, including TestCode, Codon
Preferences, Frame, Repeat, FindPatterns,
Composition, CodonFrequency,Window,
StatPlot, Consensus, FitConsensus, Xnu and