BioInformatics at FSU - Department of Biological Science of the ...

underlingbuddhaBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

64 views

BioInformatics at FSU



what it is, who’s doing
it, and why it needs to be
done now.

Steve Thompson


Florida State University School of
Computational Science and
Information Technology (
CSIT
)

Introductory outline:

What is bioinformatics, genomics,
sequence analysis, computational
molecular biology . . .

Reverse Biochemistry & Evolution.

Database growth

& cpu power.

A very brief ‘Show and Tell,’

NCBI Resources
,
GCG’s

SeqLab
,
phylogenetics
.

High quality training is essential!

Graduates need to be competitive on a
world biotechnology market.

The University’s role in all of this; out
-
reach.

My definitions:

Biocomputing and computational biology are synonymous
and describe the use of computers and computational
techniques to analyze any biological system, from
molecules, through cells, tissues, and organisms, all
the way to populations.

Bioinformatics describes using computational techniques
to access, analyze, and interpret the biological
information in any of the available biological
databases.

Sequence analysis is the study of molecular sequence
data for the purpose of inferring the function,
mechanism, interactions, evolution, and perhaps
structure of biological molecules.

Genomics analyzes the context of genes or complete
genomes (the total DNA content of an organism) within
and across genomes.

Proteomics is the subdivision of genomics concerned with
analyzing the complete protein complement, i.e. the
proteome, of organisms, both within and between
different organisms.

from a ‘virtual’ DNA sequence to actual
molecular physical characterization, not the
other way ‘round.

Using bioinformatics tools, you can infer all
sorts of functional, evolutionary, and,
structural insights into a gene product,
without the need to isolate and purify
massive amounts of protein! Eventually
you can go on to clone and express the
gene based on that analysis using PCR
techniques.

The computer and molecular databases
are an essential part of this process.

The reverse biochemistry analogy:

The exponential growth of molecular
sequence databases

& cpu power.

Year


BasePairs Sequences

1982


680338



606

1983


2274029



2427

1984


3368765



4175

1985


5204420



5700

1986


9615371



9978

1987


15514776


14584

1988


23800000


20579

1989


34762585


28791

1990


49179285


39533

1991


71947426


55627

1992


101008486


78608

1993


157152442 143492

1994


217102462 215273

1995


384939485 555694

1996


651972984 1021211

1997


1160300687 1765847

1998


2008761784 2837897

1999


3841163011 4864570

2000

11101066288 10106023

2001

15849921438 14976310

2002

28507990166 22318883

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.ht
ml

Doubling
time ~ 1
year!

Database growth
(cont.)

The Human Genome Project and numerous other genome projects
have kept the data coming at alarming rates. As of April 2003,
(50 years after the Watson
-
Crick double
-
helix!)16 Archaea, 128
Bacteria, and 10 Eukaryote complete, finished genomes; and 4
Vertebrate and 5 Plant essentially complete genome maps are
publicly available for analysis; not counting all the virus and
viroid genomes available.

The International Human Genome Sequencing Consortium
announced the completion of a "Working Draft" of the human
genome in June 2000; independently that same month, the
private company
Celera Genomics

announced that it had
completed the first assembly of the human genome. Both
articles were published mid
-
February 2001 in the journals
Science

and
Nature
.

Some neat stuff from those papers:

We,
Homo sapiens
, aren’t nearly as special as
we had once hoped we were. Of the 3.2 billion
base pairs in our DNA


Traditional
, text
-
book estimates of the number of
genes were often in the 100,000 range; turns out
we’ve only got about twice as many as a fruit fly,
between 25,000 and 35,000!

The protein coding region of our genome is only about
1% or so, much of the remainder ‘junk’ is ‘jumping,’
‘selfish DNA’ of which much may be involved in
regulation and control. Understanding this network
is a huge challenge.

100
-
200 genes were transferred from an ancestral
bacterial genome to an ancestral vertebrate
genome!
(Later shown to be not true by
more extensive
analyses
, and to be due to gene loss rather than transfer.)

(Central Dogma: DNA

> RNA

> protein)

Primary refers to one dimension


all of the ‘symbol’
information written in sequential order necessary to
specify a particular biological molecular entity, be it
polypeptide or nucleotide.

The symbols are the one letter alphabetic codes for all
of the biological nitrogenous bases and amino acid
residues and their ambiguity codes. Biological
carbohydrates, lipids, and structural information are
not included within this sequence, however, much of
this type of information is available in the reference
documentation sections associated with primary
sequences in the databases.

What are primary
sequences?

What are sequence databases?

These databases are an organized way to store the
tremendous amount of sequence information that
accumulates from laboratories worldwide. Each
database has its own specific format. Three major
database organizations around the world are
responsible for maintaining most of this data; they
largely ‘mirror’ one another.

North America: National Center for Biotechnology
Information (
NCBI
):
GenBank

& GenPept.

Also Georgetown University’s NBRF Protein
Identification Resource:
PIR

&
NRL_3D
.

Europe:
European Molecular Biology Laboratory

(also
EBI

&
ExPasy
):
EMBL

&
Swiss
-
Prot
.

Asia: The DNA Data Bank of Japan (
DDBJ
).

Content & organization:

Most sequence database installations are examples of complex
ASCII/Binary databases, but they usually are not Oracle or SQL or
Object Oriented (proprietary ones often are). They often contain
several very long text files containing different types of information
all related to particular sequences, such as all of the sequences
themselves, versus all of the title lines, or all of the reference
sections. Binary files often help ‘glue together’ all of these other
files by providing index functions.

Software is usually required to successfully interact with these
databases and access is most easily handled through various
software packages and interfaces, either on the World Wide Web
or otherwise. Nucleic acid databases are split into subdivisions
based on taxonomy (historical). Protein databases are often
organized into sections by level of annotation.

What are other biological databases?


Three dimensional structure databases:

the
Protein Data Bank

and
Rutgers Nucleic Acid Database
.

Still more; these can be considered ‘non
-
molecular’:

Reference Databases: e.g.

OMIM



Online Mendelian Inheritance in Man

PubMed/MedLine



over 11 million citations from more than
4 thousand bio/medical scientific journals.

Phylogenetic Tree Databases: e.g. the
Tree of Life
.

Metabolic Pathway Databases: e.g.
WIT

(What Is There) and
Japan’s GenomeNet
KEGG

(the Kyoto Encyclopedia of
Genes and Genomes).

Population studies data


which strains, where, etc.

And then databases that most biocomputing folk don’t even usually
consider:

e.g. GIS/GPS/remote sensing data, medical records, census
counts, mortality and birth rates . . . .

So how do you do bioinformatics?

Often on the InterNet over the World Wide Web


Site

URL (Uniform Resource Locator)

Content


Nat’l Center Biotech' Info'

http://www.ncbi.nlm.nih.gov/

databases/analysis/software

PIR/NBRF

http://www
-
nbrf.georgetown.edu/

protein sequence database

IUBIO Biology Archive

http://iubio.bio.indiana.edu/

database/software archive

Univ. of Montreal

http://megasun.bch.umontreal.ca/

database/software archive

Japan's GenomeNet

http://www.genome.ad.jp/

databases/analysis/software

European Mol' Bio' Lab'

http://www.embl
-
heidelberg.de/

databases/analysis/software

European Bioinformatics

http://www.ebi.ac.uk/

databases/analysis/software

The Sanger Institute

http://www.sanger.ac.uk/

databases/analysis/software

Univ. of Geneva BioWeb

http://www.expasy.ch/

databases/analysis/software

ProteinDataBank

http://www.rcsb.org/pdb/

3D mol' structure database

Molecules R Us

http://molbio.info.nih.gov/cgi
-
bin/pdb/

3D protein/nuc'
visualization

The Genome DataBase

http://www.gdb.org/

The Human Genome
Project

Stanford Genomics

http://genome
-
www.stanford.edu/

various genome projects

Inst. for Genomic Res’rch

http://www.tigr.org/

esp. microbial genome
projects

HIV Sequence Database

http://hiv
-
web.lanl.gov/

HIV epidemeology seq' DB

The Tree of Life

http://tolweb.org/tree/phylogeny.html

overview of all phylogeny

Ribosomal Database Proj’

http://rdp.cme.msu.edu/html/

databases/analysis/software

WIT Metabolism

http://wit.mcs.anl.gov/WIT2/

metabolic reconstruction

Harvard Bio' Laboratories

http://golgi.harvard.edu/

nice bioinformatics links list

What other resources are
available?

Desktop software solutions


public domain programs
are available, but . . . complicated to install, configure,
and maintain. User must be pretty computer savvy.
So,

commercial software packages are available, e.g.
MacVector, DS Gene, DNAsis, DNAStar, etc.,

but . . . license hassles, big expense per machine, and
Internet and/or CD database access all complicate
matters!

Therefore, UNIX server
-
based solutions, public domain or
commercial (e.g. the
Accelrys GCG Wisconsin
Package

[a
Pharmacopeia Co.]
): the
SeqLab

Graphical User
Interface.

One commercial license fee for an entire institution and
very fast, convenient database access on local server
disks. Connections from any networked terminal or
workstation anywhere!

University bioinformatics objectives:

The university tripartite mission


Education, Research, and Service.

Education: out
-
reach programs and
undergraduate and graduate
courses.

Research: bioinformatics is becoming
an
indispensable tool

in most
biological research, particularly in
molecular and cellular biology.

Service: those faculty and staff that
know bioinformatics should be
available to assist with consultation,
systems administration, and
hardware access.

Education:

Workshops



continue to teach GCG SeqLab tutorial
series; each of the four sessions offered once per
semester.

Modules


across the university curricula within existing
courses, interdisciplinary by nature, implications, &
ethics.

Graduate and Undergraduate Courses



presently three cross
-
listed biology courses; one
introductory, team
-
taught survey, stressing practical,
project
-
oriented approaches; one advanced algorithms
lecture; one programming practicum.

Computational Molecular Biology Program



proposed; to be in association and cooperating with
students’ present major department, coordinated by
CSIT. Pros and cons . . .

Summer Short Course



long
-
range ‘dream.’
Participants from world
-
wide disparate disciplines
learning bioinformatics techniques and theory.

GCG SeqLab workshop series:

Four different sessions


Intro’ to SeqLab & Multiple
Sequence Analysis
and its
supplement
,

Rational Primer Design
,

Database Searching & Pairwise
Comparisons


Significance
,

Molecular Evolutionary
Phylogenetics
.

http://bio.fsu.edu/~stevet/workshop.html

FOR MORE INFO...

Modules in existing courses:

Cooperate with extant programs to
incorporate bioinformatics into their
existing curricula.

Key is to demonstrate necessity of
knowledge & offer full cooperation with
departments.

Potential courses exist across many
different departments, and even
across different colleges. Identify
potential courses from the General
Catalog and approach individual
instructors and chairs.

So far:
http://bio.fsu.edu/~stevet/modules.html

Courses at Florida State:

Four different Special Topics Biology
Department BSC4933/5936
Bioinformatics sections


First
(first offered Spring 2002)



Introduction to Bioinformatics
” Steve
Thompson

et al.

Covers both sequence and structural analysis.

Team
-
taught; lecture + optional lab;
pragmatic, real
-
world, project
-
oriented
approach.

Survey level


introduction to the theory +
practical applications. Pluses and minuses;
the problems.

Based on
Washington State University’s Biochemistry
578

model.

Required by
Biomedical Mathematics

program.

Courses (cont.)

Second (first offered Fall 2002)


“Programming Skills for Computational
Biology and Bioinformatics” David
Swofford
. The Java model, an object
oriented framework.

Third (first offered Spring 2003)


“Advanced Bioinformatics:
Computational Methods” David
Swofford
. The theory behind
sequence analysis algorithms.

New (Fall 2003)


“Genomics and
Evolution” Thomas
Hansen
.

Courses (cont.)

Departments other than Biology:

Mathematics


MAP 5485 “
Introduction to Mathematical
Biophysics
” Jack Quine. Mathematical tools
in Biophysics.

an integral part of their
Biomedical
Mathematics

Program.

Institute of Molecular Biophysics


Center Of
Excellence In Biomolecular Computer
Modeling & Simulation
.

In all courses


don’t ignore implications, ramifications, &
ethics of bioinformatics research.

Undergraduate opportunities:

A special undergraduate fellowship


The
Howard Hughes Undergraduate
Program

in Mathematical and
Computational Biology.

Twelve Hughes Fellows per year earn a
$5000 stipend, a $1200 summer housing
allowance, and a $1000 professional
meeting allowance.

Supported by two new undergraduate
majors programs

one each in the
Mathematics

and
Biology

Departments.

Computational Biology Program:

Presently FSU computational biology is
composed of a confusing mix of
undergraduate and graduate programs
across at least three different
Departments from the College of Arts and
Sciences.

We propose a CSIT Coordinated ‘balloon’
program in association with student’s
major department that would consolidate
these efforts. Pros and Cons . . .

Undergraduate and/or Graduate?

Avoid duplication of effort.

Candidate department collaborations.

Summer short course:

Long
-
range ‘pipe dream’?

Broad spectrum


both instructors
and students from many different
disciplines and world
-
wide
distribution.

See
MBL Mol’ Evol’ Workshop

for a
model.

One or two weeks?

On
-
campus room & board support.

To be undertaken this course will
need the potential to achieve an
exceptional world
-
wide reputation!

Bioinformatics degree programs
around the world:

Relatively rare, but more are being
created all the time. Biocomputing
education URL’s are documented at:

http://www.csit.fsu.edu/HHP/gradprog.ht
ml

Most are graduate course lists, many
are graduate Masters or Ph.D.
programs, some include
undergraduate courses or programs.
There is a huge need for
bioinformatics education worldwide.

Gunnar von Heijne in his old but quite readable treatise,
Sequence Analysis in Molecular Biology; Treasure Trove
or Trivial Pursuit
(1987), provides a very appropriate
conclusion:

“Think about what you’re doing; use your knowledge of the
molecular system involved to guide both your interpretation of
results and your direction of inquiry; use as much information as
possible; and do not blindly accept everything the computer offers
you.”

He continues:

“. . . if any lesson is to be drawn . . . it surely is that to be able to
make a useful contribution one must first and foremost be a
biologist, and only second a theoretician . . . . We have to
develop better algorithms, we have to find ways to cope with the
massive amounts of data, and above all we have to become
better biologists. But that’s all it takes.”

Conclusions:


Many fine texts are also starting to
become available in the field.

To ‘honk
-
my
-
own
-
horn’ a bit,
check out the new


Current Protocols in Bioinformatics

from John Wiley & Sons, Inc:

http://www.does.org/cp/bioinfo.htm
l
.

They asked me to contribute a
chapter on multiple sequence
analysis using GCG software.

Humana Press, Inc. also
asked me to contribute.
I’ve got two chapters in
their



Introduction to
Bioinformatics:

A Theoretical And
Practical Approach

http://www.humanapress.c
om/Product.pasp?txtCatal
og=HumanaBooks&txtCat
egory=&txtProductID=1
-
58829
-
241
-
X&isVariant=
0
.

Both volumes are now
available.

Visit my Web page:


http://bio.fsu.edu/~stevet/cv.html
.

Contact me (
stevet@bio.fsu.edu
) for
specific bioinformatics assistance
and/or long distance collaboration.

FOR MORE INFO...