The World of Microbes on the Internet - NYU Langone Medical Center

whipmellificiumBiotechnology

Feb 20, 2013 (4 years and 4 months ago)

181 views

Teaching
Bioinformatics

to Undergraduates


http://www.med.nyu.edu/rcr/ASM

Stuart M. Brown

Research Computing, NYU School of Medicine

I.
What is Bioinformatics
?

II. Challenges of teaching


bioinformatics to undergraduates

III.
Common bioinformatics tools



that you can use for teaching

IV. The limits of knowledge

V. Resources for the teacher

I. What is Bioinformatics?


The use of information technology to
collect, analyze,
and interpret
biological data.


The use of software tools that deal with biological
sequences, genome analysis, molecular structures,
gene expression, regulatory and metabolic modeling


Computational biology
-

the design of new algorithms
and software to support biology research


The routine use of computers in all phases of biology
and medicine

The Human Genome Project

A Genome Revolution in Biology
and Medicine


We are in the midst of a "Golden Era" of
biology


The Human Genome Project has produced a
huge storehouse of data that will be used to
change every aspect of biological research
and medicine


The revolution is about treating biology as an
information science, not about specific
biochemical technologies.

The job of the biologist is changing


As more biological information becomes
available …and laboratory equipment becomes
more automated ...


The biologist will spend more time using computers


The biologist will spend more time on experimental
design and data analysis (and less time doing
tedious lab biochemistry)


Biology will become a more quantitative science
(think how the periodic table affected chemistry)


II. Why teach bioinformatics in
undergraduate education?


Demand for trained graduates from the biomedical
industry



Bioinformatics is essential to
understand

current
developments in all fields of biology



We need to educate an entire new generation of
scientists, health care workers, etc.


Use bioinformatics to enhance the teaching of other
subjects: genetics, evolution, biochemistry

Biochemistry & Protein Structures


"Hands
-
on graphics is a powerful enhancement to
learning, particularly individualized learning. There
is powerful synergy in learning about proteins and
learning simultaneously about how to represent and
manipulate them with computer graphics. When
students learn to use graphics they see proteins and
other complex biomolecules in a new and vivid way,
and discover personal solutions to the problem of
"seeing" new structural concepts."


Molecular Graphics Manifesto

Gale Rhodes

Chemistry Department, Univ. of Southern Maine

Challenges of presenting
bioinformatics to undergraduates


Requires a deep understanding of molecular
biology
-

lots of prerequisites


Training users or makers of these tools?


A good bioinformatics program will require
substantially more math and statistics than
most existing molecular biology and
computer science curricula.


Who will teach?

Different Programs,

Different Goals


Integrate into existing biology courses:



genetics, molecular biology, microbiology


Make one or a few cross
-
disciplinary courses


jointly taught by biology and computing faculty


open to both biology and computing students


Create a curriculum for a true bioinformatics major
(is this a double major?)


Are you training for employment or providing the
fundamentals for advanced training?

Shallow End


This workshop will focus on faculty skills
needed at the shallow end of the continuum


(a few lectures or a short course).


Use bioinformatics to teach biological
concepts

»
Evolution

»
Genetics

»
Protein structure and function

How much Computing skills?


Bioinformatics can be seen as a tool that the
biologist needs to use
-

like PCR


Or should biologists be able to write their own
programs and build databases?


it is a big advantage to be able to design exactly the
tool that you want


this may be the wave of the future


Is your school going to train "bioinformatics
professionals" or biologists with informatics
skills?"

Designing a Curriculum


To really master bioinformatics, students need
to learn a lot of molecular biology and genetics
as well as become competent programmers.


Then they need to learn specific bioinformatics
skills
-

dealing with sequence databases,
similarity algorithms, etc.


How can students learn this much material and
still manage a well rounded education?


Graduates of these programs will become scientists
and managers. Writing and presentation skills are
essential components of their education.

Different Schools have

Different Biases


There are still only a handful of
bioinformatics undergraduate programs


[Many more schools offer a single course or a
"specialized track" similar to a
biotechnology

major]


You can generally predict the bias according
to what school/department hosts the program


Computer Science vs. biology


Biomedical engineering


Medical informatics (library science)

Teaching the Teachers


There are more graduate level bioinformatics
programs, but they are all very new.


Graduates of these programs will have many
opportunities as more schools gear up to offer
bioinformatics training


The reality is that most schools will draft
existing faculty
-

often jointly from Bio and
CompSci departments


We need to train an entire generation of
existing faculty in a new discipline

Teaching Tips


Strike a balance between theory and practical
experience


early bioinformatics training should be about what
you can
do

with the tools


deeper training can focus on how they work


Balance the
"click here"

tutorials against letting
them figure it out for themselves


it will be different when they look at it next time


real bioinformatics work involves finding ways to
overcome frustrations with balky computer systems

Training "computer savvy"

scientists



Know the right tool for the job




Get the job done with tools available




Network connection is the lifeline of
the scientist




Jobs change, computers change,
projects change, scientists need to be
adaptable

III. Bioinformatics Tools

You

Can Use


GenBank
-

genes, proteins, genomes


Similarity Search tools: BLAST


Alignment: CLUSTAL


Protein families: Pfam, ProDom


Protein Structures: PDB, RasMol


Whole Genomes: UCSC, Entrez Genomes


Human Mutations: OMIM


Biochemical Pathways: KEGG


Integrated tools: Biology Workbench,





BCM SearchLauncher

Large Databases


Once upon a time,
GenBank

sent out
sequence updates on CD
-
ROM disks a few
times per year.



Now
GenBank

is over 40 Gigabytes


(
11
billion

bases)



Most biocomputing sites update their copy of
GenBank

every day over the internet.



Scientists access
GenBank

directly over the
Web

Finding Genes in GenBank




These billions of G, A, T, and C letters
would be almost useless without
descriptions of what genes they contain,
the organisms they come from, etc.



All of this information is contained in the
"annotation" part of each sequence record.

Entrez

is a Tool for Finding Sequences



GenBank

is managed by the
NCBI

(National Center
for Biotechnology Information) which is a part of
the US National Library of Medicine.



NCBI has created a Web
-
based tool called
Entrez

for finding sequences in
GenBank
.




http://www.ncbi.nlm.nih.gov



Each sequence in
GenBank

has a unique “
accession
number
”.



Entrez

can also search for keywords such as gene
names, protein names, and the names of orgainisms
or biological functions

Entrez Databases contain more than
just DNA & protein sequences


Type in a Query term


Enter your search words in the


query box and hit the “
Go
” button



Refine the Query


Often a search finds too many (or too few) sequences, so
you can go back and try again with more (or fewer)
keywords in your query


The “
History
” feature allows you to combine any of your
past queries.


The “
Limits
” feature allows you to limit a query to specific
organisms, sequences submitted during a specific period of
time, etc
.


[Many other features are designed to search for literature in
MEDLINE]

You can search for a text term in sequence annotations or in
MEDLINE

abstracts, and find all articles, DNA, and protein
sequences that mention that term.


Then from any article or sequence, you can move to "related
articles" or "related sequences".



Relationships between sequences are computed with
BLAST



Relationships between articles are computed with "MESH" terms
(shared keywords



Relationships between DNA and protein sequences rely on accession
numbers



Relationships between sequences and
MEDLINE

articles rely on both
shared keywords and the mention of accession numbers in the articles.

Related Items

Database Search Strategies


General search principles
-

not limited to
sequence (or to biology)


Use accession numbers whenever possible


Start with broad keywords and narrow the
search using more specific terms


Try variants of spelling, numbers, etc.


Search all relevant databases


Be persistent!!

>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.


Length = 369



Score = 272 bits (137), Expect = 4e
-
71


Identities = 258/297 (86%), Gaps = 1/297 (0%)


Strand = Plus / Plus




Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76


|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||

Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact
-
cgcagaccccccgcagccatggcc 59




Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136


|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||

Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119




Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196


|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||

Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179




Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256


||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||

Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239




Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313


|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||

Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296

Sample Multiple Alignment

Protein domains


(from ProDom database)

Original studied protein from which

annotation was inherited.

New sequence

Closest database annotated entry

Limits on best Matched Annotation Inheritance

result from many things including multi domain proteins transitivity.

Protein Structure


It is not really possible to predict protein
structure from just amino acid sequence


PDB

is a database of know protein structures
(determined by X
-
ray crystallography and
NMR)


There are also very handy structure viewers
such as
RasMol

that are free for any
computer

Genome Browsers


Scientists need to work with a lot of
layers of information about the genome


coding sequence of known genes and
cDNAs


computer
-
predicted genes


genetic maps (known mutations and
markers)


gene expression


cross species homology



UCSC

Ensembl at EBI/EMBL

Human Alleles


The
OMIM

(
O
nline
M
endelian
I
nheritance
in
M
an
) database at the NCBI tracks all human
mutations with known phenotypes.



It contains a total of about
2,000 genetic
diseases

[and another ~11,000 genetic loci with
known phenotypes
-

but not necessarily known gene
sequences]



It is designed for use by physicians:


can search by disease name


contains summaries from clinical studies

KEGG: Kyoto Encylopedia of

Genes and Genomes


Enzymatic and regulatory pathways


Mapped out by EC number and cross
-
referenced to genes in all known organisms



(wherever sequence information exits)


Parallel maps of regulatory pathways

Integrated Online Tools


National Database/Sequence Analsysis
Servers:


NCBI, EMBL/EBI, DDBJ



Tools for specific types of data or problems


Expasy

(Protein, Mass Spec, 2
-
D PAGE)


3
-
D Protein Structures:



PDB, Predict Protein Server


Education oriented tools


Biology Workbench


Collections of links to other servers


BCM SearchLauncher

The Limits of our Knowledge


Bioinformatics is a very dynamic discipline


Teachers can't know everything in the field


The databases are clumsily built


Biology is vastly more complex than our
software


Lots of our current bioinformatics programs don't
work well


We don't have even theoretical solutions for Gene
prediction, alternative splicing, protein structure &
function prediction, regulatory networks

What is a
Gene
?


For every 2 biologists, you get 3 definitions




“A DNA sequence that encodes a



heritable trait.”


The unit of heredity



Is it an abstract concept, or something you
can isolate in a tube or print on your screen?



“Classic” vs.. “modern” understanding of
molecular biology

Genome Confusion


The sequence of a gene in the genome includes:


protein coding sequence


introns and exons


5' and 3' untranslated regions on the mRNA


promoter and 5' transcription factor binding sites


enhancers??


What about alternative splicing?


Multiple cDNAs with different sequences (that
produce different proteins) can be transcribed from
the same genomic locus




V. Teaching Resources


The Biology Student WorkBench


http://peptide.ncsa.uiuc.edu/bioswb.html


RasMol/Chime/Protein Explorer

http://www.umass.edu/microbio/rasmol/

http://www.umass.edu/microbio/chime/

http://www.umass.edu/microbio/chime/explorer/


Bioinformatics.org



Other courses
-

It ain't cheating to learn from
your peers


Terri Attwood's Web Biocomputing tutorials

http://www.biochem.ucl.ac.uk/bsm/dbbrowser/c32/index.html

http://www.biochem.ucl.ac.uk/bsm/dbbrowser/jj/prefacefrm.html



Sequence Analysis on the Web


Christian Büschking and Chris Schleiermacher

http://bibiserv.techfak.uni
-
bielefeld.de/sadr/


Online Lectures on Bioinformatics



Hannes Luz
,
Max Planck Institute for Molecular Genetics




http://lectures.molgen.mpg.de/


Using Computers in Molecular Biology



Stuart Brown, NYU School of Medicine



http://www.med.nyu.edu/rcr/rcr/course/index.html

Teach Yourself Bioinformatics on the Web



http://www.med.nyu.edu/rcr/rcr/btr/index.html

Long Term Implications



A "periodic table for biology" will lead to
an explosion of research and discoveries
-

we will finally have the tools to start
making systematic analyses of biological
processes (quantitative biology).




Understanding the genome will lead to
the ability to change it
-

to modify the
characteristics of organisms and people in
a wide variety of ways

Genomics in Medical Education

“The

explosion

of

information

about

the

new

genetics

will

create

a

huge

problem

in

health

education
.

Most

physicians

in

practice

have

had

not

a

single

hour

of

education

in

genetics

and

are

going

to

be

severely

challenged

to

pick

up

this

new

technology

and

run

with

it
.
"


Francis Collins


Stuart M. Brown, Ph.D.

stuart.brown@med.nyu.edu

www.med.nyu/rcr

Bioinformatics:
A Biologist's
Guide to
Biocomputing
and the
Internet