Bioinformatics and Databases

abalonestrawBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

113 views

Unit 2.4: Bioinformatics and Databases

Objectives: At the end of this unit, students will

-
have been introduced to ome basic concepts and considerations
in bioinformatics and computational biology

-
know what a relational database is

-
understand why databases are useful for dealing with large
amounts of data

-
have been introduced to some of the major online biological
databases and their features

-
have gained experience in extracting data from online
biological databases

Reading:

Stein, L.D. 2003. Integrating biological databases. Nat Rev
Genet 4: 337
-
345.

Assignments:

Read the excerpts from
Current Protocols in
Bioinformatics

on Entrez and the UCSC Browser.
Follow along with the examples in Protocol 1 of each
section.

“Genomic research makes it possible to look at biological
phenomena on a scale not previously possible: all genes in a
genome, all transcripts in a cell, all metabolic processes in a
tissue. One feature that all of these approaches share is the
production of massive quantities of data. GenBank, for example,
now accommodates >10
10

nucleotides of nucleic acid sequence
data and continues to more than double in size every year. New
technologies for assaying gene expression patterns, protein
structure, protein
-
protein interactions, etc., will provide even
more data. How to handle these data, make sense of them, and
render them accessible to biologists working on a wide variety of
problems is the challenge facing bioinformatics

an emerging
field that seeks to integrate computer science with applications
derived from molecular biology.
We are swimming in a rapidly
rising sea of data. . . how do we keep from drowning?







Roos (2001).
Science
.
291
:1260

Bioinformatics is one solution to this problem

a way of coping
with large data sets and making sense of genomic
-
scale data. But
like with most approaches, it is important to have a sense of what
types of things are possible or not possible to achieve using
bioinformatics approaches.

Learn to know the difference

Bioinformatics is:


sometimes a time
-
saver: you can automate common and/or
repetative tasks, and parse large files

• sometimes essential: how else would you analyze results from a
25,000 gene microarray experiment

• sometimes not helpful/not useful/unimportant: it can be easier
and more straightforward to do a simple wet
-
lab experiment than
to devise an elaborate computational approach

• sometimes not possible: computers can’t do everything!




It’s also important to have an understanding of the underlying concepts
and algorithms in bioinformatics, just as it’s important to understand the
basic concepts and chemical basis of molecular biology, or genetics, or
biochemistry, if you’re going to do wet
-
lab experiments.


“Many biologists are comfortable using algorithms like BLAST or
GenScan without really understanding how the underlying algorithm
works. . . . BLAST solves a particular problem only approximately and it
has certain systematic weaknesses. . . . Users that do not know how
BLAST works might misapply the algorithm or misinterpret the results it
returns.” [
Pevzner (2004).
Bioinformatics

20
(14): 2159
-
2161.]





A historical perspective


The 1960s: the birth of
bioinformatics


High
-
level computer
languages


Protein sequence data


Academic access to
computers


Margaret Oakley Dayhoff


First protein database


First program for sequence
assembly

IBM 7090 computer

Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458

By way of comparison…

IBM 7090 computer

32 Kbytes RAM

2.18 µHz

$2,900,000 in 1960

20” Apple iMac

1 GB RAM

2.4 GHz

$1199 in 2008

Solving problems in computer
science


Necessary parameters for assessing the
difficulty of a computer science problem


Algorithmic complexity


Is the problem theoretically solvable?


If so, what is the most efficient solution?


Current state of computer technology


Memory


CPU speed


Cost

Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458

Algorithms


An
algorithm

is a sequence of instructions that
one must perform in order to solve a well
-
formulated
problem


First you must identify exactly what the problem
is!


A
problem

describes a class of computational
tasks. A problem
instance

is one particular input
from that task


In general, you should design your algorithms to
work for
any

instance of a problem (although
there are cases in which this is not possible)


Computer technology: memory, CPU speed, cost



Dramatic improvements on yearly basis

• We do a lot of our work using desktop Macs out of the box


-

2 quad core 2.8 GHz processors, 500 GB disk space, 4 GB RAM for
~$3000


-

2 quad core 3.0 GHz processors, 2.5 TB disk space, 8 GB RAM for
~$6000

• CPU speed vs. memory: which is more important?


-

for
protein structure
, might need
many calculations

but limited
memory


-

for
genome searches
, might have few calculations but
huge amounts
to store in memory

• Reading from memory is several orders of magnitude faster
than reading from disk

Databases


What is a database?


A collection of related data elements


tables


columns (fields)


rows (records)


Records retrieved using a query language


Database technology is well established


Tables (entitites)

•basic elements of information to track, e.g.,
gene, organism,
sequence, citation


Columns (fields)


attributes of tables, e.g. for
citation

table,
title, journal,
volume, author



Rows (records)


actual data


whereas
fields

describe
what

data is stored, the
rows

of a table
are
where

the actual data is stored

Databases


A very simple form of (non
-
electronic) database is a filing
cabinet. In the filing cabinet, you can store many different
records

(sheets of paper), each containing mulitple data elements.



Example: a filing cabinet of invoices


the filing cabinet is a table


the columns are the fields of data on the individual
invoices
(customer, product, price, quantity)


the rows (records) are the individual invoices


The biggest problem with a filing cabinet is that you can only
store your data one way (e.g., in alphabetical order of the
customer’s last name), and there’s no good way of searching your
files based on any other criteria (say, by product ordered).

Databases

Example: a filing cabinet of invoices


the filing cabinet is a table


the columns are the fields of data on the individual invoices
(customer, product, price, quantity)


the rows (records) are the individual invoices

Databases

A flat
-
file database

a spreadsheet

is the electronic
analogue to the filing cabinet:

This is more easily searchable than a paper file cabinet, but
is still very unwieldly, especially for large amounts of data.

Databases

Suppose you now want to be able to send an advertisement to
every customer who bought the Acme Snow Machine. You could
add a column to your table that includes the address for each
customer, but this is very inefficient

you will keep repeating
information for customers (like Elmer) who make multiple
purchases. Plus, as the number of rows and columns grows,
searching a flat file becomes more and more time consuming.
Also, it is difficult to construct complex queries (e.g.,
customers
who bought the Snow Machine
and

who like opera
or

live in the
Southwest desert
)

Relational Databases

The solution is the
relational database
. A relational database contains
multiple
tables

and defines the
relationships

between them. Thus you might also have a
customer table and a product table, like this:

Relational Databases

Relationships

can be built between tables and fields:

database “
schema


Relational Databases

Now only three items need to be filled in for an invoice: a
customer
, a
product
, and a
quantity
. The
price

and
total

fields can be filled in
automatically: price from a
product_table

“lookup” and total by “calculation”
(price * qty).

Relational Databases

Now we can send our advertisement to every customer who bought the Acme
Snow Machine by getting their addresses from the
customer_table

table.

To do this, we use
Structured Query Language (SQL)
:

SELECT customer_table.name, customer_table.address

FROM customer_table, invoice

WHERE invoice.product = “Acme Snow Machine”

AND invoice.customer = customer_table.name

Relational Databases

We can also make our complex query

“customers who bought the Snow Machine
and

who like opera
or

live in the
Southwest desert
)
”:

SELECT customer_table.name

FROM customer_table, invoice

WHERE invoice.product = “Snow Machine”

AND invoice.customer = customer_table.name

AND (customer_table.notes LIKE %opera% OR





cutomer_table.address = “Southwest desert”)

Online Databases

When you query an online database, your query is translated
into SQL, the database is interrogated, and the answer displayed
on your web browser.

Your computer and
browser (the “client”)

Software to receive
and translate the
instructions you enter
into your browser (on
the “server”)

The database itself

Image source: David Lane and Hugh E. Williams.
Web Database Applications with PHP & MySQL
. O’Reilly (2002).

Biological Databases


Over 1000 biological databases


Vary in size, quality, coverage, level of interest


Many of the major ones covered in the annual
Database Issue of

Nucleic Acids Research


What makes a good database?


comprehensiveness


accuracy


is up
-
to
-
date


good interface


batch search/download


API (web services, DAS, etc.)


“The Ten Commandments When Using
Servers”


Remember the
server
, the
database
, and the
program version

used


Write down sequence identification numbers


Write down the program parameters


Save your internet results the right way

(use screenshots or PDFs if necessary)


Databases are not like good wine

(use up
-
to
-
date builds)


Use local installs when it becomes necessary

Source:
Bioinformatics for Dummies

“Ten Important Bioinformatics Databases”

GenBank

www.ncbi.nlm.nih.gov

nucleotide sequences

Ensembl


www.ensembl.org

human/mouse genome (and others)

PubMed


www.ncbi.nlm.nih.gov

literature references

NR


www.ncbi.nlm.nih.gov

protein sequences

SWISS
-
PROT

www.expasy.ch


protein sequences

InterPro


www.ebi.ac.uk


protein domains

OMIM


www.ncbi.nlm.nih.gov

genetic diseases

Enzymes

www.chem.qmul.ac.uk

enzymes

PDB


www.rcsb.org/pdb/

protein structures

KEGG


www.genome.ad.jp

metabolic pathways

Source:
Bioinformatics for Dummies

NCBI (National Center for Biotechnology
Information)



over 30 databases including
GenBank, PubMed, OMIM,
and

GEO



Access all NCBI resources via
Entrez
(www.ncbi.nlm.nih.gov/Entrez/)






GenBank® is the NIH genetic
sequence database, an annotated
collection of all publicly available
DNA sequences. There are
approximately 65,369,091,950
bases in 61,132,599 sequence
records in the traditional GenBank
divisions and 80,369,977,826
bases in 17,960,667 sequence
records in the WGS division as of
August 2006.

www.ncbi.nlm.nih.gov/GenBank

www.ncbi.nlm.nih.gov/GenBank

The Reference Sequence (RefSeq) database is
a non
-
redundant collection of richly annotated
DNA, RNA, and protein sequences from diverse
taxa. Each RefSeq represents a single, naturally
occurring molecule from one organism. The goal
is to provide a comprehensive, standard dataset
that represents sequence information for a
species. It should be noted, though, that RefSeq
has been built using data from public archival
databases only.


RefSeq biological sequences (also known as
RefSeqs) are derived from GenBank records
but differ in that each RefSeq is a synthesis of
information, not an archived unit of primary
research data. Similar to a review article in the
literature, a RefSeq represents the consolidation
of information by a particular group at a
particular time.

Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)

Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)

Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)

The MOD squad


Most model organism communities have established organism
-
specific Model Organism Databases (MODs)


Many of these databases have different schemas and implementations,
although there is movement toward harmonizing many features via the
Generic Model Organism Database project.

The MOD squad

SGD: yeast (www.yeastgenome.org)

Wormbase:
C. elegans

(www.wormbase.org)

FlyBase:
Drosophila

(flybase.bio.indiana.edu)

Zfin: zebrafish (zfin.org)

and many others (
Xenopus, Dictyostelium,
Arabisdopsis…
)

The MOD squad: what about
Homo sapiens
?

There is not a true “model organism” database for Human.
The two main sources of genome information that have
evolved are the UCSC Genome Browser and Ensembl.

EnsEMBL

www.ensembl.org

UCSC

genome.ucsc.edu

UCSC Browser

UCSC Browser

Ensembl

Ensembl

Ensembl

Protein Data Bank (PDB)

Protein Data Bank (PDB)

total

yearly

Protein Data Bank (PDB)