Indepth Background Information on Bioinformatics


Oct 1, 2013 (3 years and 6 months ago)


What is bioinformatics?

(Adapted from the Frequently Asked
Questions page at

In a broad sense bioinformatics describes
any use of computers to handle biological

In practice, the definition used by most
people is narrower;
bioinformatics is a
synonym for "computational molecular
the use of computers to
characterize the molecular components of living things

"Classical" bioinformatics

When most biologists talk about bioinformatics they are referring to
the practice

of using




. Biomolecules include your genetic
material (DNA and RNA) and the products of your genes:
These are the concerns of "classical" bioinformatics, deali
ng primarily
sequence analysis

Fredj Tekaia at the
Institut Pasteur

offers this definition of

The mathematical, statistical and computing methods that
aim to solve bio
logical problems using DNA and amino acid
sequences and related information.

Most large biological molecules are
, or ordered chains of
simpler molecular modules called
. Think of the monomers
as beads or building blocks which, despite havi
ng different colors and
shapes, all have the same thickness and the same way of connecting
to one another.

Monomers that can combine in a chain are of the same general class,
but each kind of monomer in that class has its own well
defined set of
stics. Many monomer molecules can be joined together to
form a single, far larger,
. Macromolecules can have
exquisitely specific informational content and/or chemical properties.

According to this scheme, the monomers in a given macromolecule

DNA or protein can be treated computationally as
letters of an
, put together in pre
programmed arrangements to carry
messages or do work in a cell. Bioinformatics uses computational
methods to try and figure out the information contained in t
strings of letters.

"New" bioinformatics

The greatest achievement of bioinformatics to date, the
Genome Project
, is now completed. While sequencing of various
genomes is still ongoing, the nature and pri
orities of bioinformatics
research and applications are changing to focus on deciphering what
the information contained in these genomes tells us. Here are some of
the ways in which people are attacking this problem:

Now that we have sequenced multiple who
le genomes we can
look for differences and similarities between all the genes of
multiple species. From such studies we can draw particular
conclusions about species and general ones about evolution. This
kind of science is often referred to as


There are now technologies designed to measure the relative
number of copies of a genetic message (levels of gene
expression) at different stages in development or disease or in
different tissues. Such technologies, such as
DNA microarrays
are often referred to collectively as
functional genomics

are producing large amounts of data that must be stored and
analyzed using computational methods.

scale methods of investigating the functions and

associations of proteins (for example
yeast two

are frequently referred to as

(a combination of
in and gen

Medical informatics

is the more direct application of genomic
and proteomic technologies to the investigation of disease, and
usually includes the incorporation of traditional clinical data. The
merging of individual clinical data with newer molec
ular data,
while providing exciting opportunities for finding the causes of
complex diseases like diabetes, heart disease, or cancer, also
provides some unique problems in data management,
integration, and analysis.

How old is the discipline?

"How old is

bioinformatics?" The answer to this depends on which
source you choose to read. Bioinformatics is a relatively young field
compared to physics, chemistry, biochemistry, molecular biology, or
computer science, but people have been engaging in some of its
practices since the dawn of molecular biology 40 years ago.

From Attwood and Parry
Smith's "Introduction to Bioinformatics"

The term bioinformatics is used to encompass almost all
computer applications in biological sciences, but was
originally coined in

the mid
1980s for the analysis of
biological sequence data.

From Mark S. Boguski's article in the "Trends Guide to

The term "bioinformatics" is a relatively recent invention,
not appearing in the literature until 1991 and then only in
e context of the emergence of electronic publishing...

...However, some of my role models when I was a
graduate student (Margaret O. Dayhoff, Russell F.
Doolittle, Walter M. Fitch and Andrew D. McLachlan) had
been building databases, developing algorithms
making biological discoveries by sequence analysis since
the 1960s
long before anyone thought to label this
activity with a special term (if anything it was called
‘molecular evolution’). Even a relatively new kid on the
block, the National Center fo
r Biotechnology Information
(NCBI), is celebrating its 10th anniversary this year,
having been written into existence by US Congressman
Claude Pepper and President Ronald Reagan in 1988. So
bioinformatics has, in fact, been in existence for more than
30 ye
ars and is now middle


Hall 1999 (Longman Higher Education; ISBN 0582327881)


Elsevier, Trends Supplement 1998, p. 1.

Other Fields Related to Bioinformatics


Molecular biology itself
grew out of biophysics
. The
British Biophysical

defines biophysics as "an interdisciplinary fi
eld which applies
techniques from the physical sciences to understanding biological
structure and function".

Computational Biology

Computational Biology is a broader term than bioinformatics, and
encompasses various computational approaches to modeling bi
problems ranging in scope from ecosystems, to blood flow in the heart,
to a cell, to the dynamics of individual protein molecules.


Genomics is the intersection of genetics and bioinformatics, and
involves the analysis or comparison of g
enomes or subsets of

Mathematical Biology

Mathematical biology is less tied to the collection and analysis of
sequence data than bioinformatics, and generally entails developing
mathematical models (or applying existing models) to explain various

features of biological systems.


Proteomics involves characterizing the many
tens of thousands of proteins expressed in a
given cell type at a given time (e.g.
measuring their biochemical properties,
identifying what other proteins and smaller
molecules they interact with, determining the
spatial location within the cell where they are
found, or determining their three
structures) and involves the storage and
analysis of vast amounts of data.

Medical Informatics

Medical informatics

traditionally deals with the collection and
management of patient data in a health care setting.

informatics involves the combination of this traditional patient data
with newer molecular data to investigate questions of disease at the


Where to go for more information

The information in this document was adapted from:

Wikipedia has a fairly comprehensive page on Bioinformatics:

The European Bioinformatics Institute (EBI) has an extensive
information and tutorial section on their website titled “2can”:

The National Center for Biotechnology Information (part of the
National Institutes of Health, and the home for the BLAST sequence
alignment program) have an educati
on site:

The Howard Hughes Medical Institute (HHMI) has a variety of
educational materials, both online and availabl
e for free
on DVD, relating to modern biomedical science at:

Careers in Bioinformatics


This will involve the acquisition and analyzing of data from
collaborators, public databases, and genome projects, and scientific

omedical Computer Scientist

The role would involve the design and development of programs
and/or databases to be used in biological field. Strong programming
skills usually a requirement for this role.


These usually fall into three categories

Research Geneticist,
Laboratory Geneticist and Genetic Counselors. The first two usually
bioinformatics degrees at Graduate level and all require a strong
understating of genetics.

Computational Biologist

Computational Biologist develop computational too
ls and methods to
solve complex theoretical and mathematical problems as they relate to
interpreting genomic information. Knowledge of correlation in
statistical and mathematical analyses with genetic and biological
information. A bioinformatics biologist
would have to collaborate with
other researchers and departments and some of his or her duties
would include development of tools that would support research
objectives, the compilation and analysis of data, including writing and
editing reports for journ
al publication as needed.


A biostatistician responsibilities include reviewing potential
bioinformatics publications for statistical accuracy, and writing reports
for in
house team members and collaborators, as well as information
g on various studies. At a higher level they would ensure the
consistent application of statistical analysis across different studies.
Experience with a wide range of statistical methods, such as ANOVA,
logistic regression analysis, survival analysis, link
age analysis, and
multivariate analysis would be necessary. Proficiency in SPlus or SAS
(computer based statistical analysis tools)might be necessary in some

Biomedical Chemist

Biomedical Chemists analyze pharmaceutical materials for quality,
purity and strength. They use approved methodology and observe
safety practices. They produce sample batches of a drug for trouble
shooting and help design the scaling
up process that takes drug
manufacture up to factory proportions. Advanced positions req
extensive record keeping and the supervision and integration of a lab

Clinical Data Manager

The Clinical Data Manager uses complex computer systems within
bioinformatics environments would need to possess analytical skill to
detect and resolve
data problems in clinical research studies. A good
understanding of the data generated in a clinical research study,
methodologies for data storage, reviewing data, database design and
testing, and the ability to extract information are all skills that a C
must have.

Molecular Microbiologist

The Microbiologist will support efforts to characterize pathogenic
bacteria. The Microbiologist will determine bacterial/spore resistance
to standard and novel antimicrobials and decontaminants various
conditions. A

Bachelor of Science in Microbiology with experience in
general microbiology is sometimes required.

Software/Database Programmer

The bioinformatics programmer is responsible for performing analyses
on data from genomic and other biological databases, clin
ical trials and
other sources, including listings, tabulations, graphical summaries and
formal statistical estimates and tests. Ability to assess quality of
analysis data, perform cross study analyses and be able to create and
use/write SAS macros to autom
ate all of the above functions.
Additionally, the person in this role will design and create analysis
databases. A thorough knowledge of study design and protocol
requirements is fundamental.

Medical Writer/Technical Writer

The duties of the medical write
r comprise assisting departments in the
preparation and writing of documents required for regulatory
submissions, writing study protocol and other documents needed for
clinical studies, and clinical study reports in accordance with regulatory
Other tasks include Drafting and coordinating the
preparation of manuscripts for publication. A Master’s or PhD Degree is
usually required for this position.

Research Associates and Research Scientists

An advanced degree is needed. Research Associates par
ticipate in and
contribute to a scientific objective. He or She must be conversant with
laboratory equipment and software use as well as safety and protocols.
Research Associates also monitor and collect clinical trial data;
coordinate designed trials; and

prepare written reports, protocols, and
study tracking documents. They are responsible for overall site
management, including conducting initiation, interim, and close
visits. A high level of interaction is required with physicians,
pharmaceutical com
panies. Reviewing study documentation and
ensuring compliance with clinical objectives and procedures is also a
requirement for this role.

How does BLAST work?


BLAST stands for
ool. It t
akes as input
a sequence (either DNA or protein), and returns a list of sequences
from the database that are ranked according to how similar they are to
the input sequence. The underlying premise for this technique is that
sequences that are similar to th
e input, or query, sequence are

to it; that is, they are derived from a common precursor
sequence. Over time the two sequences have accumulated mutations
and are no longer identical, but we can calculate statistically the
likelihood that they a
re as similar as they are due to pure chance.
Once the sequences diverge sufficiently from each other, we are no
longer able to tell with certainty whether they are homologous or not

they are no more similar than sequences from genes that are not
d from the same common precursor molecule (i.e. unrelated
sequences). This is why the ranking according to similarity is

we want to find homologous sequences, because then we
can assume that the genes we find in the database have similar
gical functions.

The BLAST page for doing a nucleotide database search (blastn).


Each database sequence (or database “hit”) has an E
value listed after
it. The “E” stands for “expectation”. This number tells us the
likelihood that the
database and query sequences are

This is sort of backwards from what you might expect, but the number
is telling us how many times we should “expect” (hence the name,
“expectation value”) to see that amount of similarity between two
ces simply by chance (i.e. by searching a database of randomly
generated DNA sequences). An E
value of 1 means we would expect
to see that level of similarity at least once each time we search the
database with our query sequence. An E
value of 0.1 (or 1

scientific notation) means we would expect to see that level of
similarity 1 in every 10 times we searched the database. Usually
scientists use an E
value of 0.001 (or 10
) as a cutoff

if the E
is smaller than this then the “hit” is assume
d to be a homologous gene
(since there is only a 1
thousand chance that the two proteins
are not homologous), while if it is larger than this then we assume that
we cannot tell for sure whether the “hit” is homologous or not

two sequences are
not sufficiently similar for us to make that

These sequences in the database can be considered homologs to the query
sequence, since there is only 1 chance in 3 x 10

that the level of similarity
between the query and database sequences

are due to chance.

These sequences cannot be assumed to be homologs to the query sequence,
since there is roughly a 1 out of 1 chance that we would see this level of
similarity between two sequences simply by chance.


The way BLAST comes up

with the similarity measure between two
sequences is to align them. It does this by going through a process of
matching up the appropriate letters (A, G, T, and C, representing the 4
bases, or building blocks, of DNA), maximizing the number of

between like letters (e.g. an A aligned to an A), and
minimizing the number of mismatches (e.g. an A aligned to a G). I
does this by giving high scores to matches and negative scores to
mismatches, then adding up the score for the whole alignment. You
an view the alignment between the query sequence and any of the
scoring database sequences in the BLAST output.

A BLAST alignment.


One of the ways that sequences diverge from each other over time is
to either add or delete nucleotides (e.g.

the letters) from one
sequence or the other. This is accounted for by putting gaps in the

these are represented in the BLAST output by aligning a
letter in one sequence with a dash (
) in the other sequence. These
are also given negative sco
res, just as with mismatched letters.

A BLAST alignment of protein sequences with gaps (lower right
corner). Note that gaps (shown as dashes) can appear in either the query
sequence or the database sequence.


One of the useful things a
bout the BLAST site is that one can quickly
find out which organism a sequence belongs to. This can be done for
single sequences by clicking on the sequence description itself, which
takes you to a page giving information about that particular sequence,
r for all the hits at once by clicking on the “Taxonomy Reports” list.
Sometimes it make take a bit of sleuthing to figure out what the
scientific name means in everyday language (e.g. that
is a plant and
Drosophila melanogaster

is a

fruit fly), but this
is very useful information to have when evaluating the list of hits to
your query sequence.

Click on the “Taxonomy Reports” link to see more detailed information
about the organisms that the sequences found by BLAST come from.