Bioinformatics - Faculty Web Directory


Feb 20, 2013 (5 years and 4 months ago)




Jordan Martinez











2.1 Economic Growth



2.2 Advances







3.1 Software



3.1.1 Algorithms




3.1.2 Microarray



3.1.3 Protein modeling



3.2 Hardware



3.2.1 Supercompu
ter on Chip



3.2.2 Distributed processing



3.3 Perl
















Major advances in molecular biology and
genomic research have
led to the growth of

information. The new research has forced Biologists
to turn to new

technologies: bioinformatics and

computational biology

. “Bioinformatics is one of
the most rapidly growing areas of biological
(2 p. xi)

It may not be a researcher’s main
area of interest, but the

from it has

become an integral tool. Bioinformatics is so

because it
improves upon research for
genetic engineering, diseases, evolu
tionary biology,
genomics, and proteomics.

Although a major part of bioinformatics is
databases and database mining, this is not the
extent of it. There is also genomics and proteomics,
covering studies of the genome and proteome.
Genomics is the
study of DNA seq
uences such as

gene finding, pair
wise sequence alignment, multiple
sequence alignment, and transcription factor binding
site identification. Proteomics has many of the same
functions as genomics, but also includes protein
folding, active s
ite determination, pr
otein interaction,

localization. There is also computer
aided drug design, molecular phylogenetics, and
systems biology


A new addition to Bioinformatics is the
focus on sub
cellular and m
olecular levels of Biology

. Microarrays are the best tools for studying gene
expressions on a genomic scale

. Another does
protein folding which simulates the way a protein
folds into its
characteristic three
structure. A protein that does not fold into its
intended shape creates an inactive protein with
different properties

. Sequence alignment is done
on DNA, RNA, and protein sequences. It compare
two bio
sequences adding spaces where needed to
find similarities

. Also see appendix.

Bioinformatics is a broad field. It
encompasses mathematics, computer science,
biology, microbiology, and many others. The
language bar
rier is often a problem in bioinformatics.
Few schools offer bioinformatics as a course of
graduate studies; even fewer offer undergraduate
studies. In order for a person to be both proficient
with computers and biology, two or three studies
need to attemp



In the past, biology and computers were two
completely separate fields. The emergence of one had
little to do with the success or failure of the other. As
the techniques in biology advanced, the need for
processing and storing new data was

There are archives of data that have been made
electronic, databases full of protein, DNA, and RNA
sequences, and new computer generated results that
require a different way of handling data than has been
used in the past


Economic Growth

A new industry is growing surrounding
bioinformatics estimated at $1.82 billion in 2007.
Computers have been used in the field of biology for
a long time, but it was mostly the bio
medical field

. Gapping the distance between the knowledge of
both biology and computers is major impediment.
Bioinformatics courses for biologists and computer
scientist can help solve this


As the industry grows, the need for
velopers and researchers will too. The field is ever
expanding and could provide a lucrative position for
an ambitious individual
. As new techniques of
gathering data for bioinformatics grows, so do the
tools for analysis and standardization. As long as
ere are advances in the field, there will be new
positions. This means there will always be economic
growth in this field.


Even with the emergence of the industry the
management information’s systems are still behind in
the amount of informati
on being generated. There
are impediments in both the hardware and software
side of biocomputing mechanisms

. They both
need to increase their pace for projects like the
Human Genome and other genome sequencing
projects. A
lgorithms also need to be developed to
make the process more efficient. Although hardware
advances make processing information much faster

algorithms have to be modified or replaced to make
proper use of the added power.

There have been major advances in
hardware that is both specialized and common. Some
of the hardware resides in the same physical location,
while others use distributed computing to generate
from computers around the world


A major problem with new hardware is the
need for software to take advantage of the new
processing power. It is not sufficient to have faster
hardware if the software is not optimized to run on it.



Some of the key components of
ioinformatics are algorithms, hardware, software,
and programming languages. Without these tools
Bioinformatics could have never evolved into the
massive entity it is today. Algorithms are developed
to make better use of database
mining so information
n be classified more easily. Hardware has
developed faster than the software in bioinformatics.
Programming languages provide the tools required
for the analysis of information. Software, hardware,
and programming languages all play equal roles in
the deve
lopment and progression of bioinformatics.



Software is written to help researchers make sense of
the information they gather. Much of the software
used is database mining. This is the searching and
matching of gene or
protein sequences in a database.
Databases allow researches easy access to
information and methods for extracting only the
information needed to answer a specific question

Information gathered is not enough. It also has t
o be
analytically and statistically analyzed. These analyses
give biologists the information needed to draw
conclusions from the data collected.

One example of the software used is Entrez.
It is a data retrieval system developed by the National
Center for

Biotechnology Information. Entrez
provides integrated access to data including literature,
nucleotide and protein sequences, complete genomes,
dimensional structures, among others


Other examples of software simula
te protein folding,
analyze trees or other complex data structures, and
interact with databases.



mining is the process which hypotheses are
generated regarding function or
structure of a gene or
protein by identifying similar sequences in other

. There are several examples of
mining that will be discussed: clustering,
specific genomic analysis, minimum
spanning tree o
ver k vertices (k
MST), and index
based similarity search. The k
MST approximation
does a highly defined clustering algorithm to
approximate relationships in trees


Clustering is the method of classifying
objects of a similar

kind into respective categories. It
puts objects into groups based on similarities and

. This method of categorizing information
allows scientists to find similarities in gene and
protein sequences so diseases can be

together for treatment and diagnosis. Other methods
like Disease
Specific Genomic Analysis or k
are improvements on a clustering algorithm to find
specific information more efficiently


The algorithms used give b
iologists the
information needed to analyze the data and draw
conclusions. Without the implementation of
algorithms on computer systems biologists would
have limited means of analyzing gathered data.

Clustering, also referred to as clust
er analysis is an
array of algorithms for grouping objects of similar
kind into respective categories. This categorization,
in theory, sorts different objects into groups in a way
that the degree of association is high if they belong to
the same group and
low otherwise. This technique
does not, however, provide any explanation,
meaning, or interpretation to the categorization
simply states that there is an association


There are many applications for clustering.
It is
used in medicine, clustering diseases, cures for
diseases, symptoms of diseases. It is also used in
other fields like psychiatry to diagnose clusters of
symptoms, or archaeology to find clusters of tools
and other artifacts


Many types of
cluster analyses exist. There

the joining, two
way joining analysis, k
clustering, and expectation maximization clustering

. Each of these follows different lo
gic to
link their clusters, and each have varying levels of
The first is the joining or tree clustering
method. It joins objects into larger clusters by some
measure of similarity or distance. It can be measured
by the Euclidian distance, the squared Euclidian
distance, city
block distance, and others.
After the
distances have been calculated, the clusters are
amalgamated. The process is then repeated until all
the clusters have been linked to form a hierarchical


There are numerous methods to link the
clusters in t
he joining method. Some of the common
ones are single
linkage, complete linkage, un
weighted pair
group average, weighted pair
average, and Ward’s method. The single linkage
connects a cluster to its nearest neighbor. Complete
linkage connects a clus
ter to its farthest neighbor.
weighted pair
group average method calculates
the distance between two clusters as the average
distance between all pairs of objects in two different
clusters. Weighted pair
group average works the
same as the un
weighted b
ut uses then size of the
clusters as a weight. It is a better method when the
cluster sizes are uneven. Finally Ward’s method can
be used to amalgamate clusters. It minimizes the sum
of squares of any two hypothetical clusters to do the
linking. It is ver
y efficient, but leads to clusters of
small size


The second clustering method is the two
way joining analysis. This method is identical in all
ways to the joining method. However, instead of a
single linkage, two
variables are linked to cluster.
Data rarely needs to be linked together in this way,
and can cause conflicting results. Because of the
problems with resulting clusters, it is the least used


Another method of clust
ering is k
This method takes a hypothesis for the number of
clusters from the researcher, k. The best k is known
a priori
and is computed from the data. The
algorithm would then calculate exactly k different
clusters of greatest distinction. Ther
e are two major
steps in the computation. It starts with k random
clusters. It then tries to first minimize variability

within clusters and then maximize variability
between clusters


Expectation maximization (EM) clusteri
is similar to the k
Means algorithm. It varies from
other methods of clustering by using statistical
methods to link clusters. It can be used in both
continuous and categorical variables. Other
algorithms have to be modified to work with
categorical var
iables. To make this accommodation,
different weights are added to each category for each
cluster. At each iteration, probabilities are improved
to maximize the likelihood of the data given the
specified number of clusters


The EM algorithm works with samples of
distribution clusters. Each can contain different
means and standard deviations. It works by using the
resulting distribution rather than individuals. The EM
algorithm approximates the observed distributions
based o
n mixtures of different distributions in
different clusters. The distributions can be normal,
normal or Poisson. The clusters can be a
combination of any of the different distributions. It
computes classification probabilities rather than
assignments o
f observations to clusters. So each
observation belongs to its corresponding cluster with
a certain probability attached. The final result is vied
as an assignment of observations to clusters by the
largest classification probability


There are numerous methods for classifying
data using clustering algorithms. Each works in a
unique way, and should be used in the correct
context, depending on the problem.

Specific Genomic Analysi

A problem with gathering too much
information is
making sense of it. Too much information makes an
experiment difficult to manage. Data is hard to
analyze, categorize and draw conclusions from.
Genomic research has to deal with large amounts of
information, but still make sense of it so bi
can draw conclusions.

DSGA is a
specialized case of clustering and
class prediction. It measures how one disease
changes from a normal phenot
ype. After this the
deviating information is isolated. The DSGA method,
in some experiments, has
outperformed standard
clustering methods. There are many uses for disease
specific genomic analysis. It can be used in
microarrays, DNA, RNA, and any other highly
dimensional genomic or proteomic data


The method of disea
se specific
decomposition uses microarray data to analyze. The
method described is based on a decomposing
expression in diseased tissue. The decomposition is
defined by computing a linear model of disease tissue
expression onto normal expression data.


data decomposition contains two sets of
data. The first set is the diseased tissue microarray
data. The next set is the normal tissue microarray
data. The number in each of these sets does not have
to be equal in order for the analysis to be completed.
e normal set of component is used to fit the
diseased tissue into a linear model. The
decomposition would then be Ti = Nc.Ti + Dc.Ti

Nc.Ti fit to the linear model: Normal

Dc.Ti vector of residuals to linear model:
Disease Component

The nex
t step in DSGA is estimation of the
normal expression space N. This is the process of
reducing the dimension of normal expression data. A
modified principal component analysis (PCA) was
used although it had limitations because it needed
more normal tissue
samples. Using this, however,
DSGA performed better than traditional methods of


The DSGA addresses a certain series of
biological characteristics



Disease is a deviation of expression from a
normal/healthy state.


The model N for the normal state includes
biological diversity. The set of normal data
can come from any number of sources, and
there is a natural fluctuation in conditions.


For the testing
, patients were not required to
provide a normal tissue sample. This would
be impossible where all or most of an organ
was altered in some way by disease.


Each diseased sample is analyzed without
reference to any other diseased tissues.

DSGA was compared t
o other methods
analysis in a gastric cancer dataset and a breast
cancer dataset. In the data set for gastric cancer,
DSGA performed better for both tumor site and
tumor type distinctions. The results were similar in
the breast cancer dataset where DSGA

both log ratio data and zero
transformed data


The case for using disease specific genomic
analysis is overwhelming. With minimal alteration to
current formulas, DSGA outperformed every other
s method it was tested against.
Minimum spanning tree over k vertices (k

Another method of data analysis used in
Bioinformatics is graph mining. This technique puts
information in trees and then uses specialized
clustering algorithms to gather

data to make
hypothesis based on the relationships

. This is a

very specialized set of algorithms that need to run
efficiently and reliably. These algorithms make use
of techniques like microarray analysis.

One algorithm
that is used for this task is the
approximation algorithm for the k
MST. k
MST is
an algorithm that performs hierarchical clustering,
gets the best subset of clusters having k
connects them to a the root vertex, and extracts the
optimal k
MST fr
om the resultant tree

. This
algorithm shows a good running time, quality, and
statistical guarantees.

The approximation algorithm for k
MST works in
seven steps

. Also see appendix.


First the s
hortest distance is computed from
each vertex of the graph to the root vertex
using Dijkstra’s algorithm.


Second the optimal cost OPT is guessed.
Steps 3
7 are repeated to find the best k


Next Kruskal’s algorithm is for single
linkage. At each iterati
on, the next two
minimum edges are added to form a super
cluster. This is repeated until all clusters are


The distance from each cluster to the root is
calculated using recursion. The cost of this
is the edge weights plus the distance to the
t vertex.


The optimal subset of disjoint clusters C that
has exactly k vertices is done by the
hierarchical clusters.


Connect each cluster to the root vertex using
the shortest path.


Finally the optimum k
MST is found from
the resultant tree.

The total tim
e for this algorithm is
2 log

. This graph is
scalable because n<< k, so for a large value of k, the
algorithm scales very well.

An example is shown using the algorithm to
find biological pathway
s in a yeast network. The
graph contained 5552 vertices and 34000 edges.
Even though the graph was missing edges between
parts of the graph, the algorithm was able to
circumvent the broken chain to find the rest of the


Graph mining is a very important problem to
solve in a timely manner. The approximation
algorithm for k
MST is an efficient algorithm for
finding substructures. The results show a great time
complexity and approximation ratio of the algorithm.
Also s
ee appendix.
based similarity search

based similarity search is a method of finding
similarities in protein structure databases. This
extraction method improves the time of VAST by 3
to 3.5 times while keeping sensitivity similar


There are two methods of Index
similarity searches. The first method of Indexed
based similarity search matches a pairwise alignment
tool like VAST First, vector features are extracted on
triplets of Secondary Structure
Elements (SSEs) of
proteins. Following this, the features are indexed by a
multidimensionally index structure. It finds proteins
similar to a given query protein in a given dataset.
After this, proteins are aligned to the query using a
pairwise alignment t
ool. The second method joins
two protein datasets to find all
all similarity


A tool developed by the National Center for
Biotechnology Information called Entrez. This
software is a data retrieval system
that has access to
many domains. These include literature, nucleotide
and protein sequences, complete genomes, three
dimensional structures, and many others. Some of the
databases Entrez has access to are GenBank, curated
Ref Seq

database, nucleotide sequ
ences from Protein
Data Bank (PDB) records, and Third
Annotation (TPA) database


Many useful functions can be accomplished
using Entrez. One can find representative sequences,
retrieve associated literature and prot
ein records,
identify conserved domains, identify similar proteins
and known mutations, find a resolved three
dimensional structure, and view genomic context and
download the sequence region

. The amount of
information and t
he power behind Entrez makes it a
very useful tool.

Entrez has access to approximately thirty
five kinds of databases

. Each type is then linked
to numerous other databases for cross referencing,
matching, and find many oth
er characteristics.

Some of the most powerful searches through
a database with biological information stored in it, is
unexpected. For example, when doing a query using
Entrez with a specific name, sequence, or other
identification, not only will the res
ults of the
sequence show, but the journals, abstracts, and other
resources as well

. Entrez is a compilation of tools.
It uses algorithms like clustering for sequence
comparisons. Others are used to link queries to
e information.

Entrez is an amazing example of software
for bioinformatics. It combines a number of
algorithms, search results, programming tools, and
many customizations. It has a staggering amount of
data linked to its 30+ databases. Entrez is an
ous set of tools, that have many practical uses
for biologists and those in the computer field.



The next topic of discussion is microarrays.
Microarrays are small chips used to study gene
expressions. Microarray chips identify which genes
are active at the time of testing because not all genes
are active at one time

. Software captures,
manages, and analyzes DNA microarray expressions


The future goal of microarray technology
treating diseases at a local level


Microarrays analysis works by having
fragments of human DNA stuck to spots on the chip.
Next, modified DNA fragments are added and will
stick to the previous where they match. Two
taken at different states, identified by fluorescent
coloring, allow an image to be derived from the
microarray. The colors are spots of green, yellow,
and red

The complete process is as follows



Prepare DNA chip using chosen target


Generate hybridization solution that contains
a mixture of fluorescently labeled cDNAs


Incubate hybridization mixture containing
cDNAs with the DNA chip


Detect bound cDNA using lasers and store
data i
n computer

This whole process is known as hybridization
probing. Fluorescently labeled nucleic acid molecules
called mobile probes identify complementary
molecules. These are sequences that are able to base
pair with one another.
DNA is made of four differ
nucleotides: adenine, thymine, guanine, and cytosine.
Adenine complements thymine and guanine
complements cytosine. When complementary
sequences match, where immobilized target DNA
and the mobile probe DNA, cDNA or mRNA lock
together the process is kno
wn as hybridizing


When the hybridization is complete, the
microarray can be scanned or read. Lasers,
microscopes, or cameras are used for the job. The
fluorescent tags are created by the laser, and the
microscope and
camera create the digital image of
the microarray

This process is only the beginning
of the analysis.

Also see the appendix.

The next part of the analysis is analytical
and statistical. The microarray data goes through
image analysis to quantify gene expressions

“Some microarray experiments can contain up to
30,000 target spots

There are multiple steps in
the image analysis. There is the semi
construction, where the area is defined that spots are
expected. Next either automatic or manual grid
adjustments are made to ensure the grid spot is
adjusted. Spot intensities are calculated through
either an integral of non
saturated pixels or spot
medians. The local background is subtracted from
this intensity

. After the intensity has been
calculated, the data can be added to a database,
Microsoft Excel
®, or some other data management
system. The analysis of the data

gathered from
microarray analysis is clustering with the data taken
at different states

. Other methods include the k
Means and k

There are a number of tools that manage the
information. One of the tools is TM4, which is a suite
of software consisting of
Microarray Data Manager
(MADAM), TIGR_Spotfinder, Microarray Data
Analysis System (MIDAS), and Multiexperiment
Viewer (MeV), as wel
l as a Minimal Information
About a Microarray Experiment (MIAME)
MySQL database. These tools are used for spotted
color arrays, but there are others that will run
color formats


The National Center fo
r Biotechnology
Information (NCBI) has is currently working on ways
to manage the data from microarrays. One of their
projects is the Gene Expression Omnibus (GEO).
GEO is NCBI’s online resource for storage and
retrieve of gene expressions. The gene expres
can either come from organisms or artificial sources.
Another project supported by NCBI is Microarray
Markup Language (MAML). MAML is being
developed by the Microarray Gene Expression
Database. Their primary goal is to adopt standards for
array e
xperiments. These standards can be for
annotation, data representation, experimental
controls, and data normalization methods. The
ultimate goal for the NCBI is to have data in many
formats, including a new version of MAML called
MicroArray Gene Expression

Markup Language


Protein modeling

Protein modeling is the act of taking a protein
sequence and imitating the shape it would naturally
It is a subset of proteomics.
The shape a protein
goes into when it folds is the specification on what it
does. Proteins
are remarkable because they start as
simple sequence of amino acids and then assemble

Understanding protein folding is
mportant in the discovery and treatment of diseases
and other health issues. When a protein does not fold
into its correct shape, there are numerous
ramifications. Diseases are caused by incorrect
folding as well as other adverse effects

. Some
famous examples of diseases caused by misfolding
are Alzheimer’s, Mad Cow, and Parkinson’s

Protein modeling will help scientists determine where
problems occur in protein structures, and

give a method of treatment.

Protein modeling is only simulated using
experimentally determined protein structures

Experimental methods such as X
ray crystallography
and nuclear magnetic resonance (NMR)

spectroscopy, are more accurate ways of determining
a protein’s structure. Modeling gives researches a

ing point to confirm a structure by X
crystallography or NMR spectroscop

There are four steps to model a protein



elated proteins
with known three
dimensional structures need to be found.


n alignment


done between the related
dimensional structure and the target


model is constructed based on the
alignment with the related structures.


tistics and algorithms are run to
determine if the model is acceptable.

Another aspect of protein folding is
determining how different proteins interact with one

. Modeling can give biologists a way of
where a problem is in a protein structure
that causes diseases.

The problem with protein modeling is that in
nature, proteins assemble themselves as fast as a
millionth of a second

. The speed at which
proteins naturally assem
ble themselves is difficult
simulate because they work on a timescale that is
much faster than any processor available. In fact, it
would take 30 CPU years to calculate one result
where a protein folded in 10,000 nanoseconds


The results of protein modeling will fit into
the much larger section of bioinformatics,
proteomics. The combination of searching, folding,
active site determination, and protein interaction will
researchers much needed information.


Special hardware is needed to perform some of the
processing done by bioinformatics. Because of the
sheer volume of information, new hardware
architectures are needed to do the sorting algorithms
previously stated. One of the newer archit
ectures is
the supercomputer
chip (SCoC). It uses a shared
memory multiprocessor to do calculations. Due to
the availability of parts to make a SCoC they are
desirable to use where normally special
hardware would be

. Another method of
advanced hardware is distributed processing.
Distributed processing is a collection of loosely
coupled processors connected by a communication
network. A major advantage of distributed processing
is computation speedup


Every hardware architecture has its own
advantages and disadvantages. Supercomputers on
chip are limited by their ability to work concurrently
with existing hardware, and are difficult to program

. Distributed processing is limited by the
amount of information that can be sent through a
network connection and concurrency problems.
These are the examples of hardware that will be
discussed in the upcoming section.

Supercomputer on Chip

whelming amounts of data, enormous jobs, and
limited computer hardware are all reasons for new
computer architectures. Supercomputer
(SCoC) is newer specialized hardware that has many
benefits over single processors. A simple search that
looks for
an exact match between a query string and a
string of a database is a very computationally
demanding task

. A good search would need to
compensate for characters being mutated or deleted
from a database. The need for mo
re processing power
is apparent as the searches and algorithms become
more complex.

A supercomputer on chip
is the creation of a
large integrated circuit with many, fairly small

A system bus would allow data to pass
between processors and share
d memory. For database
items that are only read and written once memory and
cache can be low; for larger applications more on
chip shared memory may be more efficient

With the simple architecture, SCoC are cost

easy to implement, and extremely fast.

One SCoC design uses ARM9 processes
with 8KB of instruction cache and 8KB of data
cache. It uses a an asynchronous first
pipeline (FIFO) design. The asynchronous FIFO has
three stages. First timing gets

constrained by the local
clock of the sending synchronous block. The second
stage decouples the two end stages. And the last stage
timing gets constrained by the local clock of the
receiving synchronous block


The result
s of an SCoC are staggering. In
2003 with this design using current 130nm
technology, 227 processors each rated at 250MHz
had a combined frequency of 57GHz. By 2006,
using smaller technology, the combined frequency is
644GHz. It is estimated that by 201
6, using 22nm
technology with processors rated over 4000MHz, a
combined 24,300GHz can be obtained

. This
design is not limited to ARM processors and can be
used with a variety of processor/bus combinations
depending on the

SCoC will be an effective way to have an
enormously powerful architecture that can be custom
made for an application. The design scales well for
many implementations, and can only improve with

Distributed processing

Distributed processing is the use of loosely coupled
processors interconnected by a communication
network. The architecture can be small or large, as
well as the processors. The hardware may include
small microprocessors, wo
rkstations, minicomputers,

among others

. The power behind doing
distributed processing is resources sharing.

Resources are limited on any machine
regardless of its individual specifications. Resources
may include memory, s
torage, processor speed,
number of processors, and the time the resource is
available. The shortcomings of individual computers
are overcome by doing distributed processing. The
collective resources of a distributed system are much
more than any one indivi
dual system.

The extra resources add numerous benefits.
First, there is a computational speedup. This is due to
the number of processors being allocated to a job.
The more processors there are, the faster the
computation. Another benefit is reliability. I
f an
individual system fails, the remaining system can
continue to operate. This is true as long as there is
enough redundancy in both hardware and data

There are problems with reintegrating a site that goes
down though.

There are a number of mechanisms unique
to distributed processing that make it very efficient.
First there is computation migration. This approach
transfers the computation rather than the data across a

. The result
of processing can be much
smaller than the data needed to do the computation.
This speeds up the process, and makes better use of
the network resources. The next approach is process
migration. This is an extension of computation
migration where a process r
uns on a different site
than where it was initiated

. The reasons for this
are numerous. First there is load balancing. Some
sites using distributed processing may be
overworked, while others remain more or less idle.
process distributes the processes to even the
workload. Computational speedup can also be done
by splitting a process into subprocesses and having
them run concurrently on different sites. The
distributed system can also make better use of
individual hardw
are through process migration. Some
hardware may be specialized to handle certain types
of data better than others

. Another facet of
process migration is software. Certain software may
only exist on a few sites in a distri
buted system.
Having the software run only on one machine can
save money and time. Both approaches provide an
architecture that individual computers cannot match.

The distributed processing model is most
seen in the various applications on the World Wide

. This can be seen in numerous places in
bioinformatics. Databases have clients connect to
them, and only the result is returned to the client.
Others distributed systems only provide data
migration from server to clien

It is easy to see that distributed processing
has many advantages over other methods of
processing. It is relatively easy to implement, is very
robust, and provides a high level of service. An
example of distributed processing that focuses
primarily on

protein folding, misfolding, data
aggregation, and related diseases is Folding@home

. Folding@home is a distributed system that has
clients running their software around the world.

Folding@home was a project that started at
University. This is a primary example of a distributed
processing. To reiterate, distributed processing is the
sharing of resources over a network. This client
application simulates folding and misfolding. The
information can be used to find cause
s of diseases,
create new drugs, and solve other biological


This distributed processing system has
achieved numerous awards and goals since the
beginning of the project. In 2007, Folding@home
reached a power of 1 pe
taflop, which is one
quadrillion floating point operations per second. Even
if the Folding@home had access to all the
supercomputers in the world, fewer clock cycles
would be achieved than with the distributed
processing model


A primer on protein folding is necessary to
understand the need for a distributed system. Proteins
are long chains of amino acids. They act as enzyme
as the driving force behind all biochemical reactions.
The sequence is not nearly as important as the

a protein takes to carry out its function. The shape
they take is called folding. The remarkable thing
about proteins is that they assemble themselves
through folding, before doing their work. A protein
that misfolds, proteins that do not assemble c
are known to have serious effects including diseases


As seen earlier, protein modeling is a very
advanced function. The time it takes for a protein to
fold can be as short as a millionth of a second. In

it would take 10,000 CPU days to
simulate folding. Distributed algorithms used with
the project use over 100,000 processors to match the
microsecond barrier. The processing power has
allowed the Folding@home project to find how
proteins actually fold.

tributed processing allows computers and
biologists to solve problems that would be impossible
with any other means. All the supercomputers in the
world cannot compete with the processing power of
the Folding@home project. On lesser scales
distributed proc
essing is seen in web applications,
like database interactions. Distributed processing has
allowed computers to have the mainstream appeal
they do with the World Wide Web. No single
computer in the world can compete with the potential
power of a distribute
d processing system.



Programming languages play a major role in
bioinformatics. The software runs databases,
analyzes, stores, and retrieves data, provide
modeling, and enables distributed systems to be
established. Without programming languages,
including the very low level ones, bioinformatics
would have no way of making use of its
managing its

Many languages lend themselves well to
bioinformatics. Java is used in many applications to
update databases after processing has been done

Other languages, like PHP

and Perl, provide a
relatively easy starting place for someone in biology
to get into coding.

The ease of learning allows
someone without a computer science background to
develop custom applications easily. This also helps
lessen the gap between computer people and

There are numerous languages that provide
support to bioinformatics. Th
e goals of an application
often determine the language that will be used. For
lower level programming languages like C or C++
can be used. Java is sometimes used for database
interaction without human intervention. For direct
human interaction with databas
es and results
scripting languages like Perl are very useful. Perl has
both power and flexibility to apply itself to many
bioinformatics applications.

The strengths of the programming language
Perl make it an ideal fit for bioinformatics. It does
pattern matching, web posting, and database
A set of modules,
Bioperl was written
managing all aspects regarding bioinformatics
. It is a
set o
f tools,
under a very unrestrictive license
, for
biologists to develop programs as they are needed


There is also a module for Perl called DataBase
Independent. This module allows Perl to have a
common interface to many
different relational
database systems without rewriting code. Perl is a
good fit for its flexibility and ease of use in

Modular programming is what m

so adept at handling problems in bioinformatics.
Modular programming is a way of
organizing code
into collections of interacting parts.

Perl modules are
the mechanism for defining object
oriented classes.
Each module is a library file that uses package
declarations to create its own namespace

The use
f modules gives Perl the flexibility to create large
programs with security.

Perl modules use package declarations to
own namespace
. Namespaces are tables
containing the names of the variables and subroutines
of a program

. It is common for programs parts to
share variable names. When there is a problem with
multiple variables with the same name a namespace
collision occurs. Modular programming prevents this
by creating separate namespaces for each module

Other methods
of preventing namespace collusions

use strict

These are less effective than
modular programming, but should be used as good
. The use of modules allows there to
be code
reusability and a more logical way of programming.

There are many archives of modules
available with unrestrictive copyrights and licenses.
The Comprehensive Perl Archive Network (CPAN
) is a collection of Perl code
prewritten f
or nearly any us

. The reuse of code is
a major benefit of modular programming
, and also a
timesaver. Many of the tasks programmers want to do
are prewritten in modules and other code on CPAN.
The top
level organization of
modules included is as


Development support

Operating System Interfaces

Networking Devices IPC

Data Type Utilities

Database Interfaces

User Interfaces

Language Interfaces

File Names Systems Locking

Security and

World Wide Web HTML HTTP CGI

Server and Daemon Utilities

Archiving and Compression

Images Pixmaps Bitmaps

Mail and Usenet News

Control Flow Utilities

File Handle Input Output

Microsoft Windows Modules

Miscellaneous Modules

Commercial Software In

Not in Modulelist

Using the CPAN archive can make any
project easier from the use of existing code.

As seen earlier, databases

a major part
of Bioinformatics. Most bioinformaticians need to
know the basics of how to use them; some specialize
in them. Perl DataBase is an extension to Perl that
saves a lot of rewriting code in case of a change in
database management software


There are two main modules written for Perl
Database interactions, DataBase Independent (DBI)
and DataBase Dependent or DataBase Driver(DBD).
DBI handles the interactions from the program code.
It provides a common interface to many different
relational database systems.
DBD is responsible for
handles the communication with the database
management syste
Both DBI and DBD are CPAN



Web programming has beco
me the

way of distributing programs to users.
This is
especially true for bioinformaticians.

They need to
collaborate, disseminate, and promote research.

installations of Perl all come with the Perl
module for providing dynamic web content


The Perl module is primarily used
for creating interactive web pages. Each time a CGI
script is called from a remote
location, the CGI
program is run and the output is displayed on the
requesting computer

. This can create a different
web page each time the script is run.
Gateway Interface is not limited to Perl scripts, but it
an early example of success on the web.

programming is simply another example of how Perl
is an amazing tool for bioinformatics.

Another tool for Perl is a group of modules
called Bi


Bioperl contains
collection of over 500 Perl modules

The purpose of the Bioperl project
was to give researchers a starting place for the
programs they need to develop.
When the project
initially launched, t
here is lit
tle documentati

the project gained popularity and users, this has


officially launched in 1995 with
the release of Perl 5. This version


Perl support
oriented programming


With the
professional work done on the project, Bioperl
became a piece of software widely accepted and used
in the bioinformatics community.

There are numerous types of programs
included in the Bioperl project. Among these are
sequence, databases, alignme
nts, genomics, and
Given the enormity of the project,
Bioperl is a mainstream project for many

Sequence programs allow researchers to
input, output, run tests on, and manipulate gene and
proteomic sequences
. Some examples of
modules are
SeqIO, SeqStats, LiveSeq and LargeSeq. SeqIO
handles sequence file inputs and outputs. SeqStats
does statistical analyses on sequences. LiveSeq
sequence data, but unlike others this addresses the
problem when features have locations that

over time
The module

provides a
combination of the previous modules to very large
. The sequences can be over 100MB, but
run on a system with less than half the real memory.


Database programs give access to biological
databases. The modules
are for most major databases

such as

prot, GenBank, BLAST,
and others.
The GenBank module provides access to the
GenBank database. RemoteBlast runs a BLAST
search remotely while BPlit
e parses reports from


Alignment programs handle alignments of
sequences like the pair
wise alignment discussed
previously. There are many types of alignment
algorithms included in the package.
Some of these
are UnivA
ln, LocateableSeq, pSW, BPbl2seq,
AlignIO, Clustalw, and SeqDiff.
UnivAln is a
module that
manipulates and displays multiple
sequence alignments. The LocatableSeq module
creates an object that contain a start and end point for
locating sequences relative
to other sequences or
alignments. pSW uses the Smith
algorithm to align two sequences.

BPbl2seq uses a
parser for a BLAST algorithm search using a pair
wise alignment algorithm
; AlignIO is also aligns with
Clustalw interfaces to the Clustal
w multiple
sequence alignment package. The SeqDiff module
handles sets of mutations and variants of sequences


The genomics and proteomics section

vast number of modules to search, write, and run
statistics on se
Example modules include
onEnzyme, Sigcleave,
, SeqPattern,
Split, Fuzzy, Genscan, Resultts, Exon, ESTScan, and
RestrictionEnzyme locates restriction sites
on an enzyme. Sigcleave finds amino acid cleavage
sites. Oddcode
rewrites amino acid sequences in an
abbreviated code for statistical purposes. The
rewritten code uses a hydrophilic/hydrophobic

Using SeqPattern one can use regular
expressions descriptions of sequence patterns. Split is
a useful module that giv
es location information for a
sequence. The location may be multiple ranges, and
possibly multiple sequences. The module Fuzzy gives
location i

Genscan, Reults, Exon,
ESTScan, and MZEF are all examples of interfaces
for gene finding prog


Perl contains a massive set of modules for use in

that can be added.
Dependent and DataBase Independent add
connections and interactions with databases.
is a preassemble entity of bioinformatics algorithms,
database interactions, and standalone programs.
These modules are unique to bioinformatics, and
create a foundation for any developer wanting to help
the bioinformatics community.
Unlike many ot
programming languages, Perl has many resources
that are freely available for developers to use. The
reuse of existing code saves countless hours of

Perl is not alone in its use in
bioinformatics, yet it has many advantages of the free
ce code, software, and other resources.



Bioinformatics is an ever expanding field.
There are also many challenges facing it.
Bioinformaticians are limited in their ability to
gather, represent, and analyze data. New software

nd hardware has to be developed to keep up with an
increasing amount of data.

There are numerous software titles that are
specialized in bioinformatics, and even more
algorithms. Entrez is a data retrieval and search tool.
BLAST is NCBI’s multiple database

interface. There
are algorithms like clustering that group objects
together based on similarities. However, the reason
the objects are grouped has to be inferred by the
researcher. The approximation for k
MST and
algorithm for disease specific genomic ana
lysis were
made to improve upon previous clustering

Databases are the backbone of bioinformatics.
Without the ability to effectively and efficiently store
information bioinformatics would have no way of
distributing its information. There are n
databases that contain all different varieties of
information pertinent to bioinformatics. The
distribution of information has allowed researchers to
work together on problems that would have been
impossible before.

Whenever a new technique for gathering
biological data is
, there is a need for computers
to help facilitate the data.
It can be see with
Microarrays that there is room for creating standards,
new approaches to analysis, and
unique languages

. The need for supercomputers on chip comes
about when very specialized problems come up, and
a huge amount of processing power is needed

Folding@home had a problem that could only be
solved with
worldwide participation. Without the
process of distributing processing, there would be no
way the Folding@home could simulate protein
folding at a truly representative level


There are
countless examples of new needs coming

from new
developments that are not specific to bioinformatics.

New software can improve performance as
much, if not more than hardware. Improvements on
algorithms allow software to take advantage of the
added processing power by fixing bottlenecks in
ems. There is always room for improvement in

Different hardware implementations have
advantages and disadvantages. Distributed
processing on a massive scale requires participation
from a wide audience. There may not be enough
people to
support a project like Folding@home.
However, with the amount of resources added when
doing distributed processing greatly increases the
efficiency of work. Also, physical boundaries do not
limit a system. Supercomputers on chip are a very
specialized form

of hardware. They require very low
level knowledge of a computer to implement on a
current architecture. Using a group of tightly coupled
processors creates enormous gains in processing time
which allows researchers to be more productive.

The Perl program
ming language is extremely
beneficial to bioinformaticians. There are numerous
resources available for use in bioinformatics. Some
of these are specialized like the modules in BioPerl.
Others are built in features of the language, or
modules that can be a
dded with an unrestrictive

Continuous effort is needed to expand the field
of bioinformatics. New software has to be developed.
The software needs to take advantage of existing
algorithms as well as improve them. Hardware has to
be developed to me
et the needs of bioinformatics.
The techniques used are always pushing the current
hardware to its limits to perform. Finally,
programming languages need to be developed and
matured into something useful for both computer
science and biologists. If there i
s a lack of
information between the two, new implementations
cannot be developed. As software, hardware, and
programming languages advance, so do the results of

The evolution of b
ioinformatics has limitless
possibilities for our future. New

hardware, software,
and algorithms will allow researchers to identify,
diagnose, and p
otentially treat
diseases in humans as
well as other living organisms.
New drugs can be
engineered and
Biology and com
are the perfect marriage

of seemingly unrelated




Sequence alignment example

Unaligned sequence

Gates likes cheese

| |

Grated cheese

Aligned sequence

ates likes cheese

| |||





Notes on algorithm


1. Step 2 (guessing the bound of
) is used to
guarantee the approximation ratio.

2. In step 5 (subset selection of clusters), the goal
could be changed to finding the optimal subset of
clusters having
at least k

3. Step 7 extracts the best
MST from the resultant
tree. If one extracts a simple
subtree, the
approximation ra
tio is still preserved. And the time
complexity will be reduced from
2) to
). This
could be useful for large

Notes on running time


Step 1 takes

) log
) (or
) if using the
Fibonacci heap). Step 2 repeats log
times. Step 3
) time. Step 4 takes
) time. Step 5
) time. Step 6 takes
) time. Step 7 takes
2) time (or
) if extracting a simple
Thus, the total time compl
exity is
2 log

Example of running time


Microarray image and data



What is bioinformatics?

What is clustering?

What are some improved
clustering algorithms?

What is an SCoC?

How does distributed processing work?

What makes Perl a good choice for use in



1. Bioinformatics. [Online] March 29, 2004. [Cited:
01 30, 2008.]

Tisdall, James D.

Mastering Perl for

: O'Reilly & Associates,
Inc, 2003.

Computational Biology & Bioinformatics: A
Gentle Overview.
Nair, Achuthsankar S.


Communications of the Computer Society of India,

TM4: A Free, Open
Source System for Microarray
Data Management and Analysis.
Saeed, A.I., et al.

2003, BioTechniques, pp. 374


Folding@home distributed computing.
[Online] h


Module 1: General Biocomputing
[Online] NMSU, 2001. [Cited:
2 15, 2008.]

7. Entrez: Making use of its power.
Briefings in

: Henry Stewart Publications,
June 2003.


[Online] March 29,
2004. [Cited: January 27, 2008.]

Cluster Analysis.

: Statsoft, 2008.

Specific Genomic Analysis: Identifying
the Signature of Pathologic Biology.
Monica, Tibshirani, Robert, Borresen
Dale, Anne
Lise and Jeffrey, Stefanie.

2007, Bioinformatics
Advance Access, p. 2.

TM4: A Free, Open
Source System for
oarray Data Managmeent and Analysis.
A.I., et al.

34, s.l.

: BioTechniques, 2003.

Bioinformatics Application of a Scalable
chip Architecture.
Smith, Scott F
and Frenzel, James F.

03, Parallel and Distributed
Processing Technique
s and Applications, pp. 1

Silberschatz, Galvin and Gagne.

System Concepts with Java.

: John Wiley & Sons,

Efficient Algorightms for Mining Significant
Substrates in Graphs with Quality Guarantees.
Huahai and Singh, Ambuj K.

pp. 1,10.

Towards Index
based Similarity Search for
Protein Structure Databases.
Camoglu, Orhan,
Kahveci, Tamer and Singh, Ambuj K.

2003, IEEE
Computer Science Bioinformatics Converence, p.

16. What is GenBank? [Online] January 9, 2008.
[Cited: J
anuary 30, 2008.]

17. PubMed Central. [Online] April 16, 2007. [Cited:
January 28, 2008.]

18. What is the Human Genome Project? [Online]
December 07, 2005. [Cited: January 24, 2008.]

19. Stanford MicroArray Database. [Online] 2008.
[Cited: January 29, 2008.] http://genome

20. The Bioinformatics, Distributed Systems, and
Databases Lab (DBL). [Onl
ine] 2008.[Cited: ]