Bioinformatics - Faculty Web Directory

thelemicbathBiotechnology

Feb 20, 2013 (4 years and 1 month ago)

868 views

1









Bioinformatics


Jordan Martinez




2



Contents


1.

Introduction

................................
................................
................................
................................
..........

1

2.

Background

................................
................................
................................
................................
...........

1

2.1 Economic Growth

................................
................................
................................
................................

1

2.2 Advances

................................
................................
................................
................................
.............

1

3.

Research

................................
................................
................................
................................
................

1

3.1 Software

................................
................................
................................
................................
..............

2

3.1.1 Algorithms



Database
-
mining

................................
................................
................................
.....

2

3.1.2 Microarray

................................
................................
................................
................................
....

5

3.1.3 Protein modeling

................................
................................
................................
..........................

5

3.2 Hardware

................................
................................
................................
................................
............

6

3.2.1 Supercompu
ter on Chip

................................
................................
................................
...............

6

3.2.2 Distributed processing

................................
................................
................................
.................

6

3.3 Perl

................................
................................
................................
................................
......................

8

4.

Conclusion

................................
................................
................................
................................
.............

9

Appendix

................................
................................
................................
................................
.....................

12

References

................................
................................
................................
................................
...................

13



1



1.

Introduction

Major advances in molecular biology and
genomic research have
led to the growth of

biological
information. The new research has forced Biologists
to turn to new

technologies: bioinformatics and

computational biology

(1)
. “Bioinformatics is one of
the most rapidly growing areas of biological
science.”
(2 p. xi)

It may not be a researcher’s main
area of interest, but the
information

from it has

become an integral tool. Bioinformatics is so
significant

because it
improves upon research for
genetic engineering, diseases, evolu
tionary biology,
genomics, and proteomics.

Although a major part of bioinformatics is
bio
-
databases and database mining, this is not the
extent of it. There is also genomics and proteomics,
covering studies of the genome and proteome.
Genomics is the
study of DNA seq
uences such as

gene finding, pair
-
wise sequence alignment, multiple
sequence alignment, and transcription factor binding
site identification. Proteomics has many of the same
functions as genomics, but also includes protein
folding, active s
ite determination, pr
otein interaction,
sub
-
cellular

localization. There is also computer
-
aided drug design, molecular phylogenetics, and
systems biology

(3)
.

A new addition to Bioinformatics is the
focus on sub
-
cellular and m
olecular levels of Biology

(3)
. Microarrays are the best tools for studying gene
expressions on a genomic scale

(4)
. Another does
protein folding which simulates the way a protein
folds into its
characteristic three
-
dimensional
structure. A protein that does not fold into its
intended shape creates an inactive protein with
different properties

(5)
. Sequence alignment is done
on DNA, RNA, and protein sequences. It compare
s
two bio
-
sequences adding spaces where needed to
find similarities

(3)
. Also see appendix.

Bioinformatics is a broad field. It
encompasses mathematics, computer science,
biology, microbiology, and many others. The
language bar
rier is often a problem in bioinformatics.
Few schools offer bioinformatics as a course of
graduate studies; even fewer offer undergraduate
studies. In order for a person to be both proficient
with computers and biology, two or three studies
need to attemp
ted.

2.

Background

In the past, biology and computers were two
completely separate fields. The emergence of one had
little to do with the success or failure of the other. As
the techniques in biology advanced, the need for
processing and storing new data was

inevitable.
There are archives of data that have been made
electronic, databases full of protein, DNA, and RNA
sequences, and new computer generated results that
require a different way of handling data than has been
used in the past

(6)
.

2.1
Economic Growth

A new industry is growing surrounding
bioinformatics estimated at $1.82 billion in 2007.
Computers have been used in the field of biology for
a long time, but it was mostly the bio
-
medical field

(3)
. Gapping the distance between the knowledge of
both biology and computers is major impediment.
Bioinformatics courses for biologists and computer
scientist can help solve this

(6)
.

As the industry grows, the need for
de
velopers and researchers will too. The field is ever
-
expanding and could provide a lucrative position for
an ambitious individual
. As new techniques of
gathering data for bioinformatics grows, so do the
tools for analysis and standardization. As long as
th
ere are advances in the field, there will be new
positions. This means there will always be economic
growth in this field.

2.2
Advances

Even with the emergence of the industry the
management information’s systems are still behind in
the amount of informati
on being generated. There
are impediments in both the hardware and software
side of biocomputing mechanisms

(6)
. They both
need to increase their pace for projects like the
Human Genome and other genome sequencing
projects. A
lgorithms also need to be developed to
make the process more efficient. Although hardware
advances make processing information much faster
,

algorithms have to be modified or replaced to make
proper use of the added power.

There have been major advances in
hardware that is both specialized and common. Some
of the hardware resides in the same physical location,
while others use distributed computing to generate
results
from computers around the world

(5)
.


A major problem with new hardware is the
need for software to take advantage of the new
processing power. It is not sufficient to have faster
hardware if the software is not optimized to run on it.

3.

Research

Some of the key components of
B
ioinformatics are algorithms, hardware, software,
and programming languages. Without these tools
Bioinformatics could have never evolved into the
massive entity it is today. Algorithms are developed
to make better use of database
-
mining so information
ca
n be classified more easily. Hardware has
developed faster than the software in bioinformatics.
Programming languages provide the tools required
for the analysis of information. Software, hardware,
and programming languages all play equal roles in
the deve
lopment and progression of bioinformatics.

2



3.1
Software

Software is written to help researchers make sense of
the information they gather. Much of the software
used is database mining. This is the searching and
matching of gene or
protein sequences in a database.
Databases allow researches easy access to
information and methods for extracting only the
information needed to answer a specific question

(1)
.
Information gathered is not enough. It also has t
o be
analytically and statistically analyzed. These analyses
give biologists the information needed to draw
conclusions from the data collected.

One example of the software used is Entrez.
It is a data retrieval system developed by the National
Center for

Biotechnology Information. Entrez
provides integrated access to data including literature,
nucleotide and protein sequences, complete genomes,
three
-
dimensional structures, among others

(7)

.
Other examples of software simula
te protein folding,
analyze trees or other complex data structures, and
interact with databases.

3.1.1
Algorithms


Database
-
mining

Database
-
mining is the process which hypotheses are
generated regarding function or
structure of a gene or
protein by identifying similar sequences in other
examples

(8)
. There are several examples of
database
-
mining that will be discussed: clustering,
disease
-
specific genomic analysis, minimum
spanning tree o
ver k vertices (k
-
MST), and index
-
based similarity search. The k
-
MST approximation
does a highly defined clustering algorithm to
approximate relationships in trees

(9)
.

Clustering is the method of classifying
objects of a similar

kind into respective categories. It
puts objects into groups based on similarities and
traits

(9)
. This method of categorizing information
allows scientists to find similarities in gene and
protein sequences so diseases can be

grouped
together for treatment and diagnosis. Other methods
like Disease
-
Specific Genomic Analysis or k
-
MST
are improvements on a clustering algorithm to find
specific information more efficiently

(9)
.

The algorithms used give b
iologists the
information needed to analyze the data and draw
conclusions. Without the implementation of
algorithms on computer systems biologists would
have limited means of analyzing gathered data.

3.1.1.1
Clustering

Clustering, also referred to as clust
er analysis is an
array of algorithms for grouping objects of similar
kind into respective categories. This categorization,
in theory, sorts different objects into groups in a way
that the degree of association is high if they belong to
the same group and
low otherwise. This technique
does not, however, provide any explanation,
meaning, or interpretation to the categorization
.
It
simply states that there is an association

(10)
.

There are many applications for clustering.
It is
used in medicine, clustering diseases, cures for
diseases, symptoms of diseases. It is also used in
other fields like psychiatry to diagnose clusters of
symptoms, or archaeology to find clusters of tools
and other artifacts

(10)
.

Many types of
cluster analyses exist. There
are

the joining, two
-
way joining analysis, k
-
Means
clustering, and expectation maximization clustering

algorithms
. Each of these follows different lo
gic to
link their clusters, and each have varying levels of
efficiency.
The first is the joining or tree clustering
method. It joins objects into larger clusters by some
measure of similarity or distance. It can be measured
by the Euclidian distance, the squared Euclidian
distance, city
-
block distance, and others.
After the
distances have been calculated, the clusters are
amalgamated. The process is then repeated until all
the clusters have been linked to form a hierarchical
tree

(10)
.

There are numerous methods to link the
clusters in t
he joining method. Some of the common
ones are single
-
linkage, complete linkage, un
-
weighted pair
-
group average, weighted pair
-
group
average, and Ward’s method. The single linkage
connects a cluster to its nearest neighbor. Complete
linkage connects a clus
ter to its farthest neighbor.
Un
-
weighted pair
-
group average method calculates
the distance between two clusters as the average
distance between all pairs of objects in two different
clusters. Weighted pair
-
group average works the
same as the un
-
weighted b
ut uses then size of the
clusters as a weight. It is a better method when the
cluster sizes are uneven. Finally Ward’s method can
be used to amalgamate clusters. It minimizes the sum
of squares of any two hypothetical clusters to do the
linking. It is ver
y efficient, but leads to clusters of
small size

(10)
.

The second clustering method is the two
-
way joining analysis. This method is identical in all
ways to the joining method. However, instead of a
single linkage, two
variables are linked to cluster.
Data rarely needs to be linked together in this way,
and can cause conflicting results. Because of the
problems with resulting clusters, it is the least used
method

(10)
.

Another method of clust
ering is k
-
Means.
This method takes a hypothesis for the number of
clusters from the researcher, k. The best k is known
as
a priori
and is computed from the data. The
algorithm would then calculate exactly k different
clusters of greatest distinction. Ther
e are two major
steps in the computation. It starts with k random
clusters. It then tries to first minimize variability
3



within clusters and then maximize variability
between clusters

(10)
.

Expectation maximization (EM) clusteri
ng
is similar to the k
-
Means algorithm. It varies from
other methods of clustering by using statistical
methods to link clusters. It can be used in both
continuous and categorical variables. Other
algorithms have to be modified to work with
categorical var
iables. To make this accommodation,
different weights are added to each category for each
cluster. At each iteration, probabilities are improved
to maximize the likelihood of the data given the
specified number of clusters

(10)
.

The EM algorithm works with samples of
distribution clusters. Each can contain different
means and standard deviations. It works by using the
resulting distribution rather than individuals. The EM
algorithm approximates the observed distributions
based o
n mixtures of different distributions in
different clusters. The distributions can be normal,
log
-
normal or Poisson. The clusters can be a
combination of any of the different distributions. It
computes classification probabilities rather than
assignments o
f observations to clusters. So each
observation belongs to its corresponding cluster with
a certain probability attached. The final result is vied
as an assignment of observations to clusters by the
largest classification probability

(10)
.

There are numerous methods for classifying
data using clustering algorithms. Each works in a
unique way, and should be used in the correct
context, depending on the problem.

3.1.1.2

Disease
-
Specific Genomic Analysi
s

A problem with gathering too much
information is
making sense of it. Too much information makes an
experiment difficult to manage. Data is hard to
analyze, categorize and draw conclusions from.
Genomic research has to deal with large amounts of
information, but still make sense of it so bi
ologists
can draw conclusions.

DSGA is a
specialized case of clustering and
class prediction. It measures how one disease
changes from a normal phenot
ype. After this the
deviating information is isolated. The DSGA method,
in some experiments, has
outperformed standard
clustering methods. There are many uses for disease
-
specific genomic analysis. It can be used in
microarrays, DNA, RNA, and any other highly
dimensional genomic or proteomic data

(11)
.


The method of disea
se specific
decomposition uses microarray data to analyze. The
method described is based on a decomposing
expression in diseased tissue. The decomposition is
defined by computing a linear model of disease tissue
expression onto normal expression data.

The

data decomposition contains two sets of
data. The first set is the diseased tissue microarray
data. The next set is the normal tissue microarray
data. The number in each of these sets does not have
to be equal in order for the analysis to be completed.
Th
e normal set of component is used to fit the
diseased tissue into a linear model. The
decomposition would then be Ti = Nc.Ti + Dc.Ti
where:



Nc.Ti fit to the linear model: Normal
Component



Dc.Ti vector of residuals to linear model:
Disease Component

The nex
t step in DSGA is estimation of the
normal expression space N. This is the process of
reducing the dimension of normal expression data. A
modified principal component analysis (PCA) was
used although it had limitations because it needed
more normal tissue
samples. Using this, however,
DSGA performed better than traditional methods of
estimation

(11)
.

The DSGA addresses a certain series of
biological characteristics

(11)
:

1.

Disease is a deviation of expression from a
normal/healthy state.

2.

The model N for the normal state includes
biological diversity. The set of normal data
can come from any number of sources, and
there is a natural fluctuation in conditions.

3.

For the testing
, patients were not required to
provide a normal tissue sample. This would
be impossible where all or most of an organ
was altered in some way by disease.

4.

Each diseased sample is analyzed without
reference to any other diseased tissues.

DSGA was compared t
o other methods
of
analysis in a gastric cancer dataset and a breast
cancer dataset. In the data set for gastric cancer,
DSGA performed better for both tumor site and
tumor type distinctions. The results were similar in
the breast cancer dataset where DSGA

outperformed
both log ratio data and zero
-
transformed data
analyses

(11)
.

The case for using disease specific genomic
analysis is overwhelming. With minimal alteration to
current formulas, DSGA outperformed every other
analysi
s method it was tested against.

3.1.1.3
Minimum spanning tree over k vertices (k
-
MST)

Another method of data analysis used in
Bioinformatics is graph mining. This technique puts
information in trees and then uses specialized
clustering algorithms to gather

data to make
hypothesis based on the relationships

(11)
. This is a
4



very specialized set of algorithms that need to run
efficiently and reliably. These algorithms make use
of techniques like microarray analysis.

One algorithm
that is used for this task is the
approximation algorithm for the k
-
MST. k
-
MST is
an algorithm that performs hierarchical clustering,
gets the best subset of clusters having k
-
vertices,
connects them to a the root vertex, and extracts the
optimal k
-
MST fr
om the resultant tree

(11)
. This
algorithm shows a good running time, quality, and
statistical guarantees.

The approximation algorithm for k
-
MST works in
seven steps

(11)
. Also see appendix.

1.

First the s
hortest distance is computed from
each vertex of the graph to the root vertex
using Dijkstra’s algorithm.

2.

Second the optimal cost OPT is guessed.
Steps 3
-
7 are repeated to find the best k
-
MST.

3.

Next Kruskal’s algorithm is for single
linkage. At each iterati
on, the next two
minimum edges are added to form a super
cluster. This is repeated until all clusters are
connected.

4.

The distance from each cluster to the root is
calculated using recursion. The cost of this
is the edge weights plus the distance to the
roo
t vertex.

5.

The optimal subset of disjoint clusters C that
has exactly k vertices is done by the
hierarchical clusters.

6.

Connect each cluster to the root vertex using
the shortest path.

7.

Finally the optimum k
-
MST is found from
the resultant tree.

The total tim
e for this algorithm is
O
(
n
log
n
+
m
log
m
log
k
+
nk
2 log
k
)

(11)
. This graph is
scalable because n<< k, so for a large value of k, the
algorithm scales very well.

An example is shown using the algorithm to
find biological pathway
s in a yeast network. The
graph contained 5552 vertices and 34000 edges.
Even though the graph was missing edges between
parts of the graph, the algorithm was able to
circumvent the broken chain to find the rest of the
vertices

(11)
.

Graph mining is a very important problem to
solve in a timely manner. The approximation
algorithm for k
-
MST is an efficient algorithm for
finding substructures. The results show a great time
complexity and approximation ratio of the algorithm.
Also s
ee appendix.

3.1.1.4
Index
-
based similarity search

Index
-
based similarity search is a method of finding
similarities in protein structure databases. This
extraction method improves the time of VAST by 3
to 3.5 times while keeping sensitivity similar

(12)
.

There are two methods of Index
-
based
similarity searches. The first method of Indexed
-
based similarity search matches a pairwise alignment
tool like VAST First, vector features are extracted on
triplets of Secondary Structure
Elements (SSEs) of
proteins. Following this, the features are indexed by a
multidimensionally index structure. It finds proteins
similar to a given query protein in a given dataset.
After this, proteins are aligned to the query using a
pairwise alignment t
ool. The second method joins
two protein datasets to find all
-
to
-
all similarity

(12)
.

3.1.1.5
Entrez

A tool developed by the National Center for
Biotechnology Information called Entrez. This
software is a data retrieval system
that has access to
many domains. These include literature, nucleotide
and protein sequences, complete genomes, three
-
dimensional structures, and many others. Some of the
databases Entrez has access to are GenBank, curated
Ref Seq
2

database, nucleotide sequ
ences from Protein
Data Bank (PDB) records, and Third
-
Party
Annotation (TPA) database

(7)
.


Many useful functions can be accomplished
using Entrez. One can find representative sequences,
retrieve associated literature and prot
ein records,
identify conserved domains, identify similar proteins
and known mutations, find a resolved three
-
dimensional structure, and view genomic context and
download the sequence region

(7)
. The amount of
information and t
he power behind Entrez makes it a
very useful tool.


Entrez has access to approximately thirty
-
five kinds of databases

(8)
. Each type is then linked
to numerous other databases for cross referencing,
matching, and find many oth
er characteristics.


Some of the most powerful searches through
a database with biological information stored in it, is
unexpected. For example, when doing a query using
Entrez with a specific name, sequence, or other
identification, not only will the res
ults of the
sequence show, but the journals, abstracts, and other
resources as well

(8)
. Entrez is a compilation of tools.
It uses algorithms like clustering for sequence
comparisons. Others are used to link queries to
referenc
e information.


Entrez is an amazing example of software
for bioinformatics. It combines a number of
algorithms, search results, programming tools, and
many customizations. It has a staggering amount of
data linked to its 30+ databases. Entrez is an
enorm
ous set of tools, that have many practical uses
for biologists and those in the computer field.

5



3.1.2
Microarray

The next topic of discussion is microarrays.
Microarrays are small chips used to study gene
expressions. Microarray chips identify which genes
are active at the time of testing because not all genes
are active at one time

(3)
. Software captures,
manages, and analyzes DNA microarray expressions

(4)
.

The future goal of microarray technology
is
treating diseases at a local level

(1)
.

Microarrays analysis works by having
fragments of human DNA stuck to spots on the chip.
Next, modified DNA fragments are added and will
stick to the previous where they match. Two
sets
taken at different states, identified by fluorescent
coloring, allow an image to be derived from the
microarray. The colors are spots of green, yellow,
and red

(3)
.
The complete process is as follows

(1)
:

1.

Prepare DNA chip using chosen target
DNA’s

2.

Generate hybridization solution that contains
a mixture of fluorescently labeled cDNAs

3.

Incubate hybridization mixture containing
cDNAs with the DNA chip

4.

Detect bound cDNA using lasers and store
data i
n computer

This whole process is known as hybridization
probing. Fluorescently labeled nucleic acid molecules
called mobile probes identify complementary
molecules. These are sequences that are able to base
-
pair with one another.
DNA is made of four differ
ent
nucleotides: adenine, thymine, guanine, and cytosine.
Adenine complements thymine and guanine
complements cytosine. When complementary
sequences match, where immobilized target DNA
and the mobile probe DNA, cDNA or mRNA lock
together the process is kno
wn as hybridizing

(1)
.

When the hybridization is complete, the
microarray can be scanned or read. Lasers,
microscopes, or cameras are used for the job. The
fluorescent tags are created by the laser, and the
microscope and
camera create the digital image of
the microarray

(1)
.
This process is only the beginning
of the analysis.

Also see the appendix.

The next part of the analysis is analytical
and statistical. The microarray data goes through
image analysis to quantify gene expressions

(3)
.
“Some microarray experiments can contain up to
30,000 target spots

(1)
.”
There are multiple steps in
the image analysis. There is the semi
-
automatic
grid
construction, where the area is defined that spots are
expected. Next either automatic or manual grid
adjustments are made to ensure the grid spot is
adjusted. Spot intensities are calculated through
either an integral of non
-
saturated pixels or spot
medians. The local background is subtracted from
this intensity

(4)
. After the intensity has been
calculated, the data can be added to a database,
Microsoft Excel
®, or some other data management
system. The analysis of the data

gathered from
microarray analysis is clustering with the data taken
at different states

(3)
. Other methods include the k
-
Means and k
-
MST.


There are a number of tools that manage the
information. One of the tools is TM4, which is a suite
of software consisting of
Microarray Data Manager
(MADAM), TIGR_Spotfinder, Microarray Data
Analysis System (MIDAS), and Multiexperiment
Viewer (MeV), as wel
l as a Minimal Information
About a Microarray Experiment (MIAME)
-
compliant
MySQL database. These tools are used for spotted
two
-
color arrays, but there are others that will run
single
-
color formats

(4)
.


The National Center fo
r Biotechnology
Information (NCBI) has is currently working on ways
to manage the data from microarrays. One of their
projects is the Gene Expression Omnibus (GEO).
GEO is NCBI’s online resource for storage and
retrieve of gene expressions. The gene expres
sions
can either come from organisms or artificial sources.
Another project supported by NCBI is Microarray
Markup Language (MAML). MAML is being
developed by the Microarray Gene Expression
Database. Their primary goal is to adopt standards for
DNA
-
array e
xperiments. These standards can be for
annotation, data representation, experimental
controls, and data normalization methods. The
ultimate goal for the NCBI is to have data in many
formats, including a new version of MAML called
MicroArray Gene Expression

Markup Language
(MAGEML)

(1)
.

3.1.3
Protein modeling

Protein modeling is the act of taking a protein
sequence and imitating the shape it would naturally
take.
It is a subset of proteomics.
The shape a protein
goes into when it folds is the specification on what it
does. Proteins
are remarkable because they start as
simple sequence of amino acids and then assemble
themselves

(3)
.
Understanding protein folding is
i
mportant in the discovery and treatment of diseases
and other health issues. When a protein does not fold
into its correct shape, there are numerous
ramifications. Diseases are caused by incorrect
folding as well as other adverse effects

(5)
. Some
famous examples of diseases caused by misfolding
are Alzheimer’s, Mad Cow, and Parkinson’s

(5)
.
Protein modeling will help scientists determine where
problems occur in protein structures, and

possibly
give a method of treatment.

Protein modeling is only simulated using
experimentally determined protein structures

(1)
.
Experimental methods such as X
-
ray crystallography
and nuclear magnetic resonance (NMR)
6



spectroscopy, are more accurate ways of determining
a protein’s structure. Modeling gives researches a

star
t
ing point to confirm a structure by X
-
ray
crystallography or NMR spectroscop
y.

There are four steps to model a protein

(1)
.


1.

R
elated proteins
with known three
-
dimensional structures need to be found.

2.

A
n alignment

is

done between the related
three
-
dimensional structure and the target
sequence.

3.

A
model is constructed based on the
alignment with the related structures.

4.

S
ta
tistics and algorithms are run to
determine if the model is acceptable.


Another aspect of protein folding is
determining how different proteins interact with one
another

(5)
. Modeling can give biologists a way of
determining
where a problem is in a protein structure
that causes diseases.

The problem with protein modeling is that in
nature, proteins assemble themselves as fast as a
millionth of a second

(5)
. The speed at which
proteins naturally assem
ble themselves is difficult
simulate because they work on a timescale that is
much faster than any processor available. In fact, it
would take 30 CPU years to calculate one result
where a protein folded in 10,000 nanoseconds

(5)
.

The results of protein modeling will fit into
the much larger section of bioinformatics,
proteomics. The combination of searching, folding,
active site determination, and protein interaction will
give
researchers much needed information.

3.2
Hardware

Special hardware is needed to perform some of the
processing done by bioinformatics. Because of the
sheer volume of information, new hardware
architectures are needed to do the sorting algorithms
previously stated. One of the newer archit
ectures is
the supercomputer
-
on
-
chip (SCoC). It uses a shared
-
memory multiprocessor to do calculations. Due to
the availability of parts to make a SCoC they are
desirable to use where normally special
-
purpose
hardware would be

(12
)
. Another method of
advanced hardware is distributed processing.
Distributed processing is a collection of loosely
coupled processors connected by a communication
network. A major advantage of distributed processing
is computation speedup

(13)
.

Every hardware architecture has its own
advantages and disadvantages. Supercomputers on
chip are limited by their ability to work concurrently
with existing hardware, and are difficult to program
for

(13)
. Distributed processing is limited by the
amount of information that can be sent through a
network connection and concurrency problems.
These are the examples of hardware that will be
discussed in the upcoming section.

3.2.1
Supercomputer on Chip

Over
whelming amounts of data, enormous jobs, and
limited computer hardware are all reasons for new
computer architectures. Supercomputer
-
on
-
chip
(SCoC) is newer specialized hardware that has many
benefits over single processors. A simple search that
looks for
an exact match between a query string and a
sub
-
string of a database is a very computationally
demanding task

(12)
. A good search would need to
compensate for characters being mutated or deleted
from a database. The need for mo
re processing power
is apparent as the searches and algorithms become
more complex.

A supercomputer on chip
is the creation of a
large integrated circuit with many, fairly small
processors.

A system bus would allow data to pass
between processors and share
d memory. For database
items that are only read and written once memory and
cache can be low; for larger applications more on
-
chip shared memory may be more efficient

(12)
.
With the simple architecture, SCoC are cost
-
effective,

easy to implement, and extremely fast.

One SCoC design uses ARM9 processes
with 8KB of instruction cache and 8KB of data
cache. It uses a an asynchronous first
-
in
-
first
-
out
pipeline (FIFO) design. The asynchronous FIFO has
three stages. First timing gets

constrained by the local
clock of the sending synchronous block. The second
stage decouples the two end stages. And the last stage
timing gets constrained by the local clock of the
receiving synchronous block

(12)
.

The result
s of an SCoC are staggering. In
2003 with this design using current 130nm
technology, 227 processors each rated at 250MHz
had a combined frequency of 57GHz. By 2006,
using smaller technology, the combined frequency is
644GHz. It is estimated that by 201
6, using 22nm
technology with processors rated over 4000MHz, a
combined 24,300GHz can be obtained

(12)
. This
design is not limited to ARM processors and can be
used with a variety of processor/bus combinations
depending on the
application.

SCoC will be an effective way to have an
enormously powerful architecture that can be custom
made for an application. The design scales well for
many implementations, and can only improve with
time.

3.2.2
Distributed processing

Distributed processing is the use of loosely coupled
processors interconnected by a communication
network. The architecture can be small or large, as
well as the processors. The hardware may include
small microprocessors, wo
rkstations, minicomputers,
7



among others

(14)
. The power behind doing
distributed processing is resources sharing.

Resources are limited on any machine
regardless of its individual specifications. Resources
may include memory, s
torage, processor speed,
number of processors, and the time the resource is
available. The shortcomings of individual computers
are overcome by doing distributed processing. The
collective resources of a distributed system are much
more than any one indivi
dual system.

The extra resources add numerous benefits.
First, there is a computational speedup. This is due to
the number of processors being allocated to a job.
The more processors there are, the faster the
computation. Another benefit is reliability. I
f an
individual system fails, the remaining system can
continue to operate. This is true as long as there is
enough redundancy in both hardware and data

(14)
.
There are problems with reintegrating a site that goes
down though.

There are a number of mechanisms unique
to distributed processing that make it very efficient.
First there is computation migration. This approach
transfers the computation rather than the data across a
system

(14)
. The result
of processing can be much
smaller than the data needed to do the computation.
This speeds up the process, and makes better use of
the network resources. The next approach is process
migration. This is an extension of computation
migration where a process r
uns on a different site
than where it was initiated

(14)
. The reasons for this
are numerous. First there is load balancing. Some
sites using distributed processing may be
overworked, while others remain more or less idle.
This
process distributes the processes to even the
workload. Computational speedup can also be done
by splitting a process into subprocesses and having
them run concurrently on different sites. The
distributed system can also make better use of
individual hardw
are through process migration. Some
hardware may be specialized to handle certain types
of data better than others

(14)
. Another facet of
process migration is software. Certain software may
only exist on a few sites in a distri
buted system.
Having the software run only on one machine can
save money and time. Both approaches provide an
architecture that individual computers cannot match.


The distributed processing model is most
seen in the various applications on the World Wide
Web

(14)
. This can be seen in numerous places in
bioinformatics. Databases have clients connect to
them, and only the result is returned to the client.
Others distributed systems only provide data
migration from server to clien
t.


It is easy to see that distributed processing
has many advantages over other methods of
processing. It is relatively easy to implement, is very
robust, and provides a high level of service. An
example of distributed processing that focuses
primarily on

protein folding, misfolding, data
aggregation, and related diseases is Folding@home

(5)
. Folding@home is a distributed system that has
clients running their software around the world.

Folding@home was a project that started at
Stanford
University. This is a primary example of a distributed
processing. To reiterate, distributed processing is the
sharing of resources over a network. This client
application simulates folding and misfolding. The
information can be used to find cause
s of diseases,
create new drugs, and solve other biological
problems

(5)
.

This distributed processing system has
achieved numerous awards and goals since the
beginning of the project. In 2007, Folding@home
reached a power of 1 pe
taflop, which is one
quadrillion floating point operations per second. Even
if the Folding@home had access to all the
supercomputers in the world, fewer clock cycles
would be achieved than with the distributed
processing model

(5)
.

A primer on protein folding is necessary to
understand the need for a distributed system. Proteins
are long chains of amino acids. They act as enzyme
as the driving force behind all biochemical reactions.
The sequence is not nearly as important as the

shape
a protein takes to carry out its function. The shape
they take is called folding. The remarkable thing
about proteins is that they assemble themselves
through folding, before doing their work. A protein
that misfolds, proteins that do not assemble c
orrectly,
are known to have serious effects including diseases

(5)
.

As seen earlier, protein modeling is a very
advanced function. The time it takes for a protein to
fold can be as short as a millionth of a second. In
comparison,

it would take 10,000 CPU days to
simulate folding. Distributed algorithms used with
the project use over 100,000 processors to match the
microsecond barrier. The processing power has
allowed the Folding@home project to find how
proteins actually fold.

Dis
tributed processing allows computers and
biologists to solve problems that would be impossible
with any other means. All the supercomputers in the
world cannot compete with the processing power of
the Folding@home project. On lesser scales
distributed proc
essing is seen in web applications,
like database interactions. Distributed processing has
allowed computers to have the mainstream appeal
they do with the World Wide Web. No single
computer in the world can compete with the potential
power of a distribute
d processing system.

8



3.3
Perl

Programming languages play a major role in
bioinformatics. The software runs databases,
analyzes, stores, and retrieves data, provide
modeling, and enables distributed systems to be
established. Without programming languages,
including the very low level ones, bioinformatics
would have no way of making use of its
managing its
information.

Many languages lend themselves well to
bioinformatics. Java is used in many applications to
update databases after processing has been done

(7)
.
Other languages, like PHP

and Perl, provide a
relatively easy starting place for someone in biology
to get into coding.

The ease of learning allows
someone without a computer science background to
develop custom applications easily. This also helps
lessen the gap between computer people and
biologists.

There are numerous languages that provide
support to bioinformatics. Th
e goals of an application
often determine the language that will be used. For
lower level programming languages like C or C++
can be used. Java is sometimes used for database
interaction without human intervention. For direct
human interaction with databas
es and results
scripting languages like Perl are very useful. Perl has
both power and flexibility to apply itself to many
bioinformatics applications.

The strengths of the programming language
Perl make it an ideal fit for bioinformatics. It does
pattern matching, web posting, and database
interaction.
A set of modules,
Bioperl was written
for
managing all aspects regarding bioinformatics
. It is a
set o
f tools,
under a very unrestrictive license
, for
biologists to develop programs as they are needed

(2)
.

There is also a module for Perl called DataBase
Independent. This module allows Perl to have a
common interface to many
different relational
database systems without rewriting code. Perl is a
good fit for its flexibility and ease of use in
bioinformatics.

Modular programming is what m
akes

Perl
so adept at handling problems in bioinformatics.
Modular programming is a way of
organizing code
into collections of interacting parts.

Perl modules are
the mechanism for defining object
-
oriented classes.
Each module is a library file that uses package
declarations to create its own namespace

(2)
.
The use
o
f modules gives Perl the flexibility to create large
programs with security.

Perl modules use package declarations to
create
their
own namespace
s
. Namespaces are tables
containing the names of the variables and subroutines
of a program

(2)
. It is common for programs parts to
share variable names. When there is a problem with
multiple variables with the same name a namespace
collision occurs. Modular programming prevents this
by creating separate namespaces for each module

(2)
.
Other methods
of preventing namespace collusions
are

my
and
use strict

(2)
.
These are less effective than
modular programming, but should be used as good
practice
. The use of modules allows there to
be code
reusability and a more logical way of programming.

There are many archives of modules
available with unrestrictive copyrights and licenses.
The Comprehensive Perl Archive Network (CPAN
,
http://www.cpan.org
) is a collection of Perl code
prewritten f
or nearly any us
e

(2)
. The reuse of code is
a major benefit of modular programming
, and also a
timesaver. Many of the tasks programmers want to do
are prewritten in modules and other code on CPAN.
The top
-
level organization of
modules included is as
follows

(2)
:



Development support



Operating System Interfaces



Networking Devices IPC



Data Type Utilities



Database Interfaces



User Interfaces



Language Interfaces



File Names Systems Locking



Security and
Encryption



World Wide Web HTML HTTP CGI



Server and Daemon Utilities



Archiving and Compression



Images Pixmaps Bitmaps



Mail and Usenet News



Control Flow Utilities



File Handle Input Output



Microsoft Windows Modules



Miscellaneous Modules



Commercial Software In
terfaces



Not in Modulelist

Using the CPAN archive can make any
project easier from the use of existing code.

As seen earlier, databases
play

a major part
of Bioinformatics. Most bioinformaticians need to
know the basics of how to use them; some specialize
in them. Perl DataBase is an extension to Perl that
saves a lot of rewriting code in case of a change in
database management software

(2)
.

There are two main modules written for Perl
Database interactions, DataBase Independent (DBI)
and DataBase Dependent or DataBase Driver(DBD).
DBI handles the interactions from the program code.
It provides a common interface to many different
relational database systems.
DBD is responsible for
handles the communication with the database
management syste
m.
Both DBI and DBD are CPAN
modules

(2)
.

9



Web programming has beco
me the

principal
way of distributing programs to users.
This is
especially true for bioinformaticians.

They need to
collaborate, disseminate, and promote research.

New
installations of Perl all come with the CGI.pm Perl
module for providing dynamic web content

(2)
.

The CGI.pm Perl module is primarily used
for creating interactive web pages. Each time a CGI
script is called from a remote
location, the CGI
program is run and the output is displayed on the
requesting computer

(2)
. This can create a different
web page each time the script is run.
Common
Gateway Interface is not limited to Perl scripts, but it
was
an early example of success on the web.

Web
programming is simply another example of how Perl
is an amazing tool for bioinformatics.

Another tool for Perl is a group of modules
called Bi
o
perl

(www.bioperl.org)
.

Bioperl contains
a
collection of over 500 Perl modules
written
explicitly

for
bioinformatics.
The purpose of the Bioperl project
was to give researchers a starting place for the
programs they need to develop.
When the project
initially launched, t
here is lit
tle documentati
on
.

As
the project gained popularity and users, this has
improved.

Bioperl

officially launched in 1995 with
the release of Perl 5. This version

of

Perl support
s
object
-
oriented programming

(2)
.

With the
professional work done on the project, Bioperl
became a piece of software widely accepted and used
in the bioinformatics community.

There are numerous types of programs
included in the Bioperl project. Among these are
sequence, databases, alignme
nts, genomics, and
proteomics.
Given the enormity of the project,
Bioperl is a mainstream project for many
bioinformaticians.

Sequence programs allow researchers to
input, output, run tests on, and manipulate gene and
proteomic sequences
. Some examples of
modules are
SeqIO, SeqStats, LiveSeq and LargeSeq. SeqIO
handles sequence file inputs and outputs. SeqStats
does statistical analyses on sequences. LiveSeq
stores
sequence data, but unlike others this addresses the
problem when features have locations that

change
over time
.
The module

LargeSeq
provides a
combination of the previous modules to very large
sequences
. The sequences can be over 100MB, but
run on a system with less than half the real memory.


(2)
.


Database programs give access to biological
databases. The modules
are for most major databases

such as

Swiss
-
prot, GenBank, BLAST,
and others.
The GenBank module provides access to the
GenBank database. RemoteBlast runs a BLAST
search remotely while BPlit
e parses reports from
BLAST

(2)
.

Alignment programs handle alignments of
sequences like the pair
-
wise alignment discussed
previously. There are many types of alignment
algorithms included in the package.
Some of these
are UnivA
ln, LocateableSeq, pSW, BPbl2seq,
AlignIO, Clustalw, and SeqDiff.
UnivAln is a
module that
manipulates and displays multiple
sequence alignments. The LocatableSeq module
creates an object that contain a start and end point for
locating sequences relative
to other sequences or
alignments. pSW uses the Smith
-
Waterman
algorithm to align two sequences.

BPbl2seq uses a
parser for a BLAST algorithm search using a pair
-
wise alignment algorithm
; AlignIO is also aligns with
BLAST.
Clustalw interfaces to the Clustal
w multiple
sequence alignment package. The SeqDiff module
handles sets of mutations and variants of sequences

(2)
.

The genomics and proteomics section
contain
s

a
vast number of modules to search, write, and run
statistics on se
quences.
Example modules include
Restricti
onEnzyme, Sigcleave,
Odd
Code
, SeqPattern,
Split, Fuzzy, Genscan, Resultts, Exon, ESTScan, and
MZEF
.
RestrictionEnzyme locates restriction sites
on an enzyme. Sigcleave finds amino acid cleavage
sites. Oddcode
rewrites amino acid sequences in an
abbreviated code for statistical purposes. The
rewritten code uses a hydrophilic/hydrophobic
alphabet.

Using SeqPattern one can use regular
expressions descriptions of sequence patterns. Split is
a useful module that giv
es location information for a
sequence. The location may be multiple ranges, and
possibly multiple sequences. The module Fuzzy gives
inexact
location i
nformation.

Genscan, Reults, Exon,
ESTScan, and MZEF are all examples of interfaces
for gene finding prog
rams

(2)
.

Perl contains a massive set of modules for use in
bioinformatics

that can be added.
DataBase
Dependent and DataBase Independent add
connections and interactions with databases.
Bioperl
is a preassemble entity of bioinformatics algorithms,
database interactions, and standalone programs.
These modules are unique to bioinformatics, and
create a foundation for any developer wanting to help
the bioinformatics community.
Unlike many ot
her
programming languages, Perl has many resources
that are freely available for developers to use. The
reuse of existing code saves countless hours of
development.

Perl is not alone in its use in
bioinformatics, yet it has many advantages of the free
sour
ce code, software, and other resources.

4.

Conclusion

Bioinformatics is an ever expanding field.
There are also many challenges facing it.
Bioinformaticians are limited in their ability to
gather, represent, and analyze data. New software
10



a
nd hardware has to be developed to keep up with an
increasing amount of data.

There are numerous software titles that are
specialized in bioinformatics, and even more
algorithms. Entrez is a data retrieval and search tool.
BLAST is NCBI’s multiple database

interface. There
are algorithms like clustering that group objects
together based on similarities. However, the reason
the objects are grouped has to be inferred by the
researcher. The approximation for k
-
MST and
algorithm for disease specific genomic ana
lysis were
made to improve upon previous clustering
techniques.

Databases are the backbone of bioinformatics.
Without the ability to effectively and efficiently store
information bioinformatics would have no way of
distributing its information. There are n
umerous
databases that contain all different varieties of
information pertinent to bioinformatics. The
distribution of information has allowed researchers to
work together on problems that would have been
impossible before.

Whenever a new technique for gathering
biological data is
found
, there is a need for computers
to help facilitate the data.
It can be see with
Microarrays that there is room for creating standards,
new approaches to analysis, and
unique languages

(1)
. The need for supercomputers on chip comes
about when very specialized problems come up, and
a huge amount of processing power is needed

(13)
.
Folding@home had a problem that could only be
solved with
worldwide participation. Without the
process of distributing processing, there would be no
way the Folding@home could simulate protein
folding at a truly representative level

(5)
.

There are
countless examples of new needs coming

from new
developments that are not specific to bioinformatics.

New software can improve performance as
much, if not more than hardware. Improvements on
algorithms allow software to take advantage of the
added processing power by fixing bottlenecks in
syst
ems. There is always room for improvement in
software.

Different hardware implementations have
differing
advantages and disadvantages. Distributed
processing on a massive scale requires participation
from a wide audience. There may not be enough
people to
support a project like Folding@home.
However, with the amount of resources added when
doing distributed processing greatly increases the
efficiency of work. Also, physical boundaries do not
limit a system. Supercomputers on chip are a very
specialized form

of hardware. They require very low
-
level knowledge of a computer to implement on a
current architecture. Using a group of tightly coupled
processors creates enormous gains in processing time
which allows researchers to be more productive.

The Perl program
ming language is extremely
beneficial to bioinformaticians. There are numerous
resources available for use in bioinformatics. Some
of these are specialized like the modules in BioPerl.
Others are built in features of the language, or
modules that can be a
dded with an unrestrictive
license.

Continuous effort is needed to expand the field
of bioinformatics. New software has to be developed.
The software needs to take advantage of existing
algorithms as well as improve them. Hardware has to
be developed to me
et the needs of bioinformatics.
The techniques used are always pushing the current
hardware to its limits to perform. Finally,
programming languages need to be developed and
matured into something useful for both computer
science and biologists. If there i
s a lack of
information between the two, new implementations
cannot be developed. As software, hardware, and
programming languages advance, so do the results of
bioinformatics.

The evolution of b
ioinformatics has limitless
possibilities for our future. New

hardware, software,
and algorithms will allow researchers to identify,
diagnose, and p
otentially treat
diseases in humans as
well as other living organisms.
New drugs can be
engineered and
developed.
Biology and com
puters
are the perfect marriage

of seemingly unrelated
fields.













11















12



Appendix

Sequence alignment example

Unaligned sequence

Gates likes cheese

| |

Grated cheese



Aligned sequence

G
-
ates likes cheese

| |||


||||||

Grated
-----

cheese


k
-
MST

Notes on algorithm

(11)

1. Step 2 (guessing the bound of
OPT
) is used to
guarantee the approximation ratio.

2. In step 5 (subset selection of clusters), the goal
could be changed to finding the optimal subset of
clusters having
at least k
vertices.

3. Step 7 extracts the best
k
-
MST from the resultant
tree. If one extracts a simple
k
-
subtree, the
approximation ra
tio is still preserved. And the time
complexity will be reduced from
O
(
nk
2) to
O
(
k
). This
could be useful for large
k
.

Notes on running time

(11)

Step 1 takes

O
((
m
+
n
) log
n
) (or
O
(
m
+
n
log
n
) if using the
Fibonacci heap). Step 2 repeats log
k
times. Step 3
takes
O
(
m
log
m
) time. Step 4 takes
O
(
n
) time. Step 5
takes
O
(
nk
) time. Step 6 takes
O
(
n
) time. Step 7 takes
O
(
nk
2) time (or
O
(
k
) if extracting a simple
k
-
subtree).
Thus, the total time compl
exity is
O
(
n
log
n
+
m
log
m
log
k
+
nk
2 log
k
).


Example of running time

(11)



Microarray image and data

(3)




Questions

What is bioinformatics?

What is clustering?

What are some improved
clustering algorithms?

What is an SCoC?

How does distributed processing work?

What makes Perl a good choice for use in
bioinformatics?


13



References

1. Bioinformatics. [Online] March 29, 2004. [Cited:
01 30, 2008.]
http://www.ncbi.nlm.nih.gov/A
bout/primer/bioinfor
matics.html.

2.
Tisdall, James D.

Mastering Perl for
Bioinformatics.
Sebistopol

: O'Reilly & Associates,
Inc, 2003.

3.
Computational Biology & Bioinformatics: A
Gentle Overview.
Nair, Achuthsankar S.

s.l.

:
Communications of the Computer Society of India,
2007.

4.
TM4: A Free, Open
-
Source System for Microarray
Data Management and Analysis.
Saeed, A.I., et al.

2003, BioTechniques, pp. 374
-
378.

5.
Stanford.

Folding@home distributed computing.
[Online] h
ttp://folding.stanford.edu/.

6.
NMSU.

Module 1: General Biocomputing
Applications.
SWBIC.
[Online] NMSU, 2001. [Cited:
2 15, 2008.]
http://darwin.nmsu.edu/molb_resources/tutorials/com
p
-
bio/genbioint.htm.

7. Entrez: Making use of its power.
Briefings in
Bio
informatics.
s.l.

: Henry Stewart Publications,
June 2003.

8.
NCBI.

Bioinformatics.
NCBI.
[Online] March 29,
2004. [Cited: January 27, 2008.]
http://www.ncbi.nlm.nih.gov/About/primer/bioinfor
matics.html.

9.
Cluster Analysis.
s.l.

: Statsoft, 2008.

10.
Disease
-
Specific Genomic Analysis: Identifying
the Signature of Pathologic Biology.
Nicolau,
Monica, Tibshirani, Robert, Borresen
-
Dale, Anne
-
Lise and Jeffrey, Stefanie.

2007, Bioinformatics
Advance Access, p. 2.

11.
TM4: A Free, Open
-
Source System for
Micr
oarray Data Managmeent and Analysis.
Saeed,
A.I., et al.

34, s.l.

: BioTechniques, 2003.

12.
Bioinformatics Application of a Scalable
Supercomputer
-
on
-
chip Architecture.
Smith, Scott F
and Frenzel, James F.

03, Parallel and Distributed
Processing Technique
s and Applications, pp. 1
-
2.

13.
Silberschatz, Galvin and Gagne.

Operating
System Concepts with Java.
s.l.

: John Wiley & Sons,
`

14.
Efficient Algorightms for Mining Significant
Substrates in Graphs with Quality Guarantees.
He,
Huahai and Singh, Ambuj K.

pp. 1,10.

15.
Towards Index
-
based Similarity Search for
Protein Structure Databases.
Camoglu, Orhan,
Kahveci, Tamer and Singh, Ambuj K.

2003, IEEE
Computer Science Bioinformatics Converence, p.
148.

16. What is GenBank? [Online] January 9, 2008.
[Cited: J
anuary 30, 2008.]
http://www.ncbi.nlm.nih.gov/Genbank/.

17. PubMed Central. [Online] April 16, 2007. [Cited:
January 28, 2008.]
http://www.pubmedcentral.nih.gov/.

18. What is the Human Genome Project? [Online]
December 07, 2005. [Cited: January 24, 2008.]
http://www.ornl.gov/sci/techresources/Human_Geno
me/project/about.shtml.

19. Stanford MicroArray Database. [Online] 2008.
[Cited: January 29, 2008.] http://genome
-
www5.stanford.edu/.

20. The Bioinformatics, Distributed Systems, and
Databases Lab (DBL). [Onl
ine] 2008.[Cited: ]
http://www.cs.ucsb.edu/~dbl/publications.php.