Bioinformatics -


Oct 1, 2013 (3 years and 6 months ago)




GenBank is a very comprehensive database, maintained by NCBI of the NIH of the USA; freely accessible through Entrez.
It is a sequence database and integrates DNA, RNA and protein sequence data, with genomic, mapping, taxonomy,
protein structure, and literature information. Provisions for BLAST and hyperlinks to partnership databases are also

wise and Multiple sequence alignment

MSA indicates the similarities of the sequences that have been compared and hence

the relationships between the
organisms that carry the sequences. In addition, they are significant in providing a clue to the evolutionary status and
distances of the organisms being researched.

EMBL database

It is maintained by the European bioinf
ormatics institute and maintains a nucleotide database in collaboration with the
GenBank and DDBJ. It forms a consortium of international collaboration on sharing and updating nucleotide on a daily
basis. It provides tools and links t various other reliab
le and useful databases. The tools include those for sequence and
structural analyses, protein localisation and protein

protein interaction and also provide annotations and links to

4.Margaret Dayhoff and B

Margaret Dayhoff co
mpiled the structure and amino acid sequences of all the known proteins available in the early
1960s and distributed it to interested researchers. There was a huge interest in this work, which prompted her to widen
the database with more proteins and thei
r available and deduced nucleotide sequences. This huge compilation was too
large to be contained as written material. So she copied them into computer tapes for easy distribution and reference.
She later got the help of the American military authorities

to send the data through their network on an experimental
basis. This was the beginning of bioinformatics.


The sum total of all the proteins present in a particular cell at a specified time.


A microarray or chip is a device
in which proteins, peptides or DNA sequences are coated in a predetermined spatial
order [array] allowing them to be used as probes or templates for capturing and detecting complementary molecules.

7. FASTA format

It is the first widely used algorithm f
or database similarity search. It looks for optimal local alignments by scanning the
sequence for small matches called words. The scores of segments in which there are multiple hits are calculated and
summed up to arrive at an overall score. The alignment
is then

by fitting gaps and the ones with the best
scores are presented.

8.Primary and S
econdary biological databa

Primary databases are archival and contain raw data. They also contain information on sequences or structure of
amino acids, proteins etc. Ex: DDBJ, PDB, GenBank, EMBL etc.

Secondary databases are databases containing specified and derived information from the primary databases, like
conserved sequences, domains, motifs, active sites, predicted structures etc. Ex

9.Proteomics and Metabolomics

Proteomics is the study of the complete set of proteins in a cell at a specified time. This is a dynamic situation which
keeps changing e time and with responses to internal, ex
ternal signals or changes. The normal proteome and the
abnormal proteome will be different. Knowledge of the different between normal and abnormal proteome will give an
idea of disease status, treatment processes, genetic influences etc.

Metabolome deno
tes the sum total of the metabolic activities in a cell or organism, representing metabolites, metabolic
products, secondary metabolites etc. Metabolomics is defined as the “study of unique chemical fingerprints of all the
cellular processes and their sma
ll molecule metabolic profile”. Such a study gives an idea of the activity status of the
cell, its genetic and physiological capability in utilising, synthesising and processing biochemicals including drugs, toxins
pollutants etc. in the medical field,
it can trace inborn and acquired errors of metabolism, metabolism, it can help in
assessing how a pharmaceutical drug is assimilated, digested, utilised, metabolised and excreted. It gives an idea of how
a person will respond to the drug or biochemical etc

wise alignment

Every possible way of aligning two sequences is considered by sliding one with respect to the other, inserting gaps as
need be and detecting the best possible scores. Scores are arrived at by assigning values for matches, mismat
ches and


It is the freely accessible main nucleotide database maintained jointly by the USA, UK and Japan. It is a primary
uncurated database.

12.Query Sequence

The input sequence with which the entire entries in a database are to be com
pared for retrieving information from a
database. It represents a question presented to the database in a predetermined format. Many DBMSs use a SQL as a
standard query format.

13.Proteome and Genome


the full and complete set of proteins that

is and can be present in an organism at a particular time in a
particular cell of an organism.


the complete genetic information of an organism that contains the full set of nucleotides in one haploid set of
chromosomes in its cell.


It is an annotated protein sequence database, with its knowledge base consisting of sequence entries. For

purposes, it closely follows that of the EMBL Nucleotide sequence database. The entries are structured
so as to be readable both
to humans as well as by computer programs. Each sequence entry is composed of lines.
Different types of lines, each with their own format to record the various data that make up the entry.

Amino acid Sequence Alignment

From a multiple alignment o
f three or more protein sequences, the highly conserved residues that define structural and
functional domains in protein families can be identified. New members of such families can then be found by searching
sequence databases for other sequences with t
hese same domains.

Multiple Sequence Alignment

( MSA )

Sequence alignment may be local or global. In both, specific query sequences are used to search for similarities in a
target using specialised software and the results are given in the form of a sc
ore which indicates the level of similarity
(or dissimilarity) between the sequences studied. Each soft
ware will have its own scoring matrices on which the output
depends. However, the scoring matrices work by assigning values for matches, mismatches and
penalties for gaps in the
aligned sequences.

The most popular software to do this is CLUSTALW, PSI
BLAST. The appearance of sequences that are distantly related
can be avoided by setting cut off values. The results obtained can be tested for accuracy by s
etting e
values and
comparing known sequences.

MSA is one that contains and compares more than two sequences. It improves the accuracy of alignment between
sequence pairs and gives a single window comparison and an idea of the evolutionary features of the

sequences in
question. It gives a quantitative measurement of how much related two sequences are to one another. It also quantifies
the changes that occur as two sequences diverge over evolutionary time, taking into account the effect of substitutions,
sertion, deletions or gaps. They also give information on the presence of specific domains and motifs from different
sources of the sequences

17. Entrez

It is a protein database query page to search for sequences using author last names, text words, or ot
her key words. It is
integrated with NCBI’s databases and compiles cross references data from several related databases.


Basic Local Alignment Search Tool. It is used to detect similarity between sequences of interest and to compare the


of different organisms.

19.Domain and Motif


describes the part of a protein that can fold into two or more compact globular clusters, and carry out a
function independently, which has a specific geometrical or spatial arrangement
that enables it to execute a specific
function. Domains are polypeptide chains of more than 200 amino acids. The overall function of a protein is
determined by the combination of its domains.

Motif is a recurring thematic element, ie., it is found in m
any molecules and denotes the common group of secondary
structures that influences protein structure. It is usually a dominant or ‘central’ theme.

20. Computer memory

Memory in a computer is described in terms of binary words, and each word can be

any number of bits. The smallest
unit of memory is called a register which consists of a single binary word. The number of registers and bits that a
register holds depends on the processor. The main memory or primary memory that is readily accessible t
o the
processor is the RAM and its capacity depends on the size and sophistication of the system. It is used for storing
programs as well as data. The memory is arranged as sets of cells which contain the information being stored. In order
to access the

cells, each cell is given an unique identifying address and location. It holds temporarily not only all the data
and instructions but also the results as well, which is shared by the CPU, and the I/O devices. Other categories of main
y are ROM, Cach
e and register.

Secondary memories are those devices which store large volumes of information and have the advantage that the
memory is not volatile. It supplements the main memory. They are HDDs, CDs, DVDs, Blu
Ray discs, thumb drives, etc.

ical databases

A biological database is a collection of files containing records of biological data in machine readable form, arranged in
fields and which can be accessed, added, retrieved, manipulated and modified. They may be raw databases or curated
abases, depending on the types of data they contain. Raw databases accept and store input in simple FASTA format
usually for DNA, RNA, amino acid sequences. Input and stored data in curated databases are supervised before they are
available for public acc
ess. Curated databases usually contain data and information that have been obtained by
processing the raw data in archived databases.