Bioinformatics and Database Integration - Department of Computer ...

dasypygalstockingsBiotechnology

Oct 2, 2013 (3 years and 11 months ago)

90 views

Bioinformatics Data
Representation and Integration


By

Ngozi

Oleleh

Table of Contents


Introduction to Bioinformatics


Proteins and Sequences


Bioinformatics Tools


The databases


Blast Functions


Bioindexing


Conclusion


What is Bioinformatics


Bioinformatics is the use of computers to study and
handle biological Information


Bioinformatics can be looked at as an integration of
computer science and Biology to help enhance the
study of biological data which has been proven to be
very extensive


The role of computer science in this Interdisciplinary is
to store the data(via databases) for future Analysis via
biological tools


This field’s study includes but is not limited to the study
of genes,
dna

sequences and protein structures



Protein and Sequences


Biological proteins are made up of 20 amino acids



Alanine

-

ala
-

A


arginine

-

arg



R


asparagine

-

asn



N


aspartic acid
-

asp


D


cysteine

-

cys



C


glutamine
-

gln



Q


glutamic

acid
-

glu

-

E


glycine

-

gly



G


Histidine

-

his


H


isoleucine

-

ile



I


leucine

-

leu



L


lysine
-

lys



K


methionine

-

met


M


phenylalanine
-

phe



F


proline

-

pro


P


serine
-

ser


S


threonine

-

thr

-

T


Tryptophan
-

trp

-

W


tyrosine
-

tyr



Y


valine

-

val



V


Proteins and Sequences


Combination of these amino acids make up protein
structures and sequences


Pdb

database contains numerous protein structures
that are similar by sequence alignment of fold
recognition.


Bioinformatics studies difference and similarities of
these protein structures based on sequence similarity


A Sequence is a combination of amino acids.


This sequences can contain biological data, that can be
used to denote information about families of proteins


Bioinformatic Tools


Mage



Used to display protein singular structures


Rasmol


Used to display protein 3d Structure


LALIGN


For
pairwise

Sequence Alignment


ClustalW



Used for Multiple Sequence Alignment


Ammp



Molecular Modeling


Sequence Alignment Tools


FASTA


BLAST (will be looked at extensively)



Biological Databases


There are over 5000 public biological databases


These databases contain genomic, proteomic and
microarray data.


This so called data is made up of sequence of genes or
amino acids of proteins


Biological databases have become very useful to
scientists. It is important in understanding and
explaining a host of biological phenomena from the
structure of
biomolecules

and their interaction, to the
whole metabolism of organisms and to understanding
the evolution of species.



This knowledge helps facilitate the fight against
diseases, assists in the development of
medications and in discovering basic relationships
amongst species in the history of life.


The biological knowledge is distributed amongst
many different general and specialized databases.
This sometimes makes it difficult to ensure the
consistency of information.



Biological databases cross
-
reference other
databases with accession numbers as one way of
linking their related knowledge together.




Bioinformatics databases can be grouped into 2
groups: Generalized databases and Specialized
databases


Generalized databases


Primary Sequence Databases (EMBL,
Genebank,DDJB)


Protein Sequence Databases(Swiss
-
prot,UniProt, UniRef)


Carbohydrate Databases (CarbBank)


3d structure Databases (PDB, EBI
-
MSD,NDB)




Specialized Databases


Specialized databases


Specialized Sequence database


Genome databases


Specialized Protein Sequence database


Specialize Structure databases


Microarray databases


Main focus are the Generalized databases


Primary Sequence Database



Primary sequence databases




EMBL (European Molecular Biology Laboratory
nucleotide sequence database at EBI, Hinxton, UK)


GenBank (at National Center for Biotechnology
information, NCBI, Bethesda, MD, USA)


DDBJ (DNA Data Bank Japan at CIB , Mishima, Japan)



Protein Sequence Database


Protein sequence databases



SWISS
-
PROT (Swiss Institute of Bioinformatics, SIB, Geneva, CH)


TrEMBL (=Translated EMBL: computer annotated protein
sequence database at EBI, UK)


PIR
-
PSD (PIR
-
International Protein Sequence Database,
annotated protein database by PIR, MIPS and JIPID at NBRF,
Georgetown University, USA)


UniProt (Joined data from Swiss
-
Prot, TrEMBL and PIR)


UniRef (UniProt NREF (Non
-
redundant REFerence) database at
EBI, UK)


IPI (International Protein Index; human, rat and mouse
proteome database at EBI, UK)


Other Databases


Carbohydrate databases


CarbBank

(Former complex carbohydrate structure
database)



3D structure databases


PDB (Protein Data Bank cured by RCSB, USA)


EBI
-
MSD (Macromolecular Structure Database at EBI,
UK )


NDB (Nucleic Acid structure Database at Rutgers State
University of New Jersey , USA)


Blast



Blast is a heuristic algorithm to detect sequence

similarity and is optimized for speed. It is suitable

for large scale analysis


What blast does is to match a queried sequence to

certain positions of database sequences







Quick Diversion


Blast Example



Sequence to be queried


TSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
Q







Sequences producing significant alignments: Score(Bits) E Value



pdb|2FXP|A

Chain A, Solution Structure Of The Sars
-
Coronaviru...
82.4

3e
-
17
pdb|2BEZ|F

Chain F, Structure Of A Proteolitically Resistant ...
81.6

5e
-
17
pdb|1WNC|A

Chain A, Crystal Structure Of The Sars
-
Cov Spike P...
77.8

7e
-
16
pdb|1WYY|A

Chain A, Post
-
Fusion Hairpin Conformation Of The S...
76.6

1e
-
15
pdb|2BEQ|D

Chain D, Structure Of A Proteolytically Resistant ...
69.7

2e
-
13
pdb|1ZVA|A

Chain A, A Structure
-
Based Mechanism Of Sars Virus...
68.6

5e
-
13
pdb|1ZV7|A

Chain A, A Structure
-
Based Mechanism Of Sars Virus...
65.9

3e
-
12
pdb|1ZV8|B

Chain B, A Structure
-
Based Mechanism Of Sars Virus...
65.5

4e
-
12
pdb|1WDG|A

Chain A, Crystal Structure Of Mhv Spike Protein Fu...
25.4

4.7
pdb|2A11|A

Chain A, Crystal Structure Of Nuclease Domain Of R...
24.3

9.1


Blast Functions in Databases


Blast is one of the most heavily used data analysis
tools available, hence large scale data analysis
need to supports BLAST functions.


Blast Support is achieved by defining a set of user
-
defined functions that return BLAST results as a
table.


Many databases Support Blast Functions


Blast 2 major functions are


BLAST_MATCH


BLAST_ALIGN


The Blast Functions


function BLASTP_MATCH (


query_seq

CLOB,


seqdb_cursor

REF CURSOR,


subsequence_from

NUMBER default 1,


subsequence_to

NUMBER default
-
1,


filter_low_complexity

BOOLEAN default false,


mask_lower_case

BOOLEAN default false,


sub_matrix

VARCHAR2 default ’BLOSUM62’,


expect_value

NUMBER default 10,


open_gap_cost

NUMBER default 11,


extend_gap_cost

NUMBER default 1,


word_size

NUMBER default 3,


x_dropoff

NUMBER default 15,


final_x_dropoff

NUMBER default 25)


return table of row (
t_seq_id

VARCHAR2, score NUMBER, expect NUMBER);

Parameter Description


query_seq

The query sequence to search. A s
equence is just lines of
sequence data. Blank lines are not
allowed in the middle of bare sequence input.


seqdb_cursor

The cursor parameter supplied by the user when calling the function. It should return two
columns in its returning row, the sequence identifier and the sequence string.


Subsequence from

Start position of a region of the query sequence to be used for


the search. The default is 1.


Subsequence To

End position of a region of the query sequence to be used for


the search. If
-
1 is specified, the sequence length is taken as subsequence to. The default is
-
1.


Filter_low_complexity

TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that
have low compositional complexity. Filtering can eliminate statistically significant but biologically


uninteresting regions, leaving the more biologically interesting regions of the query sequence available for
specific matchingagainst database sequences. Filtering is only applied to the query sequence. The default
value is FALSE.


mask_lower_case

TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the
query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from
the sequence. The default value is FALSE.


sub_matrix

Specifies the substitution matrix used to assign a score for aligning any possible
pair of residues. The different options are PAM30, PAM70, BLOSUM80, BLOSUM62, and
BLOSUM45. The default is BLOSUM62.


expect_value
The statistical significance threshold for reporting matches against database
sequences. The default value is 10. Specifying 0 invokes default behavior.


open_gap_cost
The cost of opening a gap. The default value is 11. Specifying 0 invokes
default behavior.


extend_gap_cost
The cost of extending a gap. The default value is 1. Specifying 0 invokes
default behavior.


word_size
The word size used for dividing the query sequence into subsequences during the
search. The default value is 3. Specifying 0 invokes default behavior.


x_dropoff

Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes
default behavior.


final_x_dropoff

The final X dropoff value for gapped alignments in bits. The default value is
25. Specifying 0 invokes default behavior.


t_seq_id

The sequence identifier of the returned match.


score

The score of the returned match.


expect
The expect value of the returned match.

How the whole system Works


Sequences that need to be searched are
inserted into a query table



INSERT INTO query_db VALUES (’1’,
’AGCTTTTCATTCTGACTGCAACGGGCAATATGTCT
CTGT’);


How does it work


Select T_SEQ_ID, score, EXPECT as
evalue


from TABLE(BLASTP_MATCH (

(select sequence from
query_db
),
--

query_sequence

CURSOR(SELECT
seq_id
,
seq_data

FROM
swissprot

WHERE organism = 'Homo sapiens (Human)'),
--

seqdb_cursor


1,
--

subsequence_from

-
1,
--

subsequence_to

0,
--

FILTER_LOW_COMPLEXITY

0,
--

MASK_LOWER_CASE 'BLOSUM62',
--

SUB_MATRIX

10,
--

EXPECT_VALUE

0,
--

OPEN_GAP_COST

0,
--

EXTEND_GAP_COST

0,
--

WORD_SIZE

0,
--

X_DROPOFF

0))
--

FINAL_X_DROPOFF

t where
t.score

> 25;

The Search Procedure


SELECT t.t_seq_id,
t.score
,
t.expect
, p.name


FROM PROT_DB p, TABLE(


BLASTP_MATCH (


(SELECT sequence FROM
query_db

WHERE
sequence_id

= ’2’),


CURSOR(SELECT
seq_id
, sequence FROM PROT_DB),


1,


-
1,


0,


0,


’BLOSUM62’,


10,


0,


0,


0,


0,


0)


)t WHERE t.t_seq_id =
p.seq_id

AND
t.score

> 25


ORDER BY
t.expect
;


Output Results


SEQ_ID SCORE EVALUE


--------

----------

----------


P31946 205 5.8977E
-
18


Q04917 198 3.8228E
-
17


P31947 169 8.8130E
-
14


P27348 198 3.8228E
-
17


P58107 49 7.24297332

The Databases and Why


The ability to perform genome
-
wide and cross
-
genome data
analysis can reduce time required for new biological
discoveries



Since traditional databases are not built to support location
datatypes, researchers are forced to find ways in which these
databases can manage biological information that will permit
information to be queried with a Modern database system


This research has led to a concept called Bioindexing



Bioindexing


An index in this construct is basically a way of providing a
mapping between information entities.


In a traditional database, an index is an auxiliary structure
which speeds up the data retrieval process by providing a
mapping between a record key and the physical disk
address of the records containing the key


Bioindexing

provides similar functionality as a database
index but also facilitates
DATA INTEGRATION


Biological features are generally attached to locations and
locations are also the bases for maps(MAPS in this context
is an association of features with a sequence alignment),
alignment ( relationships between two genomic sequence
segments ) and other complex relationships.


The Blast Database and
Bioindexing


Bioindexing

is essentially an infrastructure for
representing and managing biological knowledge
in a large
-
scale database system using index
constructs


Bioindexing

uses “location”
datatype

and “BLAST
JOINS” to efficiently handle and query the large
amount of data.


Bioindexing

is essentially a scheme for connecting
and querying information with modern database
systems WITH THE USE OF INDEXES

Types of Indexing


Intrinsic Indexing:
Indexable

bioinformatics
datatypes
. Intrinsic indexing permits both the
representation and management of biological
mapping


Extrinsic Indexing
: is basically an efficient way of
data integration from different heterogeneous
sources such as relational tables, xml files
standard sequence formats and other sources.



Extrinsic indexing concerns the functions and
algorithms used to access and connect this
information, even when it is not stored locally



Location (How it is represented)


Without proper abstraction, users have to implement
their own codes to handle location operations


A location consists of a sequence identifier and an
interval range.


Integer Interval are modeled in [
lower,upper
] structure


Identifiers are character strings or accession numbers
used to denote a particular sequence and interval
range consists of a pair of positive integers used to
denote the sub
-
range within the given sequence


Complexity (Where Clauses ) if no
location
Datatypes

Est

sequences being needed to be grouped over consecutive overlapping EST fragments




SELECT DISTINCT A.id,
A.lower
,
B.upper


FROM ESTs AS A, ESTs AS B


WHERE
A.unigene_clusterid

=
B.unigene_clusterid


AND
A.lower

<
B.upper


AND NOT EXISTS


(SELECT *


FROM ESTs AS C


WHERE
C.unigene_clusterid

=
A.unigene_clusterid


AND
A.lower

<
C.lower

AND
C.lower

<
B.upper


AND NOT EXISTS


(SELECT * FROM ESTs AS D


WHERE
D.unigene_clusterid

=
A.unigene_clusterid


AND
D.lower

<
C.lower

AND
C.lower

<=
D.upper
))


AND NOT EXISTS


(SELECT *


FROM ESTs AS E


WHERE
E.unigene_clusterid

=
A.unigene_clusterid


AND ((
E.lower

<
A.lower

AND
A.lower

<=
E.upper
) OR


(
E.lower

<
B.upper

AND
B.upper

<
E.upper
)))

Location Datatype


A straightforward representation of a location would
be a sequence identifier as a character string and the
location interval as (start, end) pair of integers.


There are other possible representations such as
integer codes for sequence identifiers and or a
(
start,length
) interval representation




Most databases use the sequence identifier, and
location (start, end ) pair of integers.. WHY..because of
Simplicity

Simplicity using Location
Datatype


Creation and Insertion”


CREATE TABLE features ( location loc, description text);


--

The Prader
-
Willi/Angelman syndrome region on chromosome 15


INSERT INTO features VALUES ( 'NG_002690[1..755217]', 'Prader
-
Willi/Angelman
syndrome region' );


INSERT INTO features VALUES ( 'NG_002690[1..174707]', 'AC090602.16' );


INSERT INTO features VALUES ( 'NG_002690[174707..324834]', 'AC124312.5' );


INSERT INTO features VALUES ( 'NG_002690[324835..478258]', 'AC124303.5' );


INSERT INTO features VALUES ( 'NG_002690[478259..606120]', 'AC100774.2' );


INSERT INTO features VALUES ( 'NG_002690[606121..755217]', 'AC124997.4' );







The introduction of location
datatype

not only provides
a natural and intuitive way to represent biological
information, but also boosts system performance.


Additional performance increase could be achieved by
supporting the location index scheme.


Supports for indexing schemes in traditional relational
database systems are very limited and inflexible.



They are only limited to a few well
-
known index
structures, such as B+
-
tree, Hash and R
-
tree and could
be used for a limited set of native data
-
types for
(in)equality and range queries.




Essentially there are operation and functions
supported in the location
datatype
.


A major proportion of these functions are related to
interval operations.


More than 30 interval operations are defined, including
Allen's interval logic [15] (which includes after, before,
contains, during, equals, overlaps, overlapped by,


finishes, finished by, meets, met by, starts and started
by).



Optimization information (such as regarding ordering,
commutativity

or negation) is also provided to permit
optimization of important operations like merge
-
join,
hash
-
join or general theta
-
join.


Why location datatype is Needed



Here is a simple example to demonstrate the
power of location
datatype

support. This
example shows a session that painfully
attempts to locate alternatively spliced
exon

intervals which intersect with known
homology intervals and associate them with
known protein features from the
Pfam

and
Swissprot

databases.


Complexity without locations


CREATE TABLE
alt_splice_homology_map

AS


SELECT o.*,
d.swiss_id
,
d.query_start
,
d.query_end
,


d.hit_start
+(
o.seq_start
-
d.query_start
)/3,


d.hit_start
+(
o.seq_end
-
d.query_start
)/3,


FROM
alt_splice_exon_obs

o,
alt_splice_homology

d


WHERE
o.ug_id

=
d.ug_id


AND
o.seq_start

>
d.query_start


AND
o.seq_start

<
d.query_end


AND
d.e_value

< 0.01


GROUP BY
o.ug_id
,
o.seq_start
;


SELECT o.*,
f.type
,
f.start
,
f.end


FROM
alt_splice_homology_map

o,
swiss_feature

f


WHERE
o.swiss_id
=
f.swiss_id


AND
o.hit_end

>=
f.start


AND
o.hit_end

<=
f.end
;

Simplicity using locations


CREATE TABLE
alt_splice_homology_map

AS


SELECT o.*,
d.location
,


range_start
(
d.query
)+(
o.location
-
range_start
(d.hit))/3


FROM
alt_splice_exon_obs

o,
alt_splice_homology

d


WHERE
o.location

@
d.location

--

contained


AND
d.e_value

< 0.01


GROUP BY o


SELECT o.*,
f.type
,
f.location


FROM
alt_splice_homology_map

o,
swiss_feature

f


WHERE
o.location

&<
f.location

--

left overlap

Location Support


Supporting location indexing in a traditional
database implies the need to support interval
indexing.


BUT, interval indexing is not supported in
traditional databases and standard join
operations could not handle intervals
efficiently, this has led to extensive research
for interval indexing.


Here lies the need for a concept called GIST


GIST


Is an efficient solution handle the problem of
ineffective interval indexing in traditional database


Gist is basically a balanced search tree in which keys
are maintained in a hierarchical manner. The search
keys used in gist may be any arbitrary predicate, but
this predicate must hold true for the data searched
below a key.


Gist searches by traversing the entire tree in a dept
-
first search manner. If the query predicate is consistent
with a given search key, Gist will continue to search the
subtree

below the key



Gist Implementation


Gist is implemented using bounding intervals that
covers the range of



Identifier integers (
id_lower,id_upper
)


And


Intervals in the
subtree

(
lower,upper
)



Under Gist architecture interval predicates such
as such as left, right overlap,
overleft,overright
,
contains, contained and equal are all supported

What gist location does


Conclusion


Bioinformatics databases are being modeled and queried
using function(as seen in oracle and
ibm

DB2)


An efficient way of modeling these databases are seen
using
bioindexing

(as seen in
postgre
-

sql

database)


The use of an index structure as seen in
Bioindexing
, where
a location is modeled using a (DFS) tree structure leads to
less complexity.


This location index structure leads to an faster searching of
the databases


This concept of speed is very important in bioinformatics


Using a gist architecture, lead to less complex queries and a
more confined search sector for query information.


References


The Index as a First
-
Class Construct in Relational Database Systems


D. Stott Parker, Edwin Mach


Algorithms and Databases in Bioinformatics: Towards a Proteomic Ontology


Mario Cannataro, Pietro Hiram Guzzi, Tommaso Mazza, Giuseppe Tradigo and
Pierangelo Veltri


Oracle® Data Mining


Mobile Access to Biological Databases on the Internet


Pentti Riikonen*, Jorma Boberg, Tapio Salakoski, and Mauno Vihinen


Utilizing Multiple Bioinformatics Information Sources:


An XML Database Approach


Raymond K. Wong William M. Shui


Support for BioIndexing in BLASTgres


Ruey
-
Lung Hsiao, D. Stott Parker, and Hung
-
chih Yang