BINF 634 Bioinformatics Programming

weinerthreeforksΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

162 εμφανίσεις

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

1

Topics


Advanced SQL for Multiple Table Databases


JOINS


Objected Oriented Perl


BioPerl


PDB


BLAST



Some of the slides are based on previous material by J. Grefenstette

Multiple Table Databases

Suppose we want to store info such as:

+
----------
+
----------
+
----------------
+
-----------
+
------------
+

| name | city | country | city size | population |

+
----------
+
----------
+
----------------
+
-----------
+
------------
+

| Wu | Shanghai | China | 17500000 | 1321851888 |

| Lakshmi | Bombay | India | 21600000 | 1129866154 |

| Prabhu | Delhi | India | 21500000 | 1129866154 |

| Susan | New York | United States | 21900000 | 301139947 |

| Richard | Chicago | United States | 9800000 | 301139947 |

| Robert | London | United Kingdom | 12000000 | 60776238 |

| Ali | London | United Kingdom | 12000000 | 60776238 |

| Sohyoung | Seoul | South Korea | 23400000 | 49044790 |

+
----------
+
----------
+
----------------
+
-----------
+
------------
+

Problem:


When population of city or country changes, we have to update many records


data may become
inconsistent

if we update, say, the city size for one person!

Solution:


Design Database so that information is only stored in one place


called
Normalizing the Database


Uses several tables to eliminate many
-
to
-
one relationships within a table


Queries can join several tables together

JOINS

2

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Multiple Table Databases

The PERSON table:


mysql> select * from person;

+
----
+
----------
+
----------
+

| id | name | city |

+
----
+
----------
+
----------
+

| 1 | Susan | New York |

| 2 | Robert | London |

| 3 | Ali | London |

| 4 | Richard | Chicago |

| 5 | Wu | Shanghai |

| 6 | Lakshmi | Bombay |

| 7 | Sohyoung | Seoul |

| 8 | Prabhu | Delhi |

+
----
+
----------
+
----------
+

8 rows in set (0.00 sec)


JOINS

3

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Multiple Table Databases

The CITY table:


mysql> select * from city limit 10;

+
-------------
+
---------------
+
------------
+

| name | country | population |

+
-------------
+
---------------
+
------------
+

| Tokyo | Japan | 33600000 |

| Seoul | South Korea | 23400000 |

| Mexico City | Mexico | 22400000 |

| New York | United States | 21900000 |

| Bombay | India | 21600000 |

| Delhi | India | 21500000 |

| Sao Paulo | Brazil | 20600000 |

| Los Angeles | United States | 18000000 |

| Shanghai | China | 17500000 |

| Osaka | Japan | 16700000 |

+
-------------
+
---------------
+
------------
+

10 rows in set (0.00 sec)


JOINS

4

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Multiple Table Databases

The COUNTRY table:


mysql> select * from country limit 10;

+
---------------
+
------------
+

| name | population |

+
---------------
+
------------
+

| China | 1321851888 |

| India | 1129866154 |

| United States | 301139947 |

| Indonesia | 234693997 |

| Brazil | 190010647 |

| Pakistan | 169270617 |

| Bangladesh | 150448339 |

| Russia | 141377752 |

| Nigeria | 135031164 |

| Japan | 127467972 |

+
---------------
+
------------
+

10 rows in set (0.01 sec)


JOINS

5

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

6

Wikipedia Definition of a JOIN


An
SQL

JOIN clause combines records from
two or more
tables

in a database. It creates a
set that can be saved as a table or used as is.
A JOIN is a means for combining fields from
two tables by using values common to each.
ANSI standard SQL specifies four types of
JOINs: INNER, OUTER, LEFT, and RIGHT. In
special cases, a table (base table, view, or
joined table) can JOIN to itself in a
self
-
join
.

JOINS

JOINing Two Tables

To JOIN two tables, use:

SELECT
columns

From
table1, table2

WHERE
join conditions



mysql> select *
from person, city

where
person.city = city.name
;

+
----
+
----------
+
----------
+
----------
+
----------------
+
------------
+

| id | name | city | name | country | population |

+
----
+
----------
+
----------
+
----------
+
----------------
+
------------
+

| 1 | Susan | New York | New York | United States | 21900000 |

| 2 | Robert | London | London | United Kingdom | 12000000 |

| 3 | Ali | London | London | United Kingdom | 12000000 |

| 4 | Richard | Chicago | Chicago | United States | 9800000 |

| 5 | Wu | Shanghai | Shanghai | China | 17500000 |

| 6 | Lakshmi | Bombay | Bombay | India | 21600000 |

| 7 | Sohyoung | Seoul | Seoul | South Korea | 23400000 |

| 8 | Prabhu | Delhi | Delhi | India

| 21500000 |

+
----
+
----------
+
----------
+
----------
+
----------------
+
------------
+

8 rows in set (0.01 sec)


JOINS

7

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Ambiguous Column Names

To name a field, use
table.column
if column name
appears in multiple tables:


mysql> select
person.name
,
person.city
,
country
,
population


from person, city


where person.city = city.name;

+
----------
+
----------
+
----------------
+
------------
+

| name | city | country | population |

+
----------
+
----------
+
----------------
+
------------
+

| Susan | New York | United States | 21900000 |

| Robert | London | United Kingdom | 12000000 |

| Ali | London | United Kingdom | 12000000 |

| Richard | Chicago | United States | 9800000 |

| Wu | Shanghai | China | 17500000 |

| Lakshmi | Bombay | India | 21600000 |

| Sohyoung | Seoul | South Korea | 23400000 |

| Prabhu | Delhi | India | 21500000 |

+
----------
+
----------
+
----------------
+
------------
+

8 rows in set (0.00 sec)

JOINS

8

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Aliases for Column Tables

You can provide an ALIAS for any column name:


mysql> select person.name, person.city, city.country,


city.population as 'city size'


from person, city


where person.city = city.name;

+
----------
+
----------
+
----------------
+
-----------
+

| name | city | country |
city size

|

+
----------
+
----------
+
----------------
+
-----------
+

| Susan | New York | United States | 21900000 |

| Robert | London | United Kingdom | 12000000 |

| Ali | London | United Kingdom | 12000000 |

| Richard | Chicago | United States | 9800000 |

| Wu | Shanghai | China | 17500000 |

| Lakshmi | Bombay | India | 21600000 |

| Sohyoung | Seoul | South Korea | 23400000 |

| Prabhu | Delhi | India | 21500000 |

+
----------
+
----------
+
----------------
+
-----------
+

8 rows in set (0.00 sec)


JOINS

9

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

WHERE conditions in JOINS

Use WHERE conditions to limit which records are
returned.

Example: show date for people from India


mysql> select person.name, person.city, city.country,


city.population as 'city size'


from person, city


where person.city = city.name


and city.country = 'India'
;

+
---------
+
--------
+
---------
+
-----------
+

| name | city | country | city size |

+
---------
+
--------
+
---------
+
-----------
+

| Lakshmi | Bombay | India | 21600000 |

| Prabhu | Delhi | India | 21500000 |

+
---------
+
--------
+
---------
+
-----------
+

2 rows in set (0.00 sec)


JOINS

10

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

JOIN Three Tables

Can JOIN three of more tables.

Example: Show both country and city population for each person:


mysql> select person.name, person.city, city.country,


city.population as 'city size', country.population


from
person, city, country



where person.city = city.name


and city.country = country.name
;

+
----------
+
----------
+
----------------
+
-----------
+
------------
+

| name | city | country | city size | population |

+
----------
+
----------
+
----------------
+
-----------
+
------------
+

| Susan | New York | United States | 21900000 | 301139947 |

| Robert | London | United Kingdom | 12000000 | 60776238 |

| Ali | London | United Kingdom | 12000000 | 60776238 |

| Richard | Chicago | United States | 9800000 | 301139947 |

| Wu | Shanghai | China | 17500000 | 1321851888 |

| Lakshmi | Bombay | India | 21600000 | 1129866154 |

| Sohyoung | Seoul | South Korea | 23400000 | 49044790 |

| Prabhu | Delhi | India | 21500000 | 1129866154 |

+
----------
+
----------
+
----------------
+
-----------
+
------------
+

8 rows in set (0.00 sec)

JOINS

11

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

ORDER BY

Same as above, by sort results by country
population:


mysql> select person.name, person.city, city.country,


city.population as 'city size', country.population


from person, city, country


where person.city = city.name


and city.country = country.name


order by country.population desc
;

+
----------
+
----------
+
----------------
+
-----------
+
------------
+

| name | city | country | city size | population |

+
----------
+
----------
+
----------------
+
-----------
+
------------
+

| Wu | Shanghai | China | 17500000 | 1321851888 |

| Lakshmi | Bombay | India | 21600000 | 1129866154 |

| Prabhu | Delhi | India | 21500000 | 1129866154 |

| Susan | New York | United States | 21900000 | 301139947 |

| Richard | Chicago | United States | 9800000 | 301139947 |

| Robert | London | United Kingdom | 12000000 | 60776238 |

| Ali | London | United Kingdom | 12000000 | 60776238 |

| Sohyoung | Seoul | South Korea | 23400000 | 49044790 |

+
----------
+
----------
+
----------------
+
-----------
+
------------
+

8 rows in set (0.00 sec)

JOINS

12

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

13

BioPerl


Object Oriented Programming (OOP)


Intro to BioPerl


Bio::Seq


Bio::SeqIO


Bio::DB


Bio::Perl


Others


BioPerl OOP

Perl Modules


A Module is a file containing Perl code


The file should have a .pm extension



Code within a module is brought into a program with the use
statement:


use moduledir::module;


where moduledir is the directory in which the module file exists.


the moduledir can be omitted if module is in current directory or in
the standard Perl directories (specified by the @INC list)



You can access data and subroutines by using “full name”

use myModule;

$myModule::counter = 3;

myModule::translate($dna);

BioPerl OOP

14

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)


Object
-
oriented programming (OOP) is a software engineering
technique for modularizing code



Think about a program as a set of interacting
objects



Objects have
attributes


data that describes the object



Objects also have
actions

(or
methods
)


code that controls how an object interacts with other objects



Main ideas:


Hide the details of an object’s data


so it can be changed easily (by the developer)


users don’t need to know excess details


All interactions are performed via the object’s methods


BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

15

Object Oriented Programming

BioPerl OOP

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

16

Basics of Perl Objects


In
Procedural Programming
, data and functions to operate on data
that are kept separate. Example:

$reverse_complement = revcom( $sequence );



In
Object
-
Oriented Programming
, both data and functions are
encapsulated into the same construct (called objects). Example:

$reversed_obj = $seq_obj
-
>revcom();





$seq_obj

is called an
object
, and
revcom
is called a
method



In other words, the object
$seq_obj

“knows” how to calculate and
return its own reverse complement.

BioPerl OOP

Basics of Perl Objects


Objects are separated into types called
classes


The class of an object defines both the data that it can hold
and the methods that it knows.



A specific object that has a defined class is referred to as an
instance

of that class.



Perl uses the term "module" rather than "class".



An object in Perl is a special kind of hash reference.



In Bioperl the data that the object contains is stored in a
single, complex hash and the object, like
$seq_obj
, is a
reference to this hash. In addition, the methods that the
object can use are also stored in this hash as particular kinds
of references.

BioPerl OOP

17

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

18

BioPerl

(www.bioperl.org)


A collection of modules that facilitates the development of Perl scripts
for bioinformatics applications


Some Simple Procedural Functions


Bio::Perl



Sequences


Bio::Seq


Bio:SeqIO


Bio::PrimarySeq


Multiple Sequence Alignments


Bio::SimpleAlign


Access to Databases


Bio::DB::GenBank


Access to Tools


Bio::Tools::Blast


Bio::Tools::BPlite

See Table 12.1 of Tisdall

BioPerl

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

19


# these methods return strings, and accept strings in some cases:


$seqobj
-
>seq(); # string of sequence

$seqobj
-
>subseq(5,10); # part of the sequence as a string

$seqobj
-
>accession_number(); # when there, the accession number

$seqobj
-
>alphabet(); # one of 'dna','rna',or 'protein'

$seqobj
-
>seq_version() # when there, the version

$seqobj
-
>keywords(); # when there, the Keywords line

$seqobj
-
>length() # length

$seqobj
-
>desc(); # description

$seqobj
-
>primary_id(); # a unique id for this sequence regardless


# of its display_id or accession number

$seqobj
-
>display_id(); # the human readable id of the sequence


Most of these features are included in GenBank records.

Methods on Bio::Seq Objects

BioPerl

Using Bio::Seq


Each Class contains a method called
new

that returns an new object of the class


The
new

function takes arguments that initialize the data belonging to the new object


Once an object is created, you can access all the methods associated with its class by
calling its attached functions


use Bio::Seq; # module to handle sequence data


# create a new sequence object

$seq_obj = Bio::Seq
-
>new(
-
seq => 'ATGGGGGTGGTGGTACCCT',


-
id => 'human_id',


-
accession_number => 'AL000012',


-
alphabet => 'dna');


# the seq() method returns the object's sequence as a string

print "DNA: ", $seq_obj
-
>seq(), "
\
n";



# create a new sequence object contain protein sequence

$protein_obj = $seq_obj
-
>translate();

print "Protein: ", $protein_obj
-
>seq(), "
\
n";



OUTPUT:


DNA: ATGGGGGTGGTGGTACCCT


Protein: MGVVVP

BioPerl

20

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

Using Bio::Seq and Bio::SeqIO

#!/usr/bin/perl

# File: bioseqio.pl

use strict;

use warnings;

use Bio::Seq;

use Bio::SeqIO; # module to read/write many sequence formats


# create a new sequence object

my $seq_obj = Bio::Seq
-
>new(
-
seq => 'ATGGGGGTGGTGGTACCCT',


-
id => 'human_id',


-
accession_number => 'AL000012',


-
alphabet => 'dna');


# create a new FASTA file handler object (NOTE: ">" for writing)

my $seqio_obj = Bio::SeqIO
-
>new(
-
file=>">myseq.fsa",'
-
format'=>'fasta');


# tell the file handler to print the sequence object


$seqio_obj
-
>write_seq($seq_obj);


% cat myseq.fsa

>human_id

ATGGGGGTGGTGGTACCCT

BioPerl

21

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

Reading FASTA file with Bio::SeqIO

#!/usr/bin/perl

# File: read_fasta.pl

use strict;

use warnings;

use Bio::Seq;

use Bio::SeqIO; # module to read/write many sequence formats


my $file = "seq3.fsa"; #this is a FASTA file with 3 seqs


# create a new FASTA file handler object
(no ">" for reading!)

my $seqio_obj = Bio::SeqIO
-
>new(
-
file=> $file,
-
format=>"fasta");


# read sequences from the FASTA file

while (my $seq_obj =
$seqio_obj
-
>next_seq()
) {


my $seq_string = $seq_obj
-
>seq();


print "$seq_string
\
n";

}

OUTPUT:

ATTGGGTGCGCGTGCNCCTTCC

aaaaaatcatactacatgtagggtaca

aaaaaatcatact


BioPerl

22

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

A Sample GenBank Record
-

I

LOCUS NM_007553 1541 bp mRNA
linear ROD 07
-
JAN
-
2002

DEFINITION Mus musculus bone morphogenetic protein 2
(Bmp2), mRNA.

ACCESSION NM_007553

VERSION NM_007553.1 GI:6680793

KEYWORDS .

SOURCE house mouse.


ORGANISM Mus musculus


Eukaryota; Metazoa; Chordata; Craniata;
Vertebrata; Euteleostomi;


Mammalia; Eutheria; Rodentia; Sciurognathi;
Muridae; Murinae; Mus.

REFERENCE 1 (bases 1 to 1541)


AUTHORS Feng,J.Q., Harris,M.A., Ghosh
-
Choudhury,N.,
Feng,M., Mundy,G.R. and


Harris,S.E.


TITLE Structure and sequence of mouse bone
morphogenetic protein
-
2 gene


(BMP
-
2): comparison of the structures and
promoter regions of BMP
-
2


and BMP
-
4 genes


JOURNAL Biochim. Biophys. Acta 1218 (2), 221
-
224
(1994)


MEDLINE 94289485

REFERENCE 2 (bases 1 to 1541)


AUTHORS Heller,L.C., Li,Y., Abrams,K.L. and
Rogers,M.B.


TITLE Transcriptional regulation of the Bmp2
gene. Retinoic acid


induction in F9 embryonal carcinoma cells
and Saccharomyces


cerevisiae


JOURNAL J. Biol. Chem. 274 (3), 1394
-
1400 (1999)


MEDLINE 99098878


PUBMED 9880512

COMMENT PROVISIONAL REFSEQ: This record has not yet
been subject to final


NCBI review. The reference sequence was
derived from L25602.1.




FEATURES Location/Qualifiers


source 1..1541


/organism="Mus musculus"


/db_xref="taxon:10090"


/chromosome="2"


/map="2 76.1 cM"


/cell_type="spleen"


gene 1..1541


/gene="Bmp2"


/db_xref="LocusID:12156"


/db_xref="MGD:88177"


CDS 357..1541


/gene="Bmp2"


/codon_start=1


/db_xref="LocusID:12156"


/db_xref="MGD:88177"


/product="bone
morphogenetic protein 2"


/protein_id="NP_031579.1"


/db_xref="GI:6680794"


/translation="MVAGTRCLLVLLLPQVLLGGAAGLIPELG
RKKFAAASSRPLSRP


SEDVLSEFELRLLSMFGLKQRPTPSKDVVVPPYMLDLYRRHSG
QPGAPAPDHRLERAA


SRANTVRTFHQLEAVEELPEMSGKTARRFFFNLSSVPSDEFLT
SAELQIFREQIQEAL


GNSSFQHRINIYEIIKPAAANLKFPVTRLLDTRLVNQNTSQWE
SFDVTPAVMRWTTQG


HTNHGFVVEVAHLEENPGVSKRHVRISRSLHQDEHSWSQIRPL
LVTFGHDGKGHPLHK


REKRQAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYH
AFYCHGECPFPLADH


LNSTNHAIVQTLVNSVNSKIPKACCVPTELSAISMLYLDENEK
VVLKNYQDMVVEGCG


CR"



BioPerl

23

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

24

A Sample GenBank Record
-

II

misc_feature 492..1154


/note="TGFb_propeptide; Region:
TGF
-
beta propeptide"


misc_feature 1236..1538


/note="TGFB; Region:
Transforming growth factor
-
beta


(TGF
-
beta) family"


misc_feature 1236..1538


/note="TGF
-
beta; Region:
Transforming growth factor beta


like domain"

BASE COUNT 361 a 444 c 414 g 322 t

ORIGIN


1 cgcactcctc cccctgctcg aggctgtgtg tcagcacttg
gctggagact tcttgaactt


61 gccgggagag tgacttgggc tccccacttc gcgccggtgt
cctcgcccgg cggatccagt


121 cttgccgcct ccagcccgat cacctctctt cctcagcccg
ctggcccacc ccaagacaca


181 gttccctaca gggagaacac ccggagaagg aggaggaggc
gaagaaaagc aacagaagcc


241 cagttgctgc tccaggtccc tcggacagag ctttttccat
gtggagactc tctcaatgga


301 cgtgccccct agtgcttctt agacggactg cggtctccta
aagccgcagg tcgaccatgg


361 tggccgggac ccgctgtctt ctagtgttgc tgcttcccca
ggtcctcctg ggcggcgcgg


421 ccggcctcat tccagagctg ggccgcaaga agttcgccgc
ggcatccagc cgacccttgt


481 cccggccttc ggaagacgtc ctcagcgaat ttgagttgag
gctgctcagc atgtttggcc


541 tgaagcagag acccaccccc agcaaggacg tcgtggtgcc
cccctatatg ctagatctgt


601 accgcaggca ctcaggccag ccaggagcgc ccgccccaga
ccaccggctg gagagggcag


661 ccagccgcgc caacaccgtg cgcacgttcc atcaactaga
agccgtggag gaacttccag


721 agatgagtgg gaaaacggcc cggcgcttct
tcttcaattt aagttctgtc cccagtgacg


781 agtttctcac atctgcagaa ctccagatct
tccgggaaca gatacaggaa gctttgggaa


841 acagtagttt ccagcaccga attaatattt
atgaaattat aaagcctgca gcagccaact


901 tgaaatttcc tgtgaccaga ctattggaca
ccaggttagt gaatcagaac acaagtcagt


961 gggagagctt cgacgtcacc ccagctgtga
tgcggtggac cacacaggga cacaccaacc


1021 atgggtttgt ggtggaagtg gcccatttag
aggagaaccc aggtgtctcc aagagacatg


1081 tgaggattag caggtctttg caccaagatg
aacacagctg gtcacagata aggccattgc


1141 tagtgacttt tggacatgat ggaaaaggac
atccgctcca caaacgagaa aagcgtcaag


1201 ccaaacacaa acagcggaag cgcctcaagt
ccagctgcaa gagacaccct ttgtatgtgg


1261 acttcagtga tgtggggtgg aatgactgga
tcgtggcacc tccgggctat catgcctttt


1321 actgccatgg ggagtgtcct tttccccttg
ctgaccacct gaactccact aaccatgcca


1381 tagtgcagac tctggtgaac tctgtgaatt
ccaaaatccc taaggcatgc tgtgtcccca

1441 cagagctcag cgcaatctcc atgttgtacc
tagatgaaaa tgaaaaggtt gtgctaaaaa


1501 attatcagga catggttgtg gagggctgcg
ggtgtcgtta g

//




BioPerl

#!/usr/bin/perl

#File: readgb.pl


use strict;

use warnings;

use Bio::Seq;

use Bio::SeqIO;

my $inputfile = $ARGV[0];


# create file input object

my $in = Bio::SeqIO
-
>new(
-
file=> $inputfile,
-
format=> 'GenBank'
);



while (my $seq = $in
-
>next_seq()){


# print out parts of the sequence


print "Sequence ",
$seq
-
>id()
,


" with accession ",
$seq
-
>accession_number
,


" and description:
\
n",
$seq
-
>desc
, "
\
n
\
n";

}


Reading GenBank files with Bio::SeqIO


Bio::SeqIO provides methods for reading and printing files in
many file formats: FASTA, GenBank, SwissProt, GCG, EMBL, etc

BioPerl

25

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

26

[binf:~/binf634/bioperl] jsolka% ./readgb.pl genbank_sample.txt

Sequence NM_007553 with accession NM_007553 and description:

Mus musculus bone morphogenetic protein 2 (Bmp2), mRNA.


The Program in Action

BioPerl

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

27

Direct Access to Databases


You can create new Bio::Seq objects by
importing them directly from a remote
database using the Bio::DB family of modules


GenBank:

Bio::DB::GenBank


GenPept

Bio::DB::GenPept


SwissProt

Bio::DB::SwissProt


GDB


Bio::DB::GDB


ACEDB


Bio::DB:Ace

BioPerl

Fetching a Sequence from GenBank

#!/usr/bin/perl

use strict;

use warnings;

use Bio::Seq;

use Bio::DB::GenBank;

# File: getgb.pl


# get GI number from command line

my ($id) = @ARGV;


# create a new GenBank handle object

my $gb = new Bio::DB::GenBank();


# Fetch the sequence from GenBank, returning a seq_object

my $seq_obj = $gb
-
>get_Seq_by_id($id);


# print out parts of the sequence

print "Id = ", $id, "
\
n",


"Accession = ", $seq_obj
-
>accession_number, "
\
n",


"Description = ", $seq_obj
-
>desc, "
\
n",


"Sequence = ", $seq_obj
-
>seq, "
\
n";

BioPerl

28

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

29


[binf:~/binf634/bioperl] jsolka% getgb.pl 1293614

Id = 1293614

Accession = AAA98665

Description = TCP1
-
beta [Saccharomyces cerevisiae].

Sequence =
SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM






Check out this site for more information


http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html


The Program in Action

BioPerl

Bio::Perl has a number of easy to use functions, including



get_sequence
-

gets sequence from standard internet databases


read_sequence
-

reads a sequence from a file


read_all_sequences
-

reads all sequences from a file


new_sequence
-

makes a bioperl sequence from a string


write_sequence
-

writes one or more sequences to a file


translate
-

provides a translation of a sequence


translate_as_string
-

provides translation of a sequence as string


blast_sequence
-

BLASTs sequence against databases at NCBI


write_blast
-

writes a blast report out to a file



Bio::Perl

BioPerl

30

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

The Bio::Perl module provides some simple access functions, for example, this
script will retrieve a GenBank sequence and write it out in two different formats


#!/usr/bin/perl

# File: getgenbank.pl

use strict;

use warnings;

use Bio::Perl;


# this script will only work with an internet connection

# on the computer it is run on

my $seq_object = get_sequence('genbank',"ROA1_HUMAN");


# write the sequence in FASTA format

write_sequence(">roa1.fasta",'fasta',$seq_object);


# write the sequence in GenBank format

write_sequence(">roa1.genbank",'genbank',$seq_object);


exit;

Bio::Perl

BioPerl

31

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

% cat roa1.fasta

>ROA1_HUMAN Heterogeneous nuclear ribonucleoprotein A1 (Helix
-
destabilizing protein

SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVT

YATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHL

RDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKAL

SKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSG

DGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGG

SGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSS

SSSSYGSGRRF


% cat roa1.genbank

LOCUS ROA1_HUMAN 371 aa linear HUMAN 01
-
MAR
-
1989

DEFINITION Heterogeneous nuclear ribonucleoprotein A1 (Helix
-
destabilizing


protein) (Single
-
strand binding protein) (hnRNP core protein A1).

ACCESSION P09651

KEYWORDS Nuclear protein; RNA
-
binding; Repeat; Ribonucleoprotein; Methylation;


Phosphorylation; Alternative splicing; 3D
-
structure; Polymorphism.

SOURCE Human


ORGANISM Homo sapiens


Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;


Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

ORIGIN


1 sksespkepe qlrklfiggl sfettdeslr shfeqwgtlt dcvvmrdpnt krsrgfgfvt


61 yatveevdaa mnarphkvdg rvvepkravs redsqrpgah ltvkkifvgg ikedteehhl


121 rdyfeqygki evieimtdrg sgkkrgfafv tfddhdsvdk iviqkyhtvn ghncevrkal


181 skqemasass sqrgrsgsgn fgggrgggfg gndnfgrggn fsgrggfggs rggggyggsg


241 dgyngfgndg gyggggpgys ggsrgygsgg qgygnqgsgy ggsgsydsyn ngggrgfggg


301 sgsnfggggs yndfgnynnq ssnfgpmkgg nfggrssgpy ggggqyfakp rnqggyggss


361 ssssygsgrr f

//

BioPerl

32

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

33

LWP: fetch WWW documents


To automate WWW access


LWP::Simple
-

procedural interface to LWP


Example of usage:

use LWP::Simple;

$url = "http://www.whatever.com/data.html";

$page = get($url);

if ($page)

{ # do something }

else { print "Problems getting $url"; }

BioPerl

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

34

Protein Data Bank


The Protein Data Bank (PDB) is a
repository of protein data structure


http://www.rcsb.org/pdb/home/hom
e.do



The nature of proteins


Primary sequences


Secondary structures


Alpha helices


Beta strands


Turns


Tertiary structures


Beta sheets


Alpha
-
alpha units



Quaternary structures


Enzyme


Virus

PDB

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

35

Overview of PDB


ASCII flat files that are human readable



One structure per file



Problems with consistency of the formats


Code you create today may not work tomorrow


BioPerl

PDB

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

36

Files and Folders


PDB is distributed as files within directories



There are in
-
depth discussions of how to
handle these directory structures within
Chapter 11 of Tisdall


We will talk a little about recursion next week

PDB

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

37

An Example PDB File
-

I

HEADER IMMUNOGLOBULIN 10
-
MAR
-
99
43C9

TITLE CRYSTALLOGRAPHIC STRUCTURE OF THE ESTEROLYTIC AND

TITLE 2 AMIDOLYTIC 43C9 ANTIBODY

COMPND MOL_ID: 1;

COMPND 2 MOLECULE: IMMUNOGLOBULIN (LIGHT CHAIN);

COMPND 3 CHAIN: A, C, E, G;

COMPND 4 FRAGMENT: FV;

COMPND 5 ENGINEERED: YES;

COMPND 6 MOL_ID: 2;

COMPND 7 MOLECULE: IMMUNOGLOBULIN (HEAVY CHAIN);

COMPND 8 CHAIN: B, D, F, H;

COMPND 9 FRAGMENT: FV;

COMPND 10 ENGINEERED: YES

SOURCE MOL_ID: 1;

SOURCE 2 ORGANISM_SCIENTIFIC: MUS MUSCULUS;

SOURCE 3 ORGANISM_COMMON: MOUSE;

SOURCE 4 EXPRESSION_SYSTEM: ESCHERICHIA COLI;

SOURCE 5 EXPRESSION_SYSTEM_CELL_LINE: BL21(DE3);

SOURCE 6 MOL_ID: 2;

SOURCE 7 ORGANISM_SCIENTIFIC: MUS MUSCULUS;

SOURCE 8 ORGANISM_COMMON: MOUSE;

SOURCE 9 EXPRESSION_SYSTEM: ESCHERICHIA COLI;

SOURCE 10 EXPRESSION_SYSTEM_CELL_LINE: BL21(DE3)

KEYWDS IMMUNOGLOBULIN

EXPDTA X
-
RAY DIFFRACTION


AUTHOR M.M.THAYER,E.D.GETZOFF,V.A.ROBERTS

REVDAT 1 18
-
AUG
-
99 43C9 0

JRNL AUTH
M.M.THAYER,E.H.OLENDER,A.S.ARVAI,C.K.KOIKE,

JRNL AUTH 2
I.L.CANESTRELLI,J.D.STEWART,S.J.BENKOVIC,

JRNL AUTH 3 E.D.GETZOFF,V.A.ROBERTS

JRNL TITL STRUCTURAL BASIS FOR AMIDE HYDROLYSIS
CATALYZED BY

JRNL TITL 2 THE 43C9 ANTIBODY

JRNL REF J.MOL.BIOL. V. 291
329 1999

JRNL REFN ASTM JMOBAK UK ISSN 0022
-
2836
0070

REMARK 1

REMARK 1 REFERENCE 1

REMARK 1 AUTH
V.A.ROBERTS,J.STEWART,S.J.BENKOVIC,E.D.GETZOFF

REMARK 1 TITL CATALYTIC ANTIBODY MODEL AND
MUTAGENESIS IMPLICATE

REMARK 1 TITL 2 ARGININE IN TRANSITION
-

STATE
STABILIZATION

REMARK 1 REF J.MOL.BIOL. V. 235
1098 1994

REMARK 1 REFN ASTM JMOBAK UK ISSN 0022
-
2836
0070

REMARK 1 REFERENCE 2

.

.

.


PDB

An Example PDB File
-

II

REMARK 800

REMARK 800 HIS 91 IN THE VARIABLE LIGHT CHAIN (A,C,E, OR G)
-


REMARK 800 NUCLEOPHILE THAT FORMS A COVALENT BOND TO THE

REMARK 800 SUBSTRATE

REMARK 800

REMARK 800 ARG 96 IN THE VARIABLE LIGHT CHAIN (A,C,E, OR G)
-


REMARK 800 SIDE CHAIN STABILIZES NEGATIVE CHARGE FORMED IN

REMARK 800 THE TRANSITION STATES OF THE REACTION

REMARK 800

REMARK 900

REMARK 900 RELATED ENTRIES

REMARK 900 RELATED ID: 43CA RELATED DB: PDB

DBREF 43C9 A 1 107 GB 207840 M68968 1 113

DBREF 43C9 B 0 112 GB 207840 M68968 127 244

DBREF 43C9 C 1 107 GB 207840 M68968 1 113

DBREF 43C9 D 0 112 GB 207840 M68968 127 244

DBREF 43C9 E 1 107 GB 207840 M68968 1 113

DBREF 43C9 F 0 112 GB 207840 M68968 127 244

DBREF 43C9 G 1 107 GB 207840 M68968 1 113

DBREF 43C9 H 0 112 GB 207840 M68968 127 244

SEQRES 1 A 113 ASP VAL VAL MET THR GLN THR PRO SER SER LEU ALA MET

SEQRES 2 A 113 SER VAL GLY GLN LYS VAL THR MET SER CYS LYS SER SER

SEQRES 3 A 113 GLN SER LEU LEU ASN ILE SER ASN GLN LYS ASN TYR LEU

SEQRES 4 A 113 ALA TRP TYR GLN GLN LYS PRO GLY GLN SER PRO LYS LEU

SEQRES 5 A 113 LEU VAL TYR PHE ALA SER THR ARG GLU SER GLY VAL PRO

SEQRES 6 A 113 ASP ARG PHE ILE GLY SER GLY SER GLY THR ASP PHE THR

SEQRES 7 A 113 LEU THR ILE SER SER VAL GLN ALA GLU ASP GLN ALA ASP

SEQRES 8 A 113 TYR PHE CYS GLN GLN HIS TYR ARG ALA PRO ARG THR PHE

SEQRES 9 A 113 GLY GLY GLY THR LYS LEU GLU ILE LYS

SEQRES 1 B 118 GLY GLN VAL GLN LEU VAL GLU SER GLY PRO GLY LEU
VAL

SEQRES 2 B 118 ALA PRO SER GLN SER LEU SER ILE THR CYS THR VAL
SER

SEQRES 3 B 118 GLY ILE SER LEU SER ARG TYR ASN VAL HIS TRP VAL
ARG

SEQRES 4 B 118 GLN SER PRO GLY LYS GLY LEU GLU TRP LEU GLY MET
ILE

SEQRES 5 B 118 TRP GLY GLY GLY SER ILE GLU TYR ASN PRO ALA LEU
LYS

SEQRES 6 B 118 SER ARG LEU SER ILE SER LYS ASP ASN SER LYS SER
GLN

SEQRES 7 B 118 ILE PHE LEU LYS MET ASN SER LEU GLN THR ASP ASP
SER

SEQRES 8 B 118 ALA MET TYR TYR CYS VAL SER TYR GLY TYR GLY GLY
ASP

SEQRES 9 B 118 ARG PHE SER TYR TRP GLY GLN GLY THR LEU VAL THR
VAL

SEQRES 10 B 118 SER

.

.

.

FORMUL 9 HOH *162(H2 O1)

SHEET 1 A 4 MET A 4 THR A 7 0

SHEET 2 A 4 VAL A 19 LYS A 24
-
1 N LYS A 24 O THR A
5

SHEET 3 A 4 ASP A 70 ILE A 75
-
1 N ILE A 75 O VAL A
19

SHEET 4 A 4 PHE A 62 SER A 67
-
1 N SER A 67 O ASP A
70

SHEET 1 B 5 SER A 10 SER A 14 0

SHEET 2 B 5 THR A 102 LYS A 107 1 N LYS A 103 O LEU A
11

SHEET 3 B 5 ALA A 84 GLN A 90
-
1 N TYR A 86 O THR A
102

SHEET 4 B 5 LEU A 33 GLN A 38
-
1 N GLN A 38 O ASP A
85

SHEET 5 B 5 PRO A 44 VAL A 48
-
1 N VAL A 48 O TRP A
35

SHEET 1 C 4 GLN B 3 SER B 7 0

SHEET 2 C 4 LEU B 18 SER B 25
-
1 N SER B 25 O GLN B
3

SHEET 3 C 4 GLN B 77 MET B 82
-
1 N MET B 82 O LEU B
18

.

.

.

PDB

38

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

39

An Example PDB File
-
III

ATOM 1 N ASP A 1 7.422 50.761 33.806
1.00 64.60 N

ATOM 2 CA ASP A 1 8.287 50.204 34.870 1.00 77.57 C

ATOM 3 C ASP A 1 8.848 48.834 34.503 1.00 77.32 C

ATOM 4 O ASP A 1 9.752 48.357 35.183 1.00 82.14 O

ATOM 5 CB ASP A 1 7.507 50.107 36.205 1.00 86.78 C

ATOM 6 CG ASP A 1 8.392 49.691 37.423 1.00 92.57 C

ATOM 7 OD1 ASP A 1 9.630 49.891 37.411 1.00 93.28 O

ATOM 8 OD2 ASP A 1 7.826 49.183 38.422 1.00 96.45 O

ATOM 9 N VAL A 2 8.333 48.184 33.458 1.00 70.66 N

ATOM 10 CA VAL A 2 8.864 46.859 33.118 1.00 63.77 C

ATOM 11 C VAL A 2 10.397 47.012 32.964 1.00 58.68 C

ATOM 12 O VAL A 2 10.879 47.598 32.010 1.00 62.61 O

ATOM 13 CB VAL A 2 8.149 46.194 31.855 1.00 62.22 C

.

.

.

CONECT 6928 6353

MASTER 381 0 0 13 80 0 4 24 7254 8 16 76

END

PDB

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

40

PDB File Format
-

I


In
-
depth discussions are available here


http://www.wwpdb.org/docs.html


PDB files are composed of 80 columns that begin
with one of several record names and end with a
newline


Blank columns are padded with spaces


A record type is one or more lines with the same
record name


Each record type has a different number of fields
with them


PDB

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

41

PDB File Formats
-
II


Record Types


DBREF


Reference to the entry in the sequence database


SEQADV


Identification of conflicts between pdb and the named
sequence database


SEQRES


Primary sequence of backbone residues


MODRES


Identification of modification to standard residues

PDB

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

42

Our Example PDB File


The
DBREF Section


DBREF 43C9 A 1 107 GB 207840 M68968 1 113

DBREF 43C9 B 0 112 GB 207840 M68968 127 244

DBREF 43C9 C 1 107 GB 207840 M68968 1 113

DBREF 43C9 D 0 112 GB 207840 M68968 127 244

DBREF 43C9 E 1 107 GB 207840 M68968 1 113

DBREF 43C9 F 0 112 GB 207840 M68968 127 244

DBREF 43C9 G 1 107 GB 207840 M68968 1 113

DBREF 43C9 H 0 112 GB 207840 M68968 127 244


This tells us there is a PDB file called pdb43cp.ent and that the
residues in chain A are labeled from 1 to 107 in the original GenBank
database and the ID number is 207840 and the name m68968 are
assigned to that database and the residues are numbered 1 to 113 in
PDB


PDB

Our Example PDB File


The SEQRES Section

SEQRES 1 A 113 ASP VAL VAL MET THR GLN THR PRO SER SER LEU ALA MET

SEQRES 2 A 113 SER VAL GLY GLN LYS VAL THR MET SER CYS LYS SER SER

SEQRES 3 A 113 GLN SER LEU LEU ASN ILE SER ASN GLN LYS ASN TYR LEU

SEQRES 4 A 113 ALA TRP TYR GLN GLN LYS PRO GLY GLN SER PRO LYS LEU

SEQRES 5 A 113 LEU VAL TYR PHE ALA SER THR ARG GLU SER GLY VAL PRO

SEQRES 6 A 113 ASP ARG PHE ILE GLY SER GLY SER GLY THR ASP PHE THR

SEQRES 7 A 113 LEU THR ILE SER SER VAL GLN ALA GLU ASP GLN ALA ASP

SEQRES 8 A 113 TYR PHE CYS GLN GLN HIS TYR ARG ALA PRO ARG THR PHE

SEQRES 9 A 113 GLY GLY GLY THR LYS LEU GLU ILE LYS






Col

Data Type

Field

Def

1
-
6

Record name

“SEQRES”

9
-
10

Integer

serNum

Serial # of the SEQRES rec
for the current chain. Starts
at 1 and increments by 1.
Resets for a new chain.

12

Character

Chain ID

Chain identifier may be a
single character including a
blank.

14
-
17

Integer

numRes

Number of residues in the
chain. Repeated in every
record.

20
-
22

Residue name

resName

Residue name

24
-
26

Residue name

resName

Residue name

PDB

43

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

44

Using BioPerl to Parse a pdb File

#!/usr/local/bin/perl
-
w


# Parsing PDB reports with BioPerl's
Bio::Structure:IO module


use strict;

use Bio::Structure::IO;


my $inFile;


# Prompt the user for the file name if it's
not an argument

# NOTE: PDB file must be in text (not html)
format

if (! $ARGV[0])

{


print "What is the PDB file to parse? ";



# Get input and remove the newline
character at the end


chomp ($inFile = <STDIN>);

}

else

{


$inFile = $ARGV[0];

}


my $structio = Bio::Structure::IO
-
>new(
-
file
=> $inFile);

my $struc = $structio
-
>next_structure;



for my $chain ($struc
-
>get_chains) {


my $chainid = $chain
-
>id;


# one
-
letter chaincode if present,
'default' otherwise


for my $res ($struc
-
>get_residues($chain)) {


my $resid = $res
-
>id;


# format is 3
-
lettercode
-

dash
-

residue number, e.g. PHE
-
20


my $atoms = $struc
-
>get_atoms($res);


# actually a list of atom objects,
used here to get a count


print join "
\
t",
$chainid,$resid,$atoms,"
\
n";


}

}



BioPerl and PDB

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

45

The Code in Action


./pdb_parse_0.pl pdb43c9.ent >
pdb_parse_0.out

A

ASP
-
1

8


A

VAL
-
2

7


A

VAL
-
3

7


A

MET
-
4

8


A

THR
-
5

7


A

GLN
-
6

9


A

THR
-
7

7


A

PRO
-
8

7


A

SER
-
9

6


A

SER
-
10 6


A

LEU
-
11 8


A

ALA
-
12 5

.

.

.

BioPerl and PDB

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

46

The Code in Action (cont.)

H

GLY
-
106

4


H

THR
-
107

7


H

LEU
-
108

8


H

VAL
-
109

7


H

THR
-
110

7


H

VAL
-
111

7


H

SER
-
112

7


default

HOH
-
1

1


default

HOH
-
2

1


default

HOH
-
3

1


default

HOH
-
4

1


default

HOH
-
5

1

.

.

.

default

HOH
-
440

1


default

HOH
-
449

1


default

HOH
-
454

1


default

HOH
-
456

1


default

HOH
-
463

1


default

HOH
-
469

1


default

HOH
-
472

1


BioPerl and PDB

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

47

BLAST
(http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=W
eb&PAGE_TYPE=BlastHome)


The Basic Local Alignment
Search Tool (BLAST)


Tests a query against a
library of known sequences
to find similarity


Collection of programs


query to database pairs


nucleotide
-
nucleotide


protein
-
nucleotide


protein
-
protein


nucleotid protein



BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

48

BLAST Advantages of a Local
Installation Versus Public Server


Using public server


Databases used by BLAST such as GenBank are
always up to date



Using local server


Proprietary sequences


Sequences that are run time and time again

BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

49

String Matching and Homology


BLAST is basically a string matching program


Biological string matching looks for similarity as an indication of
homology


Similarity between the query and the sequences in the database
may be measured by the percent identity, or the number of
bases in the query that exactly match a corresponding region of
a sequence from the database.


It may be measured by the degree of conservation , which finds
matches between equivalent (redundant) codons or between
amino acid residues with similar properties that don’t alter the
function of a protein


Two sequences are or are not homologous, there’s no degree of
homology



Beginning PERL FOR Bioinformatics, Tisdall


BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

50

Homology


It is difficult to find a single consistent definition of homology



“In genetics, the term "homolog" is used both to refer to a homologous protein,
and to the gene (DNA sequence) encoding it.”


--

Wikipedia



In
-
depth philosophical discussions can be found in



G. P. Wagner, “The Biological Homology Concept,”
Annu. Rev. Ecol. Syst.

1989. 20:51~9.

BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

51

String Matching References


Algorithms on String, Trees, and Sequences:
Computer Science and Computational Biology
, Dan
Gusfield: Cambridge University Press, 1997.


Defacto Standard



R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.
Biological Sequence Analysis: Probabilistic Models of
Proteins and Nucleic Acids.

Cambridge University
Press, 1998.



These two complement each other nicely



BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

52

Outputs of a BLAST Search
-

I


Standard outputs


A set of scores and statistics on the matches based on the
raw score S, various parameters of the scoring algorithm,
and properties of the query and database


The raw score S


A measure of similarity and the size of the match


BLAST lists hits by their E value


The E value (expect) value of a match estimates the chance
that the string match (allowing for gaps) occurs in a
randomly generate database of the same size and
composition.


BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

53

Outputs of a BLAST Search
-

II


The closer the E value is to 0 the less likely
that the match occurred by chance


The better the match



For BLASTN


E value less than 1 may be a solid hit


E value less than 10 may be worth looking at

BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

54

Screen Capture of Example BLAST File Using First 8
Lines (400 Bases) of the File sample.dna From Chapter
8 of Tisdall as a Query Against the nt BLAST Database

BLAST

Some additional tutorial information can be found at

http://www.ncbi.nlm.nih.gov/Class/minicourses/

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

55

A Closer Look at This File

BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

56

Output from Clicking on the First
Line

BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

57

A Little Further Down in the File

BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

58

Another Example BLAST File (Text Version)
-

I

BLASTN 2.2.2 [Jan
-
08
-
2002]



Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro
A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997),

"Gapped BLAST and PSI
-
BLAST: a new generation of protein
database search

programs", Nucleic Acids Res. 25:3389
-
3402.


Query= gi|16972687|gb|BM083928.1|BM083928

imageqc_2_2001/smj416bdff41.x1 Soares_NFL_T_GBC_S1 Homo
sapiens cDNA

clone IMAGE:4597167 3', mRNA sequence


(505 letters)


Database: Non
-
redundant nucleotide
-

Mon Feb 18 09:50:08 EST
2002


1,171,752 sequences; 566,364,212 total letters


Searching..................................................do
ne



Score E

Sequences producing significant alignments:
(bits) Value


ref|NM_017857.1| Homo sapiens slingshot 3 (SSH
-
3), mRNA
957 0.0

ref|XM_040057.1| Homo sapiens slingshot 3 (SSH
-
3), mRNA
957 0.0

dbj|AK000522.1|AK000522 Homo sapiens cDNA FLJ20515 fis, clone
KA... 957 0.0

gb|BC004210.1|BC004210 Homo sapiens, Similar to hypothetical
pro... 954 0.0

gb|BC004176.1|BC004176 Homo sapiens, Similar to hypothetical
pro... 954 0.0

gb|AF085851.1|HUMYI53C10 Homo sapiens full length insert cDNA
cl... 944 0.0

ref|NM_018276.1| Homo sapiens slingshot 3 (SSH
-
3), mRNA
615 e
-
174

dbj|AK001790.1|AK001790 Homo sapiens cDNA FLJ10928 fis, clone
OV... 615 e
-
174

emb|AL353811.12|AL353811 Human DNA sequence from clone RP11
-
482I... 46 0.004

emb|AL035531.17|HS483L3 Human DNA sequence from clone RP3
-
483L3 ... 44 0.014


>ref|NM_017857.1| Homo sapiens slingshot 3 (SSH
-
3), mRNA


Length = 2064



Score = 957 bits (483), Expect = 0.0


Identities = 492/494 (99%), Gaps = 1/494 (0%)


Strand = Plus / Minus




Query: 4
aataaccaaatatgaaaatgtgttttatttctcagtacaaagccagatactgtaagg
cta 63


|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||

Sbjct: 2046
aataaccaaatatgaaaatgtgttttatttctcagtacaaagccagatactgtaagg
cta 1987


BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

59

Another Example BLAST File (Text Version)
-

II



Matrix: blastn matrix:1
-
3

Gap Penalties: Existence: 5, Extension: 2

Number of Hits to DB: 1,627,832

Number of Sequences: 1171752

Number of extensions: 1627832

Number of successful extensions: 453460

Number of sequences better than 5.0e
-
02: 10

length of query: 505

length of database: 566,364,212

effective HSP length: 19

effective length of query: 486

effective length of database: 544,100,924

effective search space: 264433049064

effective search space used: 264433049064

T: 0

A: 40

X1: 6 (11.9 bits)

X2: 15 (29.7 bits)

S1: 12 (24.3 bits)

S2: 22 (44.1 bits

BLAST

Parsing BLAST reports with SearchIO

#!/usr/local/bin/perl
-
w


# Parsing BLAST reports with BioPerl's Bio::SearchIO module

# WI Biocomputing course
-

Bioinformatics for Biologists
-

October 2003


# See help at
http://www.bioperl.org/wiki/HOWTO:SearchIO

for
all data that can be extracted


use Bio::SearchIO;


# Prompt the user for the file name if it's not an argument

# NOTE: BLAST file must be in text (not html) format

if (! $ARGV[0])

{


print "What is the BLAST file to parse? ";



# Get input and remove the newline character at the end


chomp ($inFile = <STDIN>);

}

else

{


$inFile = $ARGV[0];

}


$report = new Bio::SearchIO(


-
file=>"$inFile",


-
format => "blast");


print
"QueryAcc
\
tHitDesc
\
tHitSignif
\
tHSP_rank
\
t
\
%ID
\
teValue
\
t
HSP_length
\
n";



# Go through BLAST reports one by one

while($result = $report
-
>next_result)

{


# Go through each each matching sequence


while($hit = $result
-
>next_hit)


{


# Go through each each HSP for this
sequence


while ($hsp = $hit
-
>next_hsp)


{


# Print some tab
-
delimited data
about this HSP




print $result
-
>query_accession,
"
\
t";


print $hit
-
>description, "
\
t";


print $hit
-
>significance, "
\
t";


print $hsp
-
>rank, "
\
t";


print $hsp
-
>percent_identity, "
\
t";


print $hsp
-
>evalue, "
\
t";


print $hsp
-
>hsp_length, "
\
n";


}


}

}


BioPerl and BLAST

60

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

61

Output from blast_parse_0.pl

QueryAcc

HitDesc

HitSignif

HSP_rank

%ID

eValue

HSP_length

BM083928.1

Homo sapiens slingshot 3 (SSH
-
3), mRNA

0.0

1

99.5951417004049

0.0

494

BM083928.1

Homo sapiens slingshot 3 (SSH
-
3), mRNA

0.0

1

99.5951417004049

0.0

494

BM083928.1

Homo sapiens cDNA FLJ20515 fis, clone KAT09889

0.0

1

99.5951417004049

0.0

494

BM083928.1

Homo sapiens, Similar to hypothetical protein FLJ10928, clone MGC:4436 IMAGE:2958967, mRNA,
complete cds

0.0

1

99.5934959349593

0.0

492

BM083928.1

Homo sapiens, Similar to hypothetical protein FLJ10928, clone MGC:2772 IMAGE:2958967, mRNA,
complete cds

0.0

1

99.5934959349593

0.0

492

BM083928.1

Homo sapiens full length insert cDNA clone YI53C10

0.0

1

99.5893223819302

0.0

487

BM083928.1

Homo sapiens slingshot 3 (SSH
-
3), mRNA

1e
-
174

1

99.685534591195

1e
-
174

318

BM083928.1

Homo sapiens slingshot 3 (SSH
-
3), mRNA

1e
-
174

2

100

6e
-
48

98

BM083928.1

Homo sapiens slingshot 3 (SSH
-
3), mRNA

1e
-
174

3

98.1132075471698

1e
-
18

53

BM083928.1

Homo sapiens cDNA FLJ10928 fis, clone OVARC1000473, weakly similar to DUAL SPECIFICITY PROTEIN
PHOSPHATASE 3 (EC 3.1.3.48) (EC 3.1.3.16)

1e
-
174

1

99.685534591195

1e
-
174

318

BM083928.1

Homo sapiens cDNA FLJ10928 fis, clone OVARC1000473, weakly similar to DUAL SPECIFICITY PROTEIN
PHOSPHATASE 3 (EC 3.1.3.48) (EC 3.1.3.16)

1e
-
174

2

100

6e
-
48

98

BM083928.1

Homo sapiens cDNA FLJ10928 fis, clone OVARC1000473, weakly similar to DUAL SPECIFICITY PROTEIN
PHOSPHATASE 3 (EC 3.1.3.48) (EC 3.1.3.16)

1e
-
174

3

98.1132075471698

1e
-
18

53

BM083928.1

Human DNA sequence from clone RP11
-
482I10 on chromosome 9 Contains STSs and GSSs, complete
sequence [Homo sapiens]

0.004

1

100

0.004

23

BM083928.1

Human DNA sequence from clone RP3
-
483L3 on chromosome 6p25.1
-
25.3 Contains STSs and GSSs,
complete sequence [Homo sapiens]

0.014

1

100

0.014

22


BioPerl and BLAST

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

62

Summary


BioPerl offer several modules that make processing sequence data
convenient



Can fetch sequence data and GenBank record directly from GenBank



More info


Online documentation:

% perldoc Bio::Perl

% perldoc Bio::DB::GenBank

% perldoc Bio::SeqIO


http://www.bioperl.org/wiki


http://www.bioperl.org/wiki/HOWTOs

BioPerl

BINF 634 FALL 2011
-

Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)

63

Homework


Read Chapter 12


Exercise 12.4


Exercise 12.6