Ensembl Compara Perl API

greenbeansneedlesΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

112 εμφανίσεις

Ensembl Compara

Perl API

Stephen Fitzgerald

http://www.ebi.ac.uk/~stephenf/
edinburgh
-
workshop/

EBI
-

Wellcome Trust Genome Campus, UK


compara

What is Ensembl Compara?

A single database which contains precalculated
comparative genomics data

Access via perl API and mysql

A production system for generating that database

(not in this presentation)


Compara data

Protein
Sequen
ces

Raw genomic sequence


Whole genome alignments


(tBLAT, BlastZ
-
net, PECAN)

46 species in Ensembl release
-
52


Syntenic regions (
based on BlastZ
-
net
)


Raw Protein Alignments


Protein Family clusters


Protein trees


Gene orthology / paraology predictions

Compara database & the Ensembl
core databases

Since there is minimal primary data inside Compara, to gain
full access to the data external links with core DBs must be re
-
established



Example: compara_
52
must be linked with the

Ensembl core_
52
databases



Proper REGISTRY configuration is critical

Or load_registry_from_db is probably the best choice here


Written in Object
-
Oriented Perl


Used to retrieve data from and store data into
ensembl
-
compara database


Generalized to extend to non
-
ensembl genomic data
(Uniprot)


Follows same ‘Data Object’ & ‘Object Adaptor’
DBAdaptor design as the other Ensembl APIs

The Compara Perl API

Compara object model overview

NCBITaxon

GenomeDB

DnaFrag

Member

MethodLinkSpeciesSet

GenomicAlign

GenomicAlignBlock

SyntenyRegion

DnaFragRegion

Homology

Family

PRIMARY DATA

ANALYSIS

RESULTS

Attribute

ProteinTree

AlignedMember

Primary data



GenomeDB: relates to a particular Ensembl core DB


name(), assembly(), genebuild(), taxon()


fetch_by_name_assembly()
,
fetch_by_registry_name()
,
fetch_by_Slice()
,
fetch_all()



DnaFrag: represents a “top level” SeqRegion


name(), length(), genome_db(), slice(), coord_system_name()


fetch_by_Slice()
,
fetch_by_GenomeDB_and_name()



Member
:
list all Ensembl genes + SwissProt + SPTrEMBL


source_name(), stable_id(), genome_db(), taxon(), sequence(),
get_all_peptide_Members(), get_longest_peptide_Member(),
gene_member()


fetch_by_source_stable_id()

Analysis


MethodLinkSpeciesSet

provides a handle to isolate
specific data from the shared tables (homology,
genomic_align_block)



MethodLink
: Each individual analysis in compara is tagged
with a unique name called a
method_link_type


BLASTZ_NET, TRANSLATED_BLAT, PECAN, SYNTENY, FAMILY,
ENSEMBL_ORTHOLOGUES, ENSEMBL_PARALOGUES, PROTEIN_TREES



SpeciesSet
: the sets of species as (a ref. to) an array of
GenomeDBs



fetch_by_method_link_type_GenomeDBs(),
fetch_by_method_link_type_registry_aliases()


name(), method_link_type(), species_set(), source()

Exercises

http://www.ebi.ac.uk/~stephenf/edinburgh
-
workshop/ComparaAPI.html


GenomeDB


1. Find out the versions of human and mouse genomes in the database


2. Print the name of all the GenomeDBs in the database




DnaFrag


1. Get the DnaFrag for the chromosome 1 of the macaque genome


(using a genome_db object as an argument)


2. Get the DnaFrag for the chromosome X of the mouse genome


(using a core slice object as an argument)



MethodLinkSpeciesSet


1. Find out how many analyses are stored in the database


2. Get the name of the MethodLinkSpeciesSet corresponding to the
BlastZ
-
net analysis for human and mouse


3. Get the names of the all the species using the mlss corresponding to
the Pecan analyses


GenomeDB example code

use strict;

use Bio::EnsEMBL::Registry;

my
$reg

= "Bio::EnsEMBL::Registry";


$reg
-
>load_registry_from_db
(


-
host=>"ensembldb.ensembl.org",


-
user => "anonymous");


my
$genome_db_adaptor

=
$reg
-
>get_adaptor
(


"
Multi
", "compara", "GenomeDB");


my
$genome_db

=
$genome_db_adaptor
-
>


fetch_by_registry_name("human");


print‏“Name


:”,
$genome_db
-
>name
,


"
\
n";

print‏“Assembly

:”,
$genome_db
-
>assembly
,

"
\
n";

print‏“GeneBuild

:”,
$genome_db
-
>genebuild
,

"
\
n";

GenomeDB example code

$> perl genome_db1.pl


Homo sapiens NCBI36 2006
-
08
-
Ensembl

Mus musculus NCBIM36 2006
-
04
-
Ensembl


DnaFrag example code

use strict;

use Bio::EnsEMBL::Registry;

my
$reg

= "Bio::EnsEMBL::Registry";


$reg
-
>load_registry_from_db
(


-
host=>"ensembldb.ensembl.org",


-
user => "anonymous");


my
$genome_db_adaptor

=
$reg
-
>get_adaptor
(


"
Multi
", "compara", "GenomeDB");


my
$genome_db

=
$genome_db_adaptor
-
>


fetch_by_registry_name("human");


my
$dnafrag_adaptor

=
$reg
-
>get_adaptor
(


"
Multi
", "compara", "DnaFrag");


my
$dnafrag

=
$dnafrag_adaptor
-
>


fetch_by_GenomeDB_and_name
(
$genome_db
, "13");



print "Name



:",
$dnafrag
-
>name
,

"
\
n";

print "Length



:",
$dnafrag
-
>length
,

"
\
n";

print "CoordSystem

:",
$dnafrag
-
>coord_system_name
,


"
\
n";

DnaFrag example code

$> perl test1.pl

Name :13

Length :114142980

CoordSystem :chromosome


MethodLinkSpeciesSet

example code

use strict;

use Bio::EnsEMBL::Registry;

my
$reg

= "Bio::EnsEMBL::Registry";


$reg
-
>load_registry_from_db
(


-
host=>"ensembldb.ensembl.org",


-
user => "anonymous");


my
$mlssa

=
$reg
-
>get_adaptor
("Multi", "compara",


"MethodLinkSpeciesSet");


my
$mlss

=
$mlssa
-
>


fetch_by_method_link_type_registry_aliases
(


"BLASTZ_NET", ["human", "mouse"]);


print
$mlss
-
>name
, "
\
n";


print "type: ",
$mlss
-
>method_link_type
, "
\
n";


my
$species_set

=
$mlss
-
>species_set
();


foreach my
$this_genome_db

(
@$species_set
) {


print
$this_genome_db
-
>name
(), "
\
n";

}

MethodLinkSpeciesSet

example code

$ > perl method_link_species_set.pl

H.sap
-
M.mus blastz
-
net (on H.sap)

Genomic Alignments


BlastZ
-
Net


used to compare closely related pair of species


BlastZ
-
raw
-
> BlastZ
-
chain
-
> BlastZ
-
net




Translated BLAT


used to compare more distant pair of species


Pecan


multiple global alignments


all vs all coding exons wublastp
-
> Mercator
-
>
Pecan on each syntenic block

GenomicAlignBlock


GenomicAlignBlock


represents a genomic alignment


contains 1 GenomicAlign per sequence


fetch_all_by_MethodLinkSpeciesSet_Slice($mlss,$slice)


Methods:


method_link_species_set(), score(), length(), perc_id(),
get_all_GenomicAligns(), get_SimpleAlign()



GenomicAlign


dnafrag(), genome_db()
,
get_Slice()
, dnafrag_start,
dnafrag_end(), dnafrag_strand(), aligned_sequence()

GenomicAlignBlock


$all_GAlign

= $GABlock
-
>get_all_GenomicAligns()


$arrayref

$Simplealign

= $GABlock
-
>get_SimpleAlign()




$object



$Simplealign:

a
bioperl

object which contains the whole alignment
-

can be printed in
various format using bioperl modules


$Galign:


an object which
represents one of the sequences in the alignment only


Hsap.X.1223
-
1230: ACCTTC
-
A


<
-

$ga

Cfam.X.1390
-
1395: ACC
--
CGA


<
-

$ga


Synteny


Based on BlastZ
-
net alignments



SyntenyRegionAdaptor


fetch_all_by_MethodLinkSpeciesSet_Slice()
,
fetch_all_by_MethodLinkSpeciesSet_DnaFrag()


Methods:


get_all_DnaFragRegions(), method_link_species_set(),



DnaFragRegion


slice(), dnafrag(), dnafrag_start(), dnafrag_end(),
dnafrag_strand()

Exercises

http://www.ebi.ac.uk/~stephenf/edinburgh
-
workshop/ComparaAPI.html


GenomicAlignBlock


1. Fetch all the BLASTZ_NET alignments between the first 130K
nucleotides of the human chromosome X and the mouse genome.


2. Print the exact location of the alignment blocks.


3. Compare the original and the aligned sequences.


4. Find the BLASTZ_NET alignments between human gene BRCA2
and the mouse genome.


5. Print the BLASTZ_NET alignments between the rat gene ECSIT and
the mouse genome.


6. Print the PECAN multiple alignments between the rat gene ECSIT
and 11 other amniote vertebrates.


7. Print the constrained
-
element alignments within the rat ECSIT locus
(use the constrained elements generated from the 12
-
way alignments).




Synteny


1. Get the human
-
mouse syntenic map for human chromosome X.

GenomicAlignBlock example code

[...]

my
$slice_adaptor

=
$reg
-
>get_adaptor
(


"human", "core", "Slice");

my
$slice

=
$slice_adaptor
-
>


fetch_by_region
("chromosome", "12", 1e4, 2e4);


my
$gaba

=
$reg
-
>get_adaptor
("Multi", "compara",


"GenomicAlignBlock");


my
$genomic_align_blocks

=
$gaba
-
>


fetch_all_by_MethodLinkSpeciesSet_Slice
(


$method_link_species_set
,
$slice
);



foreach my
$this_gab

(
@$genomic_align_blocks
) {



my
$all_gas

=
$this_gab
-
>get_all_GenomicAligns
();


foreach my
$this_ga

(
@$all_gas
) {


print

$this_ga
-
>genome_db
-
>name
(),


":",
$this_ga
-
>get_Slice()
-
>name
(), "
\
n";


print

$this_ga
-
>aligned_sequence
(), "
\
n";


}


print "
\
n";

}

GenomicAlignBlock example code

$>perl gab.pl

Mus musculus:chromosome:NCBIM37:6:121449987:121450302:
-
1

CCTCTTAATAAACATTATTGTCAA[…]

Homo sapiens:chromosome:NCBI36:12:19128:19507:1

CCTCTTAATAAGCACACATATCCT[..]


Synteny example code

[...]

my
$synteny_region_adaptor

=
$reg
-
>
get_adaptor
(


"Multi", "compara", "SyntenyRegion");


my
$synteny_regions

=
$synteny_region_adaptor
-
>


fetch_all_by_MethodLinkSpeciesSet_Slice
(


$human_mouse_synteny_method_link_species_set
,


$human_slice
);


foreach my
$this_synteny_region

(
@$synteny_regions
) {



my
$these_dnafrag_regions

=


$this_synteny_region
-
>get_all_DnaFragRegions
();



foreach my
$this_dnafrag_region


(
@$these_dnafrag_regions
) {



print
$this_dnafrag_region
-
>dnafrag
-
>


genome_db
-
>name
, ": ",


$this_dnafrag_region
-
>slice
-
>name
, "
\
n";


}


print "
\
n";

}


Homology


(e! 38):


Orthologue‏predictions‏based‏on‏‘best‏reciprocal‏
blast‏hits’


Paralogues for a selected set of species


No global view of the evolution history of the
gene considered



e! 39+:


Orthologues and paralogues are inferred from
protein trees


Phylogeny: Orthology/Paralogy in one go

BSR
:

Blast

Score

Ratio
.

When

2

proteins

P
1

and

P
2

are

compared,

BSR=scoreP
1
P
2
/max(self
-
scoreP
1

or

self
-
scoreP
2
)
.

The

default

threshold

used

in

the

initial

clustering

step

is

0
.
33
.

Homology types

Homology


Homology object


contains 1 pair of
Member/Attribute

per gene/protein


fetch_all_by_Member
(),
fetch_all_by_MethodLinkSpeciesSet()
,
fetch_all_by_Member_MethodLinkSpeciesSet()


Methods:


method_link_species_set(), description(),
subtype(), perc_id(), get_all_Member_Attribute(),
get_SimpleAlign()

Family


Compara compute gene family clusters


Runs on all Ensembl transcripts plus all Uniprot/SWISSPROT
and Uniprot/SPTREMBL metazoan proteins


The algorithm is based on :

All vs all blastp

MCL clustering

Muscle multiple aligner


Results stored in family, family_member tables


Family


Family object


contains 1 pair of
Member/Attribute

per gene/protein


fetch_all by_Member
()


Methods:


method_link_species_set(), description(),
description_score(), get_all_Member_Attribute(),
get_SimpleAlign()

Exercises

http://www.ebi.ac.uk/~stephenf/edinburgh
-
workshop/ComparaAPI.html


Members


1. Find the Member corresponding to SwissProt protein O93279


2. Find the Member for the human gene BRCA2


3. Find all the peptide Members corresponding to the human gene
CTDP1



Homology


1. Get all the predicted homologues for the human gene BRCA2


2. Get all the mouse orthologues predicted for the human gene CTDP1



Family


1. Get family predicted for the human gene BRCA2


2. Get the alignments corresponding to the family of the human gene
HBEGF

Member example code

use strict;

use Bio::EnsEMBL::Registry;

my
$reg

= "Bio::EnsEMBL::Registry";


$reg
-
>load_registry_from_db
(


-
host=>"ensembldb.ensembl.org",


-
user => "anonymous");


my
$member_adaptor

=
$reg
-
>get_adaptor
(


"Multi", "compara", "Member");


my
$member

=
$member_adaptor
-
>


fetch_by_source_stable_id
(


"ENSEMBLGENE", "ENSG00000000971");


print "All proteins:
\
n";

my
$all_peptide_members

=
$member
-
>


get_all_peptide_Members
();


foreach my
$this_peptide

(
@$all_peptide_members
) {


print
$this_peptide
-
>stable_id()
, "
\
n";

}


Member example code

$> perl test2.pl

All proteins:

ENSP00000356399

ENSP00000356398

ENSP00000352658


Homology example code

[...]

my
$ma

=
$reg
-
>get_adaptor
(


"Multi", "compara", "Member");

my
$member

=
$ma
-
>fetch_by_source_stable_id
(


"ENSEMBLGENE", "ENSG00000000971");


my
$homology_adaptor

=
$reg
-
>get_adaptor
(


"Multi", "compara", "Homology");


my
$homologies

=
$homology_adaptor
-
>


fetch_all_by_Member
($member);


foreach my
$this_homology

(
@$homologies
) {


print
$this_homology
-
>description
, "
\
n";


my
$member_attributes

=
$this_homology
-
>


get_all_Member_Attribute
();


foreach my
$this_mem_attr

(
@$member_attributes
) {


my (
$this_member
,
$this_attribute
) =


@$this_mem_attr
;


print
$this_member
-
>genome_db
-
>name
, " ",



$this_member
-
>source_name
, " ",



$this_member
-
>stable_id
, "
\
n";


}


print "
\
n";

}

Family example code

[...]

my
$ma

=
$reg
-
>get_adaptor
(


"Multi", "compara", "Member");

my
$member

=
$ma
-
>fetch_by_source_stable_id
(


"ENSEMBLGENE", "ENSG00000000971");


my
$family_adaptor

=
$reg
-
>get_adaptor
(


"Multi", "compara", "Family");

my
$families

=
$family_adaptor
-
>


fetch_all_by_Member
(
$member
);


foreach my
$this_family

(
@$families
) {


print
$this_family
-
>description
, "
\
n";


my
$member_attributes

=
$this_family
-
>


get_all_Member_Attribute
();


foreach my
$this_mem_attr

(
@$member_attributes
) {


my (
$this_member
,
$this_attribute
) =


@$this_mem_attr
;


print
$this_member
-
>taxon
-
>binomial
, " ",


$this_member
-
>source_name
, " ",


$this_member
-
>stable_id
, "
\
n";


}


print "
\
n";

}

Getting More Information


perldoc


Viewer for inline API documentation.


shell> perldoc Bio::EnsEMBL::Compara::GenomeDB


shell> perldoc
Bio::EnsEMBL::Compara::DBSQL::MemberAdaptor


online at:
http://www.ensembl.org/


Tutorial document:


cvs: ensembl
-
compara/docs/ComparaTutorial.pdf


ensembl
-
dev mailing list:


ensembl
-
dev@ebi.ac.uk


E
xercise solutions:


http://www.ebi.ac.uk/~stephenf/edinburgh
-
workshop/solutions.html

Ensembl
-
dev mailing list and
HelpDesk


ensembl
-
dev mailing list is great for questions around
the API and the DB


HelpDesk is very helpful


Give detailed info on what you are trying to do


Check that you have the modules installed
($PERL5LIB pointing to them)


Guy Coates, Tim Cutts, Shelley Goddard

Systems & Support

Paul Flicek
, Yuan Chen,
Stefan Gräf,
Nathan Johnson, Daniel Rios

Functional Genomics

Ewan

Birney

(EBI),
Tim Hubbard

(Sanger Institute)


Leaders

Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel Zerbino

Research

Martin Hammond, Dan Lawson, Karyn Megy

VectorBase Annotation

Kerstin Jekosch
, Mario Caccamo, Ian Sealy

Zebrafish Annotation

Val Curwen
,
Steve Searle
, Browen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Jan
-
Hinnerck Vogel,
Kevin Howe, Felix Kokocinski, Stephen Rice, Simon White

Analysis and Annotation Pipeline

Javier Herrero,

Kathryn Beal,
Benoît Ballester,
Stephen Fitzgerald, Albert Vilella, Leo Gordon

Comparative Genomics

James Smith
, Fiona Cunningham, Anne Parker, Steve Trevanion (VEGA)


Web Team

Xos
é M Fernández
, Bert Overduin, Giulietta Spudich, Michael Schuster

Outreach

Eugene Kulesha

Distributed Annotation System (DAS)


Arek Kasprzyk
, Damian Smedley
, Richard Holland, Syed Haldar

BioMart

Glenn Proctor
, Ian Longden, Patrick Meidl, Andreas K
ähäri

Database Schema and Core API

Ensembl Team

A special case of ortholog