Decoding the genome with

whooploafSoftware and s/w Development

Dec 13, 2013 (3 years and 10 months ago)

94 views

Decoding the genome with
perl, XML and SQL

Chris Mungall

Howard Hughes Medical Institute

UC Berkeley

Lawrence Berkeley Lab

Outline


hacking the genome


Data::Stag


nested tag
-
value data and XML


DBIx::DBStag


Relational to XML mapping

CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT

TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT

GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA

TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA

TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG

CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG

CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT

TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT

TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA

ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG

CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG

AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAA
ATG

GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA

TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA

GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA

TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG

GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC

CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT

TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT

TTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGAT
CG

CGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAAC

CCAAAATGGT
GGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTA

ATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGC

GTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCT

CTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGA

GAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAG

AGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTG

TCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGA

ACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAA

TATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTT

TTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGT

TGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATAT

ATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCT

CTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACA

MVADERDDNIFKLPLIRNKFIATLNTAQYM

IAYARVVPTYCANECLSFSVLYYRKPKKTI

HDREREQRRYLDCLLNMIAYARVVPTYCAL

YIMTASHSVLFYRKPKSTMHDRGSRTDI..

Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S

The generic genome browser: a building block for a model organism system database
.

Genome Research

2002
12
:1599
-
1610.

Organism

Genome Size (Bases)

Estimated Genes

Human (
Homo sapiens
)

3 billion

30,000

Laboratory mouse (
M. musculus
)

2.6 billion

30,000

Mustard weed (
A. thaliana
)

100 million

25,000

Roundworm (
C. elegans
)

97 million

19,000

Fruitfly (D. melanogaster)

137 million

14,000

Yeast (
S. cerevisiae
)

12.1 million

6,000

Bacterium (
E. coli
)


4.6 million

3,200

Human immunodeficiency virus (HIV)


9700

9

Use case: a simple object
model for genes

A simple object model for
genes

package Bio::Gene;

use base qw(Bio::Entity);


=head2 symbol



Usage: $gene
-
>symbol(“HGNC”);


$symbol = $gene
-
>symbol;


=cut


sub symbol {


my $self = shift;


$self
-
>{_symbol} = shift if @_;


return $self
-
>{_symbol};

}



package Bio::Gene;

use base qw(Bio::Entity);

use Class::MethodMaker


[


scalar => [qw/symbol taxon/],


array => [qw/phenotypes functions positions/]


];



package Bio::Taxon;

use base qw(Bio::Entity);

use Class::MethodMaker


[


scalar => [qw/id common_name genus species


parent_node/],


];




Perl meta
-
object systems


eg Class::MethodMaker


fun!


Fine for ‘hermetic’ perl


but how do we interoperate with


other languages


databases


XML


No way of “querying” objects

Data::Stag


Objects and XML


perl objects are D(A)G
-
ish


XML is tree like


Why not merge the two?


YAML


alternate exchange format with perl
-
ish semantics


Stag


nested tag
-
value pairs


perl objects with XMLish semantics


XML
-
centric

<dataset>


<gene_set>


<gene>


<symbol>
HGNC
</symbol>


<tax_id>
9606
</tax_id>


<phenotype>
Hemochromatosis
</phenotype>


<phenotype>
Porphyria variegata
</phenotype>


<mol_function>
iron homeostasis
</mol_function>


<position>


<start>
10000
</start>


<end>
5000
</end>


<chromosome>
6
</chromosome>


</position>


</gene>




my $dataset = Data::Stag
-
>parse(“genes.xml”);

my $gene_set = $dataset
-
>get_
gene_set
;

foreach my $gene ($gene_set
-
>get_
gene
) {


next unless $gene
-
>sget_
tax_id

== 9606;


printf “%s
\
n”,


$gene
-
>sget_
symbol
;


printf “ %s
\
n”, foreach $gene
-
>get_
phenotype
;

}

<dataset>


<gene_set>


<gene>


<symbol>
HGNC
</symbol>


<tax_id>
9606
</tax_id>


<phenotype>
Hemochromatosis
</phenotype>


<phenotype>
Porphyria variegata
</phenotype>


<mol_function>
iron homeostasis
</mol_function>


<position>


<start>
10000
</start>


<end>
5000
</end>


<chromosome>
6
</chromosome>


</position>


</gene>

Stag trees


Recursive data structure


node
-
labeled tree


Implemented as nested lists


$node = [$name,$value]


$value = DATA
or

list of $nodes


Equivalent to element
-
only XML

Stag accessors


Tag
-
value pairs


$fooval = $obj
-
>get_FOO; # wantarray=false


$fooval = $obj
-
>sget_FOO;


@foovals = $obj
-
>get_FOO; # wantarray=true


$obj
-
>set_FOO($fooval)


No distinction between single and multi valued
tags


ordered (no hashes)

my $dataset = Data::Stag
-
>parse(“genes.xml”);

my @symbols =


$dataset
-
>find_
symbol
;

<dataset>


<gene_set>


<gene>


<symbol>
HGNC
</symbol>


<tax_id>
9606
</tax_id>


<phenotype>
Hemochromatosis
</phenotype>


<phenotype>
Porphyria variegata
</phenotype>


<mol_function>
iron homeostasis
</mol_function>


<position>


<start>
10000
</start>


<end>
5000
</end>


<chromosome>
6
</chromosome>


</position>


</gene>

my $dataset = Data::Stag
-
>parse(“genes.xml”);

my @symbols =


$dataset
-
>get(‘
gene_set/gene/symbol
’);

<dataset>


<gene_set>


<gene>


<symbol>
HGNC
</symbol>


<tax_id>
9606
</tax_id>


<phenotype>
Hemochromatosis
</phenotype>


<phenotype>
Porphyria variegata
</phenotype>


<mol_function>
iron homeostasis
</mol_function>


<position>


<start>
10000
</start>


<end>
5000
</end>


<chromosome>
6
</chromosome>


</position>


</gene>

my $dataset = Data::Stag
-
>parse(“genes.xml”);

my @genes =


$dataset
-
>where(‘
gene’
,


sub { shift
-
>sget(‘
position/start’
) < 5000});






<dataset>


<gene_set>


<gene>


<symbol>
HGNC
</symbol>


<tax_id>
9606
</tax_id>


<phenotype>
Hemochromatosis
</phenotype>


<phenotype>
Porphyria variegata
</phenotype>


<mol_function>
iron homeostasis
</mol_function>


<position>


<start>
10000
</start>


<end>
5000
</end>


<chromosome>
6
</chromosome>


</position>


</gene>

Attributes and mixed elements

<gene id=“HGNC”>


<desc>


implicated in <b>Hemochromatosis</b>


</desc>

</gene>

<gene>


<@>


<id>HGNC</id>


</@>


<desc>


<.>Implicated in</.>


<b>Hemochromatosis</b>


</desc>

</gene>

my $id =


$gene
-
>get(‘@/id’);


Stag alternate syntaxes

dataset:


gene_set:


gene:


symbol: HGNC


tax_id: 9606


phenotype: Hemochromatosis


phenotype: Porphyria variegata


mol_function: iron homeostasis


position:


start: 10000


end: 5000


chromosome: 6

(dataset


(gene_set


(gene


(symbol "HGNC")


(tax_id "9606")


(phenotype "Hemochromatosis")


(phenotype "Porphyria variegata")


(mol_function "iron homeostasis")


(position


(start "10000")


(end "5000")


(chromosome "6")))))



Stag event model


Like SAX, but more perlish


Data::Stag::BaseGenerator


Data::Stag::BaseHandler


Good for implementing parsers


compatible with SAX


slow!

Data::Stag features


Stag can be used with or without XML


TODO: “strict” mode


Speed issues


Stag queries vs XPATH?

SQL Databases


Relational databases are indispensable
in bioinformatics


scalable with large datasets


expressive modeling


queries


Impedance mismatch problem


relational model and imperative
programming languages are not well
-
matched

Object Relational Mapping


Tangram, Alzabo, Class::DBI, …


Object
-
centric


DBStag is SQL
-
centric

Why perl programmers don’t
like SQL

SELECT *

FROM gene

NATURAL JOIN gene_phenotype AS gp

NATURAL JOIN taxon

WHERE …






gene.
gene_i
d

gene.
symbol

gene.
tax_id

taxon.
name

taxon.
genus

taxon.
species

gp.
gene_i
d

gp. phenotype

1

HGNC

9606

huma
n

Homo

sapiens

1

Hemo
-
chromatosis

1

HGNC

9606

huma
n

Homo

sapiens

1

Porphyria

2

amontillado

7227

fruitfly

Drosophila

melanogast
er

2

abnormal
hatching

2

amontillado

7227

fruitfly

Drosophila

melanogast
er

2

hypoactive


<gene_set>


<gene>


<symbol>
HGNC
</symbol>


<tax_id>
9606
</tax_id>


<taxon>


<name>
human
</name>


<genus>
Homo
</genus>


<species>
sapiens
</genus>


</taxon>


<gene_phenotype>


<phenotype>
Hemochromatosis
</phenotype>


</gene_phenotype>


<gene_phenotype>


<phenotype>
Porphyria variegata
</phenotype>


</gene_phenotype>


</gene>

SELECT * FROM gene NATURAL JOIN gene_phenotype

NATURAL JOIN taxon

USE NESTING (gene_set(gene(taxon)(gene_phenotype)))






relational
-
to
-
XML mappings


Tables mapped to non
-
terminal nodes


Columns mapped to terminal node
(PCDATA)


Table nodes are nested according to:


order in SQL query


USE NESTING clause


Tables aliases add extra level of nesting

Can produce ‘unusual’ XML


<gene_set>


<gene>


<symbol>
HGNC
</symbol>


<tax_id>
9606
</tax_id>


<taxon>


<name>
human
</name>


<genus>
Homo
</genus>


<species>
sapiens
</genus>


</taxon>


<gene_phenotype>


<phenotype>
Hemochromatosis
</phenotype>


</gene_phenotype>


<gene_phenotype>


<phenotype>
Porphyria variegata
</phenotype>


</gene_phenotype>


</gene>

Desired XML?


<gene_set>


<gene>


<symbol>
HGNC
</symbol>


<tax_id>
9606
</tax_id>


<taxon>


<name>
human
</name>


<genus>
Homo
</genus>


<species>
sapiens
</genus>


</taxon>


<phenotype>
Hemochromatosis
</phenotype>


<phenotype>
Porphyria variegata
</phenotype>


</gene>

use DBIx::DBStag;

my $dbh = DBIx::DBStag
-
>connect(“mygenedb”);

my $gene_set = $dbh
-
>selectall_stag(q[

SELECT * FROM gene NATURAL JOIN gene_phenotype

NATURAL JOIN taxon

USE NESTING (gene_set(gene(taxon)(gene_phenotype)))

]);

foreach my $gene (@$gene_set) {


printf “Gene: %s [in %s]
\
n”,


$gene
-
>sget_
symbol
,


$gene
-
>sget(‘
taxon/name’
);


foreach my $gp ($gene
-
>get_
gene_phenotype
) {


printf “ %s
\
n”, $gp
-
>sget_
phenotype
;


}

}

Gene: HGNC [in human]


Hemochromatosis


Poryphria varietagata

Gene: amontillado [in fruitfly]


Abnormal hatching


hypoactive


DBStag Templates

SELECT * FROM gene


NATURAL JOIN gene_phenotype


NATURAL JOIN taxon


NATURAL JOIN gene_function

WHERE


[gene.symbol => %
gene_symbol
%]


[gene.gene_id IN


(SELECT gene_id FROM gene_phenotype


WHERE phenotype => %
phenotype
%)]


[gene.gene_id IN


(SELECT gene_id FROM gene_function


WHERE desc => %
function
%)]

USE NESTING


(gene_set(gene(taxon)


(gene_phenotype)


(gene_function)))





gene_query.stg:

Using templates

use DBIx::DBStag;

my $gene_set =


DBIx::DBStag
-
>selectall_stag(


-
template=>”
gene_query
”,


-
bind=>{function=>”*iron*”,


phenotype=>”*hemo*”});

unix> selectall_xml.pl
-
d mygenedb /
gene_query

\


function=‘*iron*’ phenotype=‘*hemo*’ >
\


genes.xml

Storing data

$gene_set = Data::Stag
-
>parse(“genes.xml”);

$dbh
-
>storenode($gene_set);

unix> stag
-
storenode.pl
-
d mygenedb genes.xml

additional metadata required for


-

many
-
to
-
many relations


-

foreign keys to tables with non
-
matching names

XORT mode


<gene_set>


<taxon op=“lookup” id=“
hsa
”>


<genus>
Homo
</genus>


<species>
sapiens
</genus>


</taxon>


<gene>


<symbol op=“update”>
HGNC
</symbol>


<taxon_id>
hsa
</taxon_id>


<gene_phenotype>


<phenotype>
Hemochromatosis
</phenotype>


</gene_phenotype>


<gene_phenotype>


<phenotype>
Porphyria variegata
</phenotype>


</gene_phenotype>


</gene>

Automatic schema building
from XML dataset


Steps:


pre
-
process XML


run
stag
-
autoddl.pl

on XML


run
stag
-
storenode.pl


see
DBIx::DBStag::Cookbook


When to use:


rapid development


ad
-
hoc analyses

TODO


SOAP bindings


Combinable templates


Address speed issues


Should you use it?


Data::Stag


when you don’t have an existing object model


you prefer to separate behaviour from the data
model


if you like element
-
y XML


DBIx::DBStag


your data naturally ‘nests’


you need access to complex queries in SQL


little or no inheritance

DBStag and web development


SQL Templates separate query logic
from display logic


Rendering HTML:


XSLT


HTML Templating toolkit and Stag objects

DBStag in use


Analysis of gene structure in 12 insects


Building of a data warehouse for the
Gene Ontology


Database of human genetic variation
and disease


Library of templates for many common
bio
-
schemas

Learning more


http://stag.sourceforget.net