XML and Bioinformatics

dasypygalstockingsBiotechnology

Oct 2, 2013 (4 years and 3 months ago)

82 views

XML and
Bioinformatics





Rajvi Shah

What is XML ?



XML stands for E
X
tensible
M
arkup
L
anguage


XML is a
markup language

much like
HTML


XML was designed to
describe data

and
focus on what data is.

Features Of XML



XML is an easy and automatically parseable way
to describe data


More flexible and adaptable information
identification.


XML is extensible


XML lets us design our own customized mark up
language



Why XML ?


Data in incompatible formats



Difficulties in Exchanging data


Software and hardware independent way of
sharing data


XML used to store and display data


With XML data availabe to more users


Databases and XML


Database content can
be presented in XML


XML processor
can access DBMS
or file system and
convert data to
XML


Web server can
serve content as
either XML or
HTML

Why XML For Bioinformatics ?


Biology is a complex discipline


Wide variety of data resources and
repositories


Biological data represented in multiple
fomats eg. FASTA , agp ,gff etc.


No standard protocol exists to interrogate
biological data stores

Why XML for bioinformatics


No standard nomenclature for genomic,
proteonomic,chemi
-
informatics and other
biological data


No standard data format exists to
exchange biological data.


No standard data model exists.


Difficulties in using and exchanging data



Large no of sources

XML Syntax Elements &
Attributes


<?xml version="1.0" encoding="ISO
-
8859
-
1"?>



note date="12/11/2002">

>



<to>Tove</to>


<from>Jani</from


<heading>Reminder</heading>


<body>Don't forget me this weekend!</body>

</note>

XML DTD


File containing a formal definition of the
permitted structure of the A document


A DTD describes:


What names can be used for element types


Where element types can occur


How element types fit together


The attributes of any element


An Example XML DTD

<?xml version="1.0" encoding="US
-
ASCII"?>

<!DOCTYPE seq [


<!ELEMENT seq (dbxref*, residues?) >


<!ATTLIST seq

id

ID

#REQUIRED




name

CDATA

#IMPLIED




length

CDATA

#IMPLIED >



<!ELEMENT residues (#PCDATA)>


<!ATTLIST residues type (dna | rna | aa) #REQUIRED>


...


]>

XML Schema(A better DTD)


Some developers dissatisfied with XML DTD


The description of a document’s structure
should be a XML document, not have its own
special syntax


Could manipulate schema with regular
XML editing tools


XML DTD doesn’t impose enough constraints
on data

Case Study: BigLab


BigLab is the research department of BigPharma


Business Requirements


Get Data


Align and Analyze sequences


Send to BigPharma’s headquarters


A Piece of XML Schema

<
seq id
=“my_seq” name=“NUCLEAR RIBONUCLEOPROTEIN”>


<
dbxref
>


<
database
>SWISS
-
PROT</database>


<
unique_id
>P09651</unique_id>


</dbxref
>


<
residues type
=“aa”>
SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRS
RGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKI
FVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVD
KIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGN
DNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRG
YGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGGSGSNFGGGGSYNDFGNYN
NQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF


</residues>

</seq>


Biological XML


Some DTD’s have been proposed publicly

as XML formats for biological data


GAME

Drosophila Genome Project/Celera


BIOML

ProteoMetrics


BSML

VisualGenomics


CML

OMF


GEML Gene Expression Data


Summary



1. XML is highly flexible


It is simple to modify a DTD. The XML and
DTD files are human readable and then can be
easily edited by people with only few computer
skills


2 . XML is Internet
-
oriented and has very rich
capabilities for linking data


-
This can be used for interconnecting
databases


3. XML provides an open framework for defining
standard specifications.


-
This is an important point because
bioinformatics clearly lacks standardization