PSI-MI minutes

voraciousdrabΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

83 εμφανίσεις

Minutes of the MI track of the PSI meeting

University of Liverpool

15
-
17
th

April 2013


PSICQUIC specification 1.3 and SOLR reference implementation 1.3.9

(Marine Dumousseau EMBL
-
EBI)

A highly successful Hackathon

had been held at the EBI 28th May


1st June 2012 involving 10
developers from 7 different groups (BioJS, Cytoscape, DIP, InnateDB, IntAct, MatrixDB, MINT,
MPIDB). During this week, the group achieved the indexing of MITAB 2.5, 2.6 and 2.7 using SOLR,
dev
eloped MIQL 2.7 and discussed the indexing of PSI
-
MI XML2.5. Since then the PSICQUIC
reference implementation has been updated and released
indexed using both SOLR and Lucene.
The
Lucene reference implementation (1.2.3) should be used even by resources wis
hing to restrict
themselves to MITAB2.5 as this enables exports in multiple formats.

The SOLR reference
implementation currently only indexes sequentially although this could be made a parallel process.
The SOAP
and REST
service
s

can convert MITAB 2.6/2.7


PSI
-
XML2.5. A SOLR administrative
interface is available to any interested developer.


The use of SOLR enables faceting


the breaking up of search results into multiple categories which
can then be counted and enable users to restrict/filter searches. M
IQL2.7 fields are indexed but not
stored


details of the default fields needs to be documented. The
indexing ignores

parentheses, is
case insensitive and

discards common English wordsand empty spaces before and after a word. The
search matches an exact wo
rd with a white space tokeniser
.
Missing fields are automatically
replaced as ‘
-
‘.
Fuzzy searches should be enabled in alias names and annotation fields in the future
.


Future plans include adding a Sort, enabling faceting, and resolving discrepancies betwe
en the SOAP
and REST services.


The usage of the additional columns now available in MITAB2.7 was discussed, in an attempt to
ensure that this was consistent amongst participating databases.
The use of columns 23 and 24 for
cross
-
references, rather than p
lacing these in the columns reserved for identifers a
s

aliases was
emphasised.


PSICQUIC view and PSICQUIC registry


Users currently are presented with a long list of databases, with no indication of respective content,
it was agreed it would be more usef
ul to arrange by content i.e.

-

The non
-
redundant IMEx set

-

Internally
-
curated

-

Imported

-

Predicted/Text
-
mining


REST URLs should be added to each entry in the PSICQUIC registry, also a ‘download all’ button to
each entry in the registry. A separate download wo
uld be required from IMEx records
-

keeping this
in a distinct ftp site should make this possible. A request was made that the reference query made to
check that resources are active be filtered out from the count of hits for each PSICQUIC resource


this c
an be done either by filtering out the referral URL or ignoring the query ‘count’.

When a user asks the registry to return all the services in the txt format
(
http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS&format=txt
),
only the SOAP URL is provided and the user may want to get the REST URL without having to parse
the XML results of the registry. Therefore, a new paramet
er will be available (protocol=SOAP and
protocol=REST, default = SOAP for backward compatibility) so when a user requests the txt format,
they can ask for the REST URL (Example:
http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS&format=txt&pr
otocol=REST
).


Clustering MITAB2.6/2.7

The term clustering may be misleading as it is differently used
in other fields of biology


alternative
names were discussed but no decision made. The definition of clustering (as given in Marine’s
presentation) should be added to the website.
It was
agreed that it would be sufficient to continue
to cluster using the
fields agreed for MIATB2.5, the additional fields do not hold data which it would
be appropriate to cluster on. The current method relies on the primary identifier for identifying the
molecule so it is important that the primary identifiers are correctly l
ocated in the correct columns,
and not stored as aliases (see
http://code.google.com/p/psicquic/wiki/DataDistributionBestPractices
). This will however mean
that isoforms a
nd post
-
translationally cleaved chains do not cluster with the canonical sequences
(subsequent note


it was agreed that users should also be given the option to cluster at the gene
level in future implementations).

We may need in the future to use some o
f the MITAB 2.7 fields when clustering binary interactions
but it does not make sense to store the results in MITAB 2.7 as the hierarchy would be lost, which is
very important to understand the content of the columns and not confuse the user. If we decide

to
cluster some information from MITAB 2.7, we have to store the results in XML or another format
that maintains relationships


MI Clustering (José M. Villaveces)


A tool is being developed that will enable users to cluster data on
-
the
-
fly, rather than re
ly on the
static data in resources such as IRefIndex. The MITAB results of a
PSICQUIC
query are clustered,
following a MIQL qu
ery via a REST interface. Multiple services can be clustered and downloaded
following the same query, with the job currently being

kept on the server for 1 week. There is a
current 5MB upload limit, and non
-
responsive jobs are cancelled after 15 minutes. The job is scored
using MIScore and a visualisation tool will be developed in BioJS. Future plans include improving the
library to
deal with very large amounts of data, the development of both a graphical and
programmatic query interface,
and to enrich the data with pathway information.


Experiences handling MI files (Manuel
Bernal Llinares)

To merge the data, the Teague group have de
veloped a data layer by downloading the ontology into
their database, have an alpha version of a JAVA command line tool to import data and have built a
MITAB2.7 data mapping algorithm. The need for improved documentation, code samples and the
development o
f critical tools, particularly around the XML format was highlighted, and it was agreed
this needed to be tackled.


JAMI

JAMI is a s
ingle JAVA API which op
erates over both MITAB and XML and is designed to

end

the
current

requirement for redundant developme
nt

over the two formats. Initial test
-
cases of tool
development using JAMI are under
development
, these are a single
Syntax checker and
a
file
enricher using web
-
services
. JAMI c
ode
is
available for review and input

(
https://code.google.com/p/psimi/source/browse/trunk/psi
-
jami/
)
.


PSICQUIC XML

(Lukasz Salwinski, UCLA)

XPSQ is an extensible P
SICQUIC s
erver,
in which the u
ser directly

queries rec
ord store, minimizing

LOSSY transformation stages as transformation

is

done on indexing

rather than on the query result.
XPSQ is run time configurable and accepts external XSLT transformations. XML2.5.4 in both compact
and expanded form can be indexed, als
o compressed files (.zip, .gz). XPSQ

is already able to
download data in all PSI formats, but additional formats (XGMML, RDF) need to be added. The
system requires extensive

test
ing

and statistics on query time and indexing

have yet to be gathered
but the

ability of XML to

deal with multi
-
protein complexes

is an obvious advantage of this system.
The code is available for testing and input
(
https://code.google.com/p/
psicquic/source/browse/trunk/psicquic
-
solr
-
server
).

This model will
support faceting but
we need to agree on what PSICQUIC method to add in the specification and in
which format the faceting should be returned. Once agreed, this will be a PSICQUIC interfac
e
specification and all implementations for this specification should provide faceting with this
method/format
.


MatrixDB (Sylvie Ricard
-
Blum)

The database will release a new website in July 2013, and contain both additional curate data and
additional IMEx

data accessed through PSICQUIC. mRNA data will be used to generate context
-
specific networks to give networks which are closer to biological reality. A paper has been published
(PMID: 23199376) which contains a list of around 300 proteins designated the “
core matrisome” and
a further approximately 800 “matrisome
-
associated” proteins.


Curating Complexes

Complex curation at the EBI (Birgit Meldal
, EMBL
-
EBI
)

The EBI is establishing an encyclopedia

of stable protein complexes, in collaboration with other
interested parties such as UniProtKB, Reactome, GO and PDB. In brief, data will be imported from
established databases such as Reactome, Gramene, Microme, PDB, ChEMBL and the MIPS databases
for huma
n and yeast. Issues include developing a systematic naming of complexes and providing a
hierarchical structure for protein complexes in GO, dealing with (open) sets of proteins where the
true participant of the complex is one of many possibilities, curatio
n of transient complexes,
determination of confidence scores, visualisation of complexes (rather than protein networks) and
selection of search terms and filters. A major issue is how to curate complexes consisting of several
sub
-
complexes. The curation ma
nual will be made available on the IntAct website.


Complexes in MatrixDB (
Sylvie Ricard
-
Blum, University of Lyon)

Stable extracellular permanent protein complexes are curated in IntAct, given an EBI
-
xxxxxxx
complex ID which is then incorporated into Matri
xDB as a participant of an interaction of this
complex with another biological entity. The ECM complexes and a curation toolbox will be part of
the new MatrixDB release (due July 2013). Work is ongoing with Reactome to review pathway
s

on
the assembly of co
llagen fibrils and other multimeric structures and
the definition of

collagen
complexes. PMTs
are recorded only when


essential for

complex formation.
A
ll
remaining
ECM
complexes
will be mined
from UniProt
KB and submitted to IntAct for curation
.


Cooperative Interactions (Kim de Roey, EMBL
-
Heidelberg)

The SyBOSS project are working to enable the curation and visualisation of

cooperative

interactions
which are either allosteric changes or pre
-
assembly interactions

(
http://psi
-
mi
-
cooperativeinteracti
ons.embl.de)
.
These include

allosteric interactions
in which
an interactor
changes the conformation of the pre
-
existing protein or complex, e.g. opens a secondary binding
site, in a different part of the molecul
e to where the interactor bound and

p
re
-
assem
bly complexes
are often only transiently present as they describe the intermediate topology of a larger structure as
it is being assembled.
T
he
PSI
-
MI XML2.5.4
schema was extended to describe the relationship of
complexes as participants of an interaction
or larger complex.
T
here are still some shortcomings with
identifying which participants in a complex are interacting with the new participant of a larger
complex. This can be achieved if each complex is described de novo within an interaction

but in the
l
onger term it would be preferable to

describe each complex as an element that can be an interactor
which can then be recalled from the database each time this complex is part of an interaction, i.e.
recalling both the complex as an interactor AND all the i
nformation on its topology.



Viewing complex interactions (
Colin Combe
,

Rappsi
lber Lab, Edinburgh University)

The Rappsilber lab have developed a

tool to display crosslinking data in a protein interaction
network. The tool is still under development
and d
oes not yet input PSI
-
XML data
. It colour
-
coordinates domains and has residue resolution but can also deal with amino acid ranges as
interacting sites. Uncertain interactions are indicated with dashed edges. It has a zoom
-
in/out
function to go from bubble
diagram to protein sequence for each interactor. PDB data can be
integrated and observed crosslinks are compared to the distances calculated from the crystal to
estimate likelihood of the existence of this crosslink in vivo.


XML3.0

XML2.5.4 does not adequ
ately deal with inferred knowledge adequately


the absolute requirement
for MIMIx fields which was built into this model is not appropriate for this data type. There is now a
clear use case for XML3.0


the opportunity can also be taken to clean up the 2.
5 schema whilst still
aiming to ensure backward compatibility.

a.

Complexes


no real format to export

i.


interaction ref is required to define the experiment. Add complexes at
interactor level as an additional node, removing the requirement for an
experiment.
Could
make exp
-
list optional and remove
availability/negative/intramolecular

ii.

Indicate as complex by adding attribute complex=true?

iii.

Add

cooperative elements


need interaction ref., add cooperative node.

b.

Need to capture protein groups for complexes and APMS expts. New element type
‘protein group’ (or ‘interactor group’), no associated
experiment. Interaction type


Interaction group type, interactor ref


interactor group ref.

c.

Confidence unit


used to des
cribe confidence type (
http://www.ebi.ac.uk/ontology
-
lookup/browse.do?ontName=MI&termId=MI%3A1064&termName=interaction%20confi
dence
) and not a unit which does not make sense for a confidence. The name will be
retained for backward compatibility but the documentation should be updated.

d.

Experiment
-
confidence


the experiment confidence is not actively used but there may
be one possible
use case in which the detection method remains the same but the
author gives a confidence to the experiment because some conditions changed,
however there is no known example of this

e.

Publication element


the publication description is at the level of bibR
ef element. It can
accept either Xref OR attribute but not both so we currently add publication annotations
to experiment when we have a pubmed id. It would be good to add the publication
annotations in the attributes of BibRef and for that we need bibRef
element to accept
both Xref AND attributes.

f.

Compact vs expanded


usage of compact or expanded cannot be enforced at the
schema level. Both versions of the schema are in use.

g.

Expt
-
ref


appears throughout schema but is never used

and forces reading of entire file
(expanded format)
. Could be first deprecated to see if anyone complains.

h.

Participant experimental interactor


never used, could be marked as deprecated.

i.

Mutation


position and effect can be systematically captured but n
ot the actual change.
Could use an attribute with a CV term ‘resulting sequence’ and optional cross
-
reference
list or an optional element within feature range.

j.

Stoichiometry


currently a coment. Should become an attribute on Participant.
Currently a posit
ive integer, decimals and a range should be allowed, or minimum value
and maximum value
.

k.

Can feature detection become a list (use case


PTM identified by mass spec, confirmed
by western blot). Issue with backwards compatibility.

l.

Need to capture results of

an interaction


feature attribute.

a.

Add annotation <feature attribute name>
-

this is not really an improvement on
current method (annotation topic


resultant PTM)

b.

Add “resulting feature list”


change to schema but would not break parser.

m.

Feature range
s


need to be negative to enable description of promotor regios.