PSI-MI minutes

voraciousdrabSoftware and s/w Development

Dec 14, 2013 (7 years and 7 months ago)


Minutes of the MI track of the PSI meeting

University of Liverpool


April 2013

PSICQUIC specification 1.3 and SOLR reference implementation 1.3.9

(Marine Dumousseau EMBL

A highly successful Hackathon

had been held at the EBI 28th May

1st June 2012 involving 10
developers from 7 different groups (BioJS, Cytoscape, DIP, InnateDB, IntAct, MatrixDB, MINT,
MPIDB). During this week, the group achieved the indexing of MITAB 2.5, 2.6 and 2.7 using SOLR,
eloped MIQL 2.7 and discussed the indexing of PSI
MI XML2.5. Since then the PSICQUIC
reference implementation has been updated and released
indexed using both SOLR and Lucene.
Lucene reference implementation (1.2.3) should be used even by resources wis
hing to restrict
themselves to MITAB2.5 as this enables exports in multiple formats.

The SOLR reference
implementation currently only indexes sequentially although this could be made a parallel process.
and REST

can convert MITAB 2.6/2.7

XML2.5. A SOLR administrative
interface is available to any interested developer.

The use of SOLR enables faceting

the breaking up of search results into multiple categories which
can then be counted and enable users to restrict/filter searches. M
IQL2.7 fields are indexed but not

details of the default fields needs to be documented. The
indexing ignores

parentheses, is
case insensitive and

discards common English wordsand empty spaces before and after a word. The
search matches an exact wo
rd with a white space tokeniser
Missing fields are automatically
replaced as ‘
Fuzzy searches should be enabled in alias names and annotation fields in the future

Future plans include adding a Sort, enabling faceting, and resolving discrepancies betwe
en the SOAP
and REST services.

The usage of the additional columns now available in MITAB2.7 was discussed, in an attempt to
ensure that this was consistent amongst participating databases.
The use of columns 23 and 24 for
references, rather than p
lacing these in the columns reserved for identifers a

aliases was

PSICQUIC view and PSICQUIC registry

Users currently are presented with a long list of databases, with no indication of respective content,
it was agreed it would be more usef
ul to arrange by content i.e.


The non
redundant IMEx set







REST URLs should be added to each entry in the PSICQUIC registry, also a ‘download all’ button to
each entry in the registry. A separate download wo
uld be required from IMEx records

keeping this
in a distinct ftp site should make this possible. A request was made that the reference query made to
check that resources are active be filtered out from the count of hits for each PSICQUIC resource

this c
an be done either by filtering out the referral URL or ignoring the query ‘count’.

When a user asks the registry to return all the services in the txt format
only the SOAP URL is provided and the user may want to get the REST URL without having to parse
the XML results of the registry. Therefore, a new paramet
er will be available (protocol=SOAP and
protocol=REST, default = SOAP for backward compatibility) so when a user requests the txt format,
they can ask for the REST URL (Example:

Clustering MITAB2.6/2.7

The term clustering may be misleading as it is differently used
in other fields of biology

names were discussed but no decision made. The definition of clustering (as given in Marine’s
presentation) should be added to the website.
It was
agreed that it would be sufficient to continue
to cluster using the
fields agreed for MIATB2.5, the additional fields do not hold data which it would
be appropriate to cluster on. The current method relies on the primary identifier for identifying the
molecule so it is important that the primary identifiers are correctly l
ocated in the correct columns,
and not stored as aliases (see
). This will however mean
that isoforms a
nd post
translationally cleaved chains do not cluster with the canonical sequences
(subsequent note

it was agreed that users should also be given the option to cluster at the gene
level in future implementations).

We may need in the future to use some o
f the MITAB 2.7 fields when clustering binary interactions
but it does not make sense to store the results in MITAB 2.7 as the hierarchy would be lost, which is
very important to understand the content of the columns and not confuse the user. If we decide

cluster some information from MITAB 2.7, we have to store the results in XML or another format
that maintains relationships

MI Clustering (José M. Villaveces)

A tool is being developed that will enable users to cluster data on
fly, rather than re
ly on the
static data in resources such as IRefIndex. The MITAB results of a
query are clustered,
following a MIQL qu
ery via a REST interface. Multiple services can be clustered and downloaded
following the same query, with the job currently being

kept on the server for 1 week. There is a
current 5MB upload limit, and non
responsive jobs are cancelled after 15 minutes. The job is scored
using MIScore and a visualisation tool will be developed in BioJS. Future plans include improving the
library to
deal with very large amounts of data, the development of both a graphical and
programmatic query interface,
and to enrich the data with pathway information.

Experiences handling MI files (Manuel
Bernal Llinares)

To merge the data, the Teague group have de
veloped a data layer by downloading the ontology into
their database, have an alpha version of a JAVA command line tool to import data and have built a
MITAB2.7 data mapping algorithm. The need for improved documentation, code samples and the
development o
f critical tools, particularly around the XML format was highlighted, and it was agreed
this needed to be tackled.


JAMI is a s
ingle JAVA API which op
erates over both MITAB and XML and is designed to



requirement for redundant developme

over the two formats. Initial test
cases of tool
development using JAMI are under
, these are a single
Syntax checker and
enricher using web
. JAMI c
available for review and input



(Lukasz Salwinski, UCLA)

XPSQ is an extensible P
in which the u
ser directly

queries rec
ord store, minimizing

LOSSY transformation stages as transformation


done on indexing

rather than on the query result.
XPSQ is run time configurable and accepts external XSLT transformations. XML2.5.4 in both compact
and expanded form can be indexed, als
o compressed files (.zip, .gz). XPSQ

is already able to
download data in all PSI formats, but additional formats (XGMML, RDF) need to be added. The
system requires extensive


and statistics on query time and indexing

have yet to be gathered
but the

ability of XML to

deal with multi
protein complexes

is an obvious advantage of this system.
The code is available for testing and input

This model will
support faceting but
we need to agree on what PSICQUIC method to add in the specification and in
which format the faceting should be returned. Once agreed, this will be a PSICQUIC interfac
specification and all implementations for this specification should provide faceting with this

MatrixDB (Sylvie Ricard

The database will release a new website in July 2013, and contain both additional curate data and
additional IMEx

data accessed through PSICQUIC. mRNA data will be used to generate context
specific networks to give networks which are closer to biological reality. A paper has been published
(PMID: 23199376) which contains a list of around 300 proteins designated the “
core matrisome” and
a further approximately 800 “matrisome
associated” proteins.

Curating Complexes

Complex curation at the EBI (Birgit Meldal

The EBI is establishing an encyclopedia

of stable protein complexes, in collaboration with other
interested parties such as UniProtKB, Reactome, GO and PDB. In brief, data will be imported from
established databases such as Reactome, Gramene, Microme, PDB, ChEMBL and the MIPS databases
for huma
n and yeast. Issues include developing a systematic naming of complexes and providing a
hierarchical structure for protein complexes in GO, dealing with (open) sets of proteins where the
true participant of the complex is one of many possibilities, curatio
n of transient complexes,
determination of confidence scores, visualisation of complexes (rather than protein networks) and
selection of search terms and filters. A major issue is how to curate complexes consisting of several
complexes. The curation ma
nual will be made available on the IntAct website.

Complexes in MatrixDB (
Sylvie Ricard
Blum, University of Lyon)

Stable extracellular permanent protein complexes are curated in IntAct, given an EBI
complex ID which is then incorporated into Matri
xDB as a participant of an interaction of this
complex with another biological entity. The ECM complexes and a curation toolbox will be part of
the new MatrixDB release (due July 2013). Work is ongoing with Reactome to review pathway

the assembly of co
llagen fibrils and other multimeric structures and
the definition of

complexes. PMTs
are recorded only when

essential for

complex formation.
will be mined
from UniProt
KB and submitted to IntAct for curation

Cooperative Interactions (Kim de Roey, EMBL

The SyBOSS project are working to enable the curation and visualisation of


which are either allosteric changes or pre
assembly interactions

These include

allosteric interactions
in which
an interactor
changes the conformation of the pre
existing protein or complex, e.g. opens a secondary binding
site, in a different part of the molecul
e to where the interactor bound and

bly complexes
are often only transiently present as they describe the intermediate topology of a larger structure as
it is being assembled.
MI XML2.5.4
schema was extended to describe the relationship of
complexes as participants of an interaction
or larger complex.
here are still some shortcomings with
identifying which participants in a complex are interacting with the new participant of a larger
complex. This can be achieved if each complex is described de novo within an interaction

but in the
onger term it would be preferable to

describe each complex as an element that can be an interactor
which can then be recalled from the database each time this complex is part of an interaction, i.e.
recalling both the complex as an interactor AND all the i
nformation on its topology.

Viewing complex interactions (
Colin Combe

lber Lab, Edinburgh University)

The Rappsilber lab have developed a

tool to display crosslinking data in a protein interaction
network. The tool is still under development
and d
oes not yet input PSI
XML data
. It colour
coordinates domains and has residue resolution but can also deal with amino acid ranges as
interacting sites. Uncertain interactions are indicated with dashed edges. It has a zoom
function to go from bubble
diagram to protein sequence for each interactor. PDB data can be
integrated and observed crosslinks are compared to the distances calculated from the crystal to
estimate likelihood of the existence of this crosslink in vivo.


XML2.5.4 does not adequ
ately deal with inferred knowledge adequately

the absolute requirement
for MIMIx fields which was built into this model is not appropriate for this data type. There is now a
clear use case for XML3.0

the opportunity can also be taken to clean up the 2.
5 schema whilst still
aiming to ensure backward compatibility.



no real format to export


interaction ref is required to define the experiment. Add complexes at
interactor level as an additional node, removing the requirement for an
make exp
list optional and remove


Indicate as complex by adding attribute complex=true?



cooperative elements

need interaction ref., add cooperative node.


Need to capture protein groups for complexes and APMS expts. New element type
‘protein group’ (or ‘interactor group’), no associated
experiment. Interaction type

Interaction group type, interactor ref

interactor group ref.


Confidence unit

used to des
cribe confidence type (
) and not a unit which does not make sense for a confidence. The name will be
retained for backward compatibility but the documentation should be updated.



the experiment confidence is not actively used but there may
be one possible
use case in which the detection method remains the same but the
author gives a confidence to the experiment because some conditions changed,
however there is no known example of this


Publication element

the publication description is at the level of bibR
ef element. It can
accept either Xref OR attribute but not both so we currently add publication annotations
to experiment when we have a pubmed id. It would be good to add the publication
annotations in the attributes of BibRef and for that we need bibRef
element to accept
both Xref AND attributes.


Compact vs expanded

usage of compact or expanded cannot be enforced at the
schema level. Both versions of the schema are in use.



appears throughout schema but is never used

and forces reading of entire file
(expanded format)
. Could be first deprecated to see if anyone complains.


Participant experimental interactor

never used, could be marked as deprecated.



position and effect can be systematically captured but n
ot the actual change.
Could use an attribute with a CV term ‘resulting sequence’ and optional cross
list or an optional element within feature range.



currently a coment. Should become an attribute on Participant.
Currently a posit
ive integer, decimals and a range should be allowed, or minimum value
and maximum value


Can feature detection become a list (use case

PTM identified by mass spec, confirmed
by western blot). Issue with backwards compatibility.


Need to capture results of

an interaction

feature attribute.


Add annotation <feature attribute name>

this is not really an improvement on
current method (annotation topic

resultant PTM)


Add “resulting feature list”

change to schema but would not break parser.


Feature range

need to be negative to enable description of promotor regios.