Web servicing the biological office


Sep 29, 2013 (3 years and 6 months ago)


Vol.21Suppl.22005,pages ii268–ii269
Text Mining
Web servicing the biological ofÞce
Martin Szugat

,Daniel Güttler,Katrin Fundel,Florian Sohler and Ralf Zimmer
Department of Informatics,Ludwig-Maximilians-Universität München,80333 München,Germany
Summary:Biologists routinely use Microsoft OfÞce applications
for standard analysis tasks.Despite ubiquitous internet resources,
information needed for everyday work is often not directly and seam-
lessly available.Here we describe a very simple and easily extendable
mechanismusing Web Services to enrich standard MS OfÞce applic-
ations with internet resources.We demonstrate its capabilities by
providing a Web-based thesaurus for biological objects,which maps
names to database identiÞers and vice versa via an appropriate syn-
onymlist.Theclient applicationProTagmakes thesefeatures available
in MS OfÞce applications using Smart Tags and Add-Ins.
The programs of the Microsoft OfÞce suite are widely adopted in
biology:MSExcel is usedfor processingtabular data,suchas expres-
sion values measured with DNAchips.MS Word is used for writing
scientiÞc publications.In both cases,the documents reference biolo-
gical objects,such as genes and proteins using names,descriptions or
unique identiÞers.If the user needs additional information on these
objects,the identiÞers or names have to be copied into a browser and
searched for,in public internet databases.This process is inefÞcient
and time consuming.
Another problem is that some identiÞers are not unique across
different databases and organisms.Therefore,the user has to know
which database to query.The problem is even harder when using
names instead of identiÞers.One protein usually has several textual
descriptions or different spellings,and different proteins can have
similar names making it very difÞcult to Þnd an appropriate one
uniquely describing which object is being addressed.
We present a simple solution to these problems in the form of a
Biological Name Service called ProThesaurus and a client applica-
tion called ProTag.ProThesaurus is implemented as a Web Service
andcanbe usedtomapprotein(database) identiÞers toproteinnames
or descriptions (called synonyms) and vice versa.The Web Ser-
vice interface of LiMB (D.Gttler,R.Zimmer and J.Apostolaski,
manuscript in preparation) is used for tagging of synonyms in free
text;therefore,LiMB acts as a Biological Mark-up Service.The cli-
ent application integrates seamlessly into the MS OfÞce programs
making it easy for the user to identify biological objects and per-
forming different actions on them during everyday work.It also
enables a practical way to improve the very much needed correc-
tion and extension of biological object name and synonymlists.The
goal is not to establish a new ontology for protein names,but to

To whomcorrespondence should be addressed.
give biologists practical help with the handling of correct protein
Web Services are a method to call remote programs with standardized public
interfaces via sending a respective XML string to a Web server using the
SOAP speciÞcation (http://www.w3.org/TR/SOAP/,Box et al.,2000).
To realize the biological name service,we used previously collected syn-
onym lists (Hanisch et al.,2003;Fundel et al.,2005) from a number of
public databases:Entrez Gene,HUGO(Wain et al.,2004) and SWISSPROT
(Boeckmann et al.,2003) as well as the organism-speciÞc Rat,Mouse and
Saccharomyces genome databases.Reduced synonym lists for yeast,mouse
and ßy have been evaluated in BioCreative Task 1B(Fundel et al.,2005) with
good results (Recall:88,80 and 74%,respectively),which was largely owing
to extensive curation.The synonym lists in Prothesaurus and LiMB cover
further organisms and include synonyms from more databases and thus are
signiÞcantly larger.
Smart Tags are a feature of MS OfÞce programs which annotate texts in
a document with information (tags) while the document is being edited or
viewed.Such a tag may offer different actions (through a context menu)
on that text,e.g.replacing a protein name with a unique protein iden-
tiÞer or retrieving additional information on the respective object from
the web.
The ProTag implementation tags automatically (i.e.without user
interaction) regions of text representing a biological object inside
an Excel sheet,a Word document or a PowerPoint presentation.The
text could be an identiÞer or a list of words that matches against
some synonyms.Once the user pauses to edit the document or stops
scrolling within the document,the MS OfÞce application passes the
currently visible text from the document to ProTag (Fig.1,step 1).
In order to identify biological objects within the text,the Mark-up
Services (LiMBas default) are contacted (step 2).If the text could be
matched against some identiÞers or synonyms,a list of unique keys
is returned to the client and a smart tag is added to the respective
region of text (step 3);this is indicated by small marks throughout
the document.
A mouse click on such a mark displays a context menu with one
itemfor eachbiological object that matches theidentiÞer or synonym.
Eachmenuitemprovides several actions that canbe performedonthe
corresponding smart tag (step 4),e.g.retrieving the synonyms for a
biological object and inserting theminto the document as a comment
(step 5).This requires the Biological Name Services (ProThesaurus
as default) to be contacted (step 6).
If a supposed identiÞer or a synonymis not recognized by ProTag,
the user can also performa manual and non-exact search through an
additional command bar within the MS OfÞce application.
© The Author 2005.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@oxfordjournal s.org
Web servicing the biological ofÞce
Fig.1.Architecture overview of the ProThesaurus Web Service and the ProTag MS OfÞce Smart Tag client.
For non-MS OfÞce applications (e.g.the Internet Explorer or
Adobe Acrobat Reader) a standalone version of ProTag is offered.
This software runs in the background accepting user input from the
clipboard to perform the search.The matches are displayed in a
dialog and the user can copy and paste one or more synonyms or
identiÞers to the application.
ProTagcanalsobeusedtoimprovetheBiological NameandMark-
up Services.Once a region of text is tagged as a protein identiÞer or
name,the user is allowed to update the list of associated synonyms
obtained from the Web Service,and may commit the updates to the
ProThesaurus database.
Letting biological experts update the synonyms of biological
objects has two beneÞts:Þrst,the use of standard names is encour-
aged,andsecond,the current synonymlist is expectedtobe enhanced
and corrected during the usage.Collected changes are inspected and
cross-checked in order to identify meaningful corrections and exten-
sions.These changes are then incorporated into the next version of
the database.In the mean time,the proposed synonyms and iden-
tiÞers can be accessed through a special Biological Name Service
called BeThesaurus (Beta Thesaurus),which can be integrated into
ProTag by selecting a checkbox.
Finally,the ProTag Add-In for Excel easily allows Þlling arbitrary
table entries with information requested fromthe ProThesaurus Web
Service using values of other table entries as parameters for the Web
Service call.
Further information on usage of ProTag is provided in the setup
programand as a Readme Þle.
The simple mechanism described in this note can easily be exten-
ded:Custom actions can be integrated into ProTag and additional
Web Services can be added as a Biological Name or Mark-up
We believe that the implementation proposed here has obvious
and immediate beneÞts especially for users working with MS OfÞce
applications.First,the installation,activation and conÞguration of
the tools are particularly simple.Second,many users are familiar
with the MS OfÞce programs.Third,the added value (information
on proteins) is available seamlessly and in standard easy-to-handle
form.Last but not least,the information obtained is likely to be
useful,as it is straightforward to maintain up-to-date data on the
Web server,and once a database identiÞer has been identiÞed for a
protein name through the described mechanism,the whole range of
annotations is easily accessible via ProThesaurus fromwithin OfÞce
We thank the HOBIT partners for support regarding Web Service
development and members of the LMU bioinformatics group for
valuable discussions.The HOBIT project at LMU is funded by the
Helmholtz-Gemeinschaft (http://www.helmholtz.de/).D.G.andK.F.
acknowledge funding by the BMBF project MAMS/ProBio (grant
0312708C),F.S.by the BMBF Leitprojekt Osteoarthritis (grant
Conßict of Interest:none declared.
Boeckmann,B.et al.(2003) The SWISS-PROT protein knowledgebase and its supple-
ment TrEMBL in 2003.Nucleic Acids Res.,31,365Ð370.
Thatte,S.and Winer,D.(2000) Simple Object Access Protocol (SOAP) 1.1.
Fundel,K.et al.(2005) Asimple approach for protein name identiÞcation:prospects and
limits.BMC Bioinformatics,6 (Suppl 1),S15.
Hanisch,D.et al.(2005) ProMiner:rule-based protein and gene entity recognition.BMC
Bioinformatics,6 (Suppl 1),S14.
Hanisch,D.et al.(2003) Playing biologyÕs name game:identifying protein names in
scientiÞc text.Pac.Symp.Biocomput.,8,403Ð414.
Pontius,J.U.,Wagner,L.and Schuler,G.D.(2003) UniGene:a uniÞed view of the
transcriptome.The NCBI Handbook,21,1Ð21.
Wain,H.M.et al.(2004) Genew:the human gene nomenclature database,2004 updates.
Nucl.Acids Res.,32,D255ÐD257.