D - Protein Information Resource - Georgetown University

jamaicacooperativeΤεχνίτη Νοημοσύνη και Ρομποτική

17 Οκτ 2013 (πριν από 4 χρόνια και 2 μήνες)

84 εμφανίσεις

Text Mining and Ontology Development at the Protein Information Resource (PIR)


Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology

Georgetown University Medical Center, Washington, DC 20007


The Protein Information Re
source (PIR
,
http://pir.georgetown.edu/
) is an integrated protein
bioinformatics
resource for genomic and proteomic research.
For the past two decades,
PIR

has
focused its activities
on protein
and proteomics

inf
ormatics

research and development

by providing protein databases

(e.g. UniProt)
, functional
data integration and analysis tools
,

and
other bioinformatics infrastructure
. In the past few years,
PIR
’s work

is

expand
ed

to emerging areas such as biomedical
tex
t

mining
,

cancer biomedical informatics grid development,
and

medical informatics

for translational research.
Our interest
s in text mining are

primarily due to the fact that
literature
-
based
manual
curation

of protein databases
, while critical to the
datab
ase
quality, cannot keep up with
fas
t
-
growing literature
as well as sequence data
.
C
ollaborat
in
g

with other groups
,

PIR has been

working

in

several
text mining

areas
, including
entity recognition
, i
nformation retrieval and extraction
,

and o
ntology developm
ent.


I.
Development of
literature mining resource
for NLP research and tool development

PIR developed a

literature mining resource


iProLINK

(integrated Protein Literature INformation and
Knowledge) (
ht
tp://pir.georgetown.edu/iprolink/
)

aiming to

provide annotated literature data sets for developing
new
text
mining algorithms, such as p
rotein named entity recognition
, text categorization, and protein annotation
extraction,
and

protein

ontology developme
nt. iProLINK also provides text mining tools developed using its
annotated literature data sets
, such as protein phosphorylation information extraction tool

RLIMS
-
P.

iProLINK
thus
serves as a knowledge link bridging protein databases and literature databas
e such as PubMed.


II.
Protein n
amed e
ntity
recognition

Due to the lack of name standardization, gene/protein names

described

in literature and annotated in bioinformatics
databases
may

vary greatly.
As part of entity tagging program (BioTagger), w
e deve
loped a gene/protein name
thesaurus
,

BioThesaurus
,

which maps a comprehensive collection of gene/protein names from over 35 biological
databases
,

including the major gene (e.g., Entrez Gene) and protein databases (e.g. UniProt), and the model
organism data
bases (e.g. MGD, SGD).

BioThesaurus currently contains 5.8 million gene/protein names and are
associated
with over
4.6

million protein entries in UniProt Knowledgebase (UniProtKB). A
n

online

BioThesaurus
is developed to provide online query for finding gen
e/protein synonyms and for solving name ambiguities
for
name
s

shared by more than one protein entry (
http://pir.georgetown.edu/iprolink/biothesaurus/
).


I
II.
Information
r
etrieval

and
e
xtract
ion

Biological document classification based on specific
topics
, such as protein
-
protein interactions, protein
modifications
, or pathogenesis
-
related microbial proteins, is
important
in analyzing large
-
scale biomedical
literature

and
for
database curation
.

A new
text classification
algorithm using discriminativ
e substrings as attributes
was developed based
on
literature

data

set
s about five types of protein post
-
translational modifications,
glycosylation, phosphorylation, acetylation, methylation and hydrox
ylation
, provided in iProLINK
. The
algorithm
showed that Naive Bayes and Support Vector Machine classifiers perform consistently better (
AUC

0.92
-
0.97)
when using the

substring
-
based

attribute selection than when using attributes obtained by the Porter ste
mmer
algorithm (AUC

0.86
-
0.93).
The algorithm
is highly effective even when

the training data set is

relatively small
.



Protein phosphorylation is a fundamental molecular event essential to cellular processes. We developed and
benchmarked a rule
-
based tex
t mining system,
RLIMS
-
P
, for extracting protein phosphorylation information from
MEDLINE abstracts
regarding three objects

the protein kinase, the protein substrate
, and the phosphorylation
sites, with high precision and recall. A

web
-
based version of the

RLIMS
-
P for mining of protein phosphorylation
information from MEDLINE abstracts (
http://pir.georgetown.edu/iprolink/rlimsp/
)

was recently developed
. The
online
RLIMS
-
P is designed for easy access
ibility and with enhanced functionality
, including
evidence

tagging

for
extracted objects
in abstracts
,
mapping of pho
sphorylated proteins to

t
he UniProtKB
entry
based on PubMed ID
and/or protein names using BioThesaurus.
O
nline RLIMS
-
P facilitates biologi
cal studies of phosphorylated
proteins and the database annotation of phosphorylation features.


IV.
Protein O
ntology
d
evelopment

We
recently
designed
a

PRotein Ontology (PRO)
framework
to facilitate protein annotation and to guide new
experiment
al design

(
http://pir.georgetown.edu/pro/
).

PRO

is an ontology of protein entities and relationships in
the OBO Foundry framework. Its components extend from the evolutionary relationships of protein classes of
structu
ral domains, sequence domains, and whole proteins (ProEvo

Protein Evolution ontology), to the
representation of multiple protein forms of genes generated by genetic variation, alternative splicing, proteolytic
cleavage, and other post
-
translational modific
ations (ProForm

Protein Form ontology).

PRO is designed to assist
assignment of protein annotations (properties such as molecular functions) to specific protein forms of a gene
product, and to protein classes at different evolutionary levels in a formal on
tological framework. Thus, PRO
allows for accurate association of its ProForm nodes to specific objects in other databases (e.g. Reactome pathway)
or ontologies (e.g. Sequence Ontology). PRO also facilitates the experimental design based on knowledge trans
fer
between, for example, a mouse protein form and its human ortholog (or even to suggest that such an ortholog might
exist) via the connection between Pro
Evo and ProForm.


V.
Ongoing TATRC funded work

on Pathogen Mini
ng
:
Literature Mining of Pathogenesis
-
Related Proteins in
Human Pathogens for Database

text mining

(2007
-
2009)

The
objective of this pro
ject

is to develop an automated text mining system to facilitate literature
-
based curation of
pathogenesis
-
related proteins in pathogens of military relevance
, including pathogens of importance for diseases of
the developing world and biodefense.
We will
develop a text
-
mining system to identify papers and extract
information on pathogenicity and host
-
pathogen interactions, and to integrate text
-
mining into the
UniProt
Knowledge curation system. This project will build upon our technological framework that includes text mining
algorithms, bibliography mapping, and database curation platform.
Coupling the curated pathogenesis information
from the scientific litera
ture with the rich protein information and integrated genomics and proteomics data in
UniProt will greatly improve our basic understanding of virulence and pathogenicity factors and host
-
interacting
proteins of the human pathogens. Such knowledge will faci
litate the development of preventative and therapeutic
strategies against human pathogens.

LITERATURE MINING RELATED REFERENCES:


Named entity recognition:

1.

Torii M, Hu ZZ, Song M, Wu CH, Liu H (2007) A Comparison Study on Algorithms for Detecting Biomedic
al
Short Form Definition
.

BMC Bioinformatics
(in press)

2.

Liu H, Torii M, Hu ZZ, and Wu CH (2007) Gene Mention and Gene Normalization Based on Machine
Learning and Online Resources.
In Proc. of BioCreAtIvE II


3.

Liu H,
Hu ZZ
, Wu CH (2006). BioThesaurus: a web
-
based thesaurus of protein and gene names.
Bioinformatics

22:
103
-
105.

4.

Liu H,
Hu ZZ
, Torii M, Wu C, Friedman C. (2006) Quantitative Assessment of Dictionary
-
based Protein
Named Entity Tagging.
J Am Med Inform Assoc

13:497
-
507.

5.

Mani I,
Hu Z
, Jang SB, Samuel

K, Krause M, Phillips J, Wu CH (2005). Protein name tagging guidelines:
lessons learned.
Comparative and Functional Genomics

6(1
-
2): 72
-
76.


Information retrieval and extraction:

6.

Han B, Obradovic Z,
Hu ZZ
, Wu CH, Vucetic S.

(
2006
)

Substring selection for
biomedical document
classification.
Bioinformatics

22
:2136
-
42

7.

Yuan X,
Hu ZZ
, Wu HT, Torii M, Narayanaswamy M, Ravikumar KE, Vijay
-
Shanker K, and Wu CH. (2006)
An Online Literature Mining Tool for Protein Phosphorylation.
Bioinformatics

22:1668
-
9.

8.

Hu ZZ
, Na
rayanaswamy M, Ravikumar KE, Vijay
-
Shanker K, Wu CH (2005). Literature Mining and Database
Annotation of Protein Phosphorylation Using a Rule
-
Based System.
Bioinformatics

21(11): 2759
-
2765.


Ontology development:

9.

Natale DA, Arighi CN, Barker W, Blake J, Ch
ang T, Hu ZZ, Liu H, Smith B, Wu CH.
(
2007
)
Framework for a
Protein Ontology
.

BMC Bioinformatics
,
8
(Suppl 9)
:
S1


10.

H. Liu
, Z. Hu, and C Wu (2005) DynGO: A tool for navigation and visualization of Gene Ontology resources.
BMC Bioinformatics
, BioMed Central Lt
d, 2005, 6:201 (online journal).


Annotated literature resource:

11.

Hu ZZ
, Mani I, Hermoso V, Liu H and Wu CH (2004). iProLINK: an integrated protein resource for literature
mining.
Comput Biol Chem

28: 409
-
416.