A Pilot Study Experience on Generating Complex Ontology Instances from Scientific Bibliographies on Real Biological Domains

elbowsspurgalledInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

63 εμφανίσεις

A Pilot
Study
Experience on Generating
Complex Ontology Instances from
Scientific Bibliographies on Real
Biological Domains

José F. Aldana
-
Montes
1
, Rafael Berlanga
-
Llavorí
2
, Roxana Danger
2
, Raul Montañés
-
Martínez
3
, Mª del Mar Rojano
-
Muñoz
1
, Francisca Sánch
ez
-
Jiménez
3

1

Khaos Research Group. Dpt. of Computer Languages and Computing Science,
University of Malaga, Campus de Teatinos.
29071 Malaga, Spain

http://Khaos.uma.es, E
-
mail:
jfam@lcc.uma.es

2

T
emporal
K
nowledge
B
ases
G
roup
. Dpt. of Computer Languages and Computer
Systems, University Jaume I, Castellón, Spain

http://www3.uji.es/~berlanga/
; E
-
mail:
{
berlanga, Roxana
.Danger
}
@uji.es

3

ProCel Lab.
Department of Molecular Biology and Biochemistry. Faculty of
Sciences, University of Málaga, Campus de Teatinos.
29071 Malaga, Spain

http://www.bmbq.uma.es/procel
, E
-
mail: kika@uma.es


Abstract

In this paper we present a firs
t insight into the generation of complex ontology instances
from scientific bibliographies like the one in PubMed/PubChem on a real biological
domain: Polyamines and Histamine. There is evidence for the involvement of both in
cancer and other inflammation
-

and/or angiogenesis
-
dependent diseases, but multiple
questions concerning the molecular processes behind these effects still remain to be
solved.

We address the problem of automatically generating ontology instances starting from a
collection of PDF docum
ents stored in a bibliographic database. We have developed a
domain ontology

which models and describes
what

we are searching for
.

T
he structure
of the document is extracted in order to generate a mapping between the ontology and
the document text. Using t
his mapping the ontology is populated with the extracted
knowledge.

1.

Introduction

The Semantic Web (SW) is intended to enrich the current Web by creating
knowledge objects that allow both users and programs to better exploit the Web
resources [1]. The corne
rstone of the Semantic Web is the definition of an ontology
that conceptualizes the knowledge within the resources of a
particular
domain [2].
Thus, the contents of the Web will be described in the future by means of a large
collection of semantically tagg
ed resources that must be reliable and meaningful.

Nowadays most of the information
o
n the Web consists of text documents with little
or no structure at all, which makes their manual annotation
impracticable
[3].

In automatic document classification

sever
al methods have been proposed to
automatically classify documents according to a given taxonomy of concepts (e.g.
[4]). However, these methods require a
large

amount of training data, and they do
not generate
instances

of the ontology. In the Information E
xtraction
(IE)
area, much
work has been focused on recognizing predefined entities (e.g. dates, locations,
names, etc.), as well as extracting relevant relations between them by using natural
language processing. However, current
IE

systems are restricted
to extract
ing

flat
and very simple relations, mainly to feed a relational database.
Ontology

relationships are more complex as they can involve nested concepts and they have
associated inference rules.

For
these
reason
s

we address
the problem of
annotating

texts according to an
ontology

as a mapping problem between text fragments and ontology subgraphs
.

As
in IE systems,
we also

recognize named
entities
(e.g. protein names)
that can be
directly associated to ontology concepts
. However, unlike IE systems
,
we

do

not use
either
syntactic or semantic analysis to extract the relations appearing in the text
. As
a result, we can extract complex instances

efficiently and

effectively
.


In this paper, we will apply this technique to
extract instances from the
biogenic

amines
domain
and all

the

related processes.

2.

Generation of ontology instances

In this section we introduce the main formal aspects of the proposed extraction
technique
presented in
[
2
]
.

We adopt

the formal definition of ontology from [6]. This
definition

distinguishes between the conceptual schema of the ontology, which
consists of a set of concepts and their relationships, and its associated resource
descriptions, which are called instances. In our approach, we represent both parts
as oriented graphs, ov
er which the ontology inference rules are applied to extract
and validate complex instances.

T
able 1 present
s

the definitions of the concepts used when generating partial and
complete instances from words and entities appearing within a text fragment.


Def
inition

Formal description

Description

Def. 1
.


-

The set of the ontology concepts C is the subset of nodes in N that are not pointed by any arc
labeled as “instance_of”.


-
The taxonomy of concepts C consists of the subgraph that only contains

the is_a arcs. It defines a
partial order over the elements of C, denoted with

C.


-
The relation signature is a function

: R


C x C, which contains the subsets of arc labels that
involve only concepts.


-
The function dom: R


C with dom(r)
=

1(

(r)) gives the domain of a relation r

R, whereas
range: R


C with range(r) =

2(

(r)) gives its range.


-
The function domC: C


2




gives the domain of definition of a concept.


-
The function in order to express the proximity of two c
oncepts c and c’ where dT(c, c’) and dR(c,
c’) are the distances between the two concepts c and c’.



An
ontology

is a labeled oriented
graph
G

= (
N
,
A
) where the set of
nodes
N

contains both the concept
names and the instance id
entifiers,
and
A

is a set of labeled arcs
between them representing their
relationships.

Additionally, over the graph we
introduce the following restrictions
according to [6]:


Def. 2
.

We denote with

an
instance related to the obje
ct o of the class c or simply an instance
, as

the
subgraph of the ontology that relates the node
o

with any other, and there exists an arc (
o
,
instance_of
,
c
)


A
.


Notice that the arc (o, instance_of,
c) is not included in the inst
ance
subgraph. An instance is empty if
the object o only participates in the
arc (o, instance_of, c), which is
represented as the set {(o, *, *)}

Def. 3
.

We call a
specialization of the instance
to the class
c
’,
c

c
c
’, denoted with
, the following
instance:



Basically, this definition says that
an instance can be specialized by
simply renaming the object with a
name from the target class in all the
instance triples that do not
represent

an overridden property
.

Def. 4
.

We call an
abstraction of the instance
to the class
c
’,
c


c
c
, denoted with
, the instance


Similarly to the previous definition,
we can obtain an abstr
act instance
by selecting all the triples whose
relation name can be abstracted to a
relation of the target class
c
’ and
renaming accordingly the instance
object.

Def. 5
:

We call
union of the instances
and
and
denote it by

the set:


Definitions 5
-
7 are necessary to
define the unification and
aggregation of simple instances
into complex ones.

Def. 6
:

We call
difference of the instances


and
, denoted with

, the following set:


Def. 7
:

We call
symmetric difference of the instances

and
, denoted with

the res
ult of
.


Def. 8
:

We say that two instances

and

related to the objects
o

and
o
’ of classes
c

and
c
’,
respectively,
c


C

c
’, are
complementary

if they satisfy at least one of the follow
ing two conditions:

1.

(


(
o
’,
r
,
x
), (
o
’,
r
,
x
’)


,
x



x
’,
r

is biyective) and

(


(
o
,
r
,
x
), (
o
’,
r
’,
x
’)

,
r


R

r
’,
r


r
’,
x



x
’) or

2.

at least one of the instances

or

is empty.

Two instances are complementary
if

contradictions

do not exist

between the values of their similar
relations
.

Def. 9
:

We say that two complementary instances

and
are
unifiable

in
,
c


C

c
’.


Def. 10
:

We say that two instances,
and

are aggregable in
, if

r

R
,

(
r
) = (
c
*
,
c
’’),
c


C

c
*
,
c
’’

C

c
’, and
o
’’ is the name of the instance
.


Tabla 1.
Definitions

of
the extraction method taken from [
2
]


Taking into account the definitions above, the system generates the set of instances
that are
associated to each text fragment as follows. Firstly, it constructs all the
possible partia
l instances with the concept, entity and relation names that appear in
the text fragment. Then it tries to unify partial instances, substituting them by the
unified ones. Finally, instances are aggregated to form complex instances. The
following algorithm
sums up
the process.

Further details of this process are given in
[2].


3.

Generating Complex Ontology Instances from Scientific Bibliographics on the
Domain of the Biogenic Amines

3.1

The Domain of the Biogenic Amines Metabolism


From a molecular point of view,
Biology is an informational science with three major
types of interconnected information: i) the structure of genes, along with the three
-
dimensional structures of proteins, ii) the molecular machines of life, and iii)
biological systems with their emergen
t behaviour.

V
alidation of new
biocomputational tools for biological data mining and integration
requires

experience
on a wide body of biological knowledge, which is essential: i) to select a proper and
interesting problem, ii) to build the underlying ont
ology, iii) to purge any automatic
searching result, and to build new applications (
i.e.
:
hypothesis

and
conclusions
) on
the obtained data. In this project, we have tried to accomplish all of the above
mentioned requirements to build a tool for location of

molecular information by text
mining
in both

web pages and
PDF documents
.
O
ur group combines experts in
computer science

and molecular biology, involved in interdisciplinary projects for the
last 4 years. As a pilot topic for validation, we chose a module

of the secondary
metabolism of living cells "the biogenic amine metabolism".

Biogenic amines are derived from amino acids. Polyamines play a relevant
regulatory role in macromolecular synthesis a
nd cell proliferation rates [7
-
9
]. On the
other hand, histam
ine is a biogenic amine
synthesized

by the enzyme histidine
decarboxylase (HDC) with a major role in immune response, inflammation, and
other multiple

physiological activities [10
-
11
]. All together biogenic amines are
involved in many of the most important

and general diseases,

such
as neurological
disorders, cancer,
and
immune pathologies. Traditionally, histamine and polyamines
have been
the
subject of
much
interest by many biomedical research groups without
much communication among them. However, they s
hare similar chemical and
metabolic facts and physiopa
thological missions [8
-
11
]. Nevertheless, multiple
questions concerning the molecular processes behind these effects remain to be
solved. This
may

be due to the great dispersion of available data in spe
cialized
literature of very different areas of biomedicine. For all of these reasons,
this field
seems to be worthy
of selection
as an interesting conceptual
benchmark
. In silico
predictive models have been developed and published by the group at different

levels that help to explain many of the previous experimental observation
s
, giving
rise to new findings and hypothesis
[10,12,13
]. Development of new text mining

tools is essential for efficiency in the progression of these projects.

3.2

Domain Ontology


Enz
ymes (mainly proteins) are the catalysts of the biolog
ical reactions that take part
in

any metabolic pathway. The half
-
life of a given enzyme is determined by the
balance between its synthesis and degradation rates. Protein degradation involves
many differ
ent mechanism
s

(proteinase activities) specific for each enzyme. Many of
the regulatory mechanisms of amine metabolic pathways involved enzyme
destruction after the biological mission of their respective amine has been carried
out, as a mechanism to ensure

an effective switch
-
off of the process. In addition,
several amine
-
related enzymes are synthetized in an unactive
form that needs to be
cleavaged by proteinases to reach its active stage (processing /maturation
). This is
the case for mammalian histidine d
ecarboxylase, a very unstable enzyme that
requires
a
proteolytic mechanism for both maturation and regulation of its turnover.
A lot of information on the enzyme degradation/processing mechanisms involved in
amine metabolism
has


been generated in differen
t text formats. An important
percentage of this information
has

been generated by collaborator groups and our
own group,
e
specially in the case of mammalian histidine decarboxylase [
14
-
17
].
This seems to be a good test bank for our pilot project on text mi
ning.

Thus, we
adopted HDC as the pilot molecule to contrast our data mining efforts to develop an
automatized searching tool on published pdfs and other document formats.
Nevert
heless, due to the highly consistent nature

of protein metabolism, once the
da
ta mining process is validated on this enzyme, it could be easily scaled
-
up to any
other enzyme, related or not to amine metabolism.

This ontology has been
developed in OWL [18].




Fig. 2.
Ontology


3.3

Benchmark


The benchmark has been chosen from an ext
ensive pool of scientific works coming
from a
world
-
wide group of experts on degradation and maturation of histidine
decarboxylase (HDC). In this benchmark
67 documents
are compiled in
PDF

format
and eight web pages, 50 percent of them conform

with

the per
tinent information
(highly
-
related to HDC degradation/maturation).
Both c
ollateral

(related to other facts
of histamine metabolism) and unrelated information
which were

included as controls,
are essential for a correct and accurate validation of both the
analyzers and the
ontology during this pilot experience. Document selection was done on the basis of
the impact and availability of the publication source. Selected documents content
dispersed (and in some cases contradictory) experimental results develope
d in the
field of the molecular biology and biochemistry.


Text description and figure of instances


<Sample id="004" source="22.pdf" section="Title and Abstract">

<P>Expression of 74
-
kDa histidine decarboxylase protein in a macrophage
-
like
cell line RAW

264.7 and inhibition by dexamethasone</P>


Total words

Instances

Correct Instances

Failed Instances

17

5

4

1





<Sample id="010" source="9.pdf" section="2.1.1">

<P>

The human H1 receptor contains 486, 488 or 487 amino acids in rat, mouse and
human r
espectively. It shares the typical features of GPCR, namely: seven
transmembrane domains of 20-25 amino
-
acids predicted to form an a
-
helice that
spans the plasma membrane and an extracellular NH2 terminal (polypeptide)
domain with glycosylation site. It is

encoded by a single exon gene located on
the distal short arm of chromosome 3p25 in humans and chromosome 6 in mice.
Histamine binds to aspartate residues in the transmembrane domain (motif) 3 of
the receptor and to asparagine + lysine residues within the

transmembrane
domain 5. H1 receptors are involved in the pathological process of allergy such
as allergenic rhinitis, conjunctivitis, atopic dermatitis, urticaria, asthma
and anaphylaxis. In the lung, it mediates bronchoconstriction and increased
vascular

permeability. The H1R is expressed in numerous cell types, including
airway and vascular smooth muscle cells, hepatocytes, chondrocytes, nerve
cells, endothelial cells, neutrophils, monocytes, dendritic cells, as well as T
and B lymphocytes, in which it m
ediates the various biological manifestations
of allergic responses [3-5]. H1R is a Gaq/11
-
coupled protein (polypeptide)
with a very large third intracellular loop and a relatively short C
-
terminal
(polypeptide) tail (Fig. 1). The main signal induced by
ligand binding is the
activation of phospholipase C
-
generating inositol 1,4,5
-
triphosphate (Ins
(1,4,5) P3) and 1,2
-
diacylglycerol leading to increased cytosolic (cellular
lication) Ca2+ [4]. This rise in intra
-
cellular calcium levels seems to account
for
the various pharmacological activities promoted by the receptor, such as
nitric oxide production, vasodilatation, liberation of arachidonic acid from
phospholipids and increased cyclic AMP.H1R also activates NF kB through Gaq11
and Gbg upon agonist binding
, while constitutive activation of NF kB occurs
through Gbg only [6].

</P>


Total words

Instances

Correct Instances

Failed Instances

283

14

10

4


Table 2.
Representation of total correct instances (orange diamond in figure) of text and the fails
inst
ances (gray diamond).

Concepts, links and instances were defined on the basis of scientific criteria without
taking into account the precise benchmark
-
containing information or semantic style.
One of the major difficulties detected
throughout

the work was
the semantic
dispersion in the expression of concepts and results
in

the biochemical information,
which make the accurate definition of the concepts and instances for the mining

difficult
, and
obliged

to the
establishment

of an enriched list of
synonyms
. W
orking
in this proving bank we obtain a set of results as shown in the examples

in table 2.

The results indicate that concepts contained in the text are properly detected;
although

the system fails to detect the
appropriate

context
,

the r
esults are not
dis
couraging at all; however, further efforts will be necessary to restrict possibilities
to
establish

long
-
distance relationships (with respect to the text
position

of the
concepts) and to solve problems with ambiguity of the biological terms (for instance,
in literature, H1 could mean location of a histidine residue at position 1 as well as
the abbreviation for a histamine receptor sybtype). These changes would lead to
increase the astringency of the search criteria, and consequently to the los
s

of some
info
rmation,
although

accuracy should be improved

The extraction algorithm captures the main relations
hips

between correct
ly

detected
entities, but it mainly fails in the correct identification of some basic entities. This is
principally
because we have used p
reliminary regular expressions to

detect named
entities instead of a good thesaurus. Another problem we identify is the correct
tokenization of text fragments due to the large number of chemical formula and
abbreviations that are included in biomedical tex
ts. Future work must then focus on
improving the most basic mechanisms for detecting named entities and concept
boundaries in the texts. Finally, we have also noticed that the level
of detail
of the
ontology also affects the quality of extracted instances.

Thus, the more detailed the
ontology is, the likelier and better are the relations found in text fragments.
In
this
respect, much more work must be done to improve the details of the target ontology.

4.

Conclusion

We have presented a pilot project on the gen
eration of complex ontology instances
in the very real (and thus hard) domain of the biogenic amines metabolism (BAM)
.
An ontology which modelizes the
discussion domain

of the BAM has been
developed in OWL.

We have built a
benchmark

from the real bibliogra
phy of the
BAM domain (extracted from several public domain bibliographic databases) and
we are going to use this
pilot experience

to evaluate in a well
-
known controlled
environment the quality of the results produced by our system.

Firsts results are not
absolutely
discouraging

but
further

work must be done in the description of the
ontology.


Acknowledgements

This work was supported by Grants CVI
-
267, CVI
-
657, TIN
-
09098
-
C05
-
01

and

TIN2005
-
09098
-
C05
-
04
.

References

[1]


Berners
-
Lee, T., Hendler, J., Lassila, O.

"The Semantic Web". Scientific American, 2001.

[2]

Danger,
R.,
Sanz,

I.,
Berlanga
-
Llavori,

R.,

Ruiz
-
Shulcloper,
J.

A Proposal for the Automatic
Generation of Instances from Unstructured Text

, Lecture Notes in Compute
r Science, Volume 3287,
pp.

462


469
, 20
04.


[3]

Gruber, T.R. “Towards Principles for the Design of Ontologies used for Knowledge Sharing”,
International Journal of Human
-
Computer Studies Vol. 43, pp. 907
-
928, 1995.

[4]

Forno, F., Farinetti, L., Mehan, S. “Can Data Mining Techniques Ease The Semantic Ta
gging
Burden?”, SWDB 2003, pp. 277
-
292, 2003.

[5]

Doan, A. et al. “Learning to match ontologies on the Semantic Web”. VLDB Journal 12(4), pp. 303
-
319, 2003.

[6]

Maedche, A., Neumann, G. and Staab, S. “Bootstrapping an Ontology based Information Extraction
System”
. Studies in Fuzziness and Soft Computing, Springer, 2001.

[7]

Cohen SS. Aguide to polyamines. Oxford University Press, New York; 1998.

[8]

Thomas T, Thomas TJ. Polyamines in cell growth and cell death: molecular mechanisms and
therapeutic applications. Cell Mol L
ife Sci. 2001; 58: 244

58.

[9]

Medina MA, Urdiales JL, Rodriguez
-
Caso C, Ramirez FJ, Sanchez
-
Jimenez F. Biogenic amines and
polyamines: similar biochemistry for different physiologi cal missions and biomedical applications. Crit
Rev Biochem Mol Biol. 2003; 38:

23

59.

[10]

Moya
-
Garcia AA, Medina MA, Sanchez
-
Jimenez F. Mammalian histidine decarboxylase: from
structure to function.
Bioessays 2005; 27: 57

63.

[11]

Medina MA, Quesada AR, Nunez de Castro I, Sanchez
-
Jimenez F. Histamine, polyamines, and
cancer.
Biochem Pharmaco
l. 1999; 57: 1341

4.

[12]

Medina M.A. Correa
-
Fiz F., Rodríguez
-
Caso C., Sánchez
-
Jiménez F. A comprehensive view of
polyamine and histamine metabolism to the light of new technologies. J. Cell. Mol. Med. 2005. Vol. 9.
854
-
864.

[13]

Rodríguez
-
Caso C., Montañez R., Cas
cante M., Sánchez
-
Jiménez F., Medina M.A. Mathematical
modeling of polyamine metabolism in mammals.
J. Biol. Chem. en prensa (Mayo,06), on
-
line PMID:
16709566.

[14]

Ángel N., Olmo M.T., Coleman C.S., Medina M. A., Pegg A.E., Sánchez
-
Jiménez F. Experimental
evid
ence for structure/activity features in common between mammalian. Biochem. J. 1996. Vol. 320.
365
-
368.

[15]

Matés J.M., del Valle A.E., Urdiales J.L., Coleman C.S., Feith D., Pegg A.E., Sánchez
-
Jiménez
F.Structure/function relationship studies on the T/S residu
es 173
-
177 of rat ODC. Biochem. Biophys.
Acta 1998. Vol. 1386. 113
-
120.

[16]

Olmo M.T. Urdiales J.L., Pegg A.E., Medina M.A., Sánchez
-
Jiménez F.. Eur. J. Biochem. 2000. Vol.
267. 1527
-
1531.

[17]

Olmo M.T., Sánchez
-
Jiménez F., Medina M.A., Hayashi H. Spectroscopic an
álisis of recombinant rat
histidina decarboxylase. 2002.
Vol.132.
433
-
439.

[18]

OWL Web Ontology Language 1.0 Reference. W3C Candidate Recommendation 15 Decembre 2003
http://www.w3.org/TR/owl
-
ref.