Unsupervised Ontology Acquisition from

addictedswimmingΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

64 εμφανίσεις

Unsupervised Ontology Acquisition from
plain texts
:

The
OntoGain

method

Efthymios

Drymonas

Kalliopi

Zervanou

Euripides G.M. Petrakis


Intelligent Systems Laboratory

http://www.intelligence.tuc.gr


Technical University of Crete (TUC),
Chania
, Greece

OntoGain


A platform for unsupervised ontology
acquisition from text


Application independent


Ontology of multi
-
word
term
concepts


Adjusts existing methods for taxonomy &
relation acquisition to handle multi
-
word
concepts


Outputs ontology in
OWL


Good results on Medical, Computer science
corpora


2

Why multi
-
word term concepts?


Majority of terminological expressions


Convey classificatory information,
expressed as modifiers


e.g. “
carotid artery disease
” denotes a type
of “
artery disease

which is a type of

disease



Leads to more expressive
and compact
ontology
lexicon

3

Ontology Learning Steps


Concept Extraction


C/NC
-
value



Taxonomy Induction


Clustering, Formal
Concept Analysis


Non
-
taxonomic Relations


Association Rules, Probabilistic algorithm

4

5

The C/NC
-
Value method


[
Frantzi

et.al
. , 2000]



Identifies multi
-
word term phrases
denoting domain concepts


Noun phrases are extracted first



((
adj

| noun)+ | ((adj | noun)
*
(noun prep)?)
(adj | noun)

*
) noun


C
-
Value
: Term validity criterion, relying
on the hypothesis that multi
-
word terms
tend to consist of other terms


NC
-
Value
: Uses
context
information
(valid terms tend to appear in specific
context and co
-
occur with other terms)


C
-
Value: Statistical Part


For candidate term
a



f(a):

Total frequency of occurrence


f(b):

Frequency of
a

as part of longer terms


P(T
a
):

number of these longer terms


|a|:

The length of the candidate string

Concept Extraction











C/NC
-
Value sample results

output term

c
-
nc value

web page

1740.11

information retrieval

1274.14

search engine

1103.99

machine learning

727.70

computer science

723.82

experimental result

655.125

text mining

645.57

natural language processing

582.83

world wide web

557.33

large number

530.67

artificial intelligence

515.73

relevant document

468.22

similarity measure

464.64

information extraction

443.29

knowledge discovery

435.79

7

Ontology Learning Steps



Preprocessing


Concept
Extraction


Taxonomy Induction


Non
-
taxonomic Relations

8

Taxonomy Induction


Aims at organizing concepts into a
hierarchical structure where each
concept is related to its respective
broader and narrower terms


Two methods in
OntoGain


Agglomerative clustering


Formal Concept Analysis (FCA)

Agglomerative Clustering


Proceeds bottom
-
up: at each step, the
most similar clusters are merged


Initially each term is considered a cluster


Similarity between all pairs of clusters is
computed


The most similar clusters are merged as
long as they share terms with common
heads


Group average for clusters, Dice like
formula for terms





10

Formal Concept Analysis (FCA)
[
Ganter

et al., 1999]


FCA relies on the idea that the objects
(terms) are associated with their
attributes (verbs)


Finds common attributes (verbs)
between objects and forms object
clusters that share common attributes


Formal concepts are connected with the
sub
-
concept relationship



FCA Example


Takes as input a matrix showing
associations between terms (concepts)
and attributes (verbs)


submit

test

describe

print

compute

search

Html form

*

*

*

Hierarchical
clustering

*

*

Text retrieval

*

Root node

*

*

*

*

Single cluster

*

*

*

Web page

*

*

FCA
Taxonomy

13


Formal concepts


({hierarchical
clustering, root node,
single cluster},
{compute, search})


({html form, web
page}, {print, search})


Not all dependencies
c,v

are interesting






Non
-
Taxonomic Relations

extraction phase

14



Concept Extraction


Taxonomy Induction


Non
-
Taxonomic Relations

Non
-
Taxonomic Relations


Concepts are also characterized by
attributes and relations to other
concepts in the hierarchy


Typically expressed by a verb relating
pair of concepts


Two approaches


Associations rules


Probabilistic

Association Rules [
Aggrawal

et.al., 1993]


Introduced to predict the purchase
behavior of customers


Extract terms connected with some
relation
subject
-
verb
-
object


Enhance with general terms from the
taxonomy


Eliminate redundant relations:


predictive accuracy < t

Association
Rules:
Example

Domain

Range

Label

chiasmal syndrome

pituitary disproportion

cause by

medial collateral ligament

surgical treatment

need

blood transfusion

antibiotic prophylaxis

result

lipid peroxidation

cardiopulmonary bypass

lead to

prostate specific antigen

prostatectomy

follow

chronic fatigue syndrome

cardiac function

yield

right ventricular infraction

radionuclide ventriculography

analyze by

creatinine clearance

arteriovenous hemofiltration

achieve

cardioplegic solution

superoxide dismutase

give

bacterial translocation

antibiotic prophylaxis

decrease

accurate diagnosis

clinical suspicion

depend

ultrasound examination

clinical suspicion

give

total body oxygen consumption

epidural analgesia

attenuate by

coronary arteriography

physician

perform by

17

Probabilistic approach [
Cimiano

et.al. 2006]


Collect
verbal relations from the
corpus



Find the most general relation
wrt

verb

using frequency of occurrence


Suffer_from
(man,
head_ache
)


Suffer_from
(woman,
stomach_ache
)


Suffer_from
(
patient,ache
)


Select relationships satisfying a
conditional probability measure


Associations
> t

become accepted


18

Evaluation


Relevance judgments are provided by
humans


Precision
-

Recall


We examined the
200 top
-
ranked
concepts and their respective relations
in 500 lines


Results from
OhsuMed

&
Computer
Science
corpus

19

Results

20

Processing
Layer

Method

Precision


OhsuMed

Recall


-

OhsuMed

Precision



Comp.
Science

Recall




Comp.
Science

Concept
Extraction

C/NC
-
Value

89.7%

91.4%

86.7%

89.6%

Taxonomic
Relations

Formal
Concept
Analysis

47.1%

41.6%

44.2%

48.6%

Hierarchical
Clustering

71.2%

67.3%

71.3%

62.7%

Non
-
Taxonomic
Relations

Association
Rules

71.8%

67.7%

72.8%

61.7%

Probabilistic

62.7%

55.9%

61.6%

49.4%

Comparison with Text2Onto
[Cimiano & Volker, 2005
]


21


Huge
lists of plain single word
terms,
and relations lacking
of semantic
meaning


Text2Onto
cannot work with
big texts


Cannot
export results in OWL

Conclusions


OntoGain


Multi
-
word
term concepts


Exports ontology in OWL


Domain independent


Results


C/NC
-
Value yields good results


Clustering outperforms FCA


Association Rules
perform better than
Verbal Expressions

22

Future Work


Explore more methods / combinations


e.g., clustering, FCA


Hearst patterns for discovering additional
relation types (Part
-
of)


Discover attributes and cardinality
constraints


Incorporate term similarity information
from
WordNet
,
MeSH


Resolve term ambiguities

23

Thank you!

Questions ?


24

Preprocessing


Tokenization, POS tagging, Shallow
parsing (
OpenNLP

suite)


Lemmatization (
WordNet

Java Library


Apply to all steps of
OntoGain


Shallow parsing is used in relations
acquisition for the detection of verbal
dependencies

26


Terms
sharing a head tend to be similar


e.g
.
hierarchical method

and
agglomerative
method

are both
methods


Nested terms are related to each other


e.g
.
agglomerative clustering method

and
clustering method
should be associated
)