on the base of linguistic and

steelsquareInternet και Εφαρμογές Web

20 Οκτ 2013 (πριν από 4 χρόνια και 18 μέρες)

92 εμφανίσεις

JRC 2005/05/10

Automatic event extraction from text
on the base of linguistic and
semantic annotation


Thierry Declerck

DFKI


Language Technology Lab

JRC 2005/05/10

Events …


Involve entities and relations between then


Implies a change of states


Example: The striker of Liverpool shot a
wonderful goal in the 87. Minute.


1 event (goal
-
shot)


2 entities (person and team)


1 change of state (the scoring)

JRC 2005/05/10

Events in textual documents


Various types of text


Structured:
Example
and
Example_2


For processing, pattern matching techniques required. Very
few linguistic knowledge needed


Semi
-
structured:
Example


Requires a mixture of pattern matching and more linguistic
knowledge


Unstructured:
Example


Requires a mixture of layout analysis and linguistic knowledge


All types of text require a domain specific
knowledge base (ontology) for event extraction

JRC 2005/05/10

Domain Knowledge


Domain knowledge can be organised in
terminologies, thesauri, taxonomies or
ontologies. Example of a (
non
-
formal
)
multingual ontology for the soccer domain.


More on ontology engineering in the talk by
Borislav

JRC 2005/05/10

Automatic Event Extraction from
Text is


A combination of human language
technology (HLT) and semantic web
technologies (ontologies)


Can also be done on the base of purely
statistical means (with minimal linguistic
knowledge), but we concentrate here on the
HLT
-
based approach

JRC 2005/05/10

What is Human Language
Technology


JRC 2005/05/10




Linguistic Analysis

Language technology tools are needed to support the upgrade of
the actual web to the Semantic Web (SW) by providing an
automatic analysis of the linguistic structure of textual
documents. Free text documents undergoing linguistic analysis
become available as semi
-
structured documents, from which
meaningful units can be extracted automatically (
information
extraction
) and organized through clustering or classification
(
text mining
). Here we focus on the following linguistic analysis
steps that underlie the extraction tasks:
tokenization,

morphological analysis
,
part
-
of
-
speech tagging
,
chunking
,
dependency structure analysis, semantic tagging.


JRC 2005/05/10




Tokenisation

Tokenisation
deals with the detection of the word units in a text
and with the detection of sentence boundaries.


The markets acknowledge the measures taken on the 24
th

of
September by the CEO of XYZ Corp.

JRC 2005/05/10




Morphological Analysis

Morphological analysis
is concerned with the inflectional,
derivational, and compounding processes in word formation in
order to determine properties such as stem and inflectional
information. Together with part
-
of
-
speech (PoS) information this
process delivers the morpho
-
syntactic properties of a word.


While processing the German word Häusern (houses) the
following morphological information should be analysed:


[PoS=N NUM=PL CASE=DAT GEN=NEUT STEM=HAUS]

JRC 2005/05/10




Part
-
of
-
Speech Tagging

Part
-
of
-
Speech (PoS) tagging

is the process of determining the
correct syntactic class (a part
-
of
-
speech, e.g. noun, verb, etc.) for
a particular word given its current context. The word “works” in
the following sentences will be either a
verb

or a
noun
:



He
works

[N,
V
]


the whole day for nothing.


His
works

[
N
,V]

have all been sold abroad.


PoS tagging involves disambiguation between multiple part
-
of
-
speech tags, next to guessing of the correct part
-
of
-
speech tag for
unknown words on the basis of context information.

JRC 2005/05/10




Chunking

Chunks are sequences of words which are grouped on the base
of linguistic properties, such as nominal, prepositional, adjectival
and adverbial phrases and verb groups.


[
NP

His works] [
VG

have] [
NP

all] [
VG

been sold] [
AdvP

abroad].


JRC 2005/05/10




Named Entities detection

Related to chunking is the recognition of so
-
called
named
entities

(names of institutions and companies, date expressions,
etc.). The extraction of named entities is mostly based on a
strategy that combines look up in gazetteers (lists of companies,
cities, etc.) with the definition of regular expression patterns.
Named entity recognition can be included as part of the
linguistic chunking procedure and the following sentence
fragment:


“…the secretary
-
general of the United Nations, Kofi Annan,…”

will be annotated as a nominal phrase, including two named
entities:
United Nations
with named entity class:
organization
,

and

Kofi Annan

with named entity class:
person


JRC 2005/05/10




Dependency Structure Analysis

A

dependency structure
consists of two or more linguistic units
that immediately dominate each other in a syntax tree. The
detection of such structures is generally not provided by
chunking but is building on the top of it.

There are two main types of dependencies that are relevant for
our purposes: On the one hand, the
internal

dependency
structure of phrasal units or chunks and on the other hand the so
-
called
grammatical functions

(like subject and direct object).

JRC 2005/05/10




Internal Dependency Structure

.


In

linguistic

analysis,

for

this

we

use

the

terms

head
,

complements

and

modifiers
,

where

the

head

is

the

dominating

node

in

the

syntax

tree

of

a

phrase

(chunk),

complements

are

necessary

qualifiers

thereof,

and

modifiers

are

optional

qualifiers
.

Consider

the

following

example
:


“The

shot

by

Christian

Ziege

goes

over

the

goal
.



The

prepositional

phrase

“by

Christian

Ziege”

(containing

the

named

entity

Christian

Ziege
)

depends

on

(and

modifies
)

the

head

noun

“shot”
.

JRC 2005/05/10




Grammatical Functions

Determine the
role

(function) of each of the linguistic chunks in
the sentence and allow to identify the actors involved in certain
events. So for example in the following sentence, the syntactic
(and also the semantic) subject is the NP constituent “The shot
by Christian Ziege”:


“The shot by Christian Ziege goes over the goal.”


This nominal phrase depends on (and
complements
) the verb
“goes”, whereas the Noun “shot” is the head of the NP (it this
the shot going over the goal, and not Christian Ziege!)




JRC 2005/05/10




Semantic Tagging

Automatic
semantic annotation

has developed within language
technology in recent years in connection with more integrated
tasks like information extraction, which require a certain level of
semantic analysis.
Semantic tagging

consists in the annotation of
each content word in a document with a semantic category.
Semantic categories are assigned on the basis of a semantic
resources like WordNet for English or EuroWordNet, which
links words between many European languages through a
common inter
-
lingua of concepts.


JRC 2005/05/10




Semantic Resources

Semantic resources

are captured in dictionaries, thesauri, and semantic
networks, all of which express, either implicitly or explicitly, an ontology of
the world in general or of more specific domains, such as medicine.

They can be roughly distinguished into the following three groups:



Thesauri
:
Semantic resources that group together similar words or terms
according to a standard set of relations, including broader term, narrower
term, sibling, etc. (like Roget)


Semantic Lexicons
:
Semantic resources that group together words (or
more complex lexical items) according to lexical semantic relations like
synonymy, hyponymy, meronymy, and antonymy (like WordNet)


Semantic Networks
: Semantic resources that group together objects
denoted by natural language expressions (terms) according to a set of
relations that originate in the nature of the domain of application (like UMLS
in the medical domain)

JRC 2005/05/10




The MeSH Thesaurus

MeSH

(Medical Subject Headings) is a thesaurus for indexing articles
and books in the medical domain, which may then be used for searching
MeSH
-
indexed databases. MeSH provides for each term a number of
term variants that refer to the same concept. It currently includes a
vocabulary of over 250,000 terms. The following is a sample entry for
the term gene library (MH is the term itself, ENTRY are term variants):





MH




=

Gene Library


ENTRY


=

Bank, Gene


ENTRY


=

Banks, Gene


ENTRY


=

DNA Libraries


ENTRY


=

Gene Bank



etc.


JRC 2005/05/10




The WordNet Semantic Lexicon

WordNet

has primarily been designed as a computational
account of the human capacity of linguistic categorization
and covers an extensive set of semantic classes (called
synsets
). Synsets are collections of synonyms, grouping
together lexical items according to meaning similarity.
Synsets are actually not made up of lexical items, but rather
of lexical meanings (i.e. senses)


JRC 2005/05/10




The WordNet Semantic Lexicon

WordNet

has primarily been designed as a computational
account of the human capacity of linguistic categorization
and covers an extensive set of semantic classes (called
synsets
). Synsets are collections of synonyms, grouping
together lexical items according to meaning similarity.
Synsets are actually not made up of lexical items, but rather
of lexical meanings (i.e. senses)


JRC 2005/05/10




WordNet: An Example


The word 'tree' has two meanings that roughly correspond to the classes
of plants and that of diagrams, each with their own hierarchy of classes
that are included in more general super
-
classes:


09396070 tree 0


09395329 woody_plant 0 ligneous_plant 0


09378438 vascular_plant 0 tracheophyte 0


00008864 plant 0 flora 0 plant_life 0


00002086 life_form 0 organism 0 being 0 living_thing 0


00001740 entity 0 something 0

10025462 tree 0 tree_diagram 0


09987563 plane_figure 0 two
-
dimensional_figure 0


09987377 figure 0


00015185 shape 0 form 0


00018604 attribute 0


00013018 abstraction 0

JRC 2005/05/10

What is the Semantic Web


“The Semantic Web is a new initiative to
transform the web into a structure that supports
more intelligent querying and browsing, both by
machines and by humans. This transformation is
to be supported through the generation and use of
metadata constructed via web annotation tools
using user
-
defined ontologies that can be related
to one another.”



Somewhere on the web


JRC 2005/05/10

Semantic Web


x C


D

Web
-
Page Annotation

Tool

Ontology Construction

Tool

End User

Community Portal

Inference

Engine

Metadata Repository

Annotated Web Pages

Ontology Articulation

Toolkit

Ontologies

Agents

Based on www.semanticweb.org

JRC 2005/05/10

Extracting Events from
Structured Documents


Detecting Metadata in our Example:


Type of game: N/A


Teams involved:
England

-

Deutschland



Players: Deutschland: Kahn (2)
-

Matthaeus (3)
-

Babbel (3,5),



Final (and intermediate) score:
1:0 (0:0)



Referee:
Schiedsrichter: Collina, Pierluigi (Viareggio)



Date: N/A


Etc…

JRC 2005/05/10

Extracting Events from
Structured Documents (2)


Detecting Events in our Example:


Substitution:
Eingewechselt
: 61.
Gerrard

fuer
Owen
,



Goal:
Tore
: 1:0 Shearer (53., Kopfball,
Vorarbeit Beckham)



Cards:
Gelbe Karten
: Beckham
-

Babbel,
Jeremies


JRC 2005/05/10

Results in XML


Automatically extracted events (and entities and
relations) from structured text, on the base of
patterns (
DTD
) of typical expressions and the
soccer ontology.
Example

and
Example_2


Since various results are available in XML files,
those results can be merged automatically, guided
by the ontology.
Example
. This is supporting an
incremental and dynamic extraction.

JRC 2005/05/10

Extracting Events from Semi
-
Structured Documents


Need of linguistic processing, for providing
of a basic structure of the document, which
allows the domain specific annotation.
Example
.

JRC 2005/05/10

Extracting Events from Semi
-
Structured Documents (2)


Using as well the results from the semantic
annotation of the structured documents,
supporting incremental extraction:
Example
.

JRC 2005/05/10

Actual Development


Extracting information from multilingual balance
sheets (WINS eTen project), extending this to
unstructured text and extracting relations and
events from annexes to balance sheets (upcoming
Project MUSING).


Detecting positive/negative mentioning of entities
in news documents (project Direct
-
Info on Media
Monitoring).
Example
.

JRC 2005/05/10

Further Challenge for HLT


Not only use HLT for the semantic
annotation of web pages (or other
documents), but use HLT for supporting
ontology extraction/learning from the web
(or other documents)

JRC 2005/05/10

Example of semantic relation
extraction in bio
-
medicine


[Rheumatoid arthritis]

[
is characterized
]

[by progressive synovial inflammation



and joint destruction]

[
.
]




JRC 2005/05/10

Open issues for HLT and SW


To achieve a better coordination for
improving semantic annotation results


Development and use of standards for
interelated linguistic and semantic
annotation (see eContent Project LIRICS
for standards for language resources)

JRC 2005/05/10

Interoperable Standards?

JRC 2005/05/10

Thank you!