Papers for today
•
Collaboratively built semi
-
structured content
and Artificial Intelligence: The story so
far
–
Hovy
,
Navigli
,
Ponzetto
•
YAGO2: A Spatially and Temporally Enhanced
Knowledge Base from
Wikipedia
–
Hoffarta
,
Suchanekb
,
Berbericha
,
Weikuma
Collaboratively built semi
-
structured
content
•
Main
characteristics of collaborative resources
that make them attractive for AI and NLP
research
•
S
emi
-
structured resources
enable
a
renaissance of knowledge
-
rich AI
techniques
Unstructured, structured and semi
-
structured resources
•
Unstructured
–
Strengths: easy to harvest at very large scale, many
domains, many styles, many languages…
–
Limitations: knowledge
acquisition bottleneck
(for
complex
inference
chains), degree
and quality of
ontologization
•
Structured (e.g. ontologies…)
–
Strengths: high quality
,
beneficial
for
all
kinds of intelligent
applications
.
–
Limitations:
Creation and maintenance
effort, Coverage,
up
-
to
-
date
information, the
language
barrier, low coverage
•
Semi
-
structured
–
Strengths: high quality and coverage, up
-
to
-
date and
multilingual
Semi
-
structured resources
•
Wikipedia
,
Wiktionary
, Twitter, Yahoo! Answers
•
Wikipedia
–
relies
on large amounts of manually
-
input
knowledge
–
provided
via massive online
collaboration
–
on
the basis of semi
-
structured (i.e., free
-
form
markup) content
•
Structure given by redirection pages, internal hyperlinks,
interlanguage
links, category pages,
infoboxes
–
Markup annotations indirectly encode semantic
content and, thus, world and linguistic knowledge
manually input by human editors
Filling the knowledge
gap
•
T
ransforming semi
-
structured content into
machine
-
readable
knowledge
•
Generating
semantics by exploiting the
shallow structure found in
Wikipedia
•
Acquiring related terms:
thesaurus
extraction
•
Is
-
a relation:
taxonomy induction
•
Relation extraction
–
sentences processing combined with hyperlink
information, use of
infoboxes
Filling the knowledge gap
•
O
ntologization
: building and
enriching ontologies (
YAGO2
)
–
More relations (
meronomy
, domain
-
specific…)
–
Exploiting structure.
Some of the methods
quantify semantic
distances using a relatedness measure computed on the
Wikipedia hyperlink graph
•
A
heuristic
renaissance:
High
-
quality, semi
-
structured
content enables the acquisition of machine
-
readable
knowledge on a large scale by means of heuristic methods
which essentially leverage regularities found within their
shallow
structure.
–
L
ightweight
and scalable rule
-
based approaches can be devised
to exploit the conventions governing the editorial base of
collaboratively
-
generated resources, and capture large amounts
of semantic information hidden within them.
Filling the knowledge gap
•
Named Entity Recognition
•
Named Entity Disambiguation
(associate name with
appropriate reference)
•
Word Sense Disambiguation
•
Wikification
: bringing E
ntity
and Word Sense
Disambiguation
together
–
keyword
extraction
combined with
lexical disambiguation:
given an input document, a
wikification
system identifies
the most important terms in the document and links (i.e.,
disambiguates) them to their appropriate entries within an
external encyclopedic resource, i.e., typically Wikipedia.
Filling the knowledge gap
•
Computing semantic
relatedness
: quantifying
the strength of association between words.
•
And beyond the sentence level:
•
Document clustering and text
categorization
•
Question
Answering
–
YAGO2
includes an extrinsic evaluation of the
quality of Wikipedia
on the
task
of answering
spatio
-
temporal
questions
Filling the knowledge gap
•
Information
Retrieval
–
The repository of disambiguated concepts found
in Wikipedia (i.e., its articles) provides a semantic
space into which documents and queries can be
projected in order to perform semantic retrieval
beyond the simple bag
-
of
-
words
model
Exploiting
updated content from
revision history
•
Language generation
•
L
everaging
Wikipedia’s revision history as a
source of data in order to automatically
acquire sentence rewriting models.
Exploiting
updated content from
revision history
•
Rewriting
tasks: sentence compression, text
simplification and targeted
paraphrasing
•
Summarization
–
??
The tower of Babel: multilingual
resources and applications
•
Wikipedia’s
multilinguality
–
namely, the
availability of interlinked
wikipedias
in
different languages
–
enables the acquisition
of very large, wide
-
coverage repositories of
multilingual knowledge.
•
Multilingual taxonomies and ontologies
•
Parallel
corpora and thesauri
Some Questions
•
Tease out the collaborative vs. semi
-
structured aspects
•
Collaborative
–
Over the past decade, a variety of proposals
--
MindPixel8
and Open Mind9
–
have tried
to make
manual knowledge
acquisition feasible by collecting input from
volunteers.
See also Von
Ahn
, which aims at acquiring knowledge from
users by means of online
games.
However, none of these
efforts, to date, has succeeded in producing truly wide
-
coverage resources able to compete with standard manual
resources.
–
Why? Why people like to collaborate on
Wikepedia
and
not
--
as much
--
on other projects?
–
What makes
W
ikepedia
so attractive and how can one try
to “copy” from it to encourage other collaborative efforts?
Some Questions
•
S
emi
-
structured
•
Wikipedia
,
Wiktionary
, Twitter, Yahoo! Answers
–
What aspects of the structures are most important?
•
Other resources that have similar structure
–
if
not the collaborative aspects?
–
News papers?
–
Forums?
•
Use
revision
history to discover something about
the contributors?
Papers for today
•
Collaboratively built semi
-
structured content
and Artificial Intelligence: The story so
far
–
Hovy
,
Navigli
,
Ponzetto
•
YAGO2: A Spatially and Temporally Enhanced
Knowledge Base from
Wikipedia
–
Hoffarta
,
Suchanekb
,
Berbericha
,
Weikuma
YAGO2
•
K
nowledge
base, in which
entities
, facts, and
events are anchored in both time and space.
•
YAGO2
is built automatically from Wikipedia,
GeoNames
, and
WordNet
.
•
It
contains 447
million
facts about 9.8 million
entities.
•
Paper describes the
extraction
methodology
, the
integration of the
spatio
-
temporal dimension,
and
the knowledge
representation
SPOTL
to
include time and space
Time and space
•
To know
not only that a fact is true, but also
when and where it was true.
–
Presidents
of countries or CEOs of companies
change. Even capitals of countries or spouses are
not necessarily
forever….
–
The geographical location is a crucial property not
just of physical entities such as countries,
mountains, or rivers, but also of organization
headquarters, or events such as battles, fairs, or
people’s births.
Contributions
•
I
ntegrate
entity
-
relationship
-
oriented facts with the spatial
and
temporal
dimensions.
•
Extensible
framework for fact extraction
(
from Wikipedia
and other
sources) that
can tap on
infoboxes
, lists, tables,
categories, and regular patterns in free text, and allows fast
and easy specification of new extraction
rules
•
K
nowledge
representation model tailored to capture time
and space, as well as rules for propagating time and
location
information
to all relevant
facts
•
N
ew
representation model,
SPOTL
tuples (SPO + Time +
Location
)
with expressive and easy
-
to
-
use querying
–
SPO
triples: subject
-
property
-
object
triples
YAGO
•
The YAGO knowledge base is
automatically
constructed
from Wikipedia.
•
Each
article in Wikipedia becomes an entity in the
knowledge base (e.g., since Leonard Cohen has an article in
Wikipedia
,
LeonardCohen
becomes an entity in YAGO
)
•
100
manually defined
relations
(
wasBornOnDate
,
locatedIn
…)
•
2 million entities and 20 million facts.
•
Facts
: triples of an entity, a relation, and another entity
(
wasBornIn
(
LeonardCohen
, Montreal)
)
–
SPO triples
of subject (S), predicate (P), and object (O), in
compatibility with
the
RDF
data
model
(Resource Description
Framework)
YAGO2 Extraction
Architecture
•
The
YAGO2
architecture is based on declarative rules
that are stored in text
files.
•
The
rules take the form of subject
-
predicate
-
object
-
triples, so that they are basically additional YAGO2
facts
.
–
Extraction
rules say that if a part of the source text
matches a
specified
regular expression, a sequence of facts
shall be generated.
•
Wikipedia
infoboxes
, but also to
Wikipedia
categories, article
titles,
headings
, links, or references.
•
The
extraction rules cover some 200
infobox
patterns, some 90
category patterns, and around a dozen patterns for dealing with
disambiguation pages.
Time in YAGO2
•
YAGO2 contains a data type
yagoDate
that
denotes time points, typically with a
resolution of days but sometimes with cruder
resolution like years.
•
YAGO2 assigns begin and/or end of time spans
to all entities, to all facts, and to all events, if
they have a known start point or a known end
point.
Entities and Time
•
Entities are assigned a time span to denote their existence in time. F
our
major entity types:
•
People
–
relations
wasBornOnDate
and
diedOnDate
demarcate their existence
times
–
Elvis Presley
is associated with
1935
-
01
-
08
as his birthdate and
1977
-
08
-
16
as his time of death. Bob
Dylan,
is associated only with the time of birth,
1941
-
05
-
24
•
Groups
such as music bands, football clubs, universities, or
companies
–
the
relations
wasCreatedOnDate
and
wasDestroyedOnDate
demarcate
their existence
times
•
Artifacts
such as buildings, paintings, books, music songs or
albums
–
wasCreatedOnDate
and
wasDestroyedOnDate
(e.g., for buildings or
sculptures)
•
Events
such as wars, sports competitions like Olympics or world
championship tournaments, or named epochs like the “German autumn
”
–
startedOnDate
and
endedOnDate
demarcate their existence
times
Facts and
Time
•
The YAGO2 extractors can find occurrence times of facts
from the Wikipedia
infoboxes
.
•
Example:
BobDylan
wasBornIn
Duluth
is an event that
happened in 1941
•
Two new
relations,
occursSince
and
occursUntil
•
If the same fact occurs more than once, then YAGO2 will
contain it
multiple
times with different ids. For example,
since Bob Dylan has won two Grammy awards, we would
have #1:
BobDylan
hasWonPrize
GrammyAward
with
#1
occursOnDate
1973, and a second #2
:
BobDylan
hasWonPrize
GrammyAward
(with a different id) and
the associated fact #2
occursOnDate
1979
.
Space
•
All physical objects have a location in space.
•
YAGO2 is concerned
with entities that have a permanent spatial
extent on Earth
–
for example
countries
, cities, mountains, and
rivers.
•
N
ew
class
yagoGeoEntity
, which groups together all geo
-
entities
–
S
ubclasses
of
yagoGeoEntity
are: location,
body of
water,
geological
formation,
real
property, facility, excavation, structure,
track
…
–
The
position of a geo
-
entity
can be described by geographical
coordinates
,
latitude and
longitude
•
YAGO2
harvests geo
-
entities from two
sources: Wikipedia and
GeoNames
–
(
GeoNames
has information
on location hierarchies (
partOf
), e.g.
Berlin is located in Germany is located in
Europe and provides
alternate names for each location, as well as neighboring
countries)
Entities and Location
•
Events
–
Can take place
at a specific location, such as battles or
sports
competitions
, where the relation
happenedIn
holds the place where it happened.
•
Groups or
organizations
–
Can have
a venue, such as the headquarters of a
company or the campus of a university. The location
for such entities is given by the
isLocatedIn
relation.
•
Artifacts
that are
physically located
somewhere
–
E.g.
like the Mona Lisa in the Louvre, where the
location is again
isLocatedIn
.
SPOTL(X)
-
View Model
•
SPOTLX 6
-
tuples
–
SPO triples augmented by
T
ime and
L
ocation and
keywords
or key phrases from the
conte
X
t
of
sources where the original SPO fact occurs
Size of YAGO2: entities
Size of YAGO2:
facts
Evaluation
•
Of extraction of facts from Wikipedia
Task
-
Based Evaluation
•
Answering
Spatio
-
Temporal
Questions
•
15 questions of the
GeoCLEF
2008
GiKiP
Pilot3
–
The original intent of the
GeoCLEF
GiKiP
Pilot is: “Find Wikipedia
entries / articles that answer a particular information need
which requires geo
-
graphical reasoning of some sort.”
•
4 questions working perfectly;
•
3 questions working when relaxing a geographical condition from
structural
to keyword conditions
–
resulting in a less precise but still
useful result set;
•
6 questions that could be well formulated as SPOTLX queries but did
not return any good result for the limited coverage of the knowledge
base;
•
2 questions that could not be properly formulated at all.
•
A
sample of temporal and spatial questions blocks from
Jeopardy!
Evaluation on Jeopardy
Improving Named Entity Disambiguation
by
Spatio
-
Temporal Knowledge
•
“Dylan performed Hurricane about the black
fighter Carter, from his album Desire. That
album also contains a duet with Harris in the
song
Joey.”
–
Here
, the tokens “song”, “album”, and
“performed” are strong cues for Joey (Bob Dylan
song) instead of the TV
series
Spatial Coherence
•
T
wo
entities that are geographically close to each
other are a coherent pair, based on the intuition
that texts or text passages (news, blog postings,
etc.) usually talk about a single geographic region.
•
Spatial
Coherence
is defined
between two
entities e1,e2
∈
E with geo
-
coordinates, where E
is the set of all candidates for mapping mentions
in a text to canonical
entities
Temporal Coherence
•
D
efined
between two entities e1,e2
∈
E with
existence
time
where
cet
( ) is the center of an entity’s existence time interval,
and the denominator normalizes the distance by the maximum
distance of any two entities in the current set of entity
candidates,
ei,ej
∈
E.
•
The
intuition
is
that a text usually mentions entities
that are clustered around a single or a few points in
time
Named Entity Disambiguation
•
Calculate
Spatial
and Temporal Coherence
between the mention in the input text and all
candidates
entities in the knowledge
base
•
In the weighted formula for the entities
relatedness
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο