Papers for today

italiansaucyΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 7 μήνες)

86 εμφανίσεις

Papers for today


Collaboratively built semi
-
structured content
and Artificial Intelligence: The story so
far


Hovy
,
Navigli
,
Ponzetto


YAGO2: A Spatially and Temporally Enhanced
Knowledge Base from
Wikipedia


Hoffarta
,
Suchanekb
,
Berbericha
,
Weikuma



Collaboratively built semi
-
structured
content



Main
characteristics of collaborative resources
that make them attractive for AI and NLP
research


S
emi
-
structured resources
enable
a
renaissance of knowledge
-
rich AI
techniques

Unstructured, structured and semi
-
structured resources


Unstructured


Strengths: easy to harvest at very large scale, many
domains, many styles, many languages…


Limitations: knowledge
acquisition bottleneck
(for
complex
inference
chains), degree
and quality of
ontologization



Structured (e.g. ontologies…)


Strengths: high quality
,
beneficial
for
all
kinds of intelligent
applications
.


Limitations:
Creation and maintenance
effort, Coverage,
up
-
to
-
date
information, the
language
barrier, low coverage


Semi
-
structured


Strengths: high quality and coverage, up
-
to
-
date and
multilingual


Semi
-
structured resources


Wikipedia
,
Wiktionary
, Twitter, Yahoo! Answers


Wikipedia


relies
on large amounts of manually
-
input
knowledge


provided
via massive online
collaboration


on
the basis of semi
-
structured (i.e., free
-
form
markup) content


Structure given by redirection pages, internal hyperlinks,
interlanguage

links, category pages,
infoboxes


Markup annotations indirectly encode semantic
content and, thus, world and linguistic knowledge
manually input by human editors



Filling the knowledge
gap


T
ransforming semi
-
structured content into
machine
-
readable
knowledge


Generating
semantics by exploiting the
shallow structure found in
Wikipedia


Acquiring related terms:
thesaurus
extraction


Is
-
a relation:
taxonomy induction


Relation extraction


sentences processing combined with hyperlink
information, use of
infoboxes

Filling the knowledge gap


O
ntologization
: building and
enriching ontologies (
YAGO2
)


More relations (
meronomy
, domain
-
specific…)


Exploiting structure.
Some of the methods
quantify semantic
distances using a relatedness measure computed on the
Wikipedia hyperlink graph



A
heuristic
renaissance:
High
-
quality, semi
-
structured
content enables the acquisition of machine
-
readable
knowledge on a large scale by means of heuristic methods
which essentially leverage regularities found within their
shallow
structure.


L
ightweight
and scalable rule
-
based approaches can be devised
to exploit the conventions governing the editorial base of
collaboratively
-
generated resources, and capture large amounts
of semantic information hidden within them.



Filling the knowledge gap


Named Entity Recognition


Named Entity Disambiguation

(associate name with
appropriate reference)


Word Sense Disambiguation


Wikification
: bringing E
ntity
and Word Sense
Disambiguation
together


keyword
extraction
combined with
lexical disambiguation:
given an input document, a
wikification

system identifies
the most important terms in the document and links (i.e.,
disambiguates) them to their appropriate entries within an
external encyclopedic resource, i.e., typically Wikipedia.



Filling the knowledge gap


Computing semantic
relatedness
: quantifying
the strength of association between words.


And beyond the sentence level:


Document clustering and text
categorization


Question
Answering


YAGO2
includes an extrinsic evaluation of the
quality of Wikipedia
on the
task
of answering
spatio
-
temporal
questions

Filling the knowledge gap


Information
Retrieval


The repository of disambiguated concepts found
in Wikipedia (i.e., its articles) provides a semantic
space into which documents and queries can be
projected in order to perform semantic retrieval
beyond the simple bag
-
of
-

words
model


Exploiting
updated content from
revision history


Language generation


L
everaging
Wikipedia’s revision history as a
source of data in order to automatically
acquire sentence rewriting models.


Exploiting
updated content from
revision history


Rewriting
tasks: sentence compression, text
simplification and targeted
paraphrasing


Summarization


??


The tower of Babel: multilingual
resources and applications


Wikipedia’s
multilinguality



namely, the
availability of interlinked
wikipedias

in
different languages


enables the acquisition
of very large, wide
-
coverage repositories of
multilingual knowledge.


Multilingual taxonomies and ontologies


Parallel
corpora and thesauri

Some Questions


Tease out the collaborative vs. semi
-
structured aspects


Collaborative


Over the past decade, a variety of proposals
--

MindPixel8
and Open Mind9


have tried
to make
manual knowledge
acquisition feasible by collecting input from
volunteers.
See also Von
Ahn
, which aims at acquiring knowledge from
users by means of online
games.
However, none of these
efforts, to date, has succeeded in producing truly wide
-
coverage resources able to compete with standard manual
resources.


Why? Why people like to collaborate on
Wikepedia

and
not
--
as much
--

on other projects?


What makes
W
ikepedia

so attractive and how can one try
to “copy” from it to encourage other collaborative efforts?



Some Questions


S
emi
-
structured


Wikipedia
,
Wiktionary
, Twitter, Yahoo! Answers


What aspects of the structures are most important?


Other resources that have similar structure

if
not the collaborative aspects?


News papers?


Forums?


Use
revision
history to discover something about
the contributors?


Papers for today


Collaboratively built semi
-
structured content
and Artificial Intelligence: The story so
far


Hovy
,
Navigli
,
Ponzetto


YAGO2: A Spatially and Temporally Enhanced
Knowledge Base from
Wikipedia


Hoffarta
,
Suchanekb
,
Berbericha
,
Weikuma



YAGO2


K
nowledge
base, in which
entities
, facts, and
events are anchored in both time and space.


YAGO2
is built automatically from Wikipedia,
GeoNames
, and
WordNet
.


It
contains 447
million
facts about 9.8 million
entities.


Paper describes the
extraction
methodology
, the
integration of the
spatio
-
temporal dimension,
and
the knowledge
representation
SPOTL

to
include time and space

Time and space


To know
not only that a fact is true, but also
when and where it was true.


Presidents
of countries or CEOs of companies
change. Even capitals of countries or spouses are
not necessarily
forever….


The geographical location is a crucial property not
just of physical entities such as countries,
mountains, or rivers, but also of organization
headquarters, or events such as battles, fairs, or
people’s births.





Contributions


I
ntegrate
entity
-
relationship
-
oriented facts with the spatial
and
temporal
dimensions.


Extensible
framework for fact extraction
(
from Wikipedia
and other
sources) that
can tap on
infoboxes
, lists, tables,
categories, and regular patterns in free text, and allows fast
and easy specification of new extraction
rules



K
nowledge
representation model tailored to capture time
and space, as well as rules for propagating time and
location
information
to all relevant
facts


N
ew
representation model,
SPOTL
tuples (SPO + Time +
Location
)
with expressive and easy
-
to
-
use querying


SPO
triples: subject
-
property
-
object
triples


YAGO


The YAGO knowledge base is
automatically
constructed
from Wikipedia.


Each
article in Wikipedia becomes an entity in the
knowledge base (e.g., since Leonard Cohen has an article in
Wikipedia
,
LeonardCohen

becomes an entity in YAGO
)


100
manually defined
relations

(
wasBornOnDate
,
locatedIn
…)


2 million entities and 20 million facts.


Facts
: triples of an entity, a relation, and another entity
(
wasBornIn
(
LeonardCohen
, Montreal)
)


SPO triples
of subject (S), predicate (P), and object (O), in
compatibility with
the
RDF
data
model

(Resource Description
Framework)




YAGO2 Extraction
Architecture


The
YAGO2
architecture is based on declarative rules
that are stored in text
files.


The
rules take the form of subject
-
predicate
-
object
-
triples, so that they are basically additional YAGO2
facts
.


Extraction
rules say that if a part of the source text
matches a
specified
regular expression, a sequence of facts
shall be generated.


Wikipedia
infoboxes
, but also to
Wikipedia
categories, article
titles,
headings
, links, or references.


The
extraction rules cover some 200
infobox

patterns, some 90
category patterns, and around a dozen patterns for dealing with
disambiguation pages.



Time in YAGO2


YAGO2 contains a data type
yagoDate

that
denotes time points, typically with a
resolution of days but sometimes with cruder
resolution like years.


YAGO2 assigns begin and/or end of time spans
to all entities, to all facts, and to all events, if
they have a known start point or a known end
point.


Entities and Time


Entities are assigned a time span to denote their existence in time. F
our
major entity types:


People


relations
wasBornOnDate

and
diedOnDate

demarcate their existence
times


Elvis Presley
is associated with
1935
-
01
-
08
as his birthdate and
1977
-
08
-
16
as his time of death. Bob
Dylan,
is associated only with the time of birth,
1941
-
05
-
24



Groups

such as music bands, football clubs, universities, or
companies


the
relations
wasCreatedOnDate

and
wasDestroyedOnDate

demarcate
their existence
times


Artifacts

such as buildings, paintings, books, music songs or
albums


wasCreatedOnDate

and
wasDestroyedOnDate

(e.g., for buildings or
sculptures)


Events

such as wars, sports competitions like Olympics or world
championship tournaments, or named epochs like the “German autumn



startedOnDate

and
endedOnDate

demarcate their existence
times

Facts and
Time


The YAGO2 extractors can find occurrence times of facts
from the Wikipedia
infoboxes
.


Example:

BobDylan

wasBornIn

Duluth

is an event that
happened in 1941


Two new
relations,
occursSince

and
occursUntil


If the same fact occurs more than once, then YAGO2 will
contain it
multiple
times with different ids. For example,
since Bob Dylan has won two Grammy awards, we would
have #1:
BobDylan

hasWonPrize

GrammyAward

with
#1
occursOnDate

1973, and a second #2
:
BobDylan

hasWonPrize

GrammyAward

(with a different id) and
the associated fact #2
occursOnDate

1979
.

Space


All physical objects have a location in space.


YAGO2 is concerned
with entities that have a permanent spatial
extent on Earth


for example
countries
, cities, mountains, and
rivers.


N
ew
class
yagoGeoEntity
, which groups together all geo
-
entities


S
ubclasses
of
yagoGeoEntity

are: location,
body of
water,
geological
formation,
real
property, facility, excavation, structure,
track



The
position of a geo
-
entity
can be described by geographical
coordinates
,
latitude and
longitude


YAGO2
harvests geo
-
entities from two
sources: Wikipedia and
GeoNames


(
GeoNames

has information
on location hierarchies (
partOf
), e.g.
Berlin is located in Germany is located in
Europe and provides
alternate names for each location, as well as neighboring
countries)

Entities and Location


Events


Can take place
at a specific location, such as battles or
sports
competitions
, where the relation
happenedIn

holds the place where it happened.


Groups or
organizations


Can have
a venue, such as the headquarters of a
company or the campus of a university. The location
for such entities is given by the
isLocatedIn

relation.


Artifacts
that are
physically located
somewhere


E.g.
like the Mona Lisa in the Louvre, where the
location is again
isLocatedIn
.


SPOTL(X)
-
View Model


SPOTLX 6
-
tuples


SPO triples augmented by
T
ime and
L
ocation and
keywords
or key phrases from the
conte
X
t

of
sources where the original SPO fact occurs


Size of YAGO2: entities

Size of YAGO2:
facts

Evaluation


Of extraction of facts from Wikipedia

Task
-
Based Evaluation


Answering
Spatio
-
Temporal
Questions


15 questions of the
GeoCLEF

2008
GiKiP

Pilot3


The original intent of the
GeoCLEF

GiKiP

Pilot is: “Find Wikipedia
entries / articles that answer a particular information need
which requires geo
-

graphical reasoning of some sort.”


4 questions working perfectly;


3 questions working when relaxing a geographical condition from
structural
to keyword conditions


resulting in a less precise but still
useful result set;


6 questions that could be well formulated as SPOTLX queries but did
not return any good result for the limited coverage of the knowledge
base;


2 questions that could not be properly formulated at all.



A
sample of temporal and spatial questions blocks from
Jeopardy!

Evaluation on Jeopardy

Improving Named Entity Disambiguation
by
Spatio
-
Temporal Knowledge


“Dylan performed Hurricane about the black
fighter Carter, from his album Desire. That
album also contains a duet with Harris in the
song
Joey.”


Here
, the tokens “song”, “album”, and
“performed” are strong cues for Joey (Bob Dylan
song) instead of the TV
series



Spatial Coherence


T
wo
entities that are geographically close to each
other are a coherent pair, based on the intuition
that texts or text passages (news, blog postings,
etc.) usually talk about a single geographic region.


Spatial
Coherence
is defined
between two
entities e1,e2


E with geo
-

coordinates, where E
is the set of all candidates for mapping mentions
in a text to canonical
entities



Temporal Coherence


D
efined
between two entities e1,e2


E with
existence
time




where
cet
( ) is the center of an entity’s existence time interval,
and the denominator normalizes the distance by the maximum
distance of any two entities in the current set of entity
candidates,
ei,ej



E.



The
intuition
is
that a text usually mentions entities
that are clustered around a single or a few points in
time



Named Entity Disambiguation


Calculate
Spatial
and Temporal Coherence
between the mention in the input text and all
candidates
entities in the knowledge
base


In the weighted formula for the entities
relatedness