Course 6

addictedswimmingAI and Robotics

Oct 24, 2013 (3 years and 10 months ago)

92 views

Course 6


29 March 2012


Diana
Trandab
ăț

dtrandabat@info.uaic.ro


1


What is Named Entity Recognition


Corpora, annotation


Evaluation and testing


Preprocessing


Approaches to NE


Baseline


Rule
-
based approaches


Learning
-
based approaches


Multilinguality


Applications


2


Information

Extraction

(IE)

proposes

techniques

to

extract

relevant

information

from

non
-
structured

or

semi
-
structured

texts


Extracted

information

is

transformed

so

that

it

can

be

represented

in

a

fixed

(
computer
-
readable
)

format


3


Named

Entity

Recognition

(NER)

is

an

IE

task

that

seeks

to

locate

and

classify

text

segments

into

predefined

classes

(
for

exemple

Person
,

Location
,

Time

expression
)



We

are

proud

to

announce

that

Friday,

February

17
,

we

will

have

two

sessions

in

the

Education

Seminar
.

At

12
:
30
pm,

at

the

Student

Center

Room

207
,

Joe

Mertz

will

present

"Using

a

Cognitive

Architecture

to

Design

Instructions“
.

His

session

ends

at

1
pm
.

After

a

small

lunch

break,

at

14
:
00
,

we

meet

again

at

Student

Center

Room

208
,

where

Brian

McKenzie

will

start

his

presentation
.

He

will

present

“Information

Extraction
:

how

to

automatically

learn

new

models”
.

This

session

ends

around

15
h
.


4


Person

Entity

Recognition

(NER)

is

an

IE

task

Location

to

locate

and

classify

text

segments

Time

predefined

classes

(
for

exemple

Person
,

Location
,

Time

expression
)



We

are

proud

to

announce

that

Friday,

February

17
,

we

will

have

two

sessions

in

the

Education

Seminar
.

At

12
:
30
pm,

at

the

Student

Center

Room

207
,

Joe

Mertz

will

present

"Using

a

Cognitive

Architecture

to

Design

Instructions“
.

His

session

ends

at

1
pm
.

After

a

small

lunch

break,

at

14
:
00
,

we

meet

again

at

Student

Center

Room

208
,

where

Brian

McKenzie

will

start

his

presentation
.

He

will

present

“Information

Extraction
:

how

to

automatically

learn

new

models”
.

This

session

ends

around

15
h
.


5


Person

Entity

Recognition

(NER)

is

an

IE

task

Location

to

locate

and

classify

text

segments

Time

predefined

classes

(
for

exemple

Person
,

Location
,

Time

expression
)



We

are

proud

to

announce

that

Friday,

February

17
,

we

will

have

two

sessions

in

the

Education

Seminar
.

At

12
:
30
pm,

at

the

Student

Center

Room

207
,

Joe

Mertz

will

present

"Using

a

Cognitive

Architecture

to

Design

Instructions“
.

His

session

ends

at

1
pm
.

After

a

small

lunch

break,

at

14
:
00
,

we

meet

again

at

Student

Center

Room

208
,

where

Brian

McKenzie

will

start

his

presentation
.

He

will

present

“Information

Extraction
:

how

to

automatically

learn

new

models”
.

This

session

ends

around

15
h
.


6


Person

Entity

Recognition

(NER)

is

an

IE

task

Location

to

locate

and

classify

text

segments

Time

predefined

classes

(
for

exemple

Person
,

Location
,

Time

expression
)



We

are

proud

to

announce

that

Friday,

February

17
,

we

will

have

two

sessions

in

the

Education

Seminar
.

At

12
:
30
pm
,

at

the

Student

Center

Room

207
,

Joe

Mertz

will

present

"Using

a

Cognitive

Architecture

to

Design

Instructions“
.

His

session

ends

at

1
pm
.

After

a

small

lunch

break,

at

14
:
00
,

we

meet

again

at

Student

Center

Room

208
,

where

Brian

McKenzie

will

start

his

presentation
.

He

will

present

“Information

Extraction
:

how

to

automatically

learn

new

models”
.

This

session

ends

around

15
h
.


7


NER involves two sub
-
tasks:


Identification

of

proper

names

in

texts

(Named

Entity

Identification



NEI)


Classification

into

a

set

of

predefined

categories

of

interest

(Named

Entity

Classification



NEC)

8


Usual

categories
:



Person

names
,

Organizations

(companies,

government

organisations,

committees,

etc)
,

Locations

(cities,

countries,

rivers,

etc)
,

Date

and

time

expressions


Other

common

types
:



measures

(percent,

money,

weight

etc),

email

addresses,

Web

addresses,

street

addresses,

etc
.



Some

domain
-
specific

entities
:



names

of

drugs,

medical

conditions,

names

of

ships,

bibliographic

references

etc
.

9


Variation of NEs


e.g. John Smith, Mr Smith,
John.


Ambiguity of NE types:


John
Smith (company vs. person)


May
(person
vs. month)


Washington (person vs. location)


1945 (date vs. time)


Ambiguity with common words, e.g. "may"

10


Issues of style, structure, domain, genre etc.


Punctuation, spelling, spacing, formatting, ...
all have an impact:

Dept. of Computing and Maths

Manchester Metropolitan University

Manchester

United Kingdom



Tell me more about Leonardo


Da Vinci

11


MUC

(
Message Understanding Conference
)
-
6
and MUC
-
7 corpora
-

English


CONLL shared task corpora
http://cnts.uia.ac.be/conll2003/ner/

-

NEs
in English and German

http://cnts.uia.ac.be/conll2002/ner/

-

NEs
in Spanish and Dutch


TIDES surprise language exercise (NEs in
Cebuano and Hindi)


ACE
(Automatic Content
Extraction
)


English

http
://
www.ldc.upenn.edu/Projects/ACE
/

12


100 documents in SGML


News domain


1880 Organizations (46%)


1324 Locations (32%)


887 Persons (22%)


Inter
-
annotator agreement very high (~97%)


http://www.itl.nist.gov/iaui/894.02/related_project
s/muc/proceedings/muc_7_proceedings/marsh_
slides.pdf


13

<ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>,
<ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD;
Working in chilly temperatures <TIMEX
TYPE="DATE">Wednesday</TIMEX> <TIMEX
TYPE="TIME">night</TIMEX>, <ENAMEX
TYPE="ORGANIZATION">NASA</ENAMEX> ground crews
readied the space shuttle Endeavour for launch on a Japanese
satellite retrieval mission.

<p>

Endeavour, with an international crew of six, was set to blast off
from the <ENAMEX
TYPE="ORGANIZATION|LOCATION">Kennedy Space
Center</ENAMEX> on <TIMEX
TYPE="DATE">Thursday</TIMEX> at <TIMEX
TYPE="TIME">4:18 a.m. EST</TIMEX>, the start of a 49
-
minute launching period. The <TIMEX TYPE="DATE">nine
day</TIMEX> shuttle flight was to be the 12th launched in
darkness.

14

15
(110)


Format detection


Word segmentation (for languages like Chinese)


Tokenisation


Sentence splitting


POS tagging

16


NER

systems

have

been

created

that

use

linguistic

grammer
-
based

techniques

as

well

as

statistical

methods
.


Hand
-
crafted

grammar
-
based

systems

typically

obtain

better

precision,

but

at

the

cost

of

lower

recall

and

months

of

work

by

experienced

computational

linguistics
.


Statistical

NER

systems

typically

require

a

large

amount

of

manually

annotated

training

data
.


17


Corpora are divided typically into a training and
testing portion


Rules/Learning algorithms are trained on the
training part


Tuned on the testing portion in order to
optimise


Rule priorities, rules effectiveness, etc.


Parameters of the learning algorithm and the features
used


Evaluation set


the best system configuration
is run on this data and the system performance
is obtained


No further tuning once evaluation set is used!

18

Knowledge Engineering



rule based


developed by experienced
language engineers


make use of human intuition


requires only small amount of
training data


development could be very
time consuming


some changes may be hard to
accommodate

Learning Systems



use statistics or other machine
learning


developers do not need
advanced language
engineering expertise


requires large amounts of
annotated training data


some changes
may
require re
-
annotation of the entire
training
corpus

19


System that recognises only entities stored in
its lists (gazetteers).


Advantages

-

Simple,

fast,

language

independent,

easy

to

retarget

(just

create

lists)


Disadvantages



impossible

to

enumerate

all

names,

collection

and

maintenance

of

lists,

cannot

deal

with

name

variants,

cannot

resolve

ambiguity

20


Online phone directories and yellow pages
for person and organisation
names


Locations
lists


http://ro.wikipedia.org/wiki/Format:Listele_localit%
C4%83%C8%9Bilor_din_Rom%C3%A2nia_pe_jude%C8
%9Be


Names

lists


http://ro.wikipedia.org/wiki/List%C4%83_de_nume
_rom%C3%A2ne%C8%99ti


Automatic
collection from annotated training
data

21


Internal

evidence



names

often

have

internal

structure
.

These

components

can

be

either

stored

or

guessed,

e
.
g
.

location
:




Cap. Word + {City, Forest, Center, River}



e.g. Sherwood Forest



Cap. Word + {Street, Boulevard, Avenue, Crescent,
Road}



e.g. Portobello Street



22


Ambiguously capitalised words (first word in
sentence)

[All American Bank]

vs. All
[State Police]



Semantic ambiguity


"John F. Kennedy"

= airport (location)

"Philip Morris"

= organisation


Structural ambiguity

[Cable and Wireless]

vs.
[Microsoft]

and
[Dell];

[Center for Computational Linguistics]

vs.
message from
[City Hospital]

for
[John Smith]

23


Use of context
-
based patterns is helpful in
ambiguous cases


"David Walton" and "Goldman Sachs" are
indistinguishable


But
in
"David Walton of Goldman Sachs"


if

we

have

"
David
Walton”

recognised

as
Person


we

can

use
the pattern "[Person] of [Organization
]“


and

identify
"Goldman Sachs“ correctly.

24


[PERSON] earns [MONEY]


[PERSON] joined [ORGANIZATION]


[PERSON] left [ORGANIZATION]


[PERSON] joined [ORGANIZATION] as [JOBTITLE]


[ORGANIZATION]'s [JOBTITLE] [PERSON]


[ORGANIZATION] [JOBTITLE] [PERSON]


the [ORGANIZATION] [JOBTITLE]


part of the [ORGANIZATION]


[ORGANIZATION] headquarters in [LOCATION]


price of [ORGANIZATION]


sale of [ORGANIZATION]


investors in [ORGANIZATION]


[ORGANIZATION] is worth [MONEY]


[JOBTITLE] [PERSON]


[PERSON], [JOBTITLE]

25


Patterns

are

only

indicators

based

on

likelihood



Can

set

priorities

based

on

frequency

thresholds



Need

training

data

for

each

domain



More

semantic

information

would

be

useful

(e
.
g
.

to

cluster

groups

of

verbs)


26


Created as part of GATE


GATE


Sheffield’s open
-
source
infrastructure for language processing


GATE automatically deals with document
formats, saving of results, evaluation, and
visualisation of results for debugging


GATE has a finite
-
state pattern
-
action rule
language, used by ANNIE


ANNIE modified for MUC guidelines


89.5%
f
-
measure on MUC
-
7 corpus

27

NE Components

The ANNIE system


a reusable and easily extendable set
of components

28


Needed to store the indicator strings for the
internal structure and context rules


Internal location indicators


e.g., {river,
mountain, forest} for natural locations;
{street, road, crescent, place, square, …}for
address locations


Internal organisation indicators


e.g.,
company designators {GmbH, Ltd, Inc, …}


Produces Lookup results of the given kind

29


Orthographic co
-
reference module that
matches proper names in a document


Improves NE results by assigning entity type
to previously unclassified names, based on
relations with classified NEs


May not reclassify already classified entities


Classification of unknown entities very
useful for surnames which match a full
name, or abbreviations, e.g.
[
Napoleon
]

will match
[
Napoleon Bonaparte
]
;

[International Business Machines
Ltd.]

will match
[IBM]

30


ML approaches frequently break down the
NE
R

task in two parts:


Recognising the entity boundaries


Classifying the entities in the NE categories


W
ork

is
usually

only
on one task or the other


Tokens in text are often coded with the IOB
scheme


O


outside,
B
-
NE


first word in NE,
I
-
NE


all other
words in NE


Argentina

B
-
LOC

played


O

with


O

Del


B
-
PER

Bosque


I
-
PER

31


Based on Hidden Markov Models


Features


Capitalisation


Numeric symbols


Punctuation marks


Position in the sentence


14 features in total, combining above info, e.g.,
containsDigitAndDash (09
-
96),
containsDigitAndComma (23,000.00)

32


MUC
-
6 (English) and MET
-
1(Spanish) corpora
used for evaluation


Mixed case English


IdentiFinder

-

94.9% f
-
measure


Spanish mixed case


IdentiFinder



90%


Lower
case names, noisy training data, less training
data


Training data: 650,000 words, but similar
performance with half of the data. Less than
100,000 words reduce the performance to
below 90% on English

33


Finer
-
grained categorisation needed for
applications like question answering


Person classification into 8
sub
-
categories
:
athlete
, politician/government, clergy,
businessperson, entertainer/artist, lawyer,
doctor/scientist, police.


Approach using local context and global semantic
information such as WordNet


Used a decision list classifier and
Identifinder

to
construct automatically training set from
untagged data


Held
-
out set of 1300 instances hand annotated


34


Word frequency features


how often the words
surrounding the target instance occur with a specific
category in training


For each 8 categories 10 distinct word positions = 80 features
per instance


3 words before & after the instance


The two
-
word bigrams immediately before and after the
instance


The three
-
word trigrams before/after the instance


#

Position

N
-
gram

Category

Freq.

1

Previous unigram

introduce

politician

3

2

Previous unigram

introduce

entertainer

43

3

Following bigram

into that

politician

2

4

Following bigram

into that

business

0


Topic signatures and WordNet information


Compute lists of terms that signal relevance to a
topic/category [Lin&Hovy 00] & expand with
WordNet synonyms to counter unseen examples


Politician


campaign, republican, budget


The topic signature features convey
information about the overall context in
which each instance exists


Due to differing contexts, instances of the
same name in a single text were classified
differently

36


Evaluation metric


mathematically defines
how to measure the system’s performance
against a human
-
annotated, gold standard


Scoring program


implements the metric and
provides performance measures


For each document and over the entire corpus


For each type of NE

37


Precision = correct answers/answers
produced


Recall = correct answers/total possible
correct answers


Trade
-
off between precision and recall


F
-
Measure = (
β
2

+ 1)PR /
β
2
R + P

[van Rijsbergen 75]


β

reflects the weighting between precision
and recall, typically
β
=1

38


We may also want to take account of partially correct
answers:


Precision =

Correct
+ ½ Partially correct


Correct + Incorrect +
Partial



Recall =

Correct + ½ Partially correct

Correct + Missing +
Partial



Why
: NE boundaries are often misplaced, so

some partially correct results

39

40


Recent

experiments

are

aimed

at

NE

recognition

in

multiple

languages


TIDES

surprise

language

evaluation

exercise

measures

how

quickly

researchers

can

develop

NLP

components

in

a

new

language


CONLL’
02
,

CONLL’
03

focus

on

language
-
independent

NE

recognition

41

Language

NE

Time/

Date

Numeric
exprs.

Org/Per/

Loc

Chinese

4454

17.2%

1.8%

80.9%

English

2242

10.7%

9.5%

79.8%

French

2321

18.6%

3%

78.4%

Japanese

2146

26.4%

4%

69.6%

Portuguese

3839

17.7%

12.1%

70.3%

Spanish

3579

24.6%

3%

72.5%


Numerical and time expressions are very easy
to capture using rules


Constitute together about 20
-
30% of all NEs


All numerical expressions in the 6 languages
required only 5 patterns


Time expressions similarly require only a few
rules (less than 30 per language)


Many of these rules are reusable across the
languages

43


Extensive support for non
-
Latin scripts and
text encodings, including conversion utilities


Automatic recognition of encoding [Ignat et al03]


Occupied up to 2/3 of the TIDES Hindi effort


Bi
-
lingual dictionaries


Annotated corpus for evaluation


Internet resources for gazetteer list collection
(e.g., phone books, yellow pages, bi
-
lingual
pages)

44

Multilingual Data
-

GATE


All processing, visualisation and editing tools use GUK

45


Deals with locations only


Even more ambiguity than in one language:


Multiple places that share the same name, such as
the fourteen cities and villages in the world called
‘Paris’



Place names that are also words in one or more
languages, such as ‘And’ (Iran), ‘Split’ (Croatia)



Places have varying names in different languages
(Italian ‘Venezia’ vs. English ‘Venice’, German
‘Venedig’, French ‘Venise’)


46


Disambiguation module applies heuristics
based on location size and country mentions
(prefer the locations from the country
mentioned most)


Performance evaluation:


853 locations from 80 English texts


96.8% precision


96.5% recall


47


CONLL’2002 and 2003 shared tasks were NE
in Spanish, Dutch, English, and German


The most popular ML techniques used:


Maximum Entropy (5 systems)


Hidden Markov Models (4 systems)


Connectionist methods (4 systems)


Combining ML methods has been shown to
boost results


48


The choice of features is at least as important
as the choice of ML algorithm


Lexical features (words)


Part
-
of
-
speech


Orthographic information


Affixes


Gazetteers


External, unmarked data is useful to derive
gazetteers and for extracting training
instances

49


Named

Entity

Recognition

in Web
Search


Medical NER (
Medline

abstracts
)


50


71
%

of

the

queries

in

search

engines

contain

named

entities


These

named

entities

may

be

useful

to

process

the

query

51


Motivating Examples


Consider the query “harry potter walkthrough”


The

context

of

the

query

strongly

indicates

that

the

named

entity

“harry

potter”

is

a

“Game”


Consider

the

query

“harry

potter

cast”


The

context

of

the

query

strongly

indicates

that

the

named

entity

“harry

potter”

is

a

“Movie”


52


Identifying

named

entities

can

be

very

useful
.

Consider

the

following

examples

related

to

the

query

“harry

potter

walkthrough”
:


Ranking
:

Documents

about

videogames

should

be

pushed

up

in

the

rankings


Suggestion
:

Relevant

suggestions

can

be

generated

like

“harry

potter

cheats”

or

“lord

of

the

rings

walkthrough”

53


Identification

of

Protein

and

Gene

Terms

in

medical

texts


The

identification

of

the

relevant

documents

and

the

extraction

of

the

information

from

them

are

hampered

by

the

large

size

of

literature

databases

and

the

lack

of

widely

accepted

standard

notation

for

biomedical

entities
.



OSIRIS

http
:
//ibi
.
imim
.
es/OSIRISv
1
.
2
.
html

54


Towards semantic tagging of entities


New evaluation metrics for semantic entity
recognition


Expanding the set of entities recognised


e.g., vehicles, weapons, substances (food,
drug)


Finer
-
grained hierarchies, e.g., types of
Organizations (government, commercial,
educational, etc.), Locations (regions,
countries, cities, water, etc)


55


1)

Build NER gazetteers for Romanian


Extract from the Wikipedia lists of Romanian names for male/female;


Extract from the Wikipedia lists of Romanian cities;


Extract from the Internet lists of Romanian companies



2)
Extract from texts email addresses and phone numbers of
any format:


TEL +40
-
722
-
222
-
222, Phone: (722) 222
-
222, Tel (+40): 232222222,


emails including “john(at)smith.inc.edu” or “john(la)info
punct

uaic

punct

ro




3)
Extract as many dates as possible from texts
,
including


23
iunie 2009”, “ieri”, “anul trecut”, “toamna 2001
”, “la
ora

2 pm”
etc.



4
)

Use

the

gazetteers

and

the

programs

from

above

to

extract

NER

from

Romanian

Wikipedia

pages
.

Write

the

output

in

XML

format
.

56


Borthwick
.

A
.

A

Maximum

Entropy

Approach

to

Named

Entity

Recognition
.

PhD

Dissertation
.

1999


Chinchor
.

N
.

MUC
-
7

Named

Entity

Task

Definition

Version

3
.
5
.

Available

by

from

ftp
.
muc
.
saic
.
com/pub/MUC/MUC
7
-
guidelines
,

1997


C
.

Ignat

and

B
.

Pouliquen

and

A
.

Ribeiro

and

R
.

Steinberger
.

Extending

and

Information

Extraction

Tool

Set

to

Eastern
-
European

Languages
.

Proceedings

of

Workshop

on

Information

Extraction

for

Slavonic

and

other

Central

and

Eastern

European

Languages

(IESL'
03
)
.

2003
.


McDonald

D
.

Internal

and

External

Evidence

in

the

Identification

and

Semantic

Categorization

of

Proper

Names
.

In

B
.
Boguraev

and

J
.

Pustejovsky

editors
:

Corpus

Processing

for

Lexical

Acquisition
.

Pages
21
-
39
.

MIT

Press
.

Cambridge,

MA
.

1996


D
.
Maynard
,

K
.

Bontcheva

and

H
.

Cunningham
.

Towards

a

semantic

extraction

of

named

entities
.

Recent

Advances

in

Natural

Language

Processing
,

Bulgaria,

2003
.


H
.

Cunningham
.

GATE,

a

General

Architecture

for

Text

Engineering
.

Computers

and

the

Humanities
,

volume

36
,

pp
.

223
-
254
,

2002
.



K
.

Pastra
,

D
.

Maynard,

H
.

Cunningham,

O
.

Hamza
,

Y
.

Wilks
.

How

feasible

is

the

reuse

of

grammars

for

Named

Entity

Recognition?

Language

Resources

and

Evaluation

Conference

(LREC'
2002
),

2002
.




57


CCG Group


http://cogcomp.cs.illinois.edu/demo/ner/result
s.php



LingPipe



http://alias
-
i.com/lingpipe/web/demo
-
ne.html



Stanford NER


http://nlp.stanford.edu/software/CRF
-
NER.shtml

58

59