Three kinds of web data that

homelybrrrInternet and Web Development

Dec 4, 2013 (3 years and 4 months ago)

57 views

Three kinds of web data that
can help computers make better
sense of human language

Shane Bergsma

Johns Hopkins University


Fall, 2011

2

Computers that understand language

“William Wilkinson’s ‘An Account of the
Principalities of Wallachia and Moldavia’
inspired this author’s most famous novel.”

3

Research Vision


Robust processing of human language
requires knowledge beyond what’s in small
manually
-
annotated data sets


Derive
meaning

from real
-
world data:

1)
Raw text on the web

2)
Bilingual text (words plus their translations)


Part 1: Parsing noun phrases

3)
Visual data (
labelled

online images)


Part 2: Learning the meaning of words

4

Part 1: Parsing Noun Phrases (NPs)

Google: What pages/ads should be returned for
the query “
washed baby carrots

?






[washed baby]
carrots
vs
.

washed [baby carrots]

carrots for washed babies

baby carrots that are washed

5

Training a parser via machine learning

washed baby carrots

PARSER

INCORRECT in

training data

TESTER

[washed baby] carrots

with
weights,
w
0

6

Training a parser via machine learning

washed baby carrots

PARSER

CORRECT in

gold standard

TESTER

washed [baby carrots]

Training corpus:

retired [science teacher]

[social science] teacher

female [bus driver]

[school bus] driver

zebra [hair
straightener
]

alleged [Canadian lover]



with
weights,
w
1

[
Banko

&
Brill, 2001]


Grammar
Correction

Task

More data is better data (learning curve)

8

Testing a parser on new data

washed baby smell

PARSER

INCORRECT

TESTER

Big Challenge
: For parsing
NPs,
every word matters


-

both parses are
grammatical


-

we can’t generalize from

washed baby carrots
” in
training to “
washed baby
smell
” at test time



Solution
: New sources of
data

-

Having seen
washed [baby carrots]

in training…


washed [baby smell]

with final
weights,
w
N

9

English Data for Parsing

Human

Annotated



ㄠ䵉䱌1低Ow潲摳


Penn

(Parse
-
)
Treebank



[Marcus et al., 1993]

Bitexts



ㄠ䉉䱌1低Ow潲摳


䍡湡摩慮a
䡡湳慲摳
Ⱐet挮c


[
Callison
-
Burch

et al., 2010]

Web text [N
-
grams]



ㄠ呒䥌䱉低 w潲摳


Google

N
-
gram Data



[
Brants

& Franz,

2006]

10

Task: Parsing NPs with conjunctions

1)
[
dairy
and

meat
]

production

2)
[
sustainability
]

and
[
meat
production
]



yes: [
dairy
production
] in (1)


no:

[
sustainability
production
] in (2)



Our contributions:
new semantic features
from
raw web text
and a
new approach
to
using
bilingual data
as soft supervision


[Bergsma,
Yarowsky

& Church, ACL 2011]

11

One Noun Phrase or Two:

A Machine Learning Approach

Input: “
dairy

and
meat

production
”→ features:
x


x

=
(…,

first
-
noun=
dairy
, …



second
-
noun=
meat
,





first+second
-
noun=
dairy
+
meat
,
…)



h(
x
) =
w



x

(predict one NP if
h(
x
) > 0
)



Set
w

via training on annotated training data
using some machine learning algorithm

12

Leveraging Web
-
Derived Knowledge

[
dairy
and

meat
]

production


If there is only one NP, then it is implicitly talking
about “
dairy production



Do we see this phrase occurring a lot on the web? [Yes]

sustainability
and
[
meat
production
]


If there is only one NP, then it is implicitly talking
about “
sustainability production



Do we see this phrase occurring a lot on the web? [No]


Classifier has features for these counts

13

Search Engine Page Counts for NLP


Early web work:
Use an Internet
search engine to
get web counts


[Keller &
Lapata
, 2003]


dairy production


714,000

pages


獵st慩湡扩汩瑹

灲潤畣瑩潮


ㄱⰰ1〠灡p敳

Problem:
Using a search engine is just too inefficient
to get data on a large scale

14

Google N
-
gram Data for NLP


Google N
-
gram Data
[
Brants

& Franz, 2006]


N words in sequence + their count on web:





dairy producers


22724

dairy production 17704

dairy professionals 204

dairy profits



82

dairy propaganda 15

dairy protein



1268




A compressed version of all the text on web


Enables
new
features/statistics for a range of tasks
[Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]

15

Features for Explicit Paraphrases

dairy
and

meat

production

sustainability
and
meat
production

Pattern:


of



and



↑Count(
production

of

dairy
and

meat
)


↓Count(
production

of
sustainability
and
meat
)

Pattern:




and




Count(
meat

production

and
dairy
)

↑Count(
meat

production
and
sustainability
)



and






New paraphrases extending ideas in [
Nakov

& Hearst, 2005]



and






16

Training Examples

Google

N
-
gram Data

Feature Vectors

x
1
, x
2
, x
3
, x
4

Classifier: h(x)

Machine Learning

Human
-
Annotated
Data
(small)

Raw Data
(HUGE)

17

Using Bilingual Data


Bilingual data: a rich source of paraphrases

dairy

and
meat

production



producción

láctea

y
cárnica


Build a classifier which uses
bilingual
features


Applicable when we know the translation of the NP

18

Bilingual “Paraphrase” Features

dairy
and

meat

production

sustainability
and
meat
production

Pattern:









(Spanish)

Count(
producci ón

l áctea

y
cárni ca
)


unseen

Pattern:







(Italian)

unseen

Count(
sosteni bi l i tà

e l a
produzi one

di

carne
)




and








and






19

Bilingual “Paraphrase” Features

dairy
and

meat

production

sustainability
and
meat
production

Pattern:

-





(Finnish)

Count (
mai don

̶

j a

l i han
t uotantoon
)

unseen



and








and






20

Training Examples

Translation
Data

Feature Vectors

x
1
, x
2
, x
3
, x
4

Classifier: h(
x
b
)

Machine Learning

Human
-
Annotated
Data
(small)

Bilingual Data
(medium)

21

h(
x
b
)

h(
x
m
)

+ Features from
Google Data

Training Examples

+ Features from
Translation Data

Training Examples

coal

and
steel

money

rocket

and
mortar

attacks

Bitext

Examples

22

h(
x
m
)

+ Features from
Google Data

Training Examples

+ Features from
Translation Data

Training Examples

coal

and
steel

money

rocket

and
mortar

attacks

h(
x
b
)
1

business
and
computer science

the
Bosporus

and
Dardanelles

straits

the
environment

and
air transport

23

+ Features from
Google Data

Training Examples

+ Features from
Translation Data

Training Examples

coal

and
steel

money

rocket

and
mortar

attacks

business
and
computer science

the
environment

and
air transport

the
Bosporus

and
Dardanelles

straits

h(
x
b
)
1

h(
x
m
)
1

Co
-
Training: [Yarowsky’95], [Blum & Mitchell’98]

24

h(
x
m
)
i

h(
x
b
)
i

Error rate (%) of co
-
trained classifiers

25

Error rate (%) on Penn Treebank (PTB)

0
5
10
15
20
Broad-coverage
Parsers
Nakov & Hearst
(2005)
Pitler et al
(2010)
New Supervised
Monoclassifier
Co-trained
Monoclassifier
800 PTB
training
examples

800 PTB
training
examples

2 training
examples

unsupervised

h(
x
m
)
N

26

Part 1: Conclusion


Knowledge from large
-
scale monolingual corpora
is crucial for parsing noun phrases


New paraphrase features


New way to use bilingual data as soft supervision
to guide the use of monolingual features


Next steps: Use bilingual data
even when we
don’t know the translations to begin with


infer translations jointly with syntax


i.e., beyond
bitexts

(1B), make use of huge (1T+) N
-
gram corpora in English, Spanish, French, …

27

Part 2: Using visual data to learn the
meaning of words


Large volumes of visual data also reveal word
meaning (semantics), but in
language
-
universal
way


Humans label their images as they post them online,
providing the
word
-
meaning

link


There’s
lots of images
to work with

[from
Facebook’s

Twitter feed]

28

Part 2: Using visual data to learn the
meaning of words

Progress in the area of “lexical semantics”


Task #1
: learning translations of words into
foreign languages using visual data, e.g.


turtle
” in English = “
tortuga
” in Spanish



Main contribution:
a totally new approach
to building bilingual dictionaries

[Bergsma and Van
Durme
, IJCAI 2011]


29

English Web Images

Spanish Web Images

turtle

candle

vela

tortuga

cockatoo

cacatúa

30

Task #1: Bilingual Lexicon Induction


Why?


Needed for
automatic machine translation
,
cross
-
language information retrieval
, etc.


Poor coverage

of human
-
compiled
dictionaries/
bitexts


How to do it with monolingual data only?


Link words to information that is preserved across
languages (clues to common meaning)

31

Clues to Common Meaning: Spelling

[Koehn & Knight 2002, many others]

natural
-
natural


higiénico:hygenic


rad
ón
-
radon



vela
-
candle



*
calle
-
candle



32

Clues to Common Meaning: Images

candle

calle

vela

Visual similarities:



high contrast



black background



glowing flame

33

Link words by web
-
based

visual similarity

Step 1: Retrieve online images
via
Google Image
Search
(in each lang.), 20 images for each word


Google competitive with “hand
-
prepared
datasets”

[Fergus et al., 2005]

34

Step 2: Create Image Feature Vectors

Color histogram

features

35

Step 2: Create Image Feature Vectors

SIFT

keypoint

features

Using David Lowe’s software
[Lowe, 2004]

36

Step 3: Compute an Aggregate Similarity
for Two Words

0.33

0.55

0.19

0.46

Vector

Cosine
Similarity

Best match for
one English
image

Avg. over
all
English
images

37

Output: Ranking of Foreign Translations
by Aggregate Visual Similarities

English

Spanish

French

rosary

1. camándula
:0.151

1. chapelet
:0.213

2. puntaje
:0.140


2. activité
:0.153


3. acc
identalidad
:0.139

3. rosaire
:0.150






38

Experiments


500
-
word lists in each language


Results on
all

pairs

from
German, English,
Spanish, French, Italian, Dutch


Avg. Top
-
N Accuracy
: How often correct
answer is in top N most similar words?


Lots more details in paper, including how we
determine which words are ‘physical objects’

39

Average Top
-
N Accuracy on 14
Language Pairs

0
10
20
30
40
50
60
70
80
Top-1
Top-20
40

Task #2: Lexical Semantics from Images

Can you
eat


migas
”?


Can you
eat

carillon
”?


Can you
eat

mamey
”?

Selectional

Preference:

Is noun
X
a
plausible
object for
verb
Y
?

[Bergsma and Goebel, RANLP 2011]

41

Conclusion


Robust NLP needs to look beyond human
-
annotated data to exploit large corpora


Size matters:


Most parsing systems trained on 1 million words


We use:


billions of words in
bitexts


trillions of words of monolingual text


online images: hundreds of billions


(

1000 words each


a 100 trillion words!)

42

Questions + Thanks


Gold sponsors:










Platinum sponsors (collaborators):


Kenneth Church (Johns Hopkins), Randy Goebel (Alberta),
Dekang

Lin (Google), Emily
Pitler

(Penn), Benjamin Van
Durme

(Johns Hopkins) and David
Yarowsky

(Johns Hopkins)