Three kinds of web data that

homelybrrrInternet and Web Development

Dec 4, 2013 (3 years and 10 months ago)

75 views

Three kinds of web data that
can help computers make better
sense of human language

Shane Bergsma

Johns Hopkins University


Fall, 2011

2

Computers that understand language

“William Wilkinson’s ‘An Account of the
Principalities of Wallachia and Moldavia’
inspired this author’s most famous novel.”

3

Research Vision


Robust processing of human language
requires knowledge beyond what’s in small
manually
-
annotated data sets


Derive
meaning

from real
-
world data:

1)
Raw text on the web

2)
Bilingual text (words plus their translations)


Part 1: Parsing noun phrases

3)
Visual data (
labelled

online images)


Part 2: Learning the meaning of words

4

Part 1: Parsing Noun Phrases (NPs)

Google: What pages/ads should be returned for
the query “
washed baby carrots

?






[washed baby]
carrots
vs
.

washed [baby carrots]

carrots for washed babies

baby carrots that are washed

5

Training a parser via machine learning

washed baby carrots

PARSER

INCORRECT in

training data

TESTER

[washed baby] carrots

with
weights,
w
0

6

Training a parser via machine learning

washed baby carrots

PARSER

CORRECT in

gold standard

TESTER

washed [baby carrots]

Training corpus:

retired [science teacher]

[social science] teacher

female [bus driver]

[school bus] driver

zebra [hair
straightener
]

alleged [Canadian lover]



with
weights,
w
1

[
Banko

&
Brill, 2001]


Grammar
Correction

Task

More data is better data (learning curve)

8

Testing a parser on new data

washed baby smell

PARSER

INCORRECT

TESTER

Big Challenge
: For parsing
NPs,
every word matters


-

both parses are
grammatical


-

we can’t generalize from

washed baby carrots
” in
training to “
washed baby
smell
” at test time



Solution
: New sources of
data

-

Having seen
washed [baby carrots]

in training…


washed [baby smell]

with final
weights,
w
N

9

English Data for Parsing

Human

Annotated



ㄠ䵉䱌1低Ow潲摳


Penn

(Parse
-
)
Treebank



[Marcus et al., 1993]

Bitexts



ㄠ䉉䱌1低Ow潲摳


䍡湡摩慮a
䡡湳慲摳
Ⱐet挮c


[
Callison
-
Burch

et al., 2010]

Web text [N
-
grams]



ㄠ呒䥌䱉低 w潲摳


Google

N
-
gram Data



[
Brants

& Franz,

2006]

10

Task: Parsing NPs with conjunctions

1)
[
dairy
and

meat
]

production

2)
[
sustainability
]

and
[
meat
production
]



yes: [
dairy
production
] in (1)


no:

[
sustainability
production
] in (2)



Our contributions:
new semantic features
from
raw web text
and a
new approach
to
using
bilingual data
as soft supervision


[Bergsma,
Yarowsky

& Church, ACL 2011]

11

One Noun Phrase or Two:

A Machine Learning Approach

Input: “
dairy

and
meat

production
”→ features:
x


x

=
(…,

first
-
noun=
dairy
, …



second
-
noun=
meat
,





first+second
-
noun=
dairy
+
meat
,
…)



h(
x
) =
w



x

(predict one NP if
h(
x
) > 0
)



Set
w

via training on annotated training data
using some machine learning algorithm

12

Leveraging Web
-
Derived Knowledge

[
dairy
and

meat
]

production


If there is only one NP, then it is implicitly talking
about “
dairy production



Do we see this phrase occurring a lot on the web? [Yes]

sustainability
and
[
meat
production
]


If there is only one NP, then it is implicitly talking
about “
sustainability production



Do we see this phrase occurring a lot on the web? [No]


Classifier has features for these counts

13

Search Engine Page Counts for NLP


Early web work:
Use an Internet
search engine to
get web counts


[Keller &
Lapata
, 2003]


dairy production


714,000

pages


獵st慩湡扩汩瑹

灲潤畣瑩潮


ㄱⰰ1〠灡p敳

Problem:
Using a search engine is just too inefficient
to get data on a large scale

14

Google N
-
gram Data for NLP


Google N
-
gram Data
[
Brants

& Franz, 2006]


N words in sequence + their count on web:





dairy producers


22724

dairy production 17704

dairy professionals 204

dairy profits



82

dairy propaganda 15

dairy protein



1268




A compressed version of all the text on web


Enables
new
features/statistics for a range of tasks
[Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]

15

Features for Explicit Paraphrases

dairy
and

meat

production

sustainability
and
meat
production

Pattern:


of



and



↑Count(
production

of

dairy
and

meat
)


↓Count(
production

of
sustainability
and
meat
)

Pattern:




and




Count(
meat

production

and
dairy
)

↑Count(
meat

production
and
sustainability
)



and






New paraphrases extending ideas in [
Nakov

& Hearst, 2005]



and






16

Training Examples

Google

N
-
gram Data

Feature Vectors

x
1
, x
2
, x
3
, x
4

Classifier: h(x)

Machine Learning

Human
-
Annotated
Data
(small)

Raw Data
(HUGE)

17

Using Bilingual Data


Bilingual data: a rich source of paraphrases

dairy

and
meat

production



producción

láctea

y
cárnica


Build a classifier which uses
bilingual
features


Applicable when we know the translation of the NP

18

Bilingual “Paraphrase” Features

dairy
and

meat

production

sustainability
and
meat
production

Pattern:









(Spanish)

Count(
producci ón

l áctea

y
cárni ca
)


unseen

Pattern:







(Italian)

unseen

Count(
sosteni bi l i tà

e l a
produzi one

di

carne
)




and








and






19

Bilingual “Paraphrase” Features

dairy
and

meat

production

sustainability
and
meat
production

Pattern:

-





(Finnish)

Count (
mai don

̶

j a

l i han
t uotantoon
)

unseen



and








and






20

Training Examples

Translation
Data

Feature Vectors

x
1
, x
2
, x
3
, x
4

Classifier: h(
x
b
)

Machine Learning

Human
-
Annotated
Data
(small)

Bilingual Data
(medium)

21

h(
x
b
)

h(
x
m
)

+ Features from
Google Data

Training Examples

+ Features from
Translation Data

Training Examples

coal

and
steel

money

rocket

and
mortar

attacks

Bitext

Examples

22

h(
x
m
)

+ Features from
Google Data

Training Examples

+ Features from
Translation Data

Training Examples

coal

and
steel

money

rocket

and
mortar

attacks

h(
x
b
)
1

business
and
computer science

the
Bosporus

and
Dardanelles

straits

the
environment

and
air transport

23

+ Features from
Google Data

Training Examples

+ Features from
Translation Data

Training Examples

coal

and
steel

money

rocket

and
mortar

attacks

business
and
computer science

the
environment

and
air transport

the
Bosporus

and
Dardanelles

straits

h(
x
b
)
1

h(
x
m
)
1

Co
-
Training: [Yarowsky’95], [Blum & Mitchell’98]

24

h(
x
m
)
i

h(
x
b
)
i

Error rate (%) of co
-
trained classifiers

25

Error rate (%) on Penn Treebank (PTB)

0
5
10
15
20
Broad-coverage
Parsers
Nakov & Hearst
(2005)
Pitler et al
(2010)
New Supervised
Monoclassifier
Co-trained
Monoclassifier
800 PTB
training
examples

800 PTB
training
examples

2 training
examples

unsupervised

h(
x
m
)
N

26

Part 1: Conclusion


Knowledge from large
-
scale monolingual corpora
is crucial for parsing noun phrases


New paraphrase features


New way to use bilingual data as soft supervision
to guide the use of monolingual features


Next steps: Use bilingual data
even when we
don’t know the translations to begin with


infer translations jointly with syntax


i.e., beyond
bitexts

(1B), make use of huge (1T+) N
-
gram corpora in English, Spanish, French, …

27

Part 2: Using visual data to learn the
meaning of words


Large volumes of visual data also reveal word
meaning (semantics), but in
language
-
universal
way


Humans label their images as they post them online,
providing the
word
-
meaning

link


There’s
lots of images
to work with

[from
Facebook’s

Twitter feed]

28

Part 2: Using visual data to learn the
meaning of words

Progress in the area of “lexical semantics”


Task #1
: learning translations of words into
foreign languages using visual data, e.g.


turtle
” in English = “
tortuga
” in Spanish



Main contribution:
a totally new approach
to building bilingual dictionaries

[Bergsma and Van
Durme
, IJCAI 2011]


29

English Web Images

Spanish Web Images

turtle

candle

vela

tortuga

cockatoo

cacatúa

30

Task #1: Bilingual Lexicon Induction


Why?


Needed for
automatic machine translation
,
cross
-
language information retrieval
, etc.


Poor coverage

of human
-
compiled
dictionaries/
bitexts


How to do it with monolingual data only?


Link words to information that is preserved across
languages (clues to common meaning)

31

Clues to Common Meaning: Spelling

[Koehn & Knight 2002, many others]

natural
-
natural


higiénico:hygenic


rad
ón
-
radon



vela
-
candle



*
calle
-
candle



32

Clues to Common Meaning: Images

candle

calle

vela

Visual similarities:



high contrast



black background



glowing flame

33

Link words by web
-
based

visual similarity

Step 1: Retrieve online images
via
Google Image
Search
(in each lang.), 20 images for each word


Google competitive with “hand
-
prepared
datasets”

[Fergus et al., 2005]

34

Step 2: Create Image Feature Vectors

Color histogram

features

35

Step 2: Create Image Feature Vectors

SIFT

keypoint

features

Using David Lowe’s software
[Lowe, 2004]

36

Step 3: Compute an Aggregate Similarity
for Two Words

0.33

0.55

0.19

0.46

Vector

Cosine
Similarity

Best match for
one English
image

Avg. over
all
English
images

37

Output: Ranking of Foreign Translations
by Aggregate Visual Similarities

English

Spanish

French

rosary

1. camándula
:0.151

1. chapelet
:0.213

2. puntaje
:0.140


2. activité
:0.153


3. acc
identalidad
:0.139

3. rosaire
:0.150






38

Experiments


500
-
word lists in each language


Results on
all

pairs

from
German, English,
Spanish, French, Italian, Dutch


Avg. Top
-
N Accuracy
: How often correct
answer is in top N most similar words?


Lots more details in paper, including how we
determine which words are ‘physical objects’

39

Average Top
-
N Accuracy on 14
Language Pairs

0
10
20
30
40
50
60
70
80
Top-1
Top-20
40

Task #2: Lexical Semantics from Images

Can you
eat


migas
”?


Can you
eat

carillon
”?


Can you
eat

mamey
”?

Selectional

Preference:

Is noun
X
a
plausible
object for
verb
Y
?

[Bergsma and Goebel, RANLP 2011]

41

Conclusion


Robust NLP needs to look beyond human
-
annotated data to exploit large corpora


Size matters:


Most parsing systems trained on 1 million words


We use:


billions of words in
bitexts


trillions of words of monolingual text


online images: hundreds of billions


(

1000 words each


a 100 trillion words!)

42

Questions + Thanks


Gold sponsors:










Platinum sponsors (collaborators):


Kenneth Church (Johns Hopkins), Randy Goebel (Alberta),
Dekang

Lin (Google), Emily
Pitler

(Penn), Benjamin Van
Durme

(Johns Hopkins) and David
Yarowsky

(Johns Hopkins)