Jun Wu and Sanjeev Khudanpur

sharpfartsΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

81 εμφανίσεις

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

1

Maximum Entropy Language Modeling with
Syntactic, Semantic and Collocational
Dependencies

Jun Wu and Sanjeev Khudanpur


Center for Language and Speech Processing

Johns Hopkins University

Baltimore, MD 21218


August, 2000


NSF STIMULATE Grant No. IRI
-
9618874

2

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Stimulate Team in CLSP


Faculties:


Frederick Jelinek: syntactic language modeling


Eric Brill: consensus lattice rescoring


Sanjeev Khudanpur: maximum entropy language modeling


David Yarowsky: topic/genre dependent language modeling


Students:


Ciprian Chelba: syntactic language modeling


Radu Florian: topic/genre dependent language modeling


Lidia Mangu: consensus lattice rescoring


Jun Wu: maximum entropy language modeling


Peng Xu: syntactic language modeling

3

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Outline


The Maximum entropy principle


Semantic (Topic) dependencies


Syntactic dependencies


ME models with topic and syntactic dependencies


Conclusion and Future Work

4

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

The Maximum Entropy Principle


The maximum entropy (MAXENT) principle


When we make inferences based on incomplete information, we
should draw from that probability distribution that has the
maximum entropy permitted by the information we do have.


Example (Dice)


Let be the probability that the facet with dots
faces
-
up. Seek model , that maximizes



From Lagrangian






So . Choose to normalize,

6

,

2

,

1

,

K

=

i

p

i

i

)

,

,

(

6

2

1

p

p

p

P

K

=

i

i

i

p

p

P

H

log

)

(



-

=







-



+



-

=

i

i

i

i

i

p

p

p

P

L

)

1

(

log

)

,

(

a

a

0

log

1

=

+

-

-

=





a

j

i

p

p

L

1

6

2

1

,

,

-

=

a

e

p

p

p

K

6

1

,

,

6

2

1

=

p

p

p

K

a

5

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

The Maximum Entropy Principle


The maximum entropy (MAXENT) principle


When we make inferences based on incomplete information, we
should draw from that probability distribution that has the
maximum entropy permitted by the information we do have.


Example (Dice)


Let be the probability that the facet with dots
faces
-
up. Seek model , that maximizes



From Lagrangian






So . Choose to normalize,

6
,
2
,
1
,
K
=
i
p
i
i
)
,
,
(
6
2
1
p
p
p
P
K
=
i
i
i
p
p
P
H
log
)
(

-
=



-

+

-
=
i
i
i
i
i
p
p
p
P
L
)
1
(
log
)
,
(
a
a
0
log
1
=
+
-
-
=


a
j
i
p
p
L
1
6
2
1
,
,
-
=
a
e
p
p
p
K
6
1
,
,
6
2
1
=
p
p
p
K
a
6

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

The Maximum Entropy Principle
(Cont.)


Example 2: Seek probability distribution with constraints.


( is the empirical distribution.)


The feature:


Empirical expectation:


Maximize subject to




So

4
1
ˆ
2
=
p
p
ˆ

otherwise
i
if
f
0
1


=
=
4
1
)
(
ˆ
)
(
ˆ
=

=





i
f
p
f
E
i
i
P
i
i
i
p
p
P
H
log
)
(

-
=

)
(
)
(
ˆ




=
f
E
f
E
P
P



-
+
-

+

-
=


i
i
i
i
i
i
i
i
f
p
p
p
p
P
L
)
4
1
)
(
(
)
1
(
log
)
,
(
2
1
a
a
a
20
3
,
,
,
4
1
6
3
1
2
=
=
p
p
p
p
K
7

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Maximum Entropy Language Modeling



: empirical distribution,


: feature functions,


: their empirical expectations.


A maximum entropy (ME) language model is an maximum
likelihood model in exponential family




which satisfies each constraint while maximizing .



is the history,



is the future.



P
ˆ
k
f
f
f
K
,
,
2
1
)
(
),
(
),
(
2
1
k
f
E
f
E
f
E
K
)
(
)
|
(
2
1
x
Z
h
w
P
k
f
f
f
a
a
a


=
P
P
E
E
ˆ
=
)
(
P
H
h
w
8

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Advantages and Disadvantage of
Maximum Entropy Language Modeling


Advantages:


Creating a “smooth” model that satisfies all empirical
constraints.


Incorporating various sources of information in a unified
language model.


Disadvantage:


Time and space consuming.

9

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Advantages and Disadvantage of
Maximum Entropy Language Modeling


Advantages:


Creating a “smooth” model that satisfies all empirical
constraints.


Incorporating various sources of information in a unified
language model.


Disadvantage:


Time and space consuming.

10

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Motivation for Exploiting Semantic and
Syntactic Dependencies


N
-
gram models only take local correlation between
words into account.


Several dependencies in natural language with longer
and sentence
-
structure dependent spans may
compensate for this deficiency.


Need a model that exploits topic and syntax.

Analysts and financial officials in the
former
British colony
consider

the
contract essential to the revival of
the Hong Kong futures exchange.

11

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Motivation for Exploiting Semantic and
Syntactic Dependencies


N
-
gram models only take local correlation between
words into account.


Several dependencies in natural language with longer
and sentence
-
structure dependent spans may
compensate for this deficiency.


Need a model that exploits topic and syntax.

Analysts and
financial officials

in the
former British colony

consider

the
contract essential to the revival of
the Hong Kong futures exchange.

12

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Motivation for Exploiting Semantic and
Syntactic Dependencies


N
-
gram models only take local correlation between
words into account.


Several dependencies in natural language with longer
and sentence
-
structure dependent spans may
compensate for this deficiency.


Need a model that exploits topic and syntax.

Analysts

and
financial officials

in the
former British colony

consider

the
contract

essential to the revival of
the Hong Kong
futures exchange
.

13

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Training a Topic Sensitive Model


Cluster the training data by topic.


TF
-
IDF vector (excluding stop words).


Cosine similarity.


K
-
means clustering.


Select topic dependent words:


Estimate an ME model with topic unigram constraints:


threshold

w

f

w

f

w

f

t

t

>



)

(

)

(

log

)

(

)

,

,

(

)

,

,

|

(

1

2

)

,

(

)

,

,

(

)

,

(

)

(

1

2

1

2

1

topic

w

w

Z

e

e

e

e

topic

w

w

w

P

i

i

w

topic

w

w

w

w

w

w

i

i

i

i

i

i

i

i

i

i

-

-

-

-







=

-

-

-

l

l

l

l

]

[

#

]

,

[

#

)

|

,

,

(

1

2

,

1

2

topic

w

topic

topic

w

w

w

P

i

w

w

i

i

i

i

i

=



-

-

-

-

Where

14

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Training a Topic Sensitive Model


Cluster the training data by topic.


TF
-
IDF vector (excluding stop words).


Cosine similarity.


K
-
means clustering.


Select topic dependent words:


Estimate an ME model with topic unigram constraints:


threshold

w

f

w

f

w

f

t

t

>



)

(

)

(

log

)

(

)

,

,

(

)

,

,

|

(

1

2

)

,

(

)

,

,

(

)

,

(

)

(

1

2

1

2

1

topic

w

w

Z

e

e

e

e

topic

w

w

w

P

i

i

w

topic

w

w

w

w

w

w

i

i

i

i

i

i

i

i

i

i

-

-

-

-







=

-

-

-

l

l

l

l

]

[

#

]

,

[

#

)

|

,

,

(

1

2

,

1

2

topic

w

topic

topic

w

w

w

P

i

w

w

i

i

i

i

i

=



-

-

-

-

Where

15

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Training a Topic Sensitive Model


Cluster the training data by topic.


TF
-
IDF vector (excluding stop words).


Cosine similarity.


K
-
means clustering.


Select topic dependent words:


Estimate an ME model with topic unigram constraints:


threshold

w

f

w

f

w

f

t

t

>



)

(

)

(

log

)

(

)

,

,

(

)

,

,

|

(

1

2

)

,

(

)

,

,

(

)

,

(

)

(

1

2

1

2

1

topic

w

w

Z

e

e

e

e

topic

w

w

w

P

i

i

w

topic

w

w

w

w

w

w

i

i

i

i

i

i

i

i

i

i

-

-

-

-







=

-

-

-

l

l

l

l

]

[

#

]

,

[

#

)

|

,

,

(

1

2

,

1

2

topic

w

topic

topic

w

w

w

P

i

w

w

i

i

i

i

i

=



-

-

-

-

where

16

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Recognition Using a Topic
-
Sensitive
Model


Detect the current topic from


Recognizer’s N
-
best hypotheses vs. reference transcriptions.


Using N
-
best hypotheses causes little degradation (in
perplexity and WER).


Assign a new topic for each


Conversation vs. utterance.


Topic assignment for each utterance is better than topic
assignment for the whole conversation.


See Khudanpur and Wu ICASSP’99 paper and Florian
and Yarowsky ACL’99 for details.

17

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Recognition Using a Topic
-
Sensitive
Model


Detect the current topic from


Recognizer’s N
-
best hypotheses vs. reference transcriptions.


Using N
-
best hypotheses causes little degradation (in
perplexity and WER).


Assign a new topic for each


Conversation vs. utterance.


Topic assignment for each utterance is better than topic
assignment for the whole conversation.


See Khudanpur and Wu ICASSP’99 paper and Florian
and Yarowsky ACL’99 for details.

18

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Recognition Using a Topic
-
Sensitive
Model


Detect the current topic from


Recognizer’s N
-
best hypotheses vs. reference transcriptions.


Using N
-
best hypotheses causes little degradation (in
perplexity and WER).


Assign a new topic for each


Conversation vs. utterance.


Topic assignment for each utterance is better than topic
assignment for the whole conversation.


See Khudanpur and Wu ICASSP’99 paper and Florian
and Yarowsky ACL’99 for details.

19

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Experimental Setup


The experiments are based on WS97 dev test set.


Vocabulary: 22K (closed),


LM training set: 1100 conversations, 2.1M words,


AM training set: 60 hours of speech data,


Acoustic model: state
-
clustered cross
-
word triphones model
(6700 states, 12 Gaussians/state),


Front end: 13 MF
-
PLP + + , per conv. side CMS,


Test set: 19 conversations (2 hours), 18K words,


No speaker adaptation.


The evaluation is based on rescoring 100
-
best lists of
the first pass speech recognition.




20

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Topic Assignment During Testing :
Reference Trans vs Hypotheses


Even with a WER of over 38%,
there is only a small loss in
perplexity and a negligible
loss in WER when the topic
assignment is based on
recognizer hypotheses instead
of the correct transcriptions.


Comparisons with the oracle
indicate that there is little
room for further improvement.

PPL
79.0
73.1
73.8
74.4
72.5
68
72
76
80
None
Manual
Ref
10best
Oracle
WER
38.5%
37.8%
37.8%
37.9%
37.7%
37.2%
37.4%
37.6%
37.8%
38.0%
38.2%
38.4%
38.6%
None
Manual
Ref
10best
Oracle
21

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Topic Assignment During Testing :
Reference Trans vs Hypotheses


Even with a WER of over 38%,
there is only a small loss in
perplexity and a negligible
loss in WER when the topic
assignment is based on
recognizer hypotheses instead
of the correct transcriptions.


Comparisons with the oracle
indicate that there is little
room for further improvement.

PPL
79.0
73.1
73.8
74.4
72.5
68
72
76
80
None
Manual
Ref
10best
Oracle
WER
38.5%
37.8%
37.8%
37.9%
37.7%
37.2%
37.4%
37.6%
37.8%
38.0%
38.2%
38.4%
38.6%
None
Manual
Ref
10best
Oracle
22

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Topic Assignment During Testing :
Conv. Level vs Utterance Level


Topic assignment based on
utterances brings a slightly
better result than that based
on whole conversations.


Most of utterances prefer the
topic
-
independent model.


Less than one half of the
remaining utterances prefer a
topic other than that assigned
at the conversation level.

PPL
79.0
73.8
73.3
74.4
73.5
68
72
76
80
None
Ref.C
Ref.U
10best.C
10best.U
WER
38.5%
37.8%
37.8%
37.9%
37.8%
37.4%
37.6%
37.8%
38.0%
38.2%
38.4%
38.6%
None
Ref.C
Ref.U
10best.C
10best.U
83%
10%
7%
No Topic
Agree
Disagree
23

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Topic Assignment During Testing :
Conv. Level vs Utterance Level


Topic assignment based on
utterances brings a slightly
better result than that based
on whole conversations.


Most of utterances prefer the
topic
-
independent model.


Less than one half of the
remaining utterances prefer a
topic other than that assigned
at the conversation level.

PPL
79.0
73.8
73.3
74.4
73.5
68
72
76
80
None
Ref.C
Ref.U
10best.C
10best.U
WER
38.5%
37.8%
37.8%
37.9%
37.8%
37.4%
37.6%
37.8%
38.0%
38.2%
38.4%
38.6%
None
Ref.C
Ref.U
10best.C
10best.U
83%
10%
7%
No Topic
Agree
Disagree
24

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

ME Method vs Interpolation


ME model with only topic
dependent unigram constraints
outperforms the interpolated
topic dependent trigram
model.


ME method is an effective
means of integrating topic
-
dependent and topic
-
independent constraints.

PPL
79.0
78.4
77.3
76.1
73.5
68
72
76
80
None
+T1gram
+T2gram
+T3gram
ME
WER
38.5%
38.5%
38.3%
38.1%
37.8%
37.4%
37.6%
37.8%
38.0%
38.2%
38.4%
38.6%
None
+T1gram
+T2gram
+T3gram
ME
Model

3gram

+topic

1
-
gram

+topic

2
-
gram

+topic

3
-
gram

ME

Size

499K

+70*11K

+70*26K

+70*55K

+16K

25

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

ME vs Cache
-
Based Models


Cache
-
based model reduces
the perplexity, but increase
the WER.


Cache
-
based model brings
(0.6%) more repeated errors
than the trigram model does.


Cache model may not be
practical when the baseline
WER is high.

PPL
79.0
75.2
73.5
68
72
76
80
3gram
Cache
ME
WER
38.5%
38.9%
37.8%
37.0%
37.5%
38.0%
38.5%
39.0%
3gram
Cache
ME
)
(
)
1
(
)
,
|
(
)
,
|
(
1
2
3
1
2
i
c
i
i
i
i
i
i
w
P
w
w
w
P
w
w
w
P

-
+

=
-
-
-
-
l
l
26

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Summary of Topic
-
Dependent
Language Modeling


We significantly reduce both the perplexity (7%) and
WER (0.7% absolute) by incorporating a small number
of topic constraints with N
-
grams using the ME method.


Using N
-
best hypotheses causes little degradation (in
perplexity and WER).


Topic assignment at utterance level is better than that
at conversation level.


ME method is more efficient than linear interpolation in
combining topic dependencies with N
-
grams.


The topic dependent model is better than the cache
-
based model in reducing WER when the baseline is
poor.

27

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Exploiting Syntactic Dependencies


A stack of parse trees for each sentence prefix is
generated.


All sentences in the training set are parsed by a left
-
to
-
right parser.

i

T

contract

NP

ended

VP

The

h

contract

ended

with

a

loss

of

7

cents

after

h

w

w

DT

NN

VBD

IN

DT

NN

IN

CD

NNS

i
-
2

i
-
1

i
-
2

i
-
1

w

i

nt

i
-
1

nt

i
-
2

28

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Exploiting Syntactic Dependencies
(Cont.)


A probability is assigned to each word as:


contract

NP

ended

VP

The

h

contract

ended

with

a

loss

of

7

cents

after

h

w

w

DT

NN

VBD

IN

DT

NN

IN

CD

NNS

i
-
2

i
-
1

i
-
2

i
-
1

w

i

nt

i
-
1

nt

i
-
2







-

-

-

-

-

-

-



-

-

-



=



=

i

i

i

i

S

T

i

i

i

i

i

i

i

i

i

i

S

T

i

i

i

i

i

i

i

i

i

W

T

nt

nt

h

h

w

w

w

P

W

T

T

W

w

P

W

w

P

)

|

(

)

,

,

,

,

,

|

(

)

|

(

)

,

|

(

)

|

(

1

1

2

1

2

1

2

1

1

1

1

r

r

29

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Exploiting Syntactic Dependencies
(Cont.)


A probability is assigned to each word as:








-

-

-

-

-

-

-



-

-

-



=



=

i

i

i

i

S

T

i

i

i

i

i

i

i

i

i

i

S

T

i

i

i

i

i

i

i

i

i

W

T

nt

nt

h

h

w

w

w

P

W

T

T

W

w

P

W

w

P

)

|

(

)

,

,

,

,

,

|

(

)

|

(

)

,

|

(

)

|

(

1

1

2

1

2

1

2

1

1

1

1

r

r



It is assumed that most of the useful information is


embedded in the 2 preceding words and 2 preceding


heads.

30

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Training a Syntactic ME Model


Estimate an ME model with syntactic constraints:



where

)

,

,

,

,

,

(

)

,

|

(

2

1

2

1

2

1

)

,

,

(

)

,

(

)

,

,

(

)

,

(

)

,

,

(

)

,

(

)

(

1

2

1

1

2

1

1

2

1

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-













=

i

i

i

i

i

i

w

nt

nt

w

nt

w

h

h

w

h

w

w

w

w

w

w

i

nt

nt

h

h

w

w

Z

e

e

e

e

e

e

e

,

,

,

2

1

2

1

-

-

-

-

i

i

i

i

nt

nt

h

h

,

2

1

-

-

i

i

w

w

w

P

i

i

i

i

i

i

i

i

i

i

i

i

l

l

l

l

l

l

l







-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

=

=

=

1

2

1

2

1

2

1

2

1

2

1

2

,

,

1

2

1

2

1

2

2

1

2

1

,

,

1

2

1

2

1

2

2

1

2

1

,

,

,

1

2

1

2

1

2

2

1

2

1

]

,

[

#

]

,

,

[

#

)

,

|

,

,

,

,

(

]

,

[

#

]

,

,

[

#

)

,

|

,

,

,

,

(

]

,

[

#

]

,

,

[

#

)

,

|

,

,

,

,

(

i

i

i

i

i

i

h

h

w

w

i

i

i

i

i

i

i

i

i

i

i

nt

nt

w

w

i

i

i

i

i

i

i

i

i

i

i

i

nt

nt

h

h

i

i

i

i

i

i

i

i

i

i

i

i

nt

nt

w

nt

nt

nt

nt

w

h

h

w

w

P

h

h

w

h

h

h

h

w

nt

nt

w

w

P

w

w

w

w

w

w

w

w

nt

nt

h

h

P


See Chelba and Jelinek ACL’98 and Wu and Khudanpur
ICASSP’00 for details.


31

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Experimental Results of Syntactic LMs

PPL
79.0
75.1
74.5
74.0
68
72
76
80
3gram
NT
HW
Both
WER
38.5%
37.8%
37.7%
37.5%
37.0%
37.5%
38.0%
38.5%
39.0%
3gram
NT
HW
Both


Non
-
terminal (NT) N
-
gram
constraints alone reduce
perplexity by 5% and WER by
0.7% absolute.


Head word N
-
gram constraints
result in 6% reduction in
perplexity and 0.8% absolute
in WER.


Non
-
terminal constraints and
syntactic constraints together
reduce the perplexity by 6.3%
and WER by 1.0% absolute.


32

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Experimental Results of Syntactic LMs

PPL
79.0
75.1
74.5
74.0
68
72
76
80
3gram
NT
HW
Both
WER
38.5%
37.8%
37.7%
37.5%
37.0%
37.5%
38.0%
38.5%
39.0%
3gram
NT
HW
Both


Non
-
terminal (NT) N
-
gram
constraints alone reduce
perplexity by 5% and WER by
0.7% absolute.


Head word N
-
gram constraints
result in 6% reduction in
perplexity and 0.8% absolute
in WER.


Non
-
terminal constraints and
syntactic constraints together
reduce the perplexity by 6.3%
and WER by 1.0% absolute.

33

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Experimental Results of Syntactic LMs

PPL
79.0
75.1
74.5
74.0
68
72
76
80
3gram
NT
HW
Both
WER
38.5%
37.8%
37.7%
37.5%
37.0%
37.5%
38.0%
38.5%
39.0%
3gram
NT
HW
Both


Non
-
terminal (NT) N
-
gram
constraints alone reduce
perplexity by 5% and WER by
0.7% absolute.


Head word N
-
gram constraints
result in 6% reduction in
perplexity and 0.8% absolute
in WER.


Non
-
terminal constraints and
syntactic constraints together
reduce the perplexity by 6.3%
and WER by 1.0% absolute.

34

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

ME vs Interpolation

PPL
79.0
75.5
74.0
68
72
76
80
3gram
Interp
ME
WER
38.5%
37.9%
37.5%
37.0%
37.5%
38.0%
38.5%
39.0%
3gram
Interp
ME

The ME model is more effective in using syntactic
dependencies than the interpolation model.

)
,
,
,
|
(
)
1
(
)
,
|
(
)
,
,
,
,
,
|
(
1
2
1
2
1
2
3
1
2
1
2
1
2
-
-
-
-
-
-
-
-
-
-
-
-

-
+

=
i
i
i
i
i
slm
i
i
i
i
i
i
i
i
i
i
nt
nt
h
h
w
P
w
w
w
P
nt
nt
h
h
w
w
w
P
l
l
35

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Head Words inside vs. outside 3gram
Range

The

h

contract

ended

with

a

loss

of

7

cents

after

h

w

w

DT

NP

VBD

IN

DT

NN

IN

CD

NNS

i
-
2

i
-
1

i
-
2

i
-
1

w

i

i
-
2

i
-
1

The

h

contract

ended

with

a

loss

h

w

w

DT

NP

IN

DT

i
-
2

i
-
1

w

i

VBD

37

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Syntactic Heads inside vs. outside
Trigram Range

37.8%
40.3%
39.4%
37.2%
38.8%
37.4%
38.9%
36.9%
35%
36%
37%
38%
39%
40%
41%
Inside
Outside
Trigram
NT
HW
Both
73%
27%
Inside
Outside

The WER of the baseline trigram
model is relatively high when syntactic
heads are beyond trigram range.


Lexical head words are much more
helpful in reducing WER when they are
outside trigram range (1.5%) than
they are within trigram range.


However, non
-
terminal N
-
gram
constraints help almost evenly in both
cases.


Can this gain be obtained from POS
class model too?


The WER reduction for the model with
both head word and non
-
terminal
constraints (1.4%) is more than the
overall reduction (1.0%) when head
words are beyond trigram range.


38

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Contrasting the Smoothing Effect of
NT Class LM vs POS Class LM

PPL
79.0
75.9
75.1
72
74
76
78
80
3gram
POS
NT
WER
38.5%
38.0%
37.8%
37.2%
37.6%
38.0%
38.4%
38.8%
3gram
POS
NT

POS model reduces PPL by 4% and
WER by 0.5%.


The overall gains from POS N
-
gram
constraints are smaller than those from
NT N
-
gram constraints.


Syntactic analysis seems to perform
better than just using the two previous
word positions.


An ME model with
part
-
of
-
speech (POS) N
-
gram

constraints is built as:

)

,

,

,

(

)

,

,

,

|

(

2

1

2

1

)

,

,

(

)

,

(

)

,

,

(

)

,

(

)

(

2

1

2

1

1

2

1

1

2

1

-

-

-

-

-

-

-

-

-

-

-

-

-

-









=

i

i

i

i

w

pos

pos

w

pos

w

w

w

w

w

w

i

i

i

i

i

pos

pos

w

w

Z

e

e

e

e

e

pos

pos

w

w

w

P

i

i

i

i

i

i

i

i

i

l

l

l

l

l

39

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

POS Class LM vs NT Class LM

37.8%
40.3%
39.2%
37.6%
39.4%
37.2%
35%
36%
37%
38%
39%
40%
41%
Inside
Outside
Trigram
POS
NT
63.2
%
36.8
%
54.4
%
45.6
%

When the syntactic heads are
beyond trigram range, the trigram
coverage in the test set is relatively
low.


The back
-
off effect by the POS N
-
gram constraints is effective in
reducing WER in this case.


NT N
-
gram constraints work in a
similar manner. Overall, they are
more effective perhaps because
they are linguistically more
meaningful.


Performance improves further when
lexical head words are applied on
the top of the non
-
terminals.

40

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Summary of Syntactic Language
Modeling


Syntactic heads in the language model are
complementary to N
-
grams: the model improves
significantly when the syntactic heads are beyond N
-
gram range.


Head word constraints provide syntactic information.
Non
-
terminals mainly provide a smoothing effect.


Non
-
terminals are linguistically more meaningful
predictors than POS tags, and therefore are more
effective in supplementing N
-
grams.


The Syntactic model reduces perplexity by 6.3%, WER
by 1.0% (absolute).


41

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Combining Topic, Syntactic and N
-
gram
Dependencies in an ME Framework


Probabilities are assigned as:


Only marginal constraints are necessary.


The ME composite model is trained:





-

-

-

-

-

-

-

-



=

i

i

S

T

i

i

i

i

i

i

i

i

i

i

i

i

W

T

topic

nt

nt

h

h

w

w

w

P

W

w

P

)

|

(

)

,

,

,

,

,

,

|

(

)

|

(

1

1

2

1

2

1

2

1

1

r

)

,

,

,

,

,

,

(

)

,

,

,

,

,

,

|

(

1

2

1

2

1

2

)

,

(

)

,

,

(

)

,

(

)

,

,

(

)

,

(

)

,

,

(

)

,

(

)

(

1

2

1

2

1

2

1

2

1

1

2

1

1

2

1

topic

nt

nt

h

h

w

w

Z

e

e

e

e

e

e

e

e

topic

nt

nt

h

h

w

w

w

P

i

i

i

i

i

i

w

topic

w

h

nt

w

nt

w

h

h

w

h

w

w

w

w

w

w

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

-

-

-

-

-

-

-

-

-

-

-

-















=

-

-

-

-

-

-

-

-

-

l

l

l

l

l

l

l

l

42

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Combining Topic, Syntactic and N
-
gram
Dependencies in an ME Framework


Probabilities are assigned as:


Only marginal constraints are necessary.


The ME composite model is trained:





-

-

-

-

-

-

-

-



=

i

i

S

T

i

i

i

i

i

i

i

i

i

i

i

i

W

T

topic

nt

nt

h

h

w

w

w

P

W

w

P

)

|

(

)

,

,

,

,

,

,

|

(

)

|

(

1

1

2

1

2

1

2

1

1

r

)

,

,

,

,

,

,

(

)

,

,

,

,

,

,

|

(

1

2

1

2

1

2

)

,

(

)

,

,

(

)

,

(

)

,

,

(

)

,

(

)

,

,

(

)

,

(

)

(

1

2

1

2

1

2

1

2

1

1

2

1

1

2

1

topic

nt

nt

h

h

w

w

Z

e

e

e

e

e

e

e

e

topic

nt

nt

h

h

w

w

w

P

i

i

i

i

i

i

w

topic

w

h

nt

w

nt

w

h

h

w

h

w

w

w

w

w

w

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

-

-

-

-

-

-

-

-

-

-

-

-















=

-

-

-

-

-

-

-

-

-

l

l

l

l

l

l

l

l

43

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Overall Experimental Results

PPL
79.0
73.5
74.0
67.9
60
64
68
72
76
80
3gram
Topic
Syntax
Comp

Baseline trigram WER is 38.5%.


Topic
-
dependent constraints
alone reduce perplexity by 7%
and WER by 0.7% absolute.


Syntactic Heads result in 6%
reduction in perplexity and 1.0%
absolute in WER.


Topic
-
dependent constraints and
syntactic constraints together
reduce the perplexity by 13%
and WER by 1.5% absolute.

WER
38.5%
37.8%
37.5%
37.0%
36.0%
36.5%
37.0%
37.5%
38.0%
38.5%
39.0%
3gram
Topic
Syntax
Comp
The gains from topic and syntactic dependencies are nearly additive.


44

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Overall Experimental Results

PPL
79.0
73.5
74.0
67.9
60
64
68
72
76
80
3gram
Topic
Syntax
Comp

Baseline trigram WER is 38.5%.


Topic
-
dependent constraints
alone reduce perplexity by 7%
and WER by 0.7% absolute.


Syntactic Heads result in 6%
reduction in perplexity and 1.0%
absolute in WER.


Topic
-
dependent constraints and
syntactic constraints together
reduce the perplexity by 13%
and WER by 1.5% absolute.


WER
38.5%
37.8%
37.5%
37.0%
36.0%
36.5%
37.0%
37.5%
38.0%
38.5%
39.0%
3gram
Topic
Syntax
Comp
The gains from topic and syntactic dependencies are nearly additive.


45

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Overall Experimental Results

PPL
79.0
73.5
74.0
67.9
60
64
68
72
76
80
3gram
Topic
Syntax
Comp

Baseline trigram WER is 38.5%.


Topic
-
dependent constraints
alone reduce perplexity by 7%
and WER by 0.7% absolute.


Syntactic Heads result in 6%
reduction in perplexity and 1.0%
absolute in WER.


Topic
-
dependent constraints and
syntactic constraints together
reduce the perplexity by 13%
and WER by 1.5% absolute.

WER
38.5%
37.8%
37.5%
37.0%
36.0%
36.5%
37.0%
37.5%
38.0%
38.5%
39.0%
3gram
Topic
Syntax
Comp
The gains from topic and syntactic dependencies are nearly additive.


46

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Overall Experimental Results

PPL
79.0
73.5
74.0
67.9
60
64
68
72
76
80
3gram
Topic
Syntax
Comp

Baseline trigram WER is 38.5%.


Topic
-
dependent constraints
alone reduce perplexity by 7%
and WER by 0.7% absolute.


Syntactic Heads result in 6%
reduction in perplexity and 1.0%
absolute in WER.


Topic
-
dependent constraints and
syntactic constraints together
reduce the perplexity by 13%
and WER by 1.5% absolute.

WER
38.5%
37.8%
37.5%
37.0%
36.0%
36.5%
37.0%
37.5%
38.0%
38.5%
39.0%
3gram
Topic
Syntax
Comp
The gains from topic and syntactic dependencies are nearly additive.


47

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Content Words vs. Stop words


1/5 of test tokens are content
-
bearing words.


The topic sensitive model
reduces WER by 1.4% on
content words, which is twice
as much as the overall
improvement (0.7%).


The syntactic model improves
WER on both content words
and stop words evenly.


The composite model has the
advantage of both models and
reduces WER on content words
more significantly (2.1%).

37.6%
42.2%
36.9%
40.8%
41.9%
36.3%
40.1%
36.2%
32%
34%
36%
38%
40%
42%
44%
Stop Wds
Content Wds
Trigram
Topic
Syntactic
Composite
78%
22%
Stop Wds
Cont. Wds
48

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Head Words inside vs. outside 3gram
Range


The WER of the baseline trigram
model is relatively high when head
words are beyond trigram range.


Topic model helps when trigram is
inappropriate.


The WER reduction for syntactic
model (1.4%) is more than the
overall reduction (1.0%) when head
words are outside trigram range.


The WER reduction for composite
model (2.2%) is more than the
overall reduction (1.5%) when head
words are inside trigram range.


37.8%
40.3%
39.1%
37.3%
38.9%
36.9%
38.1%
36.5%
34%
35%
36%
37%
38%
39%
40%
41%
Inside
Outside
Trigram
Topic
Syntactic
Composite
73%
27%
Inside
Outside
49

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Further Insight Into the Performance

48.6%
38.3%
37.3%
40.0%
46.1%
37.3%
36.8%
39.0%
47.2%
36.8%
36.3%
39.4%
46.0%
36.4%
38.1%
36.0%
30.0%
32.0%
34.0%
36.0%
38.0%
40.0%
42.0%
44.0%
46.0%
48.0%
50.0%
Stop Wds,
Inside
Content Wds,
Inside
Stop Wds,
Outside
Content Wds,
Outside
Trigram
Topic
Syntactic
Composite

The composite model reduces the WER of content
words by 2.6% absolute when the syntactic
predicting information is beyond trigram range.

57%
22%
15%
6%
Stop Wds, inside
Stop Wds, outside
Cont. Wds, inside
Cont. Wds, outside
50

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Concluding Remarks


A language model incorporating two diverse sources of
long
-
range dependence with N
-
grams has been built.


The WER on content words reduces by 2.1%, most of it
due to topic dependence.


The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.


These two sources of non
-
local dependencies are
complementary and their gains are almost additive.


Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.

51

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Concluding Remarks


A language model incorporating two diverse sources of
long
-
range dependence with N
-
grams has been built.


The WER on content words reduces by 2.1%, most of it
due to topic dependence.


The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.


These two sources of non
-
local dependencies are
complementary and their gains are almost additive.


Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.

52

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Concluding Remarks


A language model incorporating two diverse sources of
long
-
range dependence with N
-
grams has been built.


The WER on content words reduces by 2.1%, most of it
due to topic dependence.


The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.


These two sources of non
-
local dependencies are
complementary and their gains are almost additive.


Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.

53

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Concluding Remarks


A language model incorporating two diverse sources of
long
-
range dependence with N
-
grams has been built.


The WER on content words reduces by 2.1%, most of it
due to topic dependence.


The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.


These two sources of non
-
local dependencies are
complementary and their gains are almost additive.


Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.

54

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Concluding Remarks


A language model incorporating two diverse sources of
long
-
range dependence with N
-
grams has been built.


The WER on content words reduces by 2.1%, most of it
due to topic dependence.


The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.


These two sources of non
-
local dependencies are
complementary and their gains are almost additive.


Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.

55

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Ongoing and Future Work



Improve the training algorithm.


Apply this method to other tasks (Broadcast News).

56

Center for Language and Speech Processing, The Johns Hopkins University.

August 2000

Acknowledgement


We thank Radu Florian and David Yarowsky for their
help on topic detection and data clustering and Ciprian
Chelba and Frederick Jelinek for providing the syntactic
model (parser) for the experimental results reported
here.


This work is supported by National Science Foundation,
a STIMULATE grant (IRI
-
9618874).