Natural Language Processing
Assignment
–
Final Presentation
Varun
Suprashanth
, 09005063
Tarun
Gujjula
, 09005068
Asok Ramachandran, 09005072
Part 1 : POS Tagger
Tasks Completed
•
Implementation of Viterbi
–
Unigram, Bigram.
•
Five Fold Evaluation.
•
Per POS Accuracy.
•
Confusion Matrix.
0
0.2
0.4
0.6
0.8
1
1.2
Serie…
Per POS Accuracy for Bigram
Assumption.
Screen shot of Confusion Matrix
AJ0
AJ0
-
AV0
AJ0
-
NN1
AJ0
-
VVD
AJ0
-
VVG
AJ0
-
VVN
AJC
AJS
AT0
AV0
AV0
-
AJ0
AVP
AJ0
2899
20
32
1
3
3
0
0
18
35
27
1
AJ0
-
AV0
31
18
2
0
0
0
0
0
0
1
15
0
AJ0
-
NN1
161
0
116
0
0
0
0
0
0
0
1
0
AJ0
-
VVD
7
0
0
0
0
0
0
0
0
0
0
0
AJ0
-
VVG
8
0
0
0
2
0
0
0
1
0
0
0
AJ0
-
VVN
8
0
0
3
0
2
0
0
1
0
0
0
AJC
2
0
0
0
0
0
69
0
0
11
0
0
AJS
6
0
0
0
0
0
0
38
0
2
0
0
AT0
192
0
0
0
0
0
0
0
7000
13
0
0
AV0
120
8
2
0
0
0
15
2
24
2444
29
11
AV0
-
AJ0
10
7
0
0
0
0
0
0
0
16
33
0
AVP
24
0
0
0
0
0
0
0
1
11
0
737
Part 2 : Discriminative VS
Generative
Problem Statement
•
Generate
unigram parameters of P(
t_i|w_i
). You already
have the annotated corpus.
•
Compute
the
argmax
of P(T|W); do not invert through
Bayes theorem.
•
Compare
with unigram based unigram performance
between (2) and the HMM based system.
Tasks Completed
•
Generated
unigram parameters of
P(
ti|wi
).
•
Computed
the
argmax
of P(T|W
).
•
Compared
with unigram based unigram
performance between
the
HMM based
system
and the above.
•
Better results were produced by the generative
model in cases of ambiguous sentences.
Discriminative
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(T|W) =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(
𝑇
𝑖
,
𝑁
|
𝑊
𝑖
,
𝑁
)
=
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(
𝑇
1
|
𝑊
𝑖
,
𝑁
)
.
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(
𝑇
2
|
𝑊
𝑖
,
𝑁
).
………
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(
𝑇
𝑁
|
𝑊
𝑖
,
𝑁
)
•
Assuming word tag pair to be independent,
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(T|W) =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(
𝑇
𝑖
,
𝑁
|
𝑊
𝑖
,
𝑁
)
•
precision
0.896788
•
F
-
measure
0.896788
Per
-
PoS
Accuracy
0
0.2
0.4
0.6
0.8
1
1.2
AJ0
AJ0-NN1
AJ0-VVG
AJC
AT0
AV0-AJ0
AVP-PRP
AVQ-CJS
CJS
CJS-PRP
CJT-DT0
CRD-PNI
DT0
DTQ
ITJ
NN1
NN1-NP0
NN1-VVG
NN2-VVZ
NP0-NN1
PNI
PNP
PNX
PRP
PRP-CJS
TO0
VBB
VBG
VBN
VDB
VDG
VDN
VHB
VHG
VHN
VM0
VVB-NN1
VVD-AJ0
VVG
VVG-NN1
VVN
VVN-VVD
VVZ-NN2
Series1
Generative
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(T|W) =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(T|W). P(T).
•
Assuming unigram assumption and word tag pairs to be
independent,
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(T|W) =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑇
P(
𝑇
𝑖
|
𝑊
𝑖
)
.
P(
𝑇
𝑖
)
Part 3 : Analysis of Corpora Using
Word Prediction
Tasks Completed
•
Predicted the next word on the basis of the patterns
occurring in both the corpora.
•
First Corpus had untagged
-
word sentences and the
second one had tagged
-
word sentences.
•
The corpus with the tagged words gives better results for
word prediction.
Untagged Corpus
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
P
(
𝑊
|
𝑊
1
,
𝑁
) =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
𝑐
(
𝑊
1
,
𝑁
.
𝑊
)
𝐶
(
𝑊
1
,
𝑁
)
•
Where c() is the count.
•
By Bigram Assumption,
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
P(
𝑊
|
𝑊
1
,
𝑁
) =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
𝑐
(
𝑊
𝑁
.
𝑊
)
𝐶
(
𝑊
𝑁
)
•
By Trigram Assumption,
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
P(
𝑊
|
𝑊
1
,
𝑁
) =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
𝑐
(
𝑊
𝑁
.
𝑊
𝑁
−
1
)
𝑐
(
𝑊
𝑁
−
1
.
𝑊
)
𝐶
(
𝑊
𝑁
)
Tagged Corpus
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
P
(
𝑊
|
𝑊
1
,
𝑁
,
𝑇
1
,
𝑁
)
=
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
𝑐
(
<
𝑤
1
,
𝑇
1
>
,
<
𝑤
2
,
𝑇
2
>
,
…
…
<
𝑤
𝑁
,
𝑇
𝑁
>
,
<
𝑤
1
,
𝑇
𝑖
>
)
𝑖
𝑐
(
<
𝑤
1
,
𝑇
1
>
,
<
𝑤
2
,
𝑇
2
>
,
…
…
<
𝑤
𝑁
,
𝑇
𝑁
>
)
•
Using Bigram Assumption,
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
P
(
𝑊
|
𝑊
1
,
𝑁
,
𝑇
1
,
𝑁
) =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
𝑐
(
<
𝑤
𝑁
,
𝑇
𝑁
>
,
<
𝑤
1
,
𝑇
𝑖
>
)
𝑖
𝑐
(
<
𝑤
𝑁
,
𝑇
𝑁
>
)
•
Using Trigram Assumption,
•
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
P
(
𝑊
|
𝑊
1
,
𝑁
,
𝑇
1
,
𝑁
) =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑤
𝑐
(
<
𝑤
𝑁
,
𝑇
𝑁
>
,
<
𝑤
𝑁
−
1
,
𝑇
𝑁
−
1
>
,
<
𝑤
1
,
𝑇
𝑖
>
)
𝑖
𝑐
<
𝑤
𝑁
,
𝑇
𝑁
>
,
<
𝑤
𝑁
−
1
,
𝑇
𝑁
−
1
>
)
Examples.
•
Example 1 :
•
TO0_to
VBI_be
CJC_or
XX0_not
TO0_to
•
VBI_be
•
to be or not
to
•
The
•
Example 2:
•
AJ0_complete
CJC_and
AJ0_utter
•
NN1_contempt
•
complete and utter
•
Loud
Examples Cont.
•
Example 3:
•
PNQ_who
VBZ_is
DPS_your
AJ0
-
NN1_favourite
•
NN1_gardening
•
who is your
favourite
•
is
Results
•
Raw text LM :
•
Word Prediction Accuracy:
13.21%
•
POS tagged text LM :
•
Word Prediction Accuracy :
15.53%
Part 4 : A
-
star Implementation
Problem Statement
•
The
goal is to see which algorithm is better for POS
tagging, Viterbi or A*
•
Look
upon the column of POS tags above all the words as
forming the state space graph.
•
The
start state S is '^' and the goal stage G is
'$'. 6
. Your
job is to come up with a good heuristic. One possibility is
that the
heuristic
value h(N), where N is a node on a word
W, is the product of the distance of W from '$' and the
least arc cost in the state space graph.
•
G(N
) is the cost of the best path found so far to W from '^'.
•
Run
A* with this heuristic and see the result.
•
Compare
the result with Viterbi.
A
-
Star Implementation.
•
precision
0.937254
•
F
-
measure
0.937254
0
0.2
0.4
0.6
0.8
1
1.2
AJ0
AJ0-NN1
AJ0-VVG
AJC
AT0
AV0-AJ0
AVP-PRP
AVQ-CJS
CJS
CJS-PRP
CJT-DT0
CRD-PNI
DT0
DTQ
ITJ
NN1
NN1-NP0
NN1-VVG
NN2-VVZ
NP0-NN1
PNI
PNP
PNX
PRP
PRP-CJS
TO0
VBB
VBG
VBN
VDB
VDG
VDN
VHB
VHG
VHN
VM0
VVB-NN1
VVD-AJ0
VVG
VVG-NN1
VVN
VVN-VVD
VVZ-NN2
Series1
Screen shot of Confusion Matrix
12836
58
187
9
13
28
0
0
240
110
52
7
98
44
3
0
0
0
0
0
0
5
26
0
357
1
377
0
2
0
0
0
1
0
1
0
33
0
0
2
0
1
0
0
7
0
0
0
33
0
2
0
29
0
0
0
4
0
0
0
42
0
0
5
0
15
0
0
5
0
0
0
4
0
0
0
0
0
403
0
3
38
0
0
4
0
0
0
0
0
0
214
0
18
0
0
1
0
0
0
0
0
0
0
23454
55
0
0
82
11
2
0
0
0
58
11
99
9198
68
42
34
12
0
0
0
0
0
0
0
69
75
0
4
0
0
0
0
0
0
0
1
38
0
1533
0
0
0
0
0
0
0
0
0
5
0
72
0
0
0
0
0
0
0
0
1
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
1
109
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Heuristics.
•
h
= g * (N
-
n)/ n
•
Where N is the length of the sentence, and n is the index of the
current word in the sentence.
A
-
star Vs. Viterbi
Part 5 : YAGO
Problem
Statement
•
Take
as input two words and show A PATH between them
listing all the concepts that are encountered on the way.
•
For
example, in the path from 'bulldog' to '
cheshire
cat',
one would presumably encounter 'bulldog
-
dog
-
mammal
-
cat
-
cheshire
cat'.
Similarly for 'VVS
Laxman
' and
'Hyderabad', 'Tendulkar' and 'Tennis' (you will be
surprised!!).
Part 6: Parser Projection
Example
•
English:
Dhoni
is the
captain
of
India.
•
Hindi:
dhoni
bhaarat
ke
kaptaan
hai
.
•
Hindi
-
parse
:
[
[ [
dhoni
]
NN
]
NP
[
[[[
bhaarat
]
NNP
]
NP
[
ke
]
P
]
PP
[
kaptaan
]
NN
]
NP
[
hai
]
VBZ
]
VP
]
S
•
English
-
parse
:
[
[ [
Delhi]
NN
]
NP
[ [
is]
VBZ
[[the]
ART
[capital]
NN
]
NP
[[of]
P
[[India]
NNP
]
NP
]
PP
]
VP
]
S
28
Problems and Conclusions
•
Many Idioms in English are translated directly, even
though they mean something else,
•
E.g. Phrases like “break a leg
”, “He Lost His Head”,
“French kiss”,
“Flip the bird”
•
Noise because of misalignments.
29
Natural Language Tool Kit
•
The Natural Language Toolkit, or more commonly NLTK,
is a suite of libraries and programs for symbolic and
statistical natural language processing (NLP) for the
Python programming language.
•
NLTK includes graphical demonstrations and sample
data.
•
It is accompanied by extensive documentation, including
a book that explains the underlying concepts behind the
language processing tasks supported by the toolkit
It
provides lexical resources such as
WordNet
.
•
It has a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic
reasoning.
30
[EOF]
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment