Sentiment analysis of movie reviews using Support Vector Machines

jamaicacooperativeΤεχνίτη Νοημοσύνη και Ρομποτική

17 Οκτ 2013 (πριν από 4 χρόνια και 2 μήνες)

83 εμφανίσεις

LING / C SC 439/539


Programming a
ssignment #2


Due
Sun Feb 24

at
11
:59

p.m.


Sentiment analysis of movie reviews using Support Vector Machines


A.
Introduction

Sentiment analysis is
a subfield of NLP concerned with the determination of opinion and
subjectivity in a
text, which has application in the analysis of online product reviews, recommendations, blogs, and other
types of opinionated documents.

In this assignment you will be developing classifiers for sentiment analysis of movie reviews using
S
upport Vector Machines (SVMs), in the manner of the Pang, Lee, and Vaithyanathan [1], which was the
first
research
paper
on this topic. The goal is to develop a classifier that performs sentiment analysis,
assigning a movie review a label of "positive" or
"negative" that
predicts
whether the author of the
review liked the movie or disliked it.

You may use programming and scripting languages of
your choice for this assignment, but for the
machine learning you must use SVM
light

(section D).


B. Data

The data

(available on the course web page) consists of 1,000 positive and 1,000 negative reviews.
These have been divided into training, validation, and test sets of 800, 100, and 100 reviews,
respectively.

I
n order to encourage you not to optimize against the te
sting s
et while developing your
classifiers, the testing data will not be immediately available.

The reviews were obtained from Pang's website [2], and then part
-
of
-
speech tagged using
NLTK, the
Natural Language Toolkit
[3, 4].

Each document is formatted a
s one sentence per line. Each token is of the format
word/POS
TAG
,
where a "word" also includes punctuation. Each word is in lowercase. There is sometimes more than
one slash in a token,
such
as in
writer/director/NN
.


C. Baseline system

For a baseline
system, think of 20 words that you think would be indicative of a positive movie review,
and 20 words that you think would be indicative of a negative review.

To develop the baseline classifier, take this approach: given a movie review, count how many time
s it
contains either a positive word or a negative word (token occurrences). Assign the label
POSITIVE

if
the review contains more positive words than negative words. Assign the label
NEGATIV
E

if it contains
more negative words than positive words. If ther
e are an equal number of positive and negative words,
it is a
TIE
.

Mostly
-
complete
Python code is
being given to you for the baseline classifier

(
sentiment
-
baseline.py
)
.


D
.
Machine learning

The machine learning software to be used is SVM
light

[5], which l
earns Support Vector Machines for
binary classification.
It is available for Unix systems, Windows, and Mac OS X.

You will need to read the documentation on the SVM
light

website i
n order to fig
ure out how to use the
software. To
determine
how to use it, it

may
be helpful
to
first
create a small, "toy" dataset by hand, and
then train and test the SVM on it.

When training the classifier, select the option for classification:


-
z {c,r,p}
-

select between classification (c), regression (r), and



preference ranking (p)


A training file is of the format:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>

<target> .=. +1 |
-
1 | 0 | <float>

<feature> .=. <integer> | "qid"

<value> .=. <float>

<info> .=. <string>


Since we are doing binary classification, the value of
<target>

should be
+1

or
-
1
.

Every feature (which may be expressed as an integer or a string) is associated with a value, which is a
floating
-
point number. If you want a feature to

be binary
-
valued, you may use values of 0.0 and 1.0.

With binary features, it is not necessary to include an explicit representation feature of features that do
not occur. For example, suppose a document contains 100 different words out of a vocabulary of

50,000
possible words. If you are using binary features, it suffices to include a feature with a value of 1.0 for
each of the words that do occur. You do not have to include a feature with a value of 0.0 for each of the
49,900 words that do not appear in
the document.

You do not need to
perform
smoothing.

E
. Feature sets

Use these feature sets for training and testing your classifier:

1.

unigrams

2.

bigrams

3.

unigrams + POS

4.

adjectives

5.

top unigrams

6.

(539 only)

optimized

Detailed explanation:

1.
unigrams
: use the wor
d unigrams that occurred >= 4 times in the training data. Let this quantity be N.

2.
bigrams
: use the N most
-
frequent bigrams.

3.
unigrams + POS
: use all combinations of word/
pos

for each of the unigrams in (1). Since a word may
occur with multiple
POS
tag
s, the quantity of this type of feature will be greater than N.

4.
adjectives
: use the adjectives that occurred >= 4 times. Let this quantity be M.

Adjectives have the
POS tag
JJ
.

5.
top unigrams
: use the M most
-
frequent
word
unigrams.

6.
(
539 only
)

optimized
:
try to produce the best classifier possible. For example,
you could try new
features, such as adverb + adjective bigrams (to pick up on features such as “
very/RB good/JJ
”), or
you could
choose different cutoff values for frequencies of different

types of features
.
You could also try
different settings for training the SVM. The optimized classifier should be produced through a process of
repeatedly training the classifier and computing its performance on the validation set.

To verify whether you a
re
computing
the
se

features correctly, you may
inspect the code in
sentiment
-
feature
-
count.py
.


F
. Evaluation

Train the SVMs on the training data and perform preliminary tests on the validation data. To evaluate
your classifiers, compute the accuracy rate
on the testing data, which is percentage of movie reviews
correctly classified. For the baseline classifier, also compute the number of ties.

When the testing data is released, re
-
train your classifiers on both the training and validation sets, and
e
valuat
e on the testing
set
.

Do not
further
optimize your system
in order to improve
performance on
the testing
set
.

G
.
A
nalysis

Produce a document that states:



A

list of the positive and negative words chosen for your baseline system



P
erformance of the baseline
system on the test set



A

table listing the number of distinct features for each feature set
.
Since
the split of the data
into training and testing
is not
exactly the same as
Pang et al.’s, the quantity of different
features will
not be identical.



A

table o
f performance of the classifiers on the validation set

and test set



A

written comparison of your results with Pang et al
.
's

(minimum 5 lines)



(439 only)

Examine the misclassified reviews. Identify 5 different characteristics of misclassified
reviews,
and s
how

an
excer
pt

from
a review

for each characteristic.



(
539 only
)

P
roduce a table listing the 50 most
-
frequently misclassified reviews (
across
all 6
classifiers)

in the validation set
, and the number of classifiers
by
which they
were
misclassified.

For exam
ple, the review
cv808_12635.txt

might have been misclassified by 4 classifiers.

List
5 different characteristics of
the frequently mis
classified reviews, showing excerpts from 2
reviews for each characteristic
.

For each of these characteristics, describe a possible feature
function
that could be added to improve performance
. Do you think that these additional
feature functions would lead to a substantial improvement in performance? Explain.


H. Submission
:

Put al
l files (listed below) in a compressed directory. Name the file
L
ASTNAME_FIRSTNAME.<extension>
, and e
-
mail it to echan3@email.arizona.edu.



All source code



One example of a feature file that you produced



Your written document



Any additional files that you
would like to attach


I
. References

[1] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine
Learning Techniques, Proceedings of EMNLP 2002.

[2] http://www.cs.cornell.edu/people/pabo/movie
-
review
-
data/

[3]
http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

[4]
http://nltk.org/

[5] http://svmlight.joachims.org/