Data Mining Final Project
Nick Foti
Eric Kee
Topic: Author Identification
•
Author Identification
–
Given writing samples, can we determine who
wrote them?
–
This is a well studied field
•
See also: “stylometry”
–
This has been applied to works such as
•
The Bible
•
Shakespeare
•
Modern texts as well
Corpus Design
•
A corpus is:
–
A body of text used for linguistic analysis
•
Used Project Gutenberg to create corpus
•
The corpus was designed as follows
–
Four authors of varying similarity
•
Anne Bront
ë
•
Charlotte Brontë
•
Charles Dickens
•
Upton Sinclair
–
Multiple books per author
•
Corpus size: 90,000 lines of text
Dataset Design
•
Extracted features common in literature
–
Word Length
–
Frequency of “glue” words
•
See Appendix A and [1,2] for list of glue words
•
Note: corpus was processed using
–
C#, Matlab, Python
•
Data set parameters are
–
Number of dimensions: 309
•
Word length and 308 glue words
–
Number of observations: ≈ 3,000
•
Each obervation ≈ 30 lines of text from a book
Classifier Testing and Analysis
•
Tested classifier with test data
–
Used testing and training data sets
•
70% for training, 30% for testing
–
Used cross

validation
•
Analyzed Classifier Performance
–
Used ROC plots
–
Used confusion matrices
•
Used common plotting
scheme
(right)
Anne B.
TP
Anne B.
FP
Charlotte B.
TP
Charlotte B.
FP
78%
22%
55%
45%
Red Dots Indicate
True

Positive Cases
Binary Classification
Word Length Classification
•
Calculated average word length for each
observation
•
Computed gaussian kernel density from word
length samples
•
Used ROC curve to calculate cutoff
–
Optimized sensitivity and specificity with equal
importance
Word Length: Anne B. vs Upton S.
100%
0%
Anne B.
T P
Anne B.
F P
Upton Sinclair
T P
Upton Sinclair
F P
100%
0%
Anne Bront
ë
Charlotte Bront
ë
Word Length: Bront
ë vs. Brontë
100%
0%
Anne B.
T P
Anne B.
F P
Charlotte B.
T P
Charlotte B.
F P
78.1%
21.9%
Anne Bront
ë
Charlotte Bront
ë
Principal Component Analysis
•
Used PCA to find a better axis
•
Notice: distribution similar to word length
distribution
•
Is word length
the only useful
dimension?
Anne Bront
ë vs. Upton Sinclair
Word Length Density
PCA Density
Without word length
Principal Component Analysis
•
It appears that word length is the most useful
axis
•
We’ll come
back to this…
Anne Bront
ë vs. Upton Sinclair
PCA Density
K

Means
•
Used K

means to find dominant patterns
–
Unnormalized
–
Normalized
•
Trained K

means on training set
•
To classify observations in test set
–
Calculate distance of observation to each class
mean
–
Assign observation to the closest class
•
Performed cross

validation to estimate
performance
Unnormalized K

means
Anne Bront
ë vs. Upton Sinclair
98.1%
1.9%
Anne B.
T P
Anne B.
F P
Upton Sinclair
T P
Upton Sinclair
F P
92.1%
7.9%
Unnormalized K

means
Anne Bront
ë vs. Charlotte Brontë
95.7%
4.3%
Anne B.
T P
Anne B.
F P
Charlotte B.
T P
Charlotte B.
F P
74.7%
25.3%
Normalized K

means
Anne Bront
ë vs. Upton Sinclair
53.3%
46.7%
Anne B.
T P
Anne B.
F P
Upton Sinclair
T P
Upton Sinclair
F P
49.4%
50.6%
Normalized K

means
Anne Bront
ë vs. Charlotte Brontë
15.8%
84.2%
Anne B.
T P
Anne B.
F P
Charlotte B.
T P
Charlotte B.
F P
86.7%
13.3 %
Discriminant Analysis
•
Peformed discriminant analysis
–
Computed with equal covariance matrices
•
Used average Omega of class pairs
–
Computed with unequal covariance matrices
•
Quadratic discrimination fails because covariance
matrices have 0 determinant (see equation below)
–
Computed theoretical misclassification probability
•
To perform quadratic discriminant analysis
–
Compute Equation 1 for each class
–
Choose class with minimum value
(1)
Discriminant Analysis
Anne Bront
ë vs. Upton Sinclair
92.2%
3.8%
Anne B.
T P
Anne B.
F P
Upton Sinsclair
T P
Upton Sinclair
F P
96.2%
7.8%
Theoretical P(err) = 0.149
Empirical P(err) = 0.116
Discriminant Analysis
Anne Bront
ë vs. Charlotte Brontë
92.7%
7.3%
Anne B.
T P
Anne B.
F P
Charlotte B.
T P
Charlotte B.
F P
89.2%
10.8%
Theoretical P(err) = 0.181
Empirical P(err) = 0.152
Logistic Regression
•
Fit linear model to training data on all dimensions
•
Threw out singular dimensions
–
Left with ≈ 298 coefficients + intercept
•
Projected training data onto synthetic variable
–
Found threshold by minimizing error of misclassification
•
Projected testing data onto synthetic variable
–
Used threshold to classify points
Logistic Regression
Anne B
TP
Anne B
TP
Charlotte B
TP
Charlotte B
TP
Anne Bront
ë
Charlotte Bront
ë
89.5%
10.5%
8%
92%
Anne Bront
ë vs Charlotte Brontë
Logistic Regression
Anne B
TP
Anne B
FP
Upton S
TP
Upton S
FP
Anne Bront
ë
Upton Sinclair
98%
2%
99%
2%
Anne Bront
ë vs Upton Sinclair
4

Class Classification
4

Class K

means
•
Used K

means to find patterns among all
classes
–
Unnormalized
–
Normalized
•
Trained using a training set
•
Tested performance as in 2

class K

means
•
Performed cross

validation to estimate
performance
Unnormalized K

Means
CD
TP
AB
FP
CB
FP
US
FP
CD
FP
AB
TP
CB
FP
US
FP
CD
FP
AB
FP
CB
TP
US
FP
CD
FP
AB
FP
CB
FP
US
FP
22%
54%
87%
34%
59%
88%
Anne Bront
ë
Charlotte Bront
ë
Upton Sinclair
Charles Dickens
4

Class Confusion Matrix
Normalized K

Means
CD
TP
AB
FP
CB
FP
US
FP
CD
FP
AB
TP
CB
FP
US
FP
CD
FP
AB
FP
CB
TP
US
FP
CD
FP
AB
FP
CB
FP
US
FP
20%
67%
26%
67%
70%
67%
27%
Anne Bront
ë
Charlotte Bront
ë
Upton Sinclair
Charles Dickens
4

Class Confusion Matrix
Additional K

means testing
•
Also tested K

means without word length
–
Recall that we had perfect classification with 1D
word length
(see plot below)
–
Is K

means using only 1 dimension to classify?
Note: perfect classification
only
occurs between Anne B. and Sinclair
Anne Bront
ë vs. Upton Sinclair
CD
TP
AB
FP
CB
FP
US
FP
CD
FP
AB
TP
CB
FP
US
FP
CD
FP
AB
FP
CB
TP
US
FP
CD
FP
AB
FP
CB
FP
US
FP
35%
29%
44%
33%
35%
43%
72%
Anne Bront
ë
Charlotte Bront
ë
Upton Sinclair
Charles Dickens
4

Class Confusion Matrix
Unnormalized K

Means (No Word Length)
•
K

means can classify without word length
Multinomial Regression
•
Multinomial distribution
–
Extension of binomial distribution
•
Random variable is allowed to take on n values
•
Used multinom(…) to fit log

linear model for
training
–
Used 248 dimensions (max limit on computer)
–
Returns 3 coefficients per dimension and 3
intercepts
•
Found probability that observations belongs to
each class
Multinomial Regression
•
Multinomial Logit Function is
where
j
are the coefficients and
c
j
are the intercepts
•
To classify
–
Compute probabilities
•
Pr(y
i
= Dickens), Pr(y
i
= Anne B.), …
–
Choose class with maximum probability
Multinomial Regression
CD
TP
AB
FP
CB
FP
US
FP
CD
FP
AB
TP
CB
FP
CB
FP
CD
FP
AB
FP
CB
TP
US
FP
CD
FP
AB
FP
CB
FP
US
FP
78%
86%
83%
93%
Anne Bront
ë
Charlotte Bront
ë
Upton Sinclair
Charles Dickens
4

Class Confusion Matrix
CD
TP
AB
FP
CB
FP
US
FP
CD
FP
AB
TP
CB
FP
CB
FP
CD
FP
AB
FP
CB
TP
US
FP
CD
FP
AB
FP
CB
FP
US
FP
76%
79%
79%
91%
Upton Sinclair
Charles Dickens
Anne Bront
ë
Charlotte Bront
ë
(Without Word Length)
4

Class Confusion Matrix
Multinomial Regression
•
Multinomial regression does not require word length
Appendix A: Glue Words
I
a
aboard
about
above
across
after
again
against
ago
ahead
all
almost
along
alongside
already
also
although
always
am
amid
amidst
among
amongst
an
and
another
any
anybody
anyone
anything
anywhere
apart
are
aren't
around
as
aside
at
away
back
backward
backwards
be
because
been
before
beforehand
behind
being
below
between
beyond
both
but
by
can
can't
cannot
could
couldn't
dare
daren't
despite
did
didn't
directly
do
does
doesn't
doing
don't
done
down
during
each
either
else
elsewhere
enough
even
ever
evermore
every
everybody
everyone
everything
everywhere
except
fairly
farther
few
fewer
for
forever
forward
from
further
furthermore
had
hadn't
half
hardly
has
hasn't
have
haven't
having
he
hence
her
here
hers
herself
him
himself
his
how
however
if
in
indeed
inner
inside
instead
into
is
isn't
it
its
itself
just
keep
kept
later
least
less
lest
like
likewise
little
low
lower
many
may
mayn't
me
might
mightn't
mine
minus
more
moreover
most
much
must
mustn't
my
myself
near
need
needn't
neither
never
nevertheless
next
no
no

one
nobody
none
nor
not
nothing
notwithstanding
now
nowhere
of
off
often
on
once
one
ones
only
onto
opposite
or
other
others
otherwise
ought
oughtn't
our
ours
ourselves
out
outside
over
own
past
per
perhaps
please
plus
provided
quite
rather
really
round
same
self
selves
several
shall
shan't
she
should
shouldn't
since
so
some
somebody
someday
someone
something
sometimes
somewhat
still
such
than
that
the
their
theirs
them
themselves
then
there
therefore
these
they
thing
things
this
those
though
through
throughout
thus
till
to
together
too
towards
under
underneath
undoing
unless
unlike
until
up
upon
upwards
us
versus
very
via
was
wasn't
way
we
well
were
weren't
what
whatever
when
whence
whenever
where
whereas
whereby
wherein
wherever
whether
which
whichever
while
whilst
whither
who
whoever
whom
with
whose
within
why
without
will
won't
would
wouldn't
yet
you
your
yours
yourself
yourselves
Conclusions
•
Authors can be identified by their word usage frequencies
•
Word length may be used to distingush between the Bront
ë
sisters
–
Word length does not, however, extend to all authors (See Appendix C)
•
The glue words describe genuine differences between all four
authors
–
K

means distinguishes the same patterns that multinomial regression
classifies
•
This indicates that supervised training finds legitimate patterns, rather than
artifacts
•
The Bront
ë sisters are the most similar authors
•
Upton Sinclair is the most different author
Appendix B: Code
•
See Attached .R files
Appendix C: Single Dimension 4

Author Classification
CD
TP
AB
FP
CB
FP
US
FP
CD
FP
AB
TP
CB
FP
US
FP
CD
FP
AB
FP
CB
TP
US
FP
CD
FP
AB
FP
CB
FP
US
FP
22%
46%
94%
6%
11%
54%
96%
Anne Bront
ë
Charlotte Bront
ë
Upton Sinclair
Charles Dickens
4

Class Confusion Matrix
3%
Classification using Multinomial Regression
References
[1] Argamon, Saric, Stein, “Style Mining of Electronic Messages for Multiple
Authorship Discrimination: First Results,” SIGKDD 2003.
[2] Mitton,
“
Spelling checkers, spelling correctors and the misspellings of poor
spellers,
”
Information Processing and Management, 1987.
Comments 0
Log in to post a comment