Data Mining Final Project

levelsordData Management

Nov 20, 2013 (3 years and 11 months ago)

87 views

Data Mining Final Project

Nick Foti

Eric Kee

Topic: Author Identification


Author Identification


Given writing samples, can we determine who
wrote them?


This is a well studied field


See also: “stylometry”


This has been applied to works such as


The Bible


Shakespeare


Modern texts as well

Corpus Design


A corpus is:



A body of text used for linguistic analysis


Used Project Gutenberg to create corpus


The corpus was designed as follows


Four authors of varying similarity


Anne Bront
ë


Charlotte Brontë


Charles Dickens


Upton Sinclair


Multiple books per author


Corpus size: 90,000 lines of text


Dataset Design


Extracted features common in literature


Word Length


Frequency of “glue” words


See Appendix A and [1,2] for list of glue words



Note: corpus was processed using


C#, Matlab, Python


Data set parameters are


Number of dimensions: 309


Word length and 308 glue words


Number of observations: ≈ 3,000


Each obervation ≈ 30 lines of text from a book

Classifier Testing and Analysis


Tested classifier with test data


Used testing and training data sets


70% for training, 30% for testing


Used cross
-
validation


Analyzed Classifier Performance


Used ROC plots


Used confusion matrices



Used common plotting

scheme
(right)

Anne B.

TP

Anne B.

FP

Charlotte B.

TP

Charlotte B.

FP

78%

22%

55%

45%

Red Dots Indicate

True
-
Positive Cases




Binary Classification

Word Length Classification


Calculated average word length for each
observation


Computed gaussian kernel density from word
length samples


Used ROC curve to calculate cutoff


Optimized sensitivity and specificity with equal
importance

Word Length: Anne B. vs Upton S.


100%

0%

Anne B.

T P

Anne B.

F P

Upton Sinclair

T P

Upton Sinclair

F P

100%

0%

Anne Bront
ë

Charlotte Bront
ë

Word Length: Bront
ë vs. Brontë


100%

0%

Anne B.

T P

Anne B.

F P

Charlotte B.

T P

Charlotte B.

F P

78.1%

21.9%

Anne Bront
ë

Charlotte Bront
ë

Principal Component Analysis


Used PCA to find a better axis


Notice: distribution similar to word length
distribution


Is word length

the only useful

dimension?


Anne Bront
ë vs. Upton Sinclair

Word Length Density

PCA Density

Without word length

Principal Component Analysis



It appears that word length is the most useful
axis


We’ll come

back to this…

Anne Bront
ë vs. Upton Sinclair

PCA Density

K
-
Means


Used K
-
means to find dominant patterns


Unnormalized


Normalized


Trained K
-
means on training set


To classify observations in test set


Calculate distance of observation to each class
mean


Assign observation to the closest class



Performed cross
-
validation to estimate
performance

Unnormalized K
-
means



Anne Bront
ë vs. Upton Sinclair

98.1%

1.9%

Anne B.

T P

Anne B.

F P

Upton Sinclair

T P

Upton Sinclair

F P

92.1%

7.9%

Unnormalized K
-
means



Anne Bront
ë vs. Charlotte Brontë

95.7%

4.3%

Anne B.

T P

Anne B.

F P

Charlotte B.

T P

Charlotte B.

F P

74.7%

25.3%

Normalized K
-
means



Anne Bront
ë vs. Upton Sinclair

53.3%

46.7%

Anne B.

T P

Anne B.

F P

Upton Sinclair

T P

Upton Sinclair

F P

49.4%

50.6%

Normalized K
-
means



Anne Bront
ë vs. Charlotte Brontë

15.8%

84.2%

Anne B.

T P

Anne B.

F P

Charlotte B.

T P

Charlotte B.

F P

86.7%

13.3 %

Discriminant Analysis


Peformed discriminant analysis


Computed with equal covariance matrices


Used average Omega of class pairs


Computed with unequal covariance matrices


Quadratic discrimination fails because covariance
matrices have 0 determinant (see equation below)


Computed theoretical misclassification probability


To perform quadratic discriminant analysis


Compute Equation 1 for each class


Choose class with minimum value

(1)

Discriminant Analysis



Anne Bront
ë vs. Upton Sinclair

92.2%

3.8%

Anne B.

T P

Anne B.

F P

Upton Sinsclair

T P

Upton Sinclair

F P

96.2%

7.8%

Theoretical P(err) = 0.149

Empirical P(err) = 0.116

Discriminant Analysis



Anne Bront
ë vs. Charlotte Brontë

92.7%

7.3%

Anne B.

T P

Anne B.

F P

Charlotte B.

T P

Charlotte B.

F P

89.2%

10.8%

Theoretical P(err) = 0.181

Empirical P(err) = 0.152

Logistic Regression


Fit linear model to training data on all dimensions


Threw out singular dimensions


Left with ≈ 298 coefficients + intercept


Projected training data onto synthetic variable


Found threshold by minimizing error of misclassification


Projected testing data onto synthetic variable


Used threshold to classify points

Logistic Regression



Anne B

TP

Anne B

TP

Charlotte B

TP

Charlotte B

TP

Anne Bront
ë

Charlotte Bront
ë

89.5%

10.5%

8%

92%

Anne Bront
ë vs Charlotte Brontë

Logistic Regression



Anne B

TP

Anne B

FP

Upton S

TP

Upton S

FP

Anne Bront
ë

Upton Sinclair

98%

2%

99%

2%

Anne Bront
ë vs Upton Sinclair




4
-
Class Classification

4
-
Class K
-
means


Used K
-
means to find patterns among all
classes


Unnormalized


Normalized


Trained using a training set


Tested performance as in 2
-
class K
-
means


Performed cross
-
validation to estimate
performance

Unnormalized K
-
Means



CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP

US

FP

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

22%

54%

87%

34%

59%

88%

Anne Bront
ë

Charlotte Bront
ë

Upton Sinclair

Charles Dickens

4
-
Class Confusion Matrix

Normalized K
-
Means



CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP

US

FP

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

20%

67%

26%

67%

70%

67%

27%

Anne Bront
ë

Charlotte Bront
ë

Upton Sinclair

Charles Dickens

4
-
Class Confusion Matrix

Additional K
-
means testing


Also tested K
-
means without word length


Recall that we had perfect classification with 1D
word length
(see plot below)



Is K
-
means using only 1 dimension to classify?








Note: perfect classification
only

occurs between Anne B. and Sinclair

Anne Bront
ë vs. Upton Sinclair

CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP

US

FP

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

35%

29%

44%

33%

35%

43%

72%

Anne Bront
ë

Charlotte Bront
ë

Upton Sinclair

Charles Dickens

4
-
Class Confusion Matrix


Unnormalized K
-
Means (No Word Length)


K
-
means can classify without word length

Multinomial Regression


Multinomial distribution


Extension of binomial distribution


Random variable is allowed to take on n values


Used multinom(…) to fit log
-
linear model for
training


Used 248 dimensions (max limit on computer)


Returns 3 coefficients per dimension and 3
intercepts


Found probability that observations belongs to
each class

Multinomial Regression


Multinomial Logit Function is




where

j

are the coefficients and
c
j

are the intercepts



To classify


Compute probabilities


Pr(y
i

= Dickens), Pr(y
i

= Anne B.), …


Choose class with maximum probability

Multinomial Regression



CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP

CB

FP

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

78%

86%

83%

93%

Anne Bront
ë

Charlotte Bront
ë

Upton Sinclair

Charles Dickens

4
-
Class Confusion Matrix

CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP

CB

FP

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

76%

79%

79%

91%

Upton Sinclair

Charles Dickens

Anne Bront
ë

Charlotte Bront
ë

(Without Word Length)

4
-
Class Confusion Matrix

Multinomial Regression



Multinomial regression does not require word length


Appendix A: Glue Words

I

a

aboard

about

above

across

after

again

against

ago

ahead

all

almost

along

alongside

already

also

although

always

am

amid

amidst

among

amongst

an

and

another

any

anybody

anyone

anything

anywhere

apart

are

aren't

around

as

aside

at

away

back

backward

backwards

be

because

been

before

beforehand

behind

being

below

between

beyond

both

but

by

can

can't

cannot

could

couldn't

dare

daren't

despite

did

didn't

directly

do

does

doesn't

doing

don't

done

down

during

each

either

else

elsewhere

enough

even

ever

evermore

every

everybody

everyone

everything

everywhere

except

fairly

farther

few

fewer

for

forever

forward

from

further

furthermore

had

hadn't

half

hardly

has

hasn't

have

haven't

having

he

hence

her

here

hers

herself

him

himself

his

how

however

if

in

indeed

inner

inside

instead

into

is

isn't

it

its

itself

just

keep

kept

later

least

less

lest

like

likewise

little

low

lower

many

may

mayn't

me

might

mightn't

mine

minus

more

moreover

most

much

must

mustn't

my

myself

near

need

needn't

neither

never

nevertheless

next

no

no
-
one

nobody

none

nor

not

nothing

notwithstanding

now

nowhere

of

off

often

on

once

one

ones

only

onto

opposite

or

other

others

otherwise

ought

oughtn't

our

ours

ourselves

out

outside

over

own

past

per

perhaps

please

plus

provided

quite

rather

really

round

same

self

selves

several

shall

shan't

she

should

shouldn't

since

so

some

somebody

someday

someone

something

sometimes

somewhat

still

such

than

that

the

their

theirs

them

themselves

then

there

therefore

these

they

thing

things

this

those

though

through

throughout

thus

till

to

together

too

towards

under

underneath

undoing

unless

unlike

until

up

upon

upwards

us

versus

very

via

was

wasn't

way

we

well

were

weren't

what

whatever

when

whence

whenever

where

whereas

whereby

wherein

wherever

whether

which

whichever

while

whilst

whither

who

whoever

whom

with

whose

within

why

without

will

won't

would

wouldn't

yet

you

your

yours

yourself

yourselves


Conclusions


Authors can be identified by their word usage frequencies


Word length may be used to distingush between the Bront
ë
sisters


Word length does not, however, extend to all authors (See Appendix C)


The glue words describe genuine differences between all four
authors


K
-
means distinguishes the same patterns that multinomial regression
classifies


This indicates that supervised training finds legitimate patterns, rather than
artifacts


The Bront
ë sisters are the most similar authors


Upton Sinclair is the most different author



Appendix B: Code


See Attached .R files

Appendix C: Single Dimension 4
-
Author Classification



CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP

US

FP

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

22%

46%

94%

6%

11%

54%

96%

Anne Bront
ë

Charlotte Bront
ë

Upton Sinclair

Charles Dickens

4
-
Class Confusion Matrix


3%

Classification using Multinomial Regression

References


[1] Argamon, Saric, Stein, “Style Mining of Electronic Messages for Multiple
Authorship Discrimination: First Results,” SIGKDD 2003.


[2] Mitton,

Spelling checkers, spelling correctors and the misspellings of poor
spellers,


Information Processing and Management, 1987.