One-Way ANOVA

plantationscarfΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

93 εμφανίσεις

Cluster Analysis

Grouping Cases or Variables


Clustering Cases


Goal is to cluster cases into groups based
on shared characteristics.


Start out with each case being a one
-
case
cluster.


The clusters are located in k
-
dimensional
space, where
k

is the number of variables.


Compute the squared Euclidian distance
between each case and each other case.

Squared Euclidian Distance





the sum across variables (from
i

= 1 to
v
)
of the squared difference between the
score on variable
i

for the one case (
X
i
)
and the score on variable
i

for the other
case (
Y
i)



2
1



v
i
i
i
Y
X
Agglomerate


The two cases closest to each other are
agglomerated into a cluster.


The distances between entities (clusters
and cases) are recomputed.


The two entities closest to each other are
agglomerated.


This continues until all cases end up in
one cluster.

What is the Correct Solution?


You may have theoretical reasons to
expect a certain
k

cluster solution.


Look at that solution and see if it matches
your expectations.


Alternatively, you may try to make sense
out of solutions at two or more levels of
the analysis.

Faculty Salaries


Subjects were faculty in Psychology at
ECU.


Variables were rank, experience, number
of publications, course load, and salary.


Data are at
ClusterAnonFaculty.sav


Also see
the statistical
output


Analyze, Classify, Hierarchical
Cluster

Statistics


Plots

Method

Save

Proximity Matrix


We did not request this, but if we had it
would display a measure of dissimilarity
for each pair of entities.


The pair of cases with the smallest
squared Euclidian distance are clustered.

Stage


Cluster
Combined

Coefficients


Cluster
1


Cluster
2

Cluster 1

1

32

33

.000

Look at the Agglomeration
Schedule
.

Cases 32 and 33 are clustered. They
are very similar (distance = 0.000)

Agglomeration Schedule

Stage

Cluster Combined

Coefficient
s

Stage Cluster First
Appears

Next
Stage

Cluster 1

Cluster 2

Cluster 1

Cluster 2

Cluster 1

Cluster 2

1

32

33

.000

0

0

9

2

41

42

.000

0

0

6

3

43

44

.000

0

0

6

4

37

38

.000

0

0

5

5

37

39

.001

4

0

7

6

41

43

.002

2

3

27

Steps 2 Through 5


Stages 2
-
5


The agglomeration schedule show that in
Stage 2 cases 41 and 42 are clustered.


In Stage 3 cases 43 and 44 are clustered.


In Stage 4 cases 37 and 38 are clustered.


In Stage 5 case 39 is added to the cluster
that contains cases 37 and 38.


And so on.

Vertical Icicle, Two Clusters


Look at the top of the display (next slide).


You can see two clusters


On the left Boris through Willy


On the right, Deanna through
Sunila


The 2 cluster solution was adjuncts versus
full time faculty.


Vertical Icicle, Three Clusters


Look at the icicle second highest white
bar.


Now there are three clusters


Adjuncts


Junior faculty (Deanna through Mickey)


Senior faculty (Lawrence through Roslyn)

Vertical Icicle,
Four
Clusters


Look at
the white
bar furthest to the right.


Now there are four clusters


Adjuncts


Junior faculty


The acting chair (Lawrence)


The rest of the senior
faculty
(Catalina
through Roslyn)




The
Dendogram


At the far right you can see the two cluster
solution.


The next step to the left shows the three
cluster solution.


The next step to the left shows the four
cluster solution.


And so on.


Truncated and rotated
dendogram

on next
slide.

Compare Two Clusters


The 2 cluster solution was adjuncts versus
everybody else.


Look at the
t

tests in the output


Adjuncts had lower rank, experience,
number of publications, course load, and
salary.


Compare Three Clusters


Look at the ANOVAs and plots.


The senior faculty had higher salary,
experience, rank, and number of pubs.

Compare
Four Clusters


The acting chair had a higher salary and
number of publications.



I Could Not Help Myself


With these data on hand, I could not resist
predicting salary from the other variables.


Salary was well correlated with Rank,
FTEs, Publications, and Experience.


In the multiple regression, only Rank and
FTEs had significant unique effects.


The residuals suggest who was being
overpaid and who underpaid.


Split by Sex


For men, the unique effect of number of
publications was positive


more
publications, higher salary.


For women it was negative


more
publications, lower salary.


Curious.

Workaholism


Aziz &
Zickar

(2005)


Workaholics may be defined as those


High in
work involvement
,


High in
drive to work
, and


Low in
work enjoyment
.


For each case, a score was obtained for
each of these three dimensions.


The Three Cluster Solution


Workaholics


High work involvement


High drive to work


Low work enjoyment


Positively engaged workers


High work involvement


Medium drive to work


High work enjoyment


Unengaged workers


Low work involvement


Low drive to work


Low work enjoyment


Past research/theory indicated there
should be six clusters, but the theorized
six clusters were not obtained.

Clustering Variables


FactBeer.sav


The statistical output
.


Analyze, Classify, Hierarchical Cluster

Statistics

Plots

Method

Proximity Matrix


Is simply the
intercorrelation

matrix


The two most correlated variables are
Color
and
Aroma
(
r

= .909
)


they are
clustered on the first step.


Stage 2:

Size
and
Alcohol
(
r

= .904
) are
clustered.


Stage 3: Taste added to the cluster that
already contains Color and Aroma

Also See Other Tables & Plots


Stage 4: Cost added to the cluster that
already contains Size and Alcohol.


Stage 5: The two clusters are combined


But they are not very similar (similarity
coefficient = .038)


Now we have one cluster with six variables
and one with one (Reputation)