# CLUSTER ANALYSIS IN SAS

AI and Robotics

Nov 25, 2013 (4 years and 7 months ago)

113 views

MULTIVARIATE STATISTICS / G

CLUSTER ANALYSIS IN SAS

This is a fairly general program for carrying out a cluster analysis on the heptathlon data.

data cluster;

infile '
e
:
\
Multivariate
\
wh08.txt';

input id name \$ hurd hurdpt high highpt shot shotpt

two twopt long longpt jav javpt eight eightpt;

proc cluster

/* will be based on Euclidean distance between individuals */

method = single

/* other options are
average

and
complete
*/

simple

/* prints summary statistics of the variables */

nonorm;

/*

see below */

var hurd high shot two long jav eight;

id name;

/* identifies rows by the names of the athletes */

run;

/* draw the dendrogram */

proc tree;

id name;

run;

The
nonorm

option prevents the distances from being normalised to unit mean or unit

root mean
square. If you use
method = average
, the cluster options should include
nosquare
thus:

proc cluster method = average simple nonorm nosquare;

where
nosquare

prevents the distances from being squared.

To standardise the heptathlon data, use

the following program.

data stdise;

infile '
e
:
\
Multivariate
\
wh08.txt';

input id name \$ hurd hurdpt high highpt shot shotpt two twopt long longpt jav javpt eight eightpt;

/* make a copy of the variables to be standardised */

hurds = hurd;

highs = high;

s
hots = shot;

twos = two;

longs = long;

javs = jav;

eights = eight;

/* do the standardisation */

proc standard mean = 0 std = 1 out = stddata;

var hurds highs shots twos longs javs eights;

run;

/* print out the new values of the variables */

proc print data

= stddata;

var hurds highs shots twos longs javs eights;

run;

/* redo the clustering: you can of course add to these commands if desired */

proc cluster data = stddata method = single simple nonorm;

var hurds highs shots twos longs javs eights;

run;

If
the data are already in the form of a distance matrix, use this form of SAS program to produce
clusters. The program below applied clustering algorithms to the "alpha beta gamma delta epsilon"
distance matrix considered in lectures. This is the route you w
ill have to take if you want to use a
distance measure other than Euclidean.

data cluster (type = distance);

input name \$ alpha beta gamma delta epsilon;

cards;

alpha 0 2 4 1 2

beta 2 0 6 3 5

gamma 1 6 0 5 6

delta 4 3 5 0 4

epsilon 2 5 6 4 0

;

proc clust
er method = single nonorm;

var alpha beta gamma delta epsilon;

id name;

run;

proc tree; id name;

run;

For continuous variables, Doug provided the following macros and instructions.

%include

'n:
\
Student
\
Apstats5
\
SAS/xmacro.sas'
;

%include

'n:
\
Student
\
Aps
tats5
\
SAS /distnew.sas'
;

%include

'n:
\
Student
\
Apstats5
\
SAS /stdize.sas'
;

%include

'n:
\
Student
\
Apstats5
\
SAS /distance.sas'
;

data

toy;

input

X1 X2 @@;

datalines
;

2 0 2 1 4 0 4 3 5 3 5 8 6 6

;

run
;

%
distance
(data=toy, out=dist, shape=square, method=city
, var=x1
-
x2);

Title

'City block distance matrix'
;

proc

print

data
=dist;

run
;

%
distance
(
data
=toy, out=dist, shape=square, method=euclid, var=x1
-
x2);

Title

'Euclidean distance matrix'
;

proc

print

data
=dist;

run
;

%
stdize
(
data
=toy,var=x1
-
x2,out=std);

%
distance
(
data
=std,out=dist_std, shape=square, method=city, var=x1
-
x2);

Title

'City block distance matrix with standardised values'
;

proc

print

data
=dist_std;

run
;

%
distance
(
data
=std,out=dist_std, shape=square, method=euclid, var=x1
-
x2);

Title

'Euclidean distance m
atrix with standardised values'
;

proc

print

data
=dist_std;

run
;

I do not know of any SAS procedures for distances involving binary
-
valued variables.