> adult<-read.csv("Adult.csv",header=T) > summary(adult)

aspiringtokAI and Robotics

Oct 15, 2013 (3 years and 7 months ago)

79 views

Lab 9



C
reating Association Rules (due M 4/30
)


In this lab we will
use the a
priori algorithm to create association rules for
the Adult data set available on the UCI
Machine Learning Repository.

This data set involves Census data and contains a class attribute of whether or not
income exceeds $50,000.



PART 1


In this part we turn numerical attributes into nominal and create transaction data out of a data matrix.


1.

Download the
Adult data set from the UCI repository and create a csv file.

http://archive.ics.uci.edu/ml/datasets/Adult

The data is adult.data and the attribute names are in adult.names.


2.

Open
R and read in the data set
.


> adult<
-
read.csv("Adult.csv",header=T)

> summ
ary(adult)

How attributes are there? How many records? Which attributes are integers?


3.

Remove some attributes and convert integer attributes to factors.

a.

Remove
fnlwgt

and
education.
num
.

> adult[["fnlwgt"]] <
-

NULL

> adult[["education.num"]] <
-

NULL


b.

Group
age
.

> adult[[ "age"]] <
-

ordered(cut(adult[[ "age"]], c(15,25,45,65,100)),


labels = c("Young", "Middle", "Older", "Senior"))


c.

Group
hours.per.week
.

> adult[[ "hours.per.
week"]] <
-

ordered(cut(
adult[[ "hours.per.
week"]],

c(0,25,40,60,168)),


labels = c("Part
-
time", "Full
-
time", "Over
-
time", "
VeryHigh
"))


d.

Group
capital.gain

and
capital.loss
.

> adult[[ "capital.
gain"]] <
-

ordered(cut(
adult[[ "capital.
gain"]],

c(
-
Inf,0,



median(
adult[["capital.
gain"]][
adult[["capital.
gain"]]>0]),

Inf)), labels

= c("None", "Low",
"High"))

> adult[[ "capital.loss"]] <
-

ordered(cut(adult[[ "capital.loss"]], c(
-
Inf,0,


median(adult[["capital.loss"]][adult[["capital.loss"]]>0]), Inf)), labels = c("None", "Low",
"High"))



We will need the association rules packag
e
arules
.
Documentation is available on e
-
reserve.


Load the package in R.


> library(arules)


4.

Create transaction data.

> a
dult
2

<
-

as(
adult
, "transactions")

> adult2

> inspect(adult2[1:2])

How many transactions are there? How many items?





PART 2


For this part we will use
the apriori algorithm to create association rules from the census data.


5.

Create association rules
. Use minsup=0.5 and minconf=0.85.

> rules <
-

apriori(adult2, parameter = list(supp = 0.5, conf = 0.85, target = "rules", minlen=2))

> summary(rules)

> inspect(rules)


How many rules met the minimum threshold criteria? What intrestingness measures are provided by
default? Which rule has the highest of each?


>
inspect(head(sort(rules, by = "confidence")))


6.

Add additional interestingn
ess measures.

Add IS and phi
-
coefficient.

> quality(rules) <
-

cbind(quality(rules),IS = interestMeasure(rules, method =
"cosine",adult2),phi=interestMeasure(rules,method="phi",adult2))


Which rule has the highest of each of the new measures?


7.

Find the rules with lift above 1 and rhs= “native.country= United
-
States”.

What are the top 3 rules?

> rules.sub <
-

subset(rules, subset = rhs %pin% "native.country= United
-
States" & lift > 1)

> inspect(sort(rules.sub)[1:3])



PART 3


In this part we will

try to predict the class using the strongest rules.


8.

Locate all rules

that have

> rules.sub <
-

subset(rules, subset = rhs %pin% "Class")

> rules.sub

> inspect(sort(rules.sub, by = "support")[1:10])

> inspect(sort(rules.sub, by = "confidence")[1:10])


How many have Class = >50K and how many Class = <=50K? Comment on the
support of the most
confident. How confident are you about the ones with the highest support? How useful are the rules for
classifying income?


9.

Are there any attributes you would exclu
de from this analysis?


Consider looking at the distribution of certain attributes.

> summary(adult)


Are there any heavily weighted towards one value? Which ones? Did you notice these at
tributes in the
rules? How do

they impact support and confidence?


Is the class attribute weighted heavily towards one value? How might you change this attribute

for our
analysis

(if you had the original income information)?


10.

Repeat the
association
analysis without these attributes.

Will
this create any new rules? Wh
y or why
not? How do your results change?


Comment on your findings.

What rules seem potentially useful (and why)?