Τεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 4 χρόνια και 7 μήνες)

110 εμφανίσεις

Lab 9

C
reating Association Rules (due M 4/30
)

In this lab we will
use the a
priori algorithm to create association rules for
the Adult data set available on the UCI
Machine Learning Repository.

This data set involves Census data and contains a class attribute of whether or not
income exceeds \$50,000.

PART 1

In this part we turn numerical attributes into nominal and create transaction data out of a data matrix.

1.

Adult data set from the UCI repository and create a csv file.

2.

Open
R and read in the data set
.

-

> summ

How attributes are there? How many records? Which attributes are integers?

3.

Remove some attributes and convert integer attributes to factors.

a.

Remove
fnlwgt

and
education.
num
.

-

NULL

-

NULL

b.

Group
age
.

-

labels = c("Young", "Middle", "Older", "Senior"))

c.

Group
hours.per.week
.

week"]] <
-

ordered(cut(
week"]],

c(0,25,40,60,168)),

labels = c("Part
-
time", "Full
-
time", "Over
-
time", "
VeryHigh
"))

d.

Group
capital.gain

and
capital.loss
.

gain"]] <
-

ordered(cut(
gain"]],

c(
-
Inf,0,

median(
gain"]][
gain"]]>0]),

Inf)), labels

= c("None", "Low",
"High"))

-

-
Inf,0,

"High"))

We will need the association rules packag
e
arules
.
Documentation is available on e
-
reserve.

> library(arules)

4.

Create transaction data.

> a
dult
2

<
-

as(
, "transactions")

How many transactions are there? How many items?

PART 2

For this part we will use
the apriori algorithm to create association rules from the census data.

5.

Create association rules
. Use minsup=0.5 and minconf=0.85.

> rules <
-

apriori(adult2, parameter = list(supp = 0.5, conf = 0.85, target = "rules", minlen=2))

> summary(rules)

> inspect(rules)

How many rules met the minimum threshold criteria? What intrestingness measures are provided by
default? Which rule has the highest of each?

>

6.

ess measures.

-
coefficient.

> quality(rules) <
-

cbind(quality(rules),IS = interestMeasure(rules, method =

Which rule has the highest of each of the new measures?

7.

Find the rules with lift above 1 and rhs= “native.country= United
-
States”.

What are the top 3 rules?

> rules.sub <
-

subset(rules, subset = rhs %pin% "native.country= United
-
States" & lift > 1)

> inspect(sort(rules.sub)[1:3])

PART 3

In this part we will

try to predict the class using the strongest rules.

8.

Locate all rules

that have

> rules.sub <
-

subset(rules, subset = rhs %pin% "Class")

> rules.sub

> inspect(sort(rules.sub, by = "support")[1:10])

> inspect(sort(rules.sub, by = "confidence")[1:10])

How many have Class = >50K and how many Class = <=50K? Comment on the
support of the most
confident. How confident are you about the ones with the highest support? How useful are the rules for
classifying income?

9.

Are there any attributes you would exclu
de from this analysis?

Consider looking at the distribution of certain attributes.

Are there any heavily weighted towards one value? Which ones? Did you notice these at
tributes in the
rules? How do

they impact support and confidence?

Is the class attribute weighted heavily towards one value? How might you change this attribute

for our
analysis

(if you had the original income information)?

10.

Repeat the
association
analysis without these attributes.

Will
this create any new rules? Wh
y or why
not? How do your results change?