Data Mining Tools

stepweedheightsAI and Robotics

Oct 15, 2013 (4 years ago)

148 views

D
ata Mining Tools



Assignment
4


Due
:
Monday,
December 6th
, 9:00 am


Data

Mining


Data Mining is a powerful tool to study patterns and relations in numerous data that surrounds us. In
class, we have learned about clustering (unsupervised learning), clas
sification (using principal
component analysis, networks, fuzzy logic, and other learning tools), and models that help to
understand and predict values for new data based on training data set (using decision trees and
association rules).

There is a numbe
r of commercial as well as research products that implement some or all of the above
tools. One such example is Open Source
software issued under the
GNU General Public License
. Weka is
a col
lection of machine learning algorithms for data mining tasks. The algorithms can either be applied
directly to a dataset or called from your own Java code. Weka contains tools for data pre
-
processing,
classification, regression, clustering, association rul
es, and visualization. More advanced programmers
and researchers can use Weka to develop new machine learning and data mining schemes.

What you should do
:

1:
Follow link onto official Weka web site:
http
://www.cs.waikato.ac.nz/ml/weka/

and read “Getting
started” information. Download and install Weka version weka
-
3
-
6
-
3 for Windows , Mac or Linux. Note
that Windows
version Windowsx64 has been tested for both self
-
extracting executable
weka
-
3
-
6
-
3jre.exe; 3
4.8 MB and weka
-
3
-
6
-
3.exe; 20.85 MB
.

Install and run the software. Remember that if you are
using lab machines, the software must be uninstalled following the completion of the assignment.

2:
Run Weka (Weka 3.6 with console).
You should be able to see a b
asic interface
:





3:
There are sample databases that will be downloaded and come with Weka program files. The specific
data file format that Weka can work with is .arff. Use /data/contact
-
lenses.arff database for this
assignment. You can browse other
datasets if desired. For lenses data set, demonstrate that you can
perform the following
FIVE

major Weka functionalities in Weka EXPLORER PACKAGE (you m
ay

find out
more about them
from on
-
line help or comprehensive Weka tutorials under Documentation links)
.


A.

[6p total]

Run Classify utility.
For test option chose “Use training set”.

A1. [1p] For PRISM classifier, report resulting Prism rules.

A2. [3p] For Decision table, report number of correctly classified instances, incorrectly classified
instances and m
ean absolute error.

A3.
[2p] For Ridor classifier, report number of rules and list them all


B.
[6p total]

Run Cluster functionality. Run DBScan, Hierarchical Clustering and SimpleKMeans methods
(on training set).Store clusters for visualization. For Hie
rarchical Clustering and SimpleKMean chose
number of clusters to be
5
(by clicking onto the string with parameters next to CHOOSE button (below
Clusterer)

[2p]
.
Use default settings for DBscan

[1p]
.
Report
clustering results for all 3 methods
(
c
lusterer’s
output)

[3p]
.


C.
[4p total]

Run Associate functionality with Apriori associator on lenses dataset. Provide
written
answers from the resulting
run.

C1
[1p]
Wh
a
t is minimum support

reported
?

C2

[1p]
Minimum confidence?

C3
[1p]
Generated sets of large it
emlists?

C4
[1p]
Best rules found?


D.
[2p total]

Select Attributes Functionality


for Search method,
c
hoose Principal Component Analysis
with Ranker Search Method (parameters chosen by the system). Use full training set. Provide
screenshot of Attribute
selection output (with correlation matrix, eigenvalues and eigenvectors). No
discussion on the output needed.



E
.
[2p

total
]
Provide screenshot of visualization for lenses dataset (
with

all attributes)
.
Please
chang
e

Plot
Siz
e, Jitter and Colors
from defa
ults
(no need for multiple screenshots, one is enough for one chosen
setting).

Expand one chosen quadrant to show X/Y point distribution (one only).


Bonus: [3p]
There
is

a
variety of application
s

and projects that use Weka. The full list is available unde
r
Further Information

Related Projects menu item.

Some interesting examples are:



WekaMetal

-

a meta
-
learning extension to Weka.



Tertius
: a system for rule discovery.



MARFF

-

extension of ARFF for Multi
-
Relational Applications.



TClass

-

classifying multivariate time series.



Bayesian Network Classifiers

-

with bindings for Weka.



Weka on Text

-

software for text mining.



Judge

-

software for document classification and clustering.



Fuz
zy algorithms

-

for clustering and classification.



Agent Academy

-

Java integrated development framework for creating Intelligent Agents and
Multi Agent Systems



GeneticProgramming

-

Genetic Programming Classifier for Weka



Weka
-
GDPM

-

extended version of Weka 3.4 to support automatic geogra
phic data
preprocessing for spatial data mining.



OpenSubspace

-

An open source framework for evaluation and exploration of subspace
clustering algorithms in WEKA



Olex
-
GA

-

A genetic algorithm for the induction of rule
-
based text classifiers



Graph RAT

-

A framework for combining graph and non
-
graph algorithms



TUBE

-

Tree
-
based Density Estimation Algorithms

Your goal is to choose ONE from the above REDUCED LIST of applications (there are more links on the
web site, but they are less relevant to course material), run it and
answer the q
uestions below
:

Written description:

Q1. [1p]
Name of the chosen Weka project

from the above list

and o
ne sentence justification why this
project/topic was chosen

Q2. [1p]
Main functionality of the project chosen

one paragraph

Q3. [1p]
Example of applic
ations (i.e. which data sets/databases

can be studied with this tool)

Q4. [1p]
Your experience with how easy it was to run it

or whether it was possible at all.

NOTE:

due to the highly distributed and complex nature of the project, some links might be dea
ctivated
during the course of the assignment. If the issue persists, please choose an alternative project and
inform your TA that project is no longer available.

What to submit

Submit
WRITTEN REPORT as .doc or .pdf file
to your TA
, according to TA requir
ements.
Course late
assignment policy allows for up to 2 days late submission, based on the date and time it is received by
your TA, with 10% of your mark penalty for each late day.

Sample file for testing your program may be
provided by your TA.

Collabora
tion

The assignment must be done individually so everything that you han
d in must be your original work
.
Copying another student's work is an academic misconduct.