ata Mining Tools
, 9:00 am
Data Mining is a powerful tool to study patterns and relations in numerous data that surrounds us. In
class, we have learned about clustering (unsupervised learning), clas
sification (using principal
component analysis, networks, fuzzy logic, and other learning tools), and models that help to
understand and predict values for new data based on training data set (using decision trees and
There is a numbe
r of commercial as well as research products that implement some or all of the above
tools. One such example is Open Source
software issued under the
GNU General Public License
. Weka is
lection of machine learning algorithms for data mining tasks. The algorithms can either be applied
directly to a dataset or called from your own Java code. Weka contains tools for data pre
classification, regression, clustering, association rul
es, and visualization. More advanced programmers
and researchers can use Weka to develop new machine learning and data mining schemes.
What you should do
Follow link onto official Weka web site:
and read “Getting
started” information. Download and install Weka version weka
3 for Windows , Mac or Linux. Note
version Windowsx64 has been tested for both self
4.8 MB and weka
3.exe; 20.85 MB
Install and run the software. Remember that if you are
using lab machines, the software must be uninstalled following the completion of the assignment.
Run Weka (Weka 3.6 with console).
You should be able to see a b
There are sample databases that will be downloaded and come with Weka program files. The specific
data file format that Weka can work with is .arff. Use /data/contact
lenses.arff database for this
assignment. You can browse other
datasets if desired. For lenses data set, demonstrate that you can
perform the following
major Weka functionalities in Weka EXPLORER PACKAGE (you m
more about them
line help or comprehensive Weka tutorials under Documentation links)
Run Classify utility.
For test option chose “Use training set”.
A1. [1p] For PRISM classifier, report resulting Prism rules.
A2. [3p] For Decision table, report number of correctly classified instances, incorrectly classified
instances and m
ean absolute error.
[2p] For Ridor classifier, report number of rules and list them all
Run Cluster functionality. Run DBScan, Hierarchical Clustering and SimpleKMeans methods
(on training set).Store clusters for visualization. For Hie
rarchical Clustering and SimpleKMean chose
number of clusters to be
(by clicking onto the string with parameters next to CHOOSE button (below
Use default settings for DBscan
clustering results for all 3 methods
Run Associate functionality with Apriori associator on lenses dataset. Provide
answers from the resulting
t is minimum support
Generated sets of large it
Best rules found?
Select Attributes Functionality
for Search method,
hoose Principal Component Analysis
with Ranker Search Method (parameters chosen by the system). Use full training set. Provide
screenshot of Attribute
selection output (with correlation matrix, eigenvalues and eigenvectors). No
discussion on the output needed.
Provide screenshot of visualization for lenses dataset (
e, Jitter and Colors
(no need for multiple screenshots, one is enough for one chosen
Expand one chosen quadrant to show X/Y point distribution (one only).
variety of application
and projects that use Weka. The full list is available unde
Related Projects menu item.
Some interesting examples are:
learning extension to Weka.
: a system for rule discovery.
extension of ARFF for Multi
classifying multivariate time series.
Bayesian Network Classifiers
with bindings for Weka.
Weka on Text
software for text mining.
software for document classification and clustering.
for clustering and classification.
Java integrated development framework for creating Intelligent Agents and
Multi Agent Systems
Genetic Programming Classifier for Weka
extended version of Weka 3.4 to support automatic geogra
preprocessing for spatial data mining.
An open source framework for evaluation and exploration of subspace
clustering algorithms in WEKA
A genetic algorithm for the induction of rule
based text classifiers
A framework for combining graph and non
based Density Estimation Algorithms
Your goal is to choose ONE from the above REDUCED LIST of applications (there are more links on the
web site, but they are less relevant to course material), run it and
answer the q
Name of the chosen Weka project
from the above list
ne sentence justification why this
project/topic was chosen
Main functionality of the project chosen
Example of applic
ations (i.e. which data sets/databases
can be studied with this tool)
Your experience with how easy it was to run it
or whether it was possible at all.
due to the highly distributed and complex nature of the project, some links might be dea
during the course of the assignment. If the issue persists, please choose an alternative project and
inform your TA that project is no longer available.
What to submit
WRITTEN REPORT as .doc or .pdf file
to your TA
, according to TA requir
assignment policy allows for up to 2 days late submission, based on the date and time it is received by
your TA, with 10% of your mark penalty for each late day.
Sample file for testing your program may be
provided by your TA.
The assignment must be done individually so everything that you han
d in must be your original work
Copying another student's work is an academic misconduct.