Data Mining Workbenches: a overview
&comparison focusing on open
CS240B notes by C.
Comparing KDD/DM Toolsets
Many packages and very few in
An Evaluation by USDA Forest Service comparing
, Orange, and
Analytics Survey (annual)
by USDA Forest Service (USFS)
Forest Service (USFS
technology to map USFS Forest Inventory and Analysis
(FIA) biomass, forest type, forest type groups, and
National Forest vegetation
The results of the study were reported by: B.
A. J. Lister, H. Fisk and Dan Wendt “
of Open Source
Data Mining Software
Symposium on Forest
and Analysis (FIA
), October 2008
By the University
NZ, in 1993
GNU Public License (GPL) in 1995.
An extension of the S language (Bell Labs)
packages are supplied with the basic
each including many functions
project.org offers 1,364 additional
packages extending the basic R functionality.
Environment for Knowledge
by the University of Waikato, New Zealand, which
supports the software with funds by the NZ government.
in 1993 and released in 1996. A GPL package
WEKA is a collection of machine
interface tools (R, SQL)
University of Ljubljana,
. Still evolving: frequent new releases
ain routines & libraries in
++ but Python is
used to call the routines
using both scripting and GUI
Orange also has a GUI version called Orange
Canvas, which allows for interactive machine
learning “visual programming”.
Goodnight and North Carolina State University
1970s. In 1976 the
and further develop the
® currently has 10,658
and is the largest
privately held software
company with annual
billion (in 2007)
SAS® is used in 109
different industries, with 44,000
customer sites worldwide.
® is purchased by contacting a distributor
directly: it can
cost several thousand dollars depending
the options. The
the software, technical support, and
licenses, which are renewed regularly, incurring more costs.
How easy is the interface to use and understand?
Are there a variety of models and options available?
How easy to use is the software’s programming language?
Does the software integrate easily with other programs?
how widespread is the software?
of useful features & algorithms
and academic repute
SAS®: The Enterprise Guide for SAS® has a user
system that allows for the building of graphical models.
GUIs also exist for other SAS® modules, but unlike WEKA and
Orange there is no universal GUI for SAS
SAS® is primarily driven by its own programming language, a
new user will require some training
R, like SAS®, is used by numerous industries and thus
has a wide variety of models and options.
R is driven by its own scripting language, which does require
some training and/or experience
GUIs for specific functions only.
WEKA does have a comprehensive GUI with many models and
options available. WEKA’s GUI is easy for users need a good
understanding of modeling techniques. to integrate WEKA with other
Familiarity with Java is needed to extend WEKA and link with other software
WEKA can be expanded and used within R,
Open source data visualization and analysis for novice and
experts. Data mining through visual programming or Python scripting
Orange website (
) Orange has a good website on how
to integrate Orange with Python.
The number of models and options available in Orange lags behind not only
SAS® and R but WEKA as well.
significantly faster than
on classification trees.
Orange is the least stable although new
versions are released monthly
is a stable program, but also does not
work well with large datasets.
recently recently introduced MOA to
process massive data sets in a stream
Most Popular Data Mining Software
Analytics Survey (Early 2007)
about the tools used often and occasionally.
Clearly more popular than the rest were:
SAS Enterprise Miner
C4.5 / C5.0
Critical Mass and Popularity
Top ten most used packages by KDD Nuggets Survey (May 2007):
SAS Enterprise Miner
Microsoft SQL Server
Note: Microsoft Excel omitted as it's not really "data mining" software, and
I've merged the tools offered by a single vendor (SPSS and SAS)
see the full survey results
Votes from tool vendors were removed..
Comparing with 2008
data mining tools/software used,
the big changes are growth in SPSS,
, and R.
Popular Data Mining Software (cont.)
is taken every year and
summary report can be obtained free
2009 SURVEY HIGHLIGHTS:
and R made substantial movement up
miner’s tool rankings this year, and are now used by large
both academic and for
profit data miners.
SAS Enterprise Miner dropped in data miner’s tool rankings
2010 SURVEY HIGHLIGHTS:
After a steady rise across the past few years, R overtook other tools
to become the tool used by more data
STATISTICA has also been climbing in
STATISTICA, IBM SPSS Modeler, and R received the
satisfaction ratings in both 2010 and 2009.
Witten, I.H.; Frank, E.
Data Mining: Practical machine
learning tools and techniques. 2nd Edition,
WEKA Manual for Version 3.6.0
, G.. “Orange: From
experimental machine learning to interactive data
mining”, 2004. (
R Development Core Team.
A language and
environment for statistical computing
. R Foundation for
Comparison to R, WEKA is weaker in classical statistics but
stronger in machine learning (data mining) algorithms.
WEKA has developed a set of extensions covering diverse
areas, such as text mining, visualization and bioinformatics.
WEKA 3.6 includes support for importing PMML models
(Predictive Modeling Markup Language). PMML is a XML
standard fro expressing statistical and data mining models.
can interface with many systems and formats: SQL,
WEKA has 2 limitations:
Java implementation is somewhat slower than an equivalent in
of the algorithms require all the data stored in main
memory. So it restricts application to small or medium
MOA: Massive Online Analysis
MOA supports bi
directional interaction with WEKA
to deal with the scaling up the implementation of state of the art
algorithms to real world dataset sizes using a streaming settings
MOA: a software environment for testing algorithms and running
experiments for online learning from evolving data streams
A DSMS will then be required to deploy these algorithms on actual
MOA is not a DSMS
Downloads available under GNU GPL license
Several Data Sets used:
SEA Concepts Generator:
abrupt concept drift
STAGGER Concepts Generator by
: used as
for CVFDT versus VFDT
Random RBF Generator
Function Generator It was introduced by
MOA Currently supports:
Classification and clustering methods
is easily extensible and has nice GUI
, G. Holmes, R.
DATA STREAM MINING: A Practical Approach
et al.: MOA:
Massive Online Analysis, a
Framework for Stream