Data Mining Workbenches: a overview &comparison focusing on open-source packages

jazzydoeSoftware and s/w Development

Oct 30, 2013 (3 years and 7 months ago)

62 views

1

Data Mining Workbenches: a overview
&comparison focusing on open
-
source
packages

CS240B notes by C.
Zaniolo

Comparing KDD/DM Toolsets


Many packages and very few in
-
depth
comparisons


An Evaluation by USDA Forest Service comparing
R,
WEKA
, Orange, and
SAS®


Several User
-
satisfaction/popularity surveys


KDD
-
nuggets


Rexer

Analytics Survey (annual)

2

An Evaluation
of CART
Programs

by USDA Forest Service (USFS)



By USDA
Forest Service (USFS
)


USFS uses
c
lassification
a
nd
regression
-
t
ree (CART)
technology to map USFS Forest Inventory and Analysis
(FIA) biomass, forest type, forest type groups, and
National Forest vegetation
.


The results of the study were reported by: B.
Ruefenacht
, G.
Liknes
,
A. J. Lister, H. Fisk and Dan Wendt “
Evaluation
of Open Source
Data Mining Software
Packages
”,

Symposium on Forest
Inventory
and Analysis (FIA
), October 2008
; Park
City,UT
. Proc.


3

R:
(
http://www.r
-
project.org)


By the University
of Auckland,
NZ, in 1993


GNU Public License (GPL) in 1995.


An extension of the S language (Bell Labs)


Twelve
packages are supplied with the basic
R
distribution

each including many functions


http://cran.r
-
project.org offers 1,364 additional
packages extending the basic R functionality.

4

WEKA:
www.cs.waikato.ac.nz/ml/weka
/

Waikato
Environment for Knowledge
Analysis


by the University of Waikato, New Zealand, which
supports the software with funds by the NZ government.


Starded

in 1993 and released in 1996. A GPL package


WEKA is a collection of machine
-
learning algorithms
implemented in
Java plus


data
preprocessing tools
,
and visualization
tools,
interface tools (R, SQL)

5

Orange:
www.ailab.si/orange
/


By the
University of Ljubljana,
Slovenia,
in
2004,
under GPL
. Still evolving: frequent new releases


M
ain routines & libraries in
C
++ but Python is
used to call the routines
and access
libraries
www.ailab.si/orange/doc/ofb/.


U
sers
can
add
their
machine
-
learning
algorithms
using both scripting and GUI
environments


Orange also has a GUI version called Orange
Canvas, which allows for interactive machine
-
learning “visual programming”.


6

SAS®
(Statistical
Analysis
Software)


By Jim
Goodnight and North Carolina State University
associates in
early
1970s. In 1976 the
SAS
-
Institute
was
founded to
distribute
and further develop the
increasingly
popular
software.


SAS
® currently has 10,658
employees,
and is the largest
privately held software
company with annual
revenue of

$
2.15
billion (in 2007)
SAS® is used in 109
countries,
different industries, with 44,000
customer sites worldwide.


SAS
® is purchased by contacting a distributor
directly: it can
cost several thousand dollars depending
on
the options. The
purchase
includes
the software, technical support, and
licenses, which are renewed regularly, incurring more costs.

7

Evaluation Criteria


Cost


Usability:


How easy is the interface to use and understand?


Are there a variety of models and options available?


How easy to use is the software’s programming language?


Does the software integrate easily with other programs?


Performance
w.r.t
.


speed,


stability, and


accuracy
.


Critical Mass:
how widespread is the software?


Uniqueness
of useful features & algorithms


Defensibility

w.r.t.citations

and academic repute



8

Usability


SAS®: The Enterprise Guide for SAS® has a user
-
friendly GUI
system that allows for the building of graphical models.


GUIs also exist for other SAS® modules, but unlike WEKA and
Orange there is no universal GUI for SAS


SAS® is primarily driven by its own programming language, a
new user will require some training


R, like SAS®, is used by numerous industries and thus
has a wide variety of models and options.


R is driven by its own scripting language, which does require
some training and/or experience


GUIs for specific functions only.

9

Usability (Cont.)


WEKA does have a comprehensive GUI with many models and
options available. WEKA’s GUI is easy for users need a good
understanding of modeling techniques. to integrate WEKA with other
software programs


Familiarity with Java is needed to extend WEKA and link with other software
programs


WEKA can be expanded and used within R,


Orange
:
Open source data visualization and analysis for novice and
experts. Data mining through visual programming or Python scripting
.



Orange website (
http://www.ailab.si/orange/
) Orange has a good website on how
to integrate Orange with Python.


The number of models and options available in Orange lags behind not only
SAS® and R but WEKA as well.

10

Performance notes


R

significantly faster than
WEKA

and
Orange

on classification trees.


Orange is the least stable although new
versions are released monthly


WEKA

is a stable program, but also does not
work well with large datasets.


The
weka

recently recently introduced MOA to
process massive data sets in a stream
-
like mode.

11


Evaluation Results

12

Most Popular Data Mining Software


Rexer

Analytics Survey (Early 2007)
asked
about the tools used often and occasionally.
Clearly more popular than the rest were:


SPSS or

SPSS Clementine


"Own Code"


SAS or

SAS Enterprise Miner

Followed by


R


Weka


C4.5 / C5.0


13

Critical Mass and Popularity

Top ten most used packages by KDD Nuggets Survey (May 2007):



SPSS/

SPSS Clementine


Salford

Systems

CART/MARS/
TreeNet
/RF


Yale (now

Rapid Miner
)


SAS /

SAS Enterprise Miner


Angoss

Knowledge Studio

/

Knowledge Seeker


KXEN


Weka


R


Microsoft SQL Server
?


MATLAB
?


Note: Microsoft Excel omitted as it's not really "data mining" software, and
I've merged the tools offered by a single vendor (SPSS and SAS)

You can

see the full survey results


14

15

Comments Gregory
Piatetsky
-
Shapiro,
KDnuggets

Editor:

Votes from tool vendors were removed..

Comparing with 2008
KDnuggets

Poll on
data mining tools/software used,

the big changes are growth in SPSS,
RapidMiner
, and R.

Popular Data Mining Software (cont.)


Rexer

Analytics Survey

is taken every year and
the
summary report can be obtained free
.


2009 SURVEY HIGHLIGHTS:


Open
-
source tools
Weka

and R made substantial movement up
data

miner’s tool rankings this year, and are now used by large
numbers of

both academic and for
-
profit data miners.


SAS Enterprise Miner dropped in data miner’s tool rankings


2010 SURVEY HIGHLIGHTS:


R:

After a steady rise across the past few years, R overtook other tools
to become the tool used by more data

miners (43%)


STATISTICA has also been climbing in

the rankings.

STATISTICA, IBM SPSS Modeler, and R received the
strongest

satisfaction ratings in both 2010 and 2009.

16

17

18

Selected References


Witten, I.H.; Frank, E.
Data Mining: Practical machine
learning tools and techniques. 2nd Edition,
Morgan
Kaufmann, 2005.


R
. R.
Bouckaert

et al.,
WEKA Manual for Version 3.6.0
,
2008
.


Demsar

J.;
Zupan
, B.;
Leban
, G.. “Orange: From
experimental machine learning to interactive data
mining”, 2004. (
http://www.ailab.si/orange
).


R Development Core Team.
R
:
A language and
environment for statistical computing
. R Foundation for
Statistical Computing,
2008
.

19

About
Weka



Comparison to R, WEKA is weaker in classical statistics but
stronger in machine learning (data mining) algorithms.


WEKA has developed a set of extensions covering diverse
areas, such as text mining, visualization and bioinformatics.


WEKA 3.6 includes support for importing PMML models
(Predictive Modeling Markup Language). PMML is a XML
-
based
standard fro expressing statistical and data mining models.


WEKA
can interface with many systems and formats: SQL,
LibSVM

and
SVM
-
Light,….


WEKA has 2 limitations:


Java implementation is somewhat slower than an equivalent in
C/C++


Most
of the algorithms require all the data stored in main
memory. So it restricts application to small or medium
-
sized
datasets
.

MOA: Massive Online Analysis


MOA supports bi
-
directional interaction with WEKA


to deal with the scaling up the implementation of state of the art
algorithms to real world dataset sizes using a streaming settings








MOA: a software environment for testing algorithms and running
experiments for online learning from evolving data streams


A DSMS will then be required to deploy these algorithms on actual
data streams

MOA is not a DSMS

20

21






Downloads available under GNU GPL license


Several Data Sets used:


SEA Concepts Generator:
artificial

dataset

with

abrupt concept drift


STAGGER Concepts Generator by
Schlimmer

and Grange


Rotating
Hyperplane
: used as
testbed

for CVFDT versus VFDT


Random RBF Generator


Waveform Generator


Function Generator It was introduced by
Agrawal

et al.


MOA Currently supports:

Classification and clustering methods

System

is easily extensible and has nice GUI

Good Documentation:


Albert
Bifet
, G. Holmes, R.
Kirkby

& B.
Pfahringer
:
DATA STREAM MINING: A Practical Approach
.
May 2011.


Albert
Bifet

et al.: MOA:
Massive Online Analysis, a
Framework for Stream
Classication

and Clustering

(2010)