Data Mining and Cross-Validation

batterycopperInternet and Web Development

Nov 12, 2013 (3 years and 6 months ago)


Data Mining and Cross

over distributed / grid enabled
networks: current state of the art

Presented by: Juan Bernal


Introduction to Data Mining

Instructor: Dr. Koshgoftaar,

Florida Atlantic University

Spring, 2008



Validation definition and importance

Why is Cross
Validation a computational
intensive task?

Distributing Data
Mining processes over a computer network.

WEKA and distributed Data Mining: How it is done

Other Projects implementing grids/distributed networks

Weka Parallel

Grid Weka





Data Mining today is being performed in vast amounts of ever
growing data. The need to analyze and extract information
from different domain databases demands more
computational resources and the expected results in the
minimum amount of time possible.

There are many different projects that try to address Data
Mining processes over distributed, or grid enabled networks.
All of them attempt to make use of all the available computer
resources in a grid or networked environment to improve the
time that takes to obtain results and even to increase the
accuracy of the results obtained.

One of the Data Mining most computational intensive tasks is
Cross Validation, which is the focus in many grid/distributed
network Data Mining tools.


Validation (CV) is the standard Data Mining method for
evaluating performance of classification algorithms. Mainly, to
evaluate the Error Rate of a learning technique.

In CV a dataset is partitioned in n folds, where each is used
for testing and the remainder used for training. The procedure
of testing and training is repeated n times so that each
partition or fold is used once for testing.

The standard way of predicting the error rate of a learning
technique given a single, fixed sample of data is to use a
stratified 10
fold cross

Stratification implies making sure that when sampling is done
each class is properly represented in both training and test
datasets. This is achieved by randomly sampling the dataset
when doing the n fold partitions.

Fold Cross

In a stratified 10
fold Cross
Validation the data is divided
randomly into 10 parts in which the
class is represented in
approximately the same proportions
as in the full dataset. Each part is
held out in turn and the learning
scheme trained on the remaining
tenths; then its error rate is
calculated on the holdout set. The
learning procedure is executed a
total of 10 times on different training
sets, and finally the 10 error rates
are averaged to yield an overall
error estimate.

fold cross
graphical example

Why is Cross
Validation a
intensive task?

When seeking an accurate error estimate, it
is standard procedure to repeat the CV
process 10 times. This means invoking the
learning algorithm 100 times and is a
computational and time intensive task.

Given the nature of Cross
Validation many
researchers have worked on executing this
process more efficiently over a grid or
networked computer environments.

Distributing Data
Mining processes
over a computer network

Different projects including WEKA have
implemented a way to distribute Data Mining
processes and in particular Cross
Validation over
networked computers. In almost all projects a client
server approach is used and methods like Java RMI
(Remote Method Invocation) and WSRF (Web
Services Resource Framework) are implemented to
allow network communications between clients and

Also, WEKA is the main tool over which different
projects are based to achieve Data Mining over
computer networks due to its easily accessible Java
source code and adaptability.

WEKA distribution of Data Mining
Processes over several computers

The WEKA tool contains a feature to split an experiment
and distribute it across several processors.

Distributing an experiment involves splitting it into
subexperiments that RMI sends to the host for
execution. The experiment can be partitioned by
datasets, where each subexperiment is self
and applies all schemes to a single dataset. In the other
hand, with few datasets the partitions can set by run. For
example a 10 times 10
fold CV would be split into 10
sub experiments, one per run.

This feature is available from the experimenter section of
the WEKA tool which is the main section under which
research is done.

Under the Experimenter the ability to distribute
processes is found under the advanced version of the
Setup panel.

WEKA requirements for
distributing experiments

Each host:

Needs Java installed

Needs access to databases to be used

Needs to be running the
weka.experiment.RemoteEngine experiment server

Distributing an experiment works best if the results
are sent to a central database by selecting JDBC as
the result destination. If not preferred, each host
can save the results to a different ARFF that can be
merged afterwards.

WEKA difficulties for
distributed implementation

File and directory permissions can be difficult to set

Manually installing and configuring each host with the
Weka experimenter server and the remote.policy file
which grants remote engine permissions for network

Manually initializing or starting each host.

Setting up a centralized database server and access.

In the positive side once all these configurations and
preparations are done the experiment can be
executed and time can be saved by distributing the
workload among the hosts.

WEKA Experimenter Tutorial:

Other Projects implementing grids / distributed
networks for Data Mining and Cross

Based on Weka there are some projects that try to
improve the process of performing data mining and
validation over numerous computers:

Weka Parallel

Grid Weka




Machine Learning in Parallel

Parallel was created with the intention of being able to run the
validation portion of any given classifier very quickly. This
speed increase is accomplished by simultaneously calculating the
necessary information using many different machines.

To achieve communication from the computer running Weka (client)
to the other computers (servers) Weka
Parallel uses a simple
connection established by the Socket class in the package.
Each server would start a daemon that listens to a port, then the
socket would open a Data and an Object DataStream to send/receive

RMI was not used to manage the client calls to servers to do the
necessary methods for calculating specific folds of CV. Instead, the
client sends integer codes to the servers telling him what methods to

Each server receives a copy of the dataset, and information on what
fold it has to perform. The client computer maintains an index to
assign what fold each server performs and has a Round Robin


Speedup performance analysis

An experiment was done
running the J48 decision tree
classifier with default
parameters on the Waveform
5000 dataset from the UCI
repository. The 5000
Waveform dataset contains
5300 points in 21 dimensions,
and the goal s to find the
classifier that correctly
distinguishes between 3
classes of waves. A 500
cross validation was ran using
up to 14 computers with similar

Parallel link:


In the Grid
enabled Weka, execution of the following tasks can be distributed
across several computers in
an ad
hoc Grid

Building a classifier on a remote machine.

Testing a previously built classifier on several machines in parallel .

Labeling a dataset using a previously built classifier on several machine in

Using several machines to perform parallel cross

Labeling involves applying a previously learned classifier to an unlabelled
data set to predict instance labels.

Testing takes a labeled data set, temporarily removes class labels, applies
the classifier, and then analyses the quality of the classification algorithm by
comparing the actual and the predicted labels.

Finally, for
fold cross
validation a labeled data set is partitioned into
training and testing iterations are performed. On each iteration, one
fold is used as a test set, and the rest of the data is used as a training set. A
classifier is learned on the training set and then validated on the test data.

Weka is similar to the Weka
Parallel project, but allows for performing
more functions in parallel on remote machines (and also includes better load
balancing, fault monitoring, and datasets management).


The labeling function is
distributed by partitioning the
data set, labeling several
partitions in parallel on different
available machines, and
merging the results into a single
labeled data set.

The testing function is
distributed in a similar way, with
test statistics being computed in
parallel on several machines for
different subsets of the test

Distributing cross
validation is
also straightforward: individual
iterations for different folds are
executed on different machines.

Weka : Setup details

It uses a custom interface for communication between clients
and servers utilizing native Java object serialization for data

It is mainly done on a Java command line execution style.

It uses a .weka
parallel configuration file in the client computer
to setup the list of servers, in the following format:

PORT=<Port number>

<Machine IP address or DNS name>

<Number of Weka servers running on this machine>

<Max. amount of memory on this machine in Mbytes>

<Machine IP address or DNS name>

For each Weka server, a copy of the Weka software (the .jar
file) is made on the selected machines and the Weka server
class is run as follows: java weka.core.DistributedServer
<Port number>

If a machine is going to run more than one weka server each
server should have its own directory so it doesn’t combine
results generated.

Performance analysis between
Parallel and Grid

weka sacrifices
some performance in
exchange of more
features, compare to the
Weka. These
features are load
balancing, data
recovery/fault monitoring,
and more data mining
functions than just cross

Weka Development:

Weka HowTo:


Inhambu is a distributed object
oriented system that
supports the execution of data mining applications
on clusters of PCs and workstations.

Inhambu is a system that uses the idle resources in
a cluster composed of commodity PCs, supporting
the execution of DM applications based on the
Weka tool.

Its goal is to improve issues with Scheduling and
load sharing, Overloading and contention
avoidance, Heterogeneity, and Fault tolerance
when performing Data Mining processes in a grid or
clusters of computers.

Inhambu: architecture

The architecture of Inhambu implements:

An application layer: consists in a modified implementation of
Weka. With specific components implemented and deployed at the
client and server sides. The client component executes the user
interface and generates DM tasks, while the server contains the
core Weka classes which execute the DM tasks.

A resource management layer: which provides for the execution of
Weka in a distributed environment.

The trader provides publishing and discovery mechanisms for
clients and servers.

Inhambu: Improvements

Scheduling and load sharing: Implementation of static and
dynamic performance indices. Static performance indices are
usually implemented by static values that express or quantify
amounts of resources and capacities. After an index is created
then a dynamic monitoring performance updates the index.

Overloading and contention avoidance: Implementation of a “best
effort” policy, where to avoid overloading a computer, it can only
be chosen to receive load entities if its load index is below a given
threshold. Default value chosen for the threshold is 0.7. for the
relationship utilization index vs. the response time of a computer

Heterogeneity: Based on the Capacity State Index maintained
distribution of the work can be enhanced in heterogeneous

Fault tolerance: Checkpointing and recovery was implemented in
the client side.

Inhambu: performance against

Performance was done by
running experiments on 2 real
world databases:

Adults Census Income, and
the a dataset for the diffuse
cell lymphoma

The first performance test was
done to determine

as shown in the tables when
using J48 and PART

Inhambu and Weka
performs roughly similar for
fine granularity tasks, and
Inhambu performs better than
Parallel when running
tasks whose granularity is

Inhambu: Performance on non
dedicated and heterogeneous clusters

Notice that Weka
can lead to better
performance in presence of
shorter tasks, such as J4.8,
due to its low communication
overhead (it uses sockets).
Regardless of higher
overhead due to the use of
RMI, Inhambu has a better
performance in presence of
longer tasks,

Inhambu link:


The goal of Weka4WS is to extend Weka to support
remote execution of the data mining algorithms
through the Web Services Resource Framework
(WSRF) Web Services.

To enable remote invocation, all the data mining
algorithms provided by the Weka library are
exposed as a Web Service.

Weka4WS has been developed by using the WSRF
Java library provided by Globus Toolkit 4 (GT4).
Which is an OGSA (Open Grid Service

Weka4WS structure

In the Weka4WS framework all
nodes use the GT4 services for
standard Grid functionalities, such
as security and data
management. Those nodes can
be distinguished in two

user nodes
, which are the local
machines of the users providing
the Weka4WS client software

computing nodes
, which
provide the Weka4WS Web
Services allowing the execution of
remote data mining tasks.

The storage node can be applied
when a centralized database is

Weka4WS :Setup details

Weka4WS requires
Globus Toolkit 4

on the computing nodes and
only the
Java WS Core

(a subset of Globus Toolkit) on the user
nodes. But since GT4 only runs in Unix platforms, the computing
nodes need to be Unix or Linux.

The Weka4WS client can be installed in either Unix or Windows

Due to the nature of the web
service oriented approach there are
security requirements because Weka4WS runs in a security context,
and uses a grid
map authorization (only users listed in the service
map can execute it). Authentication needed using certificates.

In the client computer a machines file is needed for listing all the
computing nodes. This is the only setup/configuration Weka4WS
needs. The format of this file:

# ==================== computing node ====================



container port

gridFTP port



Weka4WS: performance

Performance analysis of
Weka4WS for executing a
typical data mining task in
different network scenarios.
In particular, the execution
times of the different steps
needed to perform the
overall data mining task
were evaluated to determine
overhead on LAN vs. WAN

No performance
comparisons were done
against other Grid enabled
data mining tools.

Weka4WS paper:


The area of Data Mining and Cross
Validation over Grid
enabled environments is in constant development.

Latest efforts try to develop and implement standard
frameworks such as the OGSA (Open Grid Service
Architecture) for data mining tools.

From the analysis of each of the presented tools, Weka4WS
presents the most interesting overall. Still, other projects have
positive features that eventually will be conglomerated into a
single Grid Data Mining Tool based on Weka.

Further research will focus on enhancing performance, of the
current tools that use RMI and the WSRF to avoid the
overhead given by communications. Also, a further research
topic can include available peer
peer or internet networks to
facilitate performing data mining task over an Internet cluster
available to everyone.