on the Data Grid

sharpfartsAI and Robotics

Nov 8, 2013 (3 years and 11 months ago)

80 views

Scalable Clustering
on the Data Grid

Patrick Wendel (pjw4@doc.ic.ac.uk)

Moustafa Ghanem

Yike Guo


Discovery Net

Department of Computing

Imperial College, London

20/09/2005

All Hands Meeting, Nottingham

Outline


Discovery Net


Data Clustering


Mining Distributed Data


Description of the strategy


Deployment


Evaluation


Conclusions


Future Works

20/09/2005

All Hands Meeting, Nottingham

Discovery Net


Multidisciplinary project funded by the EPSRC under the UK e
-
Science
programme (started Oct 2002, ended March 05)



Developed an infrastructure for Knowledge Discovery Services for
integrating and analysing data collected from high throughput devices
and sensors



Applications to:


Life Sciences


High throughput genomics and proteomics


Real
-
time Environmental Monitoring


High throughput dispersed air sensing technology


Geo
-
Hazard modelling


Earthquake modelling through satellite imagery



The project covered many areas including infrastructure, applications
and algorithms (text mining)



Produced the Discovery Net platform which aims to integrate, compose,
coordinate and deploy knowledge discovery services using a workflow
technology.

20/09/2005

All Hands Meeting, Nottingham

Discovery Net

Using Distributed Computing Resources

Scientific

Information

Scientific
Discovery

Literature

Databases

Operational

Data

Images

Instrument

Data


e
-
Science


large scale science that will increasingly
be carried out through distributed global
collaborations enabled by the Internet.


20/09/2005

All Hands Meeting, Nottingham

Data Clustering


We concentrate on a particular class of data mining
algorithms: Clustering



A class of explorative data mining techniques, used to
find out groups of points that are similar/close to each
other.




Popular analysis technique. Useful for exploring,
understanding, modelling large data sets




Two main types of clustering:


Hierarchical: Reorganises the data set into a hierarchy
of clusters based on their similarity.


Partition/Model based: Tries to partition the data set into
a number of clusters or try to fit a statistical model (e.g.
mixture of Gaussians) to a data set




Successfully applied to sociological data, image
processing and genomic data.


20/09/2005

All Hands Meeting, Nottingham

Mining Data on the Grid


Changing environment for data analysis:



From analysing data files held locally (or
close to the algorithm), to using remote data
source, using remote services through
portals, now towards distributed data
executions.


Distributed data sources:


Data mining processes can now require data
spread across multiple organisations



Service
-
oriented approach:


High
-
level functionalities are now available through
well
-
defined services, instead of providing low
-
level
(terminal etc..) access to resources


20/09/2005

All Hands Meeting, Nottingham

Goal


Design a service
-
oriented distributed data
clustering strategy:




that can be deployed on a Grid environment
(i.e. a standard
-
based, service oriented,
secure distributed environment)




that would allow the end
-
user/data analysts
to deploy easily against its own data sets

20/09/2005

All Hands Meeting, Nottingham

Requirements 1/2


Performance issues:


The analysis process using data grids directly and
analysis services must be more efficient than
gathering all the data on my desktop!


Accuracy:


The strategy should at least provide a model more
representative of the overall data set


Security


The deployed strategy should ensure consistent
handling of authentication and authorization aspects
throughout


Privacy:


Restricted access to the data source

20/09/2005

All Hands Meeting, Nottingham

Requirements 2/2


Heterogeneity of the resources used and/or connectivity


It’s very unlikely the set of resources involved in the
distributed analysis process will be similar or work over
networks of similar bandwidth


Loose
-
coupling between resources participating in the
distributed analysis


The analyst has less control on what is available/provided
by each data grid or each analysis service. Therefore the
framework should, as much as possible, be unaffected by
minor differences between functionalities provided by each
site.


Service
-
oriented approach: The deployment of the analysis
process should be based on the co
-
ordination of high
-
level
services (instead of a dedicated distributed algorithm, e.g.
MPI implementation)

20/09/2005

All Hands Meeting, Nottingham

Current strategy


We restrict the current framework to the case where
instances are distributed but have the same attributes
on each different fragments (~ horizontal fragments)





Based on the EM
-
Clustering algorithm (mixture of
Gaussian model fitting algorithm).


Hierarchical clustering inherently complex to distribute


Statistical approach of EM provides a sound basis to
define a model combination strategy

20/09/2005

All Hands Meeting, Nottingham

Approach


Generate clustering models at each data
source location (compute near the data)



Transfer partial models in standard format
(PMML) to a combiner site



Normalise the relative weights of each
model



Perform an EM
-
based method on partial
models to generate a global model.


20/09/2005

All Hands Meeting, Nottingham

Combining Cluster Models


Derived from the EM
-
Clustering algorithm
itself


Adapted to take as input the models
generated at each site


Each partial model is treated like a (very)
compressed representation of the fragment
(similar to the two step approaches of some
scalable clustering algorithms).


More detailed algorithm and formulae in
proceedings

20/09/2005

All Hands Meeting, Nottingham

Deployment: Discovery Net


The Discovery Net platform is used to build and deploy this framework.




Implementation based on an open architecture re
-
using common protocols
and common infrastructure elements (such as the Globus Toolkits).




It also defines its own protocol for workflows,
Discovery Process Markup
Language
(DPML) which allows the definition of data analysis workflows to
be executed on distributed resources.



The platform comprises a server that stores, schedules the workflows and
manage the data, and a thick client to help the workflow construction
process.



Thus giving the end user the ability to define application
-
specific workflows
performing such tasks as distributed data mining.



The model combiner is implemented as a workflow activity in Discovery Net

20/09/2005

All Hands Meeting, Nottingham

Deployment

Data sources

Discovery Net servers

Partial clustering

Partial clustering

Partial clustering

PMML

PMML

PMML

Partial models

Global model

Combiner site

Source A

Source B

Source C

20/09/2005

All Hands Meeting, Nottingham

Deployment: Workflow


The Discovery Net client enables the composition
and the execution of the distributed process as a
workflow constructed visually.


The execution engine will coordinate the distributed
execution

20/09/2005

All Hands Meeting, Nottingham

Accuracy Evaluation: Data
Distribution


Comparison of the accuracy of the combined model with the
average accuracy of partial models against the entire data sets
(i.e. have we gained some accuracy by considering the
fragments together)


Accuracy will strongly depend on how the data is distributed
among different sites. In the evaluation we introduce a
randomness ratio to determine how similar the data distribution
is among fragments.


0 meaning that each site would have data drawn from
different distributions


1 meaning that the data from all fragments are drawn from
the same distribution


Measured by log
-
likelihood function of the test data set:


The likelihood function of a data set represents how much
that data is likely to be following the distribution function
defined by the model

20/09/2005

All Hands Meeting, Nottingham

Accuracy Evaluation: Data
distribution

Randomness ratio effect
-120000
-100000
-80000
-60000
-40000
-20000
0
0
0.2
0.4
0.6
0.8
1
1.2
Ratio
LLikelihood
Average log-Likelihood
Combined log-Likelihood

As expected, the ratio has a huge effect on gained accuracy.
For low levels, each fragment becomes less and less
representative of the complete data set, therefore the combined
model will outperform partial ones
.

20/09/2005

All Hands Meeting, Nottingham

Accuracy Evaluation: Number of
fragments

-80000
-70000
-60000
-50000
-40000
-30000
-20000
-10000
0
2
3
4
5
6
7
8
9
10
Avg Likelihood
Combined Likelihood

(r= 0.2, 10,000 points, 5 clusters) The accuracy does degrade
with increasing number of fragments, but so does the average
accuracy of models generated from individual fragments.

20/09/2005

All Hands Meeting, Nottingham

Accuracy Evaluation: Increasing data size


(r=0.2,d=5,5 fragments). Consistent behaviour of the combined model’s
accuracy over partial ones.

Increasing data size
-5000000
-4000000
-3000000
-2000000
-1000000
0
0
200000
400000
600000
800000
100000
0
120000
0
# instances
LLikelihood
Average log-Likelihood
Combined log-Likelihood
20/09/2005

All Hands Meeting, Nottingham

Performance Evaluation


Performance evaluation is only partially relevant, as the process does not
feed back combined models and partial models are generated near the
data.


The heterogeneity of real deployments is difficult to take into account.


Time in seconds, for an increasing number of fragments

20/09/2005

All Hands Meeting, Nottingham

Performance Evaluation


Execution time with lower dimensionality and larger data sets

20/09/2005

All Hands Meeting, Nottingham

Conclusions


Encouraging results in terms of accuracy vs.
performance, given the constraints.


But is the trade
-
off between accuracy and flexibility
(generally the case in distributed data mining)
acceptable?


This should be part of a wider explorative process,
probably as a first step into the understanding of the
data set.


Being part of the Discovery Net platform, the
distributed analysis process can be simply designed
from the Discovery Net client software.

20/09/2005

All Hands Meeting, Nottingham

Future Works


First step towards more generic distributed data mining
strategies (classification algorithms, association rules)


Need evaluation against real data sets !


Possible improvements including:


Refinement through feedback


Use of a more complex intermediate summary
structure for the partial models (e.g. tree structures
containing summary information)


Estimation of the number of clusters (using Bayesian
Information Criteria)


Plenty of possible clustering algorithms to try to use.



20/09/2005

All Hands Meeting, Nottingham

Questions?