NII International Internship Project:

muttchessAI and Robotics

Nov 8, 2013 (3 years and 8 months ago)

59 views

NII International Intern
ship

Project
:

Distributed Data
Clustering

Supervisor:
Michael HOULE
, Visiting Professor




For data mining applications, the size of main memory typically limits the size of the
data sets that can be analyzed. Avoiding the main
-
mem
ory limitation necessitates a
choice between the use of external memory (disk) or distributed processing (multiple
cores). Either approach also requires a clustering method that is inherently
decomposable. Relatively few parallelizable clustering methods a
re known, most of
which involve the partitioning of the data set, the independent clustering of each
partition, and the merging of the result clusters across all partitions. For data mining
applications, this divide
-
and
-
conquer approach has the effect of
m
issing

those very
small aggregations (the
nuggets

of information) that may prove to be the most
valuable to the user. For example, if the data set is partitioned into 10 subsets for
clustering, any aggregation of 30 points would have (on average) 3 points
in any
given partition


typically too few to be recognized as a cluster for that partition.


The project will investigate the application of the
relevant
-
set correlation

(RSC)
clustering model
[1]
to
the
clustering
of data from distributed databases
, in s
uch a way
that the smallest nuggets of information are still preserved
. Developed at NII
,

RSC is
a generic model for clustering that requires no direct knowledge of the nature or
representation of the data. In lieu of such knowledge, the model relies solel
y on the
existence of an oracle for quer
ies
-
by
-
example, that accepts a reference to a data item
and returns a ranked set of items relevant to the query. In principle, the role of the
oracle could be played by any similarity search structure, or even a sear
ch engine
whose
internal
ranking function and relevancy scores are
secret
. The quality of cluster
candidates, the degree of association between pairs of cluster candidates, and the
degree of association between clusters and data items are all assessed acco
rding to the
statistical significance of a form of correlation among pairs of relevant sets and/or
candidate cluster sets.


Based on the RSC model, a

general
-
purpose
scalable clustering heuristic
,
GreedyRSC
,
has already been developed and

demonstrated for
very large, high
-
dimensional
datasets
,

using a fast approximate similarity search structure
(the
SASH

[
2
]
)
as the
oracle

[
1
]
.

The features of
GreedyRSC

include:



The ability to scale to large data sets, both in terms of the number of

items and
the size of t
he attribute sets.



Genericity, in its ability to deal with different types of attributes

(categorical,
ordinal, spatial).



Automatic determination of an appropriate number of clusters, with the user
specifying as input parameters only the minimum desired cl
uster size and the
maximum allowable correlation (proportion of overlap) between pairs of clusters.



Robustness with respect to noisy data.



The ability to identify clusters of any

size (as small as three items).

As currently implemented
,
GreedyRSC

is a batc
h method

that makes use of a single
CPU. It is capable of clustering large data sets through the use of external memory
(disk)
, by breaking the data into several chunks, each of which can reside in main
memory. However, the computational cost due to the di
vision of the data into
c

chunks increases by a factor of
c
.


Some of the topics that may be investigated within the scope of an internship project
include

(but are not limited to)
:



Scalability of
similarity search to extremely large and extremely high
-
dimensional data sets. This may include the
development of parallel or distributed
versions

of existing sequential de
signs, or the development of fast similarity join
algorithms.



Extensions and modification of existing parallel designs of
GreedyRSC
.



Application of GreedyRSC to large
-
scale clustering tasks in su
pport of other
projects at NII, such as the production of large visual vocabularies for
NII

s
entries in the annual
Tre
cVid video re
trieval and classification challenges
.

The details of the internship project
will be negotiated individually with the students,
taking their background and interests into account.


The ideal duration of this project is 6 months, although visits of as short as
3

months
will still be considered. Although it is possible to reduce the length of the internship
after being accepted
, it may be difficult to extend the duration beyond that which is
stated in the candidate’s application. Therefore, candidates are strongly recommended
to state in their application only the
longest possible duration

for their intended stay at
NII.


3
. Ref
erences

[
1
] M. E. Houle, "
T
he
r
elevant
-
s
et
c
orrelation
m
odel

for data clustering
", in Proc.
8
th

SIAM International Conference on Data Mining

(
SDM

200
8
),
pp. 775
-
786,
Atlanta,
GA, USA, 2008
.


[
2
]
M. E. Houle and J. Sakuma,

"
Fast approximate similarity searc
h in extremely
high
-
dimensional data sets
"
,

in
Proc.

21
st

IEEE International C
onference on Data
Engineering
(ICDE

2005)
, pp. 619
-
630
, Tokyo, Japan, 2005
.