Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku

dealerdeputyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

75 εμφανίσεις

Spiros

Papadimitriou
Jimeng

Sun

IBM T.J. Watson Research Center

Hawthorne, NY, USA

Reporter:
Nai
-
Hui
, Ku


Introduction


Related Work


Distributed Mining Process


Co
-
clustering Huge Datasets


Experiments


Conclusions


Problems


Huge datasets


Natural sources of data are impure form


Proposed Method


A comprehensive Distributed Co
-
clustering (
DisCo
)
solution


Using
Hadoop


DisCo

is a scalable framework under which
various co
-
clustering algorithms can be
implemented


Map
-
Reduce framework


employs a distributed storage cluster


block
-
addressable storage


a centralized metadata server


a convenient data access


storage API for Map
-
Reduce tasks


Co
-
clustering


Algorithm


cluster shapes


checkerboard partitions


single bi
-
cluster


Exclusive row and column partitions


overlapping partitions


Optimization criteria


code length


Identifying the source and
obtaining the data

Transform raw data into
the appropriate format
for data analysis

Visual results, or turned
into the input for other
applications.


Data pre
-
processing


Processing 350 GB raw network event log


Needs over 5 hours to extract source/destination IP
pairs


Achieve
much better performance

on a few
commodity nodes running
Hadoop


Setting up
Hadoop

required
minimal effort


Specifically for co
-
clustering, there are two main
preprocessing tasks:


Building the graph from raw data


Pre
-
computing the transpose


During co
-
clustering optimization, we need
to iterate over both rows and columns.


Need to pre
-
compute the adjacency lists for
both the original graph as well as its
transpose


Definitions and overview


Matrices are denoted by boldface capital letters


Vectors are denoted by boldface lowercase
letters


a
ij
:the

(
i
, j)
-
th

element of matrix A


Co
-
clustering algorithms employs a checkerboard


the original adjacency matrix


a grid of sub
-
matrices


An m x n matrix, a co
-
clustering is a pair of row
and column labeling vectors


r(
i
):the
i
-
th

row of the matrix


G: the
k
×


group matrix

A

a


g
pq

gives the sufficient statistics for the (p, q)
sub
-
matrix



Map function


Reduce function


Global sync


Setup


39 nodes


Two dual
-
core processors


8GM RAM


Linux RHEL4


4Gbps Ethernets


SATA, 65MB/sec or roughly 500 Mbps


The total capacity of our HDFS cluster was just
2.4 terabytes


HDFS block size was set to 64MB (default value)


JAVA


Sun JDK version 1.6.0_03



The pre
-
processing step on the ISS data


Default values


39 nodes


6 concurrent maps per node


5 reduce tasks


256MB input split size



Using relatively low
-
cost components


I/O rates that exceed those of high
-
performance
storage systems.


Performance scales almost linearly with the
number of machines/disks.