Summarizing data streams using segment-wise distributional clustering

convertingtownΛογισμικό & κατασκευή λογ/κού

4 Νοε 2013 (πριν από 4 χρόνια και 6 μέρες)

142 εμφανίσεις

Summarizing

data streams using segment
-
wise
distributional clustering


Abstract
:

Data Streams the most essential factor in recent years because of advances in
hardware technology enable automated recording of large amounts of data. The
primary constraint in the effective mining of streams is the large volume of data
which must be proce
ssed in real time. Density estimation provides a simpler and
efficient overview of the probabilistic data distribution of a stream segments but
the direct use of density distributions turns out to be an inefficient storage and
processing mechanism in pract
ice. So here we introduce and move about the
concept of cluster histograms, which provides an efficient way to estimate and
summarize the most important data distribution profiles over different stream
segments. These profiles can be constructed in a super
vised or unsupervised way
depending upon the nature of the underlying application. These profiles can also be
used for change detection, anomaly detection, segmental nearest neighbour search,
or supervised stream segment classification. In addition these t
echniques can also
be used for modeling other kinds of data such as text and categorical data. The
tasks are made more flexible to perform cluster histogram framework which
follows from its general way of storing the historical density profile of the data
streams. As a result, our proposed method is analytical framework for density
-
based mining of data streams.







Scope of the project:


Clustering
-

Stream mining


Data streams



The basic concepts underlying the cluster histogram framework.



We will discuss how the cluster histograms are maintained dynamically.



We will illustrate how to use the cluster histogram approach for a number of
unsupervised applications.




Show how to use technique for classification in the presence of class labels.



Presenting methods for segment
-
wise distributional clustering.



Here we propose a statistical characterization of sliding windows over the stream
segments called cluster histograms. We will show that this representation plays a
unique role on how most rel
evant density profiles different data segments of data
streams. Density estimation provides a relevant overview of the probabilistic data
distribution, it implies that cluster histograms can be applied for a variety of
complex data mining tasks. Such tasks
can vary from unsupervised methods such
as change detection to supervised tasks such as stream segment classification or
event detection. If we take into consideration of many real data sets, only a small
portion of the space is populated with data points,
because of the abnormality in
the underlying data.So the space and time efficiencies summarize the density
distribution only at the dense regions in the data. While density
-
based methods are
often used to construct clusters, the reverse technique of constru
cting density
estimates from clusters is rarely used in real applications. We will show that

the
clustering and density estimation problems are related in such a way that is
possible to approximate the density distribution of a data set with the use of fine
-
grained clusters. This is

only possible when an even larger number of data points
are available for constructing the clusters.By doing so data streams provide the
ideal framework for cluster
-
based data summarization. We can apply different
ways where stre
am segments can be clustered in differeint ways. It is essential to
characterize these differences in clustering behavior so that interesting
characteristics of different segments of the data stream are revealed or exp
osed.




Proposed System:


CLUSTER HIS
TOGRAM FRAMEWORK:




The discussion here is all about the cluster histogram framework. The
concept of cluster histograms is designed to construct effective summaries of the
probabilistic data distribution over different sliding windows of the stream. The
m
ost direct way of finding the data distribution is that of kernel density estimation.
While density estimation has been adapted to large databases, its storage and use
for real
-
time stream mining environments presents a number of different
challenges. This
is because of the natural inefficient behaviour of kernel density
estimation which estimates the density distributions over all regions of the data
including the sparse ones. This can rapidly resist the non
-
uniform data distributions
of high dimensionality
stream. Our aim is to use the stream segment
characterization not just for change detection, but also for a variety of other tasks
such as segment classification and anomaly detection. In order to achieve this goal
for fast data streams or segments, it is i
mportant to efficiently update and store the
corresponding data distributions so the cluster histogram framework turns out to be
a practical, efficient and scalable approach.


System requirements:

Software Requirements

Platform





: JDK 1.6

Program

Language



: JAVA

Tool IDE




: Net beans

Data Base




: MySQL

Operating System



: Windows
xp

Hardware Requirements

Processor




: Pentium IV Processor

RAM





: 512 MB

Hard Drive




: 10GB