Clustering of the self-organizing map

tripastroturfΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 6 μέρες)

97 εμφανίσεις

Clustering of the self
-
organizing map

Vesanto, J.


Alhoniemi, E.



Neural Networks Res. Centre, Helsinki Univ. of Technol., Espoo;

This paper appears in:

Neural Networks, IEEE Transactions on

On page(s): 586
-
600

Volume: 11,


May 2000

ISSN: 1045
-
9227

References Cited: 49

CODEN: ITN
NEP

INSPEC Accession Number: 6633557


Abstract:

The self
-
organizing map (SOM) is an excellent tool in exploratory phase of data
mining. It projects input space on prototypes of a low
-
dimensional regular grid
that can be effectively utilized to visualize
and explore properties of the data.
When the number of SOM units is large, to facilitate quantitative analysis of the
map and the data, similar units need to be grouped, i.e., clustered. In this paper,
different approaches to clustering of the SOM are cons
idered. In particular, the
use of hierarchical agglomerative clustering and partitive clustering using K
-
means are investigated. The two
-
stage procedure
-
first using SOM to produce the
prototypes that are then clustered in the second stage
-
is found to perfo
rm well
when compared with direct clustering of the data and to reduce the computation
time


Index Terms:

data analysis


data mining


learning (artificial intelligence)


self
-
organising feature maps




Documents that cite this document

Select link to view other documents in the database that cite this one.



Reference list:

1,


D. Pyle, " Data Preparation for Data Mining",
Morgan Kaufmann
, San
Francisco, CA, 1999.


2,


T. Kohon
en, " Self
-
Organizing Maps",
Springer
, Berlin/Heidelberg , Germany,
vol.30, 1995.


3,


J. Vesanto, "SOM
-
based data visualization methods",
Intell. Data Anal.
, vol.3,
no.2, pp.111
-

126, 1999.


4,


Fuzzy Models for Pattern Recognition: Methods that Search
for Structures in
Data",
IEEE
, New York, 1992.


5,


G. J. McLahlan, K. E. Basford, "Mixture Models: Inference and Applications to
Clustering",
Marcel Dekker
, New York, vol.84, 1987.


6,


J. C. Bezdek, "Some new indexes of cluster validity",
IEEE Trans. S
yst., Man,
Cybern. B
, vol.28, pp.301
-
315, 1998.

[Abstract]

[PDF Full
-
Text (888KB)]



7,


M. Blatt, S. Wiseman, E. Domany, "Superparamagnetic clustering of data",
Phys. Rev. Lett.
, vol.76, no.18, pp.3251
-

3254, 1996.


8,


G. Karypis, E.
-
H. Han, V. Kumar, "Chameleon: Hierarchical clustering using
dynamic
modeling",
IEEE Comput.
, vol.32, pp.68
-
74, Aug. 1999.

[Abstract]

[PDF Full
-
Text (1620KB)]



9,


J. Lampinen, E. Oja, "Clustering properties of hierarchical self
-
organizing
maps",
J. Math. Imag. Vis.
, vol.2, no.2

3, pp.261
-
272, Nov. 1992.


10,


E. Boudaillier, G. Hebrail, "Interactive interpretation of hier
archical
clustering",
Intell. Data Anal.
, vol.2, no.3, 1998.


11,


J. Buhmann, H. Kühnel, "Complexity optimized data clustering by
competitive neural networks",
Neural Comput.
, vol.5, no. 3, pp.75
-
88, May 1993.


12,


G. W. Milligan, M. C. Cooper, "An exa
mination of procedures for determining
the number of clusters in a data set",
Psychometrika
, vol.50, no.2, pp.159
-
179,
June 1985.


13,


D. L. Davies, D. W. Bouldin, "A cluster separation measure",
IEEE Trans.
Patt. Anal. Machine Intell.
, vol.PAMI
-
1, pp.22
4
-
227, Apr. 1979.


14,


S. P. Luttrell, "Hierarchical self
-
organizing networks",
Proc. 1st IEE Conf.
Artificial Neural Networks
, London , U.K., pp.2
-
6, 1989.

[Abst
ract]

[PDF Full
-
Text (344KB)]



15,


A. Varfis, C. Versino, "Clustering of socio
-
economic data with Kohonen
maps",
Neural Network World
, vol.2, no.
6, pp.813
-
834, 1992.


16,


P. Mangiameli, S. K. Chen, D. West, "A comparison of SOM neural network
and hierarchical clustering methods",
Eur. J. Oper. Res.
, vol. 93, no.2, Sept.
1996.


17,


T. Kohonen, "Self
-
organizing maps: Optimization approaches ",
Ar
tificial
Neural Networks
,
Elsevier
, Amsterdam, The Netherlands, pp. 981
-
990, 1991.


18,


J. Moody, C. J. Darken, "Fast learning in networks of locally
-
tuned processing
units",
Neural Comput.
, vol.1, no.2, pp. 281
-
294, 1989.


19,


J. A. Kangas, T. K. Koho
nen, J. T. Laaksonen, "Variants of self
-
organizing
maps ",
IEEE Trans. Neural Networks
, vol.1, pp.93
-
99, Mar. 1990.

[Abstract]

[PDF Full
-
Text (664KB)]



20,


T. Martinez, K. Schulten, "A "neural
-
gas" network learns topologies ",
Artificial Neural Networks
,
Elsevier
, Amsterdam, The Netherlands, pp. 397
-
402,
1991
.


21,


B. Fritzke, "Let it grow

Self
-
organizing feature maps with problem
dependent cell structure",
Artificial Neural Networks
,
Elsevier
, Amsterdam, The
Netherlands, pp.403
-
408, 1991.


22,


Y. Cheng, "Clustering with competing self
-
organizing maps ",
P
roc. Int. Conf.
Neural Networks
, vol.4, pp. 785
-
790, 1992.

[Abstract]

[PDF Full
-
Text (376KB)]



23,


B. Fritzke, "Growing grid

A self
-
organizing network with constant
neighborhood range and adaptation strength",
Neural Process. Lett.
, vol.2, no.5,
pp.9
-
13, 1995.


24,


J. Blackmore, R. Miikkulainen, "Visualiz
ing high
-
dimensional structure with
the incremental grid growing neural network",
Proc. 12th Int. Conf. Machine
Learning
, pp.55
-

63, 1995.


25,


S. Jockusch, "A neural network which adapts its stucture to a given set of
patterns",
Parallel Processing in N
eural Systems and Computers
,
Elsevier
,
Amsterdam, The Netherlands , pp.169
-
172, 1990.


26,


J. S. Rodrigues, L. B. Almeida, "Improving the learning speed in topological
maps of pattern",
Proc. Int. Neural Network Conf.
, pp.813
-
816, 1990.


27,


D. Alahak
oon, S. K. Halgamuge, "Knowledge discovery with supervised and
unsupervised self evolving neural networks",
Proc. Fifth Int. Conf. Soft Comput.
Inform./Intell. Syst.
, pp. 907
-
910, 1998.


28,


A. Gersho, "Asymptotically optimal block quantization ",
IEEE T
rans. Inform.
Theory
, vol.IT
-
25, pp.373
-
380, July 1979.


29,


P. L. Zador, "Asymptotic quantization error of continuous signals and the
quantization dimension",
IEEE Trans. Inform. Theory
, vol.IT
-
28 , pp.139
-
149,
Mar. 1982.


30,


H. Ritter, "Asymptotic
level density for a class of vector quantization
processes",
IEEE Trans. Neural Networks
, vol.2, pp.173
-
175, Jan. 1991.

[Abstract]

[PDF Full
-
Text (280KB)]



31,


T. Kohonen, "Comparison of SOM point densities based on different criteria",
Neural Comput.
, vol.11, pp. 2171
-
2185, 1999.


32,


R. D. Lawrence, G. S.
Almasi, H. E. Rushmeier, "A scalable parallel algorithm
for self
-
organizing maps with applications to sparse data problems",
Data Mining
Knowl. Discovery
, vol.3, no.2, pp.171
-
195, June 1999.


33,


T. Kohonen, S. Kaski, K. Lagus, J. SalojärviSalojarvi, J.
Honkela, V. Paatero,
A. Saarela, "Self organization of a massive document collection",
IEEE Trans.
Neural Networks
, vol.11, pp.XXX
-
XXX, May 2000.

[Abstract]

[PDF Full
-
Text (548KB)]



34,


P. Koikkalainen, "Fast deterministic self
-
organizing maps ",
Proc. Int. Conf.
Artificial Neural Networks
, vol.2, pp.63
-
6
8, 1995 .


35,


A. Ultsch, H. P. Siemon, "Kohonen's self organizing feature maps for
exploratory data analysis",
Proc. Int. Neural Network Conf.
, Dordrecht, The
Netherlands, pp.305
-
308, 1990.


36,


M. A. Kraaijveld, J. Mao, A. K. Jain, "A nonlinear proj
ection method based on
Kohonen's topology preserving maps",
IEEE Trans. Neural Networks
, vol.6,
pp.548
-
559, May 1995.

[Abstract]

[PDF Full
-
Text (1368KB)]



37,


J. Iivarinen, T. Kohonen, J. Kangas, S. Kaski, "Visualizing the clusters on the
self
-
organizing map",
Proc. Conf. Artificial Intell. Res. Finland
, Hel
sinki, Finland,
pp.122
-

126, 1994.


38,


X. Zhang, Y. Li, "Self
-
organizing map as a new method for clustering and
data analysis",
Proc. Int. Joint Conf. Neural Networks
, pp.2448
-
2451, 1993.

[Abstract]

[PDF Full
-
Text (304KB)]



39,


J. B. Kruskal, "Multidimensional scaling by optimizing goodness of fit to

a
nonmetric hypothesis",
Psychometrika
, vol.29, no. 1, pp.1
-
27, Mar. 1964.


40,


J. B. Kruskal, "Nonmetric multidimensional scaling: A numerical method",
Psychometrika
, vol.29, no.2, pp.115
-
129, June 1964.


41,


J. W. Sammon, Jr., "A nonlinear mapping
for data structure analysis",
IEEE
Trans. Comput.
, vol.C
-
18, pp. 401
-
409, May 1969.


42,


P. Demartines, J. HéraultHerault, " Curvilinear component analysis: A self
-
organizing neural network for nonlinear mapping of data sets",
IEEE Trans. Neural
Networks
, vol.8, pp.148
-
154, Jan. 1997.

[Abstract]

[PDF Full
-
Text (216KB)]



43,


A. Varfis, "On the use of two traditional statistical techniques to improve the
readibility of Kohonen Maps",
Proc. NATO ASI Workshop Statistics Neural
Networks
, 1993.


44,


S. Kaski, J. Venna, T. Kohonen, "Coloring that reveals high
-
dimensional
structures in data",
Proc. 6th Int. Conf. Neural Inform. Process.
, pp.729
-
734,
1999.

[Abstract]

[PDF Full
-
Text (316KB)]



45,


F. Murtagh, "Interpreting the Kohonen self
-
organizing map using contiguity
-
constrained clustering",
Patt. Recognit. Lett.
, vol.16, pp.399
-
408, 1995.


46,


O. Simula,

J. Vesanto, P. Vasara, R.
-
R. Helminen, "Industrial Applications of
Neural Networks",
CRC
, Boca Raton, FL, pp.87
-
112, 1999 .


47,


J. Vaisey, A. Gersho, "Simulated annealing and codebook design",
Proc. Int.
Conf. Acoust., Speech, Signal Process.
, pp. 117
6
-
1179, 1988.

[Abstract]

[PDF Full
-
Text (484KB)]



48,


J. K. Flanagan, D. R. Morrell, R. L. Frost, C. J. Read, B. E. Nelson, "Vector
quantization codebook generation using simulated annealing",
Proc. Int. Conf.
Acoust., Speech, Signal Process.
, vol.3, pp.1759
-
1762, 1989.

[Abstract]

[PDF Full
-
Text (200KB)]



49,


T. Graepel, M. Burger, K. Obermayer, "Phase trans
itions in stochastic self
-
organizing maps",
Phys. Rev. E
, vol.56, pp.3876
-

3890, 1997.

Grid
-
clustering: an efficient hierarchical clustering
method for very large data sets

Schikuta, E.



Inst. of Appl. Comput. Sci. & Inf. Syst., Wien Univ.;

This paper appears in:

Patter
n Recognition, 1996., Proceedings of the 13th
International Conference on

08/25/1996

-
08/29/1996,

25
-
29 Aug 1996

Location: Vienna

,


Austria

On page(s): 101
-
105 vol.2

25
-
29 Aug 1996

Number of Pages: 4 vol. (xxxi+976+xxix+922+xxxi+1008+xxix+788)

INSPEC A
ccession Number: 5443777


Abstract:

Clustering is a common technique for the analysis of large images. In this paper a
new approach to hierarchical clustering of very large data sets is presented. The
GRIDCLUS algorithm uses a multidimensional grid data
structure to organize the
value space surrounding the pattern values, rather than to organize the patterns
themselves. The patterns are grouped into blocks and clustered with respect to
the blocks by a topological neighbor search algorithm. The runtime beh
avior of
the algorithm outperforms all conventional hierarchical methods. A comparison of
execution times to those of other commonly used clustering algorithms, and a
heuristic runtime analysis are presented


Index Terms:

computer vision


data structures


search problems


topology


A scalable parall
el subspace clustering algorithm for
massive data sets

Nagesh, H.S.


Goil, S.


Choudhary, A.



Dept. of Electron. & Comput. Eng., Northwestern Univ., Evanston, IL;

This paper appears in:

Parallel Processing, 2000. Proceedings. 2000
International Conference on

08/21/2000

-
08/24/2000,

2000

L
ocation: Toronto, Ont.

,


Canada

On page(s): 477
-
484

2000

References Cited: 19

Number of Pages: xx+590

INSPEC Accession Number: 6742433


Abstract:

Clustering is a data mining problem which finds dense regions in a sparse multi
-
dimensional data set. The
attribute values and ranges of these regions
characterize the clusters. Clustering algorithms need to scale with the data base
size and also with the large dimensionality of the data set. Further, these
algorithms need to explore the embedded clusters in a

subspace of a high
dimensional space. However the time complexity of the algorithm to explore
clusters in subspaces is exponential in the dimensionality of the data and is thus
extremely compute intensive. Thus, parallelization is the choice for discoveri
ng
clusters for large data sets. In this paper we present a scalable parallel subspace
clustering algorithm which has both data and task parallelism embedded in it. We
also formulate the technique of adaptive grids and present a truly unsupervised
clusteri
ng algorithm requiring no user inputs. Our implementation shows near
linear speedups with negligible communication overheads. The use of adaptive
grids results in two orders of magnitude improvement in the computation time of
our serial algorithm over curr
ent methods with much better quality of clustering.
Performance results on both real and synthetic data sets with very large number
of dimensions on a 16 node IBM SP2 demonstrate our algorithm to be a practical
and scalable clustering technique


Index Te
rms:

computational complexity


data mining


parallel algorithms


pattern clustering


very large
databases


Clustering soft
-
devices in the semantic grid

Zhuge, H.



Inst. of Comput. Technol., Acad. Sinica, China;

This paper appears in:

Computing in Science & Engineering [see als
o IEEE
Computational Science and Engineering]

On page(s): 60
-

62

Nov/Dec 2002

ISSN: 1521
-
9615

INSPEC Accession Number: 7470995


Abstract:

Soft
-
devices are promising next
-
generation Web resources. they are software
mechanisms that provide services to eac
h other and to other virtual roles
according to the content of their resources and related configuration information.
They can contain various kinds of resources such as text, images, and other
services. Configuring a resource in a soft
-
device is similar t
o installing software in
a computer: both processes contain multi
-
step human
-
computer interactions.


Index Terms:

Internet


information resources

A new data clustering approach for data mining in
large databases

Cheng
-
Fa Tsai


Han
-
Chang Wu


Chun
-
Wei Tsai



Dept. of Manage. Inf. Syst., Nat. Pingtung Uni
v. of Sci. & Technol.;

This paper appears in:

Parallel Architectures, Algorithms and Networks,
2002. I
-
SPAN '02. Proceedings. International Symposium on

05/22/2002

-
05/24/2002,

2002

Location: Makati City, Metro Manila

,


Philippines

On page(s): 278
-
283

2002

References Cited: 37

Number of Pages: xiv+368

INSPEC Accession Number: 7329072


Abstract:

Clustering is the unsupervised classification of patterns (data item, feature
vectors, or observations) into groups (clusters). Clustering in data mining is ve
ry
useful to discover distribution patterns in the underlying data. Clustering
algorithms usually employ a distance metric
-
based similarity measure in order to
partition the database such that data points in the same partition are more
similar than points
in different partitions. In this paper, we present a new data
clustering method for data mining in large databases. Our simulation results show
that the proposed novel clustering method performs better than a fast self
-
organizing map (FSOM) combined with t
he k
-
means approach (FSOM+k
-
means)
and the genetic k
-
means algorithm (GKA). In addition, in all the cases we
studied, our method produces much smaller errors than both the FSOM+k
-
means
approach and GKA


Index Terms:

data mining


genetic algorithms


pattern clustering


s
elf
-
organising feature maps


very large
databases


A distribution
-
based clustering algorithm for mining in
large spatial databases

Xiaowei Xu


Ester, M.


Kriegel, H.
-
P.


Sander, J.



Munich Univ.;

This paper appears in:

Data Engineering, 1998. Proceedings., 14th
International Conference on

02/23/1
998

-
02/27/1998,

23
-
27 Feb 1998

Location: Orlando, FL

,


USA

On page(s): 324
-
331

23
-
27 Feb 1998

References Cited: 14

Number of Pages: xxi+605

INSPEC Accession Number: 5856765


Abstract:

The problem of detecting clusters of points belonging to a spatia
l point process
arises in many applications. In this paper, we introduce the new clustering
algorithm DBCLASD (Distribution
-
Based Clustering of LArge Spatial Databases) to
discover clusters of this type. The results of experiments demonstrate that
DBCLASD,

contrary to partitioning algorithms such as CLARANS (Clustering Large
Applications based on RANdomized Search), discovers clusters of arbitrary shape.
Furthermore, DBCLASD does not require any input parameters, in contrast to the
clustering algorithm DBSC
AN (Density
-
Based Spatial Clustering of Applications
with Noise) requiring two input parameters, which may be difficult to provide for
large databases. In terms of efficiency, DBCLASD is between CLARANS and
DBSCAN, close to DBSCAN. Thus, the efficiency of
DBCLASD on large spatial
databases is very attractive when considering its nonparametric nature and its
good quality for clusters of arbitrary shape


Index Terms:

data analysis


deductive databases


knowledge acquisition


very large databases


visual
databases


Clustering algorithms and validity measures

Halkidi, M.


Batistakis, Y.


Vazirgiannis, M.



Dept. of Inf., Athens Univ. of Econ. & Bus. ;

This paper appears in:

Scientific and Statistical Database Management,
2001. SSDBM 2001. Proceedings. Thirteenth International Conference on

0
7/18/2001

-
07/20/2001,

2001

Location: Fairfax, VA

,


USA

On page(s): 3
-
22

2001

References Cited: 32

Number of Pages: x+279

INSPEC Accession Number: 7028952


Abstract:

Clustering aims at discovering groups and identifying interesting distributions and
patterns in data sets. Researchers have extensively studied clustering since it
arises in many application domains in engineering and social sciences. In the last
years the availability of huge transactional and experimental data sets and the
arising requi
rements for data mining created needs for clustering algorithms that
scale and can be applied in diverse domains. The paper surveys clustering
methods and approaches available in the literature in a comparative way. It also
presents the basic concepts, pri
nciples and assumptions upon which the
clustering algorithms are based. Another important issue is the validity of the
clustering schemes resulting from applying algorithms. This is also related to the
inherent features of the data set under concern. We re
view and compare
clustering validity measures available in the literature. Furthermore, we illustrate
the issues that are under
-
addressed by the recent algorithms and we address
new research directions


Index Terms:

data mining


pattern clustering


transaction processin
g


very large databases


`1+1>2': merging distance and density based
clustering

Dash, M.


Liu, H.


Xiaowei Xu



Sch. of Comput., Nat. Univ. of Singapore;

This paper appears in:

Database Systems for Advanced Applications, 2001.
Proceedings. Seventh International Conference on

04/18/2001

-
04/21/2001,

2001

Loc
ation: Hong Kong

,


China

On page(s): 32
-
39

2001

References Cited: 18

Number of Pages: xiv+362

INSPEC Accession Number: 6912922


Abstract:

Clustering is an important data exploration task. Its use in data mining is growing
very fast. Traditional cluster
ing algorithms which no longer cater for the data
mining requirements are modified increasingly. Clustering algorithms are
numerous which can be divided in several categories. Two prominent categories
are distance
-
based and density
-
based (e.g. K
-
means and
DBSCAN, respectively).
While K
-
means is fast, easy to implement and converges to local optima almost
surely, it is also easily affected by noise. On the other hand, while density
-
based
clustering can find arbitrary shape clusters and handle noise well, it
is also slow in
comparison due to neighborhood search for each data point, and faces a difficulty
in setting the density threshold properly. We propose BRIDGE that efficiently
merges the two by exploiting the advantages of one to counter the limitations of

the other and vice versa. BRIDGE enables DBSCAN to handle very large data
efficiently and improves the quality of K
-
means clusters by removing the noisy
points. It also helps the user in setting the density threshold parameter properly.
We further show th
at other clustering algorithms can be merged using a similar
strategy. An example given in the paper merges BIRCH clustering with DBSCAN


Index Terms:

data mining


database theory


pattern clustering


very large databases

Interactively exploring hierarchical clustering resu
lts
[gene identification]

Jinwook Seo


Shneiderman, B.



Department of Computer Science & Human
-
Computer Interaction Laboratory,
Maryland Univ., College Park, MD ;

This paper appears in:

Computer

On page(s): 80
-
86

Volume: 35,


Jul 2002

ISSN: 0018
-
9162

CODEN: CPTRB4

INSPEC Accessio
n Number: 7330434


Abstract:

To date, work in microarrays, sequenced genomes and bioinformatics has focused
largely on algorithmic methods for processing and manipulating vast biological
data sets. Future improvements will likely provide users with guida
nce in
selecting the most appropriate algorithms and metrics for identifying meaningful
clusters
-
interesting patterns in large data sets, such as groups of genes with
similar profiles. Hierarchical clustering has been shown to be effective in
microarray da
ta analysis for identifying genes with similar profiles and thus
possibly with similar functions. Users also need an efficient visualization tool,
however, to facilitate pattern extraction from microarray data sets. The
Hierarchical Clustering Explorer int
egrates four interactive features to provide
information visualization techniques that allow users to control the processes and
interact with the results. Thus, hybrid approaches that combine powerful
algorithms with interactive visualization tools will jo
in the strengths of fast
processors with the detailed understanding of domain experts


Index Terms:

Hierarchical Clustering Explorer


algorithmic methods


arrays


bioinformatics


biological
data sets


biology computing


data mining


data visualisation


gene functions


gene
identification


gene profiles


genetics


hierarchical systems


interactive exploration


interactive information visualization tool


interactive systems


meaningful cluster
identification


metrics


microarray data analysis


pattern clustering


pattern extraction


process control


sequenced genomes


Clustering validity assessm
ent: finding the optimal
partitioning of a data set

Halkidi, M.


Vazirgiannis, M.



This paper appears in:

Data Mining, 2001. ICDM 2001, Proceedings IEEE
International Conference on

11/29/2001

-
12/02/2001,

2001

Location: San Jose, CA

,


USA

On page(s): 187
-
194

2001

Referen
ces Cited: 26

Number of Pages: xxi+677

INSPEC Accession Number: 7169295


Abstract:

Clustering is a mostly unsupervised procedure and the majority of clustering
algorithms depend on certain assumptions in order to define the subgroups
present in a data se
t. As a consequence, in most applications the resulting
clustering scheme requires some sort of evaluation regarding its validity. In this
paper we present a clustering validity procedure, which evaluates the results of
clustering algorithms on data sets.
We define a validity index, S Dbw, based on
well
-
defined clustering criteria enabling the selection of optimal input parameter
values for a clustering algorithm that result in the best partitioning of a data set.
We evaluate the reliability of our index bo
th theoretically and experimentally,
considering three representative clustering algorithms run on synthetic and real
data sets. We also carried out an evaluation study to compare S Dbw
performance with other known validity indices. Our approach performed
favorably
in all cases, even those in which other indices failed to indicate the correct
partitions in a data set


Index Terms:

data mining


pattern clustering


Fast hierarchical clustering based on compressed data

Rendon, E.


Barandela, R.



Pattern Recognition Lab., Technol. Inst. of Toluca, Metep
ec, Mexico;

This paper appears in:

Pattern Recognition, 2002. Proceedings. 16th
International Conference on

On page(s): 216
-

219 vol.2

2002

ISSN: 1051
-
4651

Number of Pages: 4 vol.(xxix+834+xxxv+1116+xxxiii+1068+xxv+418)

INSPEC Accession Number: 7474651


Abstract:

Clustering in data mining is the process of discovering groups in a dataset, in
such a way, that the similarity between the elements of the same cluster is
maximum and between different clusters is minimal. Some algorithms attempt to
group a rep
resentative sample of the whole dataset and later to perform a
labeling process in order to group the rest of the original database. Other
algorithms perform a pre
-
clustering phase and later apply some classic clustering
algorithm in order to create the fi
nal clusters. We present a pre
-
clustering
algorithm that not only provides good results and efficient optimization of main
memory but it also is independent of the data input order. The efficiency of the
proposed algorithm and a comparison of it with the p
re
-
clustering BIRCH
algorithm are shown.


Index Terms:

data compression


data mining


pattern clustering


Clustering spatial data in the presence of obstacles: a
density
-
based approach

Zaiane, O.R.


Chi
-
Hoon Lee



Database Lab., Alberta Univ., Edmonton, Alta., Canada;

This paper

appears in:

Database Engineering and Applications Symposium,
2002. Proceedings. International

On page(s): 214
-

223

2002

ISSN: 1098
-
8068

Number of Pages: xiii+295

INSPEC Accession Number: 7426599


Abstract:

Clustering spatial data is a well
-
known proble
m that has been extensively
studied. Grouping similar data in large 2
-
dimensional spaces to find hidden
patterns or meaningful sub
-
groups has many applications such as satellite
imagery, geographic information systems, medical image analysis, marketing,
co
mputer visions, etc. Although many methods have been proposed in the
literature, very few have considered physical obstacles that may have significant
consequences on the effectiveness of the clustering. Taking into account these
constraints during the clu
stering process is costly and the modeling of the
constraints is paramount for good performance. In this paper, we investigate the
problem of clustering in the presence of constraints such as physical obstacles
and introduce a new approach to model these c
onstraints using polygons. We also
propose a strategy to prune the search space and reduce the number of polygons
to test during clustering. We devise a density
-
based clustering algorithm, DBCluC,
which takes advantage of our constraint modeling to efficie
ntly cluster data
objects while considering all physical constraints. The algorithm can detect
clusters of arbitrary shape and is insensitive to noise, the input order and the
difficulty of constraints. Its average running complexity is O(NlogN) where N is

the number of data points.


Index Terms:

computational complexity


data mining


pattern clustering


visual databases




Clustering large datasets in arbitrary metric spaces

Ganti, V.


Ramakrishnan, R.


Gehrke, J.


Powell, A.


French, J.



Dept. of Comput. Sci., Virginia Univ., Charlottesville, VA;

This paper ap
pears in:

Data Engineering, 1999. Proceedings., 15th
International Conference on

03/23/1999

-
03/26/1999,

23
-
26 Mar 1999

Location: Sydney, NSW

,


Australia

On page(s): 502
-
511

23
-
26 Mar 1999

References Cited: 26

Number of Pages: xxiii+648

INSPEC Accessio
n Number: 6233205


Abstract:

Clustering partitions a collection of objects into groups called clusters, such that
similar objects fall into the same group. Similarity between objects is defined by a
distance function satisfying the triangle inequality; t
his distance function along
with the collection of objects describes a distance space. In a distance space, the
only operation possible on data objects is the computation of distance between
them. All scalable algorithms in the literature assume a special
type of distance
space, namely a k
-
dimensional vector space, which allows vector operations on
objects. We present two scalable algorithms designed for clustering very large
datasets in distance spaces. Our first algorithm BUBBLE is, to our knowledge, the
first scalable clustering algorithm for data in a distance space. Our second
algorithm BUBBLE
-
FM improves upon BUBBLE by reducing the number of calls to
the distance function, which may be computationally very expensive. Both
algorithms make only a single
scan over the database while producing high
clustering quality. In a detailed experimental evaluation, we study both
algorithms in terms of scalability and quality of clustering. We also show results
of applying the algorithms to a real life dataset


Ind
ex Terms:

data ha
ndling


data m
ining


database theory


trees (mathematics)


very large databases


Improving OLAP performance by multidimensional
hierarchical clustering

Markl, V.


Ramsak, F.


Bayer, R.



Bayerisches Forschungszentrum fur Wissensbasierte Syst., Munchen;

This paper appears in:

Database Engineering and Applications, 199
9. IDEAS
'99. International Symposium Proceedings

08/02/1999

-
08/04/1999,

Aug 1999

Location: Montreal, Que.

,


Canada

On page(s): 165
-
177

Aug 1999

References Cited: 34

IEEE Catalog Number: PR00265

Number of Pages: xiii+467

INSPEC Accession Number: 63526
78


Abstract:

Data warehousing applications cope with enormous data sets in the range of
Gigabytes and Terabytes. Queries usually either select a very small set of this
data or perform aggregations on a fairly large data set. Materialized views storing
p
re
-
computed aggregates are used to efficiently process queries with
aggregations. This approach increases resource requirements in disk space and
slows down updates because of the view maintenance problem. Multidimensional
hierarchical clustering (MHC) of
OLAP data overcomes these problems while
offering more flexibility for aggregation paths. Clustering is introduced as a way
to speed up aggregation queries without additional storage cost for
materialization. Performance and storage cost of our access meth
od are
investigated and compared to current query processing scenarios. In addition
performance measurements on real world data for a typical star schema are
presented


Index Terms:

data mining


data warehouses


distributed databases


pattern clustering


query processing




C
lustering algorithms for large sets of heterogeneous
remote sensing data

Palubinskas, G.


Datcu, M.


Pac, R.



Remote Sensing Data Center, German Aerosp. Res. Establ., Wessling;

This paper appears in:

Geoscience and Remote Sensing Symposium, 1999.
IGARSS '99 Proceedings. IEEE 1999 International

06/28/
1999

-
07/02/1999,

1999

Location: Hamburg

,


Germany

On page(s): 1591
-
1593 vol.3

1999

References Cited: 13

Number of Pages: 5 vol. (xci+2770)

INSPEC Accession Number: 6440656


Abstract:

The authors introduce a concept for a global classification of rem
ote sensing
images in large archives, e.g. covering the whole globe. Such an archive for
example will be created after the Shuttle Radar Topography Mission in 1999. The
classification is realized as a two step procedure: unsupervised clustering and
supervi
sed hierarchical classification. Features, derived from different and non
-
commensurable models, are combined using an extended k
-
means clustering
algorithm and supervised hierarchical Bayesian networks incorporating any
available prior information about th
e domain


Index Terms:

belief networks


data mining


feature extraction


geographic information systems


geophysical signal processing


geophysical techniques


image classification


remote
sensing




Clustering
-
regression
-
ordering steps for knowledge
discovery in sp
atial databases

Lazarevic, A.


Xia
owei Xu


Fiez, T.


Obradov
ic, Z.



Sch. of Electr. Eng. & Comput. Sci., Washington State Univ., Pullman, WA;

This paper appears in:

Neural Networks, 1999. IJCNN '99. International
Joint Conference on

07/10/1999

-
07/16/1999,

1999

Location: Washington, DC

,


USA

On page(s): 2530
-
2534 vol.4

1999

References Cited: 11

Number of Pages: 6 vol. lxii+4439

INSPEC Accession Number: 6589860


Abstract:

Precision agriculture is a new approach to farming in which environmental
characteristics at a sub
-
field level are used to guide crop prod
uction decisions.
Instead of applying management actions and production inputs uniformly across
entire fields, they are varied to match site
-
specific needs. A first step in this
process is to define spatial regions having similar characteristics and to bui
ld local
regression models describing the relationship between field characteristics and
yield. From these yield prediction models, one can then determine optimum
production input levels. Discovery of “similar” regions in fields is done by applying
the DBS
CAN clustering algorithm on data from more than one field, ignoring
spatial attributes and the corresponding yield values. The experimental results on
real life agriculture data show observable improvements in prediction accuracy,
although there are many u
nresolved issues in applying the proposed method in
practice


Index Terms:

agriculture


data mining


pattern recognition


visual databases

Self
-
organizing systems for knowledge discovery in
large databases

Hsu, W.H.


Anvil, L.S.


Pottenger, W.M.


Tcheng, D.


Welge, M.



National Center for Supercomput. Applications, Illinois Univ., Urbana,
IL;

This paper appears in:

Neural Networks, 1999. IJCNN '99. International
Joint Conference on

07/10/1999

-
07/16/1999,

1999

Location: Washington, DC

,


USA

On page(s): 2480
-
2485 vol.4

1999

References Cited: 20

Number of Pages: 6 vol. lxii+4439

INSPEC Ac
cession Number: 6589850


Abstract:

We present a framework in which self
-
organizing systems can be used to perform
change of representation on knowledge discovery problems and to learn from
very large databases. Clustering using self
-
organizing maps is ap
plied to produce
multiple, intermediate training targets that are used to define a new supervised
learning and mixture estimation problem. The input data is partitioned using a
state space search over subdivisions of attributes, to which self
-
organizing ma
ps
are applied to the input data as restricted to a subset of input attributes. This
approach yields the variance
-
reducing benefits of techniques such as stacked
generalization, but uses self
-
organizing systems to discover factorial (modular)
structure amo
ng abstract learning targets. This research demonstrates the
feasibility of applying such structure in very large databases to build a mixture of
ANNs for data mining and KDD


Index Terms:

data mining


learning (artificial intelligence)


search problems


self
-
organising
feature maps


very large databases


Clustering very large databases using EM mixture
models

Bradley, P.S.


Fayyad, U.M.


Reina, C.A.



Microsoft Res., USA;

This paper appears in:

Pattern Recognition, 2000. Proceedings. 15th
International Conference on

09/03/2000

-
09
/07/2000,

2000

Location: Barcelona

,


Spain

On page(s): 76
-
80 vol.2

2000

References Cited: 13

Number of Pages: 4 vol(xxxi+1134+xxxiii+1072+1152+xxix+881)

INSPEC Accession Number: 6887409


Abstract:

Clustering very large databases is a challenge for tr
aditional pattern recognition
algorithms, e.g. the expectation
-
maximization (EM) algorithm for fitting mixture
models, because of high memory and iteration requirements. Over large
databases, the cost of the numerous scans required to converge and large
me
mory requirement of the algorithm becomes prohibitive. We present a
decomposition of the EM algorithm requiring a small amount of memory by
limiting iterations to small data subsets. The scalable EM approach requires at
most one database scan and is based
on identifying regions of the data that are
discardable, regions that are compressible, and regions that must be maintained
in memory. Data resolution is preserved to the extent possible based upon the
size of the memory buffer and fit of the current model

to the data. Computational
tests demonstrate that the scalable scheme outperforms similarly constrained EM
approaches


Index Terms:

data mining


maximum likelihood estimation


pattern clustering


probability


very large
databases




Documents that cite this document

Select link to view other documents in the database that cite this one.

DGLC: a density
-
based global logical combinatorial
clustering algorithm for large mixed incomplete da
ta

Ruiz
-
Shulcloper, J.


Alba
-
Cabrera, E.


Sanchez
-
Diaz, G.



Dept. of Electr. & Comput. Eng., Tennessee Univ., Knoxville, TN;

This paper appears in:

Geoscience and Remote Sensing Symposium, 2000.
Proceedings. IGARSS 2000. IEEE 2000 International

07/2
4/2000

-
07/28/2000,

2000

Location: Honolulu, HI

,


USA

On page(s): 2846
-
2848 vol.7

2000

Number of Pages: 7 vol.(clvi+3242)

INSPEC Accession Number: 6804410


Abstract:

Clustering has been widely used in areas as pattern recognition, data analysis and
i
mage processing. Recently, clustering algorithms have been recognized as one of
a powerful tool for data mining. However, the well
-
known clustering algorithms
offer no solution to the case of large mixed incomplete data sets. The authors
comment the possib
ilities of application of the methods, techniques and
philosophy of the logical combinatorial approach for clustering in these kinds of
data sets. They present the new clustering algorithm DGLC for discovering β
0
-
density connected components from large mi
xed incomplete data sets. This
algorithm combines the ideas of logical combinatorial pattern recognition with the
density based notion of cluster. Finally, an example is showed in order to
illustrate the work of the algorithm


Index Terms:

data mining


geophysical signa
l processing


geophysical techniques


pattern recognition


remote sensing


terrain mapping




Documents that cite this document

Select link to view other documents in

the database that cite this one.

A new validation index for determining the number of
clusters in a data set

Haojun Sun


Shengrui Wang


Qingshan Jiang



Dept. of Math. & Comput. Sci., Sherbrooke Univ., Que.;

This paper appears in:

Neural Networks, 2001. Proceedings. IJCNN '01.
International Joint Co
nference on

07/15/2001

-
07/19/2001,

2001

Location: Washington, DC

,


USA

On page(s): 1852
-
1857 vol.3

2001

References Cited: 12

Number of Pages: 4 vol. xlvi+3014

INSPEC Accession Number: 7036661


Abstract:

Clustering analysis plays an important role in

solving practical problems in such
domains as data mining in large databases. In this paper, we are interested in
fuzzy c
-
means (FCM) based algorithms. The main purpose is to design an
effective validity function to measure the result of clustering and de
tecting the
best number of clusters for a given data set in practical applications. After a
review of the relevant literature, we present the new validity function.
Experimental results and comparisons will be given to illustrate the performance
of the new

validity function


Index Terms:

data mining


neural nets


pattern clustering




Documents that cite this document

Select link to view other documents in the database that cite this one.

Imag
e data mining from financial documents based on
wavelet features

El Badawy, O.


El
-
Sakka, M.R.


Hassanein, K.


Kamel, M.S.



Dept. of Syst. Design Eng., Waterloo Univ., Ont. ;

This paper appears in:

Image Processing, 2001. Proceedings. 2001
International Conference on

10/07/2
001

-
10/10/2001,

2001

Location: Thessaloniki

,


Greece

On page(s): 1078
-
1081 vol.1

2001

References Cited: 11

Number of Pages: 3 vol.(lxx+1133+1108+1110)

INSPEC Accession Number: 7211050


Abstract:

We present a framework for clustering and classifying
cheque images according
to their payee
-
line content. The features used in the clustering and classification
processes are extracted from the wavelet domain by means of thresholding and
counting of wavelet coefficients. The feasibility of this framework is
tested on a
database of 2620 cheque images. This database consists of cheques from 10
different accounts. Each account is written by a different person. Clustering and
classification are performed separately on each account using distance
-
based
techniques.

We achieved correct
-
classification rates of 86% and 81% for the
supervised and unsupervised learning cases, respectively. These rates are the
average of correct
-
classification rates obtained from the 10 different accounts


Index Terms:

banking


cheque processing


data mining


document image proce
ssing


feature extraction


handwritten character recognition


image classification


learning (artificial intelligence)


pattern clustering


unsupervised learning


wavelet transforms




Gradual clustering algorithms

Fei Wu


Gardarin, G.



PRiSM Lab., Versailles Univ., Versail
les;

This paper appears in:

Database Systems for Advanced Applications, 2001.
Proceedings. Seventh International Conference on

04/18/2001

-
04/21/2001,

2001

Location: Hong Kong

,


China

On page(s): 48
-
55

2001

References Cited: 13

Number of Pages: xiv+362

INSPEC Accession Number: 6912924


Abstract:

Clustering is one of the important techniques in data mining. The objective of
clustering is to group objects into clusters such that objects within a cluster are
more similar to each other than objects in dif
ferent clusters. The similarity
between two objects is defined by a distance function, e.g., the Euclidean
distance, which satisfies the triangular inequality. Distance calculation is
computationally very expensive and many algorithms have been proposed so

far
to solve this problem. This paper considers the gradual clustering problem. From
practice, we noticed that the user often begins clustering on a small number of
attributes, e.g., two. If the result is partially satisfying the user will continue
cluste
ring on a higher number of attributes, e.g., ten. We refer to this problem as
the gradual clustering problem. In fact gradual clustering can be considered as
vertically incremental clustering. Approaches are proposed to solve this problem.
The main idea is

to reduce the number of distance calculations by using the
triangle inequality. Our method first stores in an index the distances between a
representative object and objects in n
-
dimensional space. Then these pre
-
computed distances are used to avoid dista
nce calculations in (n+m)
-
dimensional
space. Two experiments on real data sets demonstrate the added value of our
approaches. The implemented algorithms are based on the DBSCAN algorithm
with an associated M
-
Tree as index tree. However the principles of ou
r idea can
well be integrated with other tree structures such as MVP
-
Tree, R*
-
Tree, etc., and
with other clustering algorithms


Index Terms:

data mining


pattern clustering


tree data structures


very large databases


A similarity
-
based soft clustering algorithm

for
documents

King
-
Ip Lin


Kondadadi, R.



Dept. of Math. Sci., Memphis Univ., Memphis, TN ;

This paper appears in:

Database Systems for Advanced Applications, 2001.
Proceedings. Seventh International Conference on

04/18/2001

-
04/21/2001,

2001

Location: Hong Kong

,


China

On pa
ge(s): 40
-
47

2001

References Cited: 22

Number of Pages: xiv+362

INSPEC Accession Number: 6912923


Abstract:

Document clustering is an important tool for applications such as Web search
engines. Clustering documents enables the user to have a good overall

view of
the information contained in the documents that he has. However, existing
algorithms suffer from various aspects, hard clustering algorithms (where each
document belongs to exactly one cluster) cannot detect the multiple themes of a
document, whil
e soft clustering algorithms (where each document can belong to
multiple clusters) are usually inefficient. We propose SISC (similarity
-
based soft
clustering), an efficient soft clustering algorithm based on a given similarity
measure. SISC requires only a

similarity measure for clustering and uses
randomization to help make the clustering efficient. Comparison with existing
hard clustering algorithms like K
-
means and its variants shows that SISC is both
effective and efficient


Index Terms:

data mining


document handling


patt
ern clustering


very large databases




Clustering of web users using session
-
based similarity
measures

Jitian Xiao


Yanchun Zhang



Sch. of Comput. & Inf. Sci., Edith Cowan Univ., Mount Lawley, WA;

This paper appears in:

Computer Networks and Mobile Computing, 2001.
Proceedings.
2001 International Conference on

10/16/2001

-
10/19/2001,

2001

Location: Los Alamitos, CA

,


USA

On page(s): 223
-
228

2001

References Cited: 15

Number of Pages: xii+529

INSPEC Accession Number: 7114126


Abstract:

One important research topic in web usag
e mining is the clustering of web users
based on their common properties. Informative knowledge obtained from web
user clusters were used for many applications, such as the prefetching of pages
between web clients and proxies. This paper presents an approa
ch for measuring
similarity of interests among web users from their past access behaviors. The
similarity measures are based on the user sessions extracted from the user's
access logs. A multi
-
level scheme for clustering a large number of web users is
prop
osed, as an extension to the method proposed in our previous work (2001).
Experiments were conducted and the results obtained show that our clustering
method is capable of clustering web users with similar interests


Index Terms:

Internet


data mining


pattern clustering


user interface manage
ment systems




Documents that cite this document

Select link to view other documents in the database that cite this one.

A fast algorithm to cluster high dim
ensional basket
data

Ordonez, C.


Omiecinski, E.


Ezquerra, N.



Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA;

This paper appears in:

Data Mining, 2001. ICDM 2001, Proceedings IEEE
International Conference on

11/29/2001

-
12/02/2001,

2001

Location: San Jose, CA

,


USA

On page
(s): 633
-
636

2001

References Cited: 17

Number of Pages: xxi+677

INSPEC Accession Number: 7169363


Abstract:

Clustering is a data mining problem that has received significant attention by the
database community. Data set size, dimensionality and sparsity
have been
identified as aspects that make clustering more difficult. The article introduces a
fast algorithm to cluster large binary data sets where data points have high
dimensionality and most of their coordinates are zero. This is the case with
basket d
ata transactions containing items, that can be represented as sparse
binary vectors with very high dimensionality. An experimental section shows
performance, advantages and limitations of the proposed approach


Index Terms:

data mining


pattern clustering


very large da
tabases


A scalable algorithm for clustering sequential data

Guralnik, V.


Karypis, G.



Dept. of Comput. Sci., Minnesota Univ., Minneapolis, MN;

This paper appears in:

Data Mining, 2001. ICDM 2001, Proceedings IEEE
International Conference on

11/29/2001

-
12/02/2001,

2001

Locati
on: San Jose, CA

,


USA

On page(s): 179
-
186

2001

References Cited: 15

Number of Pages: xxi+677

INSPEC Accession Number: 7169294


Abstract:

In recent years, we have seen an enormous growth in the amount of available
commercial and scientific data. Data f
rom domains such as protein sequences,
retail transactions, intrusion detection, and Web
-
logs have an inherent sequential
nature. Clustering of such data sets is useful for various purposes. For example,
clustering of sequences from commercial data sets ma
y help marketer identify
different customer groups based upon their purchasing patterns. Grouping protein
sequences that share similar structure helps in identifying sequences with similar
functionality. Over the years, many methods have been developed for

clustering
objects according to their similarity. However these methods tend to have a
computational complexity that is at least quadratic on the number of sequences.
In this paper we present an entirely different approach to sequence clustering
that does

not require an all
-
against
-
all analysis and uses a near
-
linear complexity
K
-
means based clustering algorithm. Our experiments using data sets derived
from sequences of purchasing transactions and protein sequences show that this
approach is scalable and l
eads to reasonably good clusters


Index Terms:

biology computing


computational complexity


data mining


molecular biophysics


pattern
clustering


proteins


retail data processing


sequences




Documents that cite

this document

Select link to view other documents in the database that cite this one.



A hypergraph based clustering algorithm for spatial
data sets

Jong
-
Sheng Cherng


Mei
-
Jung Lo



Dept. of Electr. Eng., Da Yeh Univ., Changhwa;

This paper appears in:

Data Mining, 2001
. ICDM 2001, Proceedings IEEE
International Conference on

11/29/2001

-
12/02/2001,

2001

Location: San Jose, CA

,


USA

On page(s): 83
-
90

2001

References Cited: 14

Number of Pages: xxi+677

INSPEC Accession Number: 7169282


Abstract:

Clustering is a disco
very process in data mining and can be used to group
together the objects of a database into meaningful subclasses which serve as the
foundation for other data analysis techniques. The authors focus on dealing with a
set of spatial data. For the spatial da
ta, the clustering problem becomes that of
finding the densely populated regions of the space and thus grouping these
regions into clusters such that the intracluster similarity is maximized and the
intercluster similarity is minimized. We develop a novel
hierarchical clustering
algorithm that uses a hypergraph to represent a set of spatial data. This
hypergraph is initially constructed from the Delaunay triangulation graph of the
data set and can correctly capture the relationships among sets of data point
s.
Two phases are developed for the proposed clustering algorithm to find the
clusters in the data set. We evaluate our hierarchical clustering algorithm with
some spatial data sets which contain clusters of different sizes, shapes, densities,
and noise. E
xperimental results on these data sets are very encouraging


Index Terms:

data mining


graph theory


mesh generation


pattern clustering


visual databases




Documents that cite this document

Select l
ink to view other documents in the database that cite this one.

A genetic rule
-
based data clustering toolkit

Sarafis, I.


Zalzala, A.M.S.


Trinder, P.W.



Dept. of Comput. & Electr. Eng., Heriot
-
Watt Univ., Edinburgh;

This paper appears in:

Evolutionary Computation, 200
2. CEC '02.
Proceedings of the 2002 Congress on

05/12/2002

-
05/17/2002,

2002

Location: Honolulu, HI

,


USA

On page(s): 1238
-
1243

2002

References Cited: 17

IEEE Catalog Number: 02TH8600

Number of Pages: 2 vol.xxxvi+2034

INSPEC Accession Number: 7328863


Abstract:

Clustering is a hard combinatorial problem and is defined as the unsupervised
classification of patterns. The formation of clusters is based on the principle of
maximizing the similarity between objects of the same cluster while
simultaneously
minimizing the similarity between objects belonging to distinct
clusters. This paper presents a tool for database clustering using a rule
-
based
genetic algorithm (RBCGA). RBCGA evolves individuals consisting of a fixed set of
clustering rules, where each r
ule includes d non
-
binary intervals, one for each
feature. The investigations attempt to alleviate certain drawbacks related to the
classical minimization of square
-
error criterion by suggesting a flexible fitness
function which takes into consideration, c
luster asymmetry, density, coverage
and homogeny


Index Terms:

data mining


database theory


genetic algorithms


knowledge based systems


least mean
squares methods


pattern clustering


very large databases




Documents that cite this document

Select link to view other documents in the database that cite this one

Clustering in the framework of collaborative agents

Pedrycz, W.


Vukovich, G.



Dept. of Electr. & Comput. Eng., Alberta Univ., Edmonton
, Alta.;

This paper appears in:

Fuzzy Systems, 2002. FUZZ
-
IEEE'02. Proceedings of
the 2002 IEEE International Conference on

05/12/2002

-
05/17/2002,

2002

Location: Honolulu, HI

,


USA

On page(s): 134
-
138

2002

Number of Pages: 2 vol.xxxi+1621

INSPEC Acces
sion Number: 7322085


Abstract:

We are concerned with data mining in a distributed environment such as the
Internet. As sources of data are distributed across the WWW cyberspace, this
organization implies a need to develop computing agents exhibiting som
e form of
collaboration. We propose a model of collaborative clustering realized over a
collection of datasets in which a computing agent carries out an individual (local)
clustering process. The essence of a global search for data structures carried out
i
n this environment deals with a determination of crucial common relationships
occurring across the network. Depending upon a way in which datasets are
accessible and on a detailed mechanism of interaction, we introduce a concept of
horizontal and vertical
collaboration. These modes depend upon a way in which
datasets are accessed. The clustering algorithms interact between themselves by
exchanging information about "local" partition matrices. In this sense, the
required communication links are established a
t the level of information granules
(more specifically, fuzzy sets or fuzzy relations forming the partition matrices)
rather than data that are directly available in the databases


Index Terms:

Internet


WWW cyberspace


World Wide Web


clustering


collaborative agents


data
mining


data structures


dataset access methods


distributed data sources


distributed
environment


fuzzy relations


fuzzy set theory


fuzzy sets


horizontal collaboration


information granules


information resources


interaction mechanism


local partition
matrices


matrix algebra


multi
-
agent syste
ms


pattern clustering


vertical collaboration


δ
-
clusters: capturing subspace correlation in a large
data set

Jiong Yang


Wei Wang


Haixun Wang


Yu, P.



IBM Thomas J. Watson Res. Center, Yorktown Heights, NY;

This paper appears in:

Data Engineering, 2002. Proceedings. 18th
International Conference on

02/26/2002

-
03/01/2002,

200
2

Location: San Jose, CA

,


USA

On page(s): 517
-
528

2002

References Cited: 13

Number of Pages: xix+735

INSPEC Accession Number: 7254404


Abstract:

Clustering has been an active research area of great practical importance for
recent years. Most previous

clustering models have focused on grouping objects
with similar values on a (sub)set of dimensions (e.g., subspace cluster) and
assumed that every object has an associated value on every dimension (e.g.,
bicluster). These existing cluster models may not a
lways be adequate in capturing
coherence exhibited among objects. Strong coherence may still exist among a set
of objects (on a subset of attributes) even if they take quite different values on
each attribute and the attribute values are not fully specifie
d. This is very
common in many applications including bio
-
informatics analysis as well as
collaborative filtering analysis, where the data may be incomplete and subject to
biases. In bio
-
informatics, a bicluster model has recently been proposed to
capture
coherence among a subset of the attributes. We introduce a more general
model, referred to as the δ
-
cluster model, to capture coherence exhibited by a
subset of objects on a subset of attributes, while allowing absent attribute values.
A move
-
based algorit
hm (FLOC) is devised to efficiently produce a near
-
optimal
clustering results. The δ
-
cluster model takes the bicluster model as a special
case, where the FLOC algorithm performs far superior to the bicluster algorithm.
We demonstrate the correctness and ef
ficiency of the δ
-
cluster model and the
FLOC algorithm on a number of real and synthetic data sets


Index Terms:

data mining


pattern clustering


very large databases


Binary rule generation via Hamming Clustering

Muselli, M.


Liberati, D.



Ist. per i Circuiti Elettronici, Consiglio Nazionale
delle Ricerche, Genova, Italy;

This paper appears in:

Knowledge and Data Engineering, IEEE Transactions
on

On page(s): 1258
-

1268

Volume: 14,


Nov/Dec 2002

ISSN: 1041
-
4347

INSPEC Accession Number: 7472285


Abstract:

The generation of a set of rules und
erlying a classification problem is performed
by applying a new algorithm called Hamming Clustering (HC). It reconstructs the
AND
-
OR expression associated with any Boolean function from a training set of
samples. The basic kernel of the method is the gener
ation of clusters of input
patterns that belong to the same class and are close to each other according to
the Hamming distance. Inputs which do not influence the final output are
identified, thus automatically reducing the complexity of the final set of r
ules. The
performance of HC has been evaluated through a variety of artificial and real
-
world benchmarks. In particular, its application in the diagnosis of breast cancer
has led to the derivation of a reduced set of rules solving the associated
classifica
tion problem.


Index Terms:

Boolean functions


data mining


generalisation (artificial intelligence)


learning (artificial
intelligence)


medical diagnostic computing


medical expert systems


pattern clustering




Documents that cite this document

Select link to view other documents in the database that cite this one.

HD
-
Eye: visual mining of high
-
dimensional data

Hinneburg, A.


Keim, D.A.


Wawryniuk, M.



Inst. of Comput.
Sci., Halle Univ.;

This paper appears in:

Computer Graphics and Applications, IEEE

On page(s): 22
-
31

Volume: 19,


Sep/Oct 1999

ISSN: 0272
-
1716

References Cited: 18

CODEN: ICGADZ

INSPEC Accession Number: 6353310


Abstract:

Clustering in high
-
dimensional
databases poses an important problem. However,
we can apply a number of different clustering algorithms to high
-
dimensional
data. The authors consider how an advanced clustering algorithm combined with
new visualization methods interactively clusters data
more effectively.
Experiments show these techniques improve the data mining process


Index Terms:

data mining


data visualisation


very large databases


Chameleon: hierarchical clustering using dynamic
modeling

Karypis, G.


Eui
-
Hong Han


Kumar, V.



Dept. of Comput. Sci., Minnesota Univ., Minneapolis, MN;

This

paper appears in:

Computer

On page(s): 68
-
75

Volume: 32,


Aug 1999

ISSN: 0018
-
9162

References Cited: 11

CODEN: CPTRB4

INSPEC Accession Number: 6332134


Abstract:

Clustering is a discovery process in data mining. It groups a set of data in a way
that ma
ximizes the similarity within clusters and minimizes the similarity between
two different clusters. Many advanced algorithms have difficulty dealing with
highly variable clusters that do not follow a preconceived model. By basing its
selections on both int
erconnectivity and closeness, the Chameleon algorithm
yields accurate results for these highly variable clusters. Existing algorithms use a
static model of the clusters and do not use information about the nature of
individual clusters as they are merged.
Furthermore, one set of schemes (the
CURE algorithm and related schemes) ignores the information about the
aggregate interconnectivity of items in two clusters. Another set of schemes (the
Rock algorithm, group averaging method, and related schemes) ignore
s
information about the closeness of two clusters as defined by the similarity of the
closest items across two clusters. By considering either interconnectivity or
closeness only, these algorithms can select and merge the wrong pair of clusters.
Chameleon'
s key feature is that it accounts for both interconnectivity and
closeness in identifying the most similar pair of clusters. Chameleon finds the
clusters in the data set by using a two
-
phase algorithm. During the first phase,
Chameleon uses a graph partiti
oning algorithm to cluster the data items into
several relatively small subclusters. During the second phase, it uses an
algorithm to find the genuine clusters by repeatedly combining these subclusters


Index Terms:

data analysis


data mining


graph theory


pattern clustering




Documents that cite this document

Select link to view other documents in the database that cite this one.


Clustering of the self
-
organizing map

Vesanto, J.


Alhoniemi, E.



Neural Networks Res. Centre
, Helsinki Univ. of Technol., Espoo;

This paper appears in:

Neural Networks, IEEE Transactions on

On page(s): 586
-
600

Volume: 11,


May 2000

ISSN: 1045
-
9227

References Cited: 49

CODEN: ITNNEP

INSPEC Accession Number: 6633557


Abstract:

The self
-
organizin
g map (SOM) is an excellent tool in exploratory phase of data
mining. It projects input space on prototypes of a low
-
dimensional regular grid that
can be effectively utilized to visualize and explore properties of the data. When the
number of SOM units is
large, to facilitate quantitative analysis of the map and the
data, similar units need to be grouped, i.e., clustered. In this paper, different
approaches to clustering of the SOM are considered. In particular, the use of
hierarchical agglomerative cluster
ing and partitive clustering using K
-
means are
investigated. The two
-
stage procedure
-
first using SOM to produce the prototypes
that are then clustered in the second stage
-
is found to perform well when
compared with direct clustering of the data and to redu
ce the computation time


Index Terms:

data analysis


data mining


learning (artificial intelligence)


self
-
organising feature maps




Documents that cite this document

Select lin
k to view other documents in the database that cite this one.



Reference list:

1,


D. Pyle, " Data Preparation for Data Mining",
Morgan Kaufmann
, San Francisco,
CA, 1999.


2,


T. Kohonen, " Self
-
Organizing Maps",
Springer
, Berlin/Heidelberg , Germany,
vol.30, 1995.


3,


J. Vesanto, "SOM
-
based data visualization methods",
Intell. Data Anal.
, vol.3,
no.2, pp.111
-

126, 1999.


4,


Fuzzy Models for Pattern Recognition: Methods that Search for Structures in
Data",
IEEE
, New York, 1992.


5,


G. J. McLahlan,

K. E. Basford, "Mixture Models: Inference and Applications to
Clustering",
Marcel Dekker
, New York, vol.84, 1987.


6,


J. C. Bezdek, "Some new indexes of cluster validity",
IEEE Trans. Syst., Man,
Cybern. B
, vol.28, pp.301
-
315, 1998.

[Abstract]

[PDF Full
-
Text (888KB)]



7,


M. Blatt, S. Wiseman, E. Dom
any, "Superparamagnetic clustering of data",
Phys. Rev. Lett.
, vol.76, no.18, pp.3251
-

3254, 1996.


8,


G. Karypis, E.
-
H. Han, V. Kumar, "Chameleon: Hierarchical clustering using
dynamic modeling",
IEEE Comput.
, vol.32, pp.68
-
74, Aug. 1999.

[Abstract]

[PDF Full
-
Text (1620KB)]



9,


J. Lampinen, E. Oja, "Cl
ustering properties of hierarchical self
-
organizing
maps",
J. Math. Imag. Vis.
, vol.2, no.2

3, pp.261
-
272, Nov. 1992.


10,


E. Boudaillier, G. Hebrail, "Interactive interpretation of hierarchical clustering",
Intell. Data Anal.
, vol.2, no.3, 1998.


11,


J. Buhmann, H. Kühnel, "Complexity optimized data clustering by competitive
neural networks",
Neural Comput.
, vol.5, no. 3, pp.75
-
88, May 1993.


12,


G. W. Milligan, M. C. Cooper, "An examination of procedures for determining
the number of clusters in a d
ata set",
Psychometrika
, vol.50, no.2, pp.159
-
179,
June 1985.


13,


D. L. Davies, D. W. Bouldin, "A cluster separation measure",
IEEE Trans. Patt.
Anal. Machine Intell.
, vol.PAMI
-
1, pp.224
-
227, Apr. 1979.


14,


S. P. Luttrell, "Hierarchical self
-
organizi
ng networks",
Proc. 1st IEE Conf.
Artificial Neural Networks
, London , U.K., pp.2
-
6, 1989.

[Abstract]

[PDF Full
-
Text (344KB)]



15,


A. Varfis, C. Versino, "Clustering of socio
-
economic data with Kohonen maps",
Neural Network World
, vol.2, no.6, pp.813
-
834, 1992.


16,


P. Mangiameli, S. K. Chen, D. West, "A c
omparison of SOM neural network and
hierarchical clustering methods",
Eur. J. Oper. Res.
, vol. 93, no.2, Sept. 1996.


17,


T. Kohonen, "Self
-
organizing maps: Optimization approaches ",
Artificial
Neural Networks
,
Elsevier
, Amsterdam, The Netherlands, pp.
981
-
990, 1991.


18,


J. Moody, C. J. Darken, "Fast learning in networks of locally
-
tuned processing
units",
Neural Comput.
, vol.1, no.2, pp. 281
-
294, 1989.


19,


J. A. Kangas, T. K. Kohonen, J. T. Laaksonen, "Variants of self
-
organizing
maps ",
IEEE Tran
s. Neural Networks
, vol.1, pp.93
-
99, Mar. 1990.

[Abstract]

[PDF Full
-
Text (664KB)]



20,


T. Martinez, K. Schulten, "A "neural
-
gas" network learns topologies ",
Artificial
Neural Networks
,
Elsevier
, Amsterdam, The Netherlands, pp. 397
-
402, 1991.


21,


B. Fritzke, "Let it grow

Self
-
organizing feature maps with
problem dependent
cell structure",
Artificial Neural Networks
,
Elsevier
, Amsterdam, The Netherlands,
pp.403
-
408, 1991.


22,


Y. Cheng, "Clustering with competing self
-
organizing maps ",
Proc. Int. Conf.
Neural Networks
, vol.4, pp. 785
-
790, 1992.

[Abstract]

[PDF Full
-
Text (376KB)]



23,


B. Fritzke, "Growing
grid

A self
-
organizing network with constant
neighborhood range and adaptation strength",
Neural Process. Lett.
, vol.2, no.5,
pp.9
-
13, 1995.


24,


J. Blackmore, R. Miikkulainen, "Visualizing high
-
dimensional structure with the
incremental grid growing neu
ral network",
Proc. 12th Int. Conf. Machine Learning
,
pp.55
-

63, 1995.


25,


S. Jockusch, "A neural network which adapts its stucture to a given set of
patterns",
Parallel Processing in Neural Systems and Computers
,
Elsevier
,
Amsterdam, The Netherlands ,

pp.169
-
172, 1990.


26,


J. S. Rodrigues, L. B. Almeida, "Improving the learning speed in topological
maps of pattern",
Proc. Int. Neural Network Conf.
, pp.813
-
816, 1990.


27,


D. Alahakoon, S. K. Halgamuge, "Knowledge discovery with supervised and
unsup
ervised self evolving neural networks",
Proc. Fifth Int. Conf. Soft Comput.
Inform./Intell. Syst.
, pp. 907
-
910, 1998.


28,


A. Gersho, "Asymptotically optimal block quantization ",
IEEE Trans. Inform.
Theory
, vol.IT
-
25, pp.373
-
380, July 1979.


29,


P. L
. Zador, "Asymptotic quantization error of continuous signals and the
quantization dimension",
IEEE Trans. Inform. Theory
, vol.IT
-
28 , pp.139
-
149, Mar.
1982.


30,


H. Ritter, "Asymptotic level density for a class of vector quantization
processes",
IEEE Tr
ans. Neural Networks
, vol.2, pp.173
-
175, Jan. 1991.

[Abstract]

[PDF Full
-
Text (280KB)]



31,


T. Kohonen, "Comparison of SOM point densities based on different criteria",
Neural Comput.
, vol.11, pp. 2171
-
2185, 1999.


32,


R. D. Lawrence, G. S. Almasi, H. E. Rushmeier, "A scalable parallel algorithm
for self
-
org
anizing maps with applications to sparse data problems",
Data Mining
Knowl. Discovery
, vol.3, no.2, pp.171
-
195, June 1999.


33,


T. Kohonen, S. Kaski, K. Lagus, J. SalojärviSalojarvi, J. Honkela, V. Paatero, A.
Saarela, "Self organization of a massive doc
ument collection",
IEEE Trans. Neural
Networks
, vol.11, pp.XXX
-
XXX, May 2000.

[Abstract]

[PDF Full
-
Text (548KB)]



34,


P. Koikkalainen, "Fast deterministic self
-
organizing maps ",
Proc. Int. Conf.
Artificial Neural Networks
, vol.2, pp.63
-
68, 1995 .


35,


A. Ultsch, H. P. Siemon, "Kohonen's self organizing

feature maps for
exploratory data analysis",
Proc. Int. Neural Network Conf.
, Dordrecht, The
Netherlands, pp.305
-
308, 1990.


36,


M. A. Kraaijveld, J. Mao, A. K. Jain, "A nonlinear projection method based on
Kohonen's topology preserving maps",
IEEE Tra
ns. Neural Networks
, vol.6, pp.548
-
559, May 1995.

[Abstract]

[PDF Full
-
Text (1368KB)]



37,


J. Iivarinen, T. Kohonen, J. Kangas, S. Kaski, "Visualizing the clusters on the
self
-
organizing map",
Proc. Conf. Artificial Intell. Res. Finland
, Helsinki, Finland,
pp.122
-

126, 1994.


38,


X. Zhang, Y. Li, "Self
-
org
anizing map as a new method for clustering and data
analysis",
Proc. Int. Joint Conf. Neural Networks
, pp.2448
-
2451, 1993.

[Abstract]

[PDF Full
-
Text (304KB)]



39,


J. B. Kruskal, "Multidimensional scaling by optimizing goodness of fit to a
nonmetric hypothesis",
Psychometrika
, vol.29, no. 1, pp.1
-
27, Mar
. 1964.


40,


J. B. Kruskal, "Nonmetric multidimensional scaling: A numerical method",
Psychometrika
, vol.29, no.2, pp.115
-
129, June 1964.


41,


J. W. Sammon, Jr., "A nonlinear mapping for data structure analysis",
IEEE
Trans. Comput.
, vol.C
-
18, pp. 401
-
409, May 1969.


42,


P. Demartines, J. HéraultHerault, " Curvilinear component analysis: A self
-
organizing neural network for nonlinear mapping of data sets",
IEEE Trans. Neural
Networks
, vol.8, pp.148
-
154, Jan. 1997.

[Abstract]

[PDF Full
-
Text (216KB)]



43,


A. Varfis, "On the use of two traditional stat
istical techniques to improve the
readibility of Kohonen Maps",
Proc. NATO ASI Workshop Statistics Neural Networks
,
1993.


44,


S. Kaski, J. Venna, T. Kohonen, "Coloring that reveals high
-
dimensional
structures in data",
Proc. 6th Int. Conf. Neural Inform
. Process.
, pp.729
-
734, 1999.

[Abstract]

[PDF Full
-
Text (316KB)]



45,


F. Murtagh, "Interpreting the Kohonen self
-
organizing map using contiguity
-
constrained clustering",
Patt. Recognit. Lett.
, vol.16, pp.399
-
408, 1995.


46,


O. Simula, J. Vesanto, P. Vasara, R.
-
R. Helminen, "Industrial Applications of
Neural Networks",
CRC
, Boca Raton, FL, pp.87
-
112, 1999 .


47,


J. Vaisey, A. Gersho, "Simulated annealing and codebook design",
Proc. Int.
Conf. Acoust., Speech, Signal Process.
, pp. 1176
-
1179, 1988.

[Abstract]

[PDF Full
-
Text (484KB)]



48,


J. K. Flanagan, D. R. Morrell, R. L. Frost, C. J. Read, B. E. Nels
on, "Vector
quantization codebook generation using simulated annealing",
Proc. Int. Conf.
Acoust., Speech, Signal Process.
, vol.3, pp.1759
-
1762, 1989.

[Abstract]


[PDF Full
-
Text (200KB)]



49,


T. Graepel, M. Burger, K. Obermayer, "Phase transitions in stochastic self
-
organizing maps",
Phys. Rev. E
, vol.56, pp
.3876
-

3890, 1997.







Redefining clustering for high
-
dimensional
applications

Aggarwal, C.C.


Yu, P.S.



IBM Thomas J. Watson Res. Center, Yorktown Heights, NY;

This paper appears in:

Knowledge and Data Engineering, IEEE Transactions
on

On page(s): 210
-
225

Volume: 1
4,


Mar/Apr 2002

ISSN: 1041
-
4347

References Cited: 39

CODEN: ITKEEH

INSPEC Accession Number: 7224458


Abstract:

Clustering problems are well
-
known in the database literature for their use in
numerous applications, such as customer segmentation, classifi
cation, and trend
analysis. High
-
dimensional data has always been a challenge for clustering
algorithms because of the inherent sparsity of the points. Recent research results
indicate that, in high
-
dimensional data, even the concept of proximity or
cluste
ring may not be meaningful. We introduce a very general concept of
projected clustering which is able to construct clusters in arbitrarily aligned
subspaces of lower dimensionality. The subspaces are specific to the clusters
themselves. This definition is
substantially more general and realistic than the
currently available techniques which limit the method to only projections from the
original set of attributes. The generalized projected clustering technique may also
be viewed as a way of trying to redefin
e clustering for high
-
dimensional
applications by searching for hidden subspaces with clusters which are created by
interattribute correlations. We provide a new concept of using extended cluster
feature vectors in order to make the algorithm scalable for
very large databases.
The running time and space requirements of the algorithm are adjustable and are
likely to trade
-
off with better accuracy


Index Terms:

data mining


pattern clustering


very large databases