Support Vector Machine Classification Based on Fuzzy Clustering for Large Data Sets

grizzlybearcroatianAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

103 views

Support Vector Machine Classification Based on
Fuzzy Clustering for Large Data Sets
Jair Cervantes
1
,Xiaoou Li
1
,and Wen Yu
2
1
Secci´on de Computaci´on Departamento de Ingenier´a El´etrica
CINVESTAV-IPN
A.P.14-740,Av.IPN 2508,M´exico D.F.,07360,M´exico
2
Departamento de Control Autom´atico,CINVESTAV-IPN
A.P.14-740,Av.IPN 2508,M´exico D.F.,07360,M´exico
yuw@ctrl.cinvestav.mx
Abstract.Support vector machine (SVM) has been successfully applied
to solve a large number of classification problems.Despite its good the-
oretic foundations and good capability of generalization,it is a big chal-
lenging task for the large data sets due to the training complexity,high
memory requirements and slow convergence.In this paper,we present
a new method,SVM classification based on fuzzy clustering.Before ap-
plying SVM we use fuzzy clustering,in this stage the optimal number
of clusters are not needed in order to have less computational cost.We
only need to partition the training data set briefly.The SVM classifica-
tion is realized with the center of the groups.Then the de-clustering and
SVM classification via reduced data are used.The proposed approach
is scalable to large data sets with high classification accuracy and fast
convergence speed.Empirical studies show that the proposed approach
achieves good performance for large data sets.
1 Introduction
The digital revolution has made possible that the data capture be easy and
its storage have a practically null cost.As a matter of this,enormous quanti-
ties of highly dimensional data are stored in databases continuously.Due to this,
semi-automatic methods for classification fromdatabases are necessary.Support
vector machine (SVM) is a powerful technique for classification and regression.
Training an SVMis usually posed as a quadratic programming (QP) problem to
find a separation hyper-plane which implicates a matrix of density n×n,where
the n is the number of points in the data set.This needs huge quantities of
computational time and memory for large data sets,so the training complexity
of SVM is highly dependent on the size of a data set [1][19].Many efforts have
been made on the classification for large data sets.Sequential Minimal Opti-
mization (SMO)[12] transforms the large QP problem into a series of small QP
problems,each one involves only two variables [4][6][15].In [11] was applied the
boosting to Platt’s SMO algorithm and to use resulting Boost-SMO method for
speeding and scaling up the SVM training.[18] discusses large scale approxi-
mations for Bayesian inference for LS-SVM.The results of [7] demonstrate that
A.Gelbukh and C.A.Reyes-Garcia (Eds.):MICAI 2006,LNAI 4293,pp.572–582,2006.
c
 Springer-Verlag Berlin Heidelberg 2006
SVM Classification Based on Fuzzy Clustering for Large Data Sets 573
a fair computational advantage can be obtained by using a recursive strategy
for large data sets,such as those involved in data mining and text categoriza-
tion applications.Vector quantization is applied in [8] to reduce a large data
set by replacing examples by prototypes.Training time for choosing optimal pa-
rameters is greatly reduced.[9] proposes an approach based on an incremental
learning technique and a multiple proximal support vector machine classifier.
Random Selection [2][14][16] is to select data such that the learning be maxi-
mized.However,it could over-simplify the training data set,lose the benefits
of SVM,specially if the probability distribution of the training data and the
testing data are different.
On the other hand,unsupervised classification,called clustering is the classifi-
cation of similar objects into different groups,or more precisely,the partitioning
of a data set into subsets (clusters),so that the data in each subset (ideally)
share some common trait (often proximity according to some defined distance
measure).The goal of clustering is to separate a finite unlabeled data set into
a finite and discrete set of “natural,” hidden data structures [13].Some results
[1][5][19] show that clustering technique can help to decrease complexity of SVM
training.But,they need more computations to build the hierarchical structure.
In this paper we propose a new approach for classification of large data sets,
named SVM classification based on fuzzy clustering.To the best of our knowl-
edge,SVM classification based on fuzzy clustering has not yet been established
in the literature.
In partitioning,the number of clusters is pre-defined to avoid computational
cost for determining the optimal number of clusters.We only section the training
data set and to exclude the set of clusters with minor probability for support
vectors.Based on the obtained clusters,which are defined as mixed category
and uniformcategory,we extract support vectors by SVMand forminto reduced
clusters.Then we apply de-clustering for the reduced clusters,and obtain subsets
fromthe original sets.Finally,we use SVMagain and finish the classification.An
experiment is given to show the effectiveness of the new approach.The structure
of the paper is organized as follows:after the introduction of the SVM and
fuzzy clustering in Section II,we introduce the SVM based on fuzzy clustering
classification in Section III.Section IV demonstrates experimental results on
artificial and real data sets.We conclude our study in Section V.
2 Support Vector Machine for Classification and Fuzzy
Clustering
Assume that a training set X is given as:
(x
1
,y
1
),(x
2
,y
2
),...,(x
n
,y
n
) (1)
i.e.X = {x
i
,y
i
}
n
i=1
where x
i
∈ R
d
and y
i
∈ (+1,−1).Training SVM yields to
solve a quadratic programming problem as follows
574 J.Cervantes,X.Li,and W.Yu
max
α
i

1
2
l
￿
i,j=1
α
i
y
i
α
j
y
j
Kx
i
· x
j
 +
l
￿
i=1
α
i
subject to:
l
￿
i=1
α
i
y
i
= 0,C ≥ α
i
≥ 0,i = 1,2,...l
where C > 0,α
i
= [α
1

2
,...,α
l
]
T

i
≥ 0,i = 1,2,...,l,are coefficients
corresponding to x
i
,x
i
with nonzero α
i
is called Support Vector (SV).The
function K is called the Mercel kernel,which must satisfy the Mercer condition
[17].
Let S be the index set of SV,then the optimal hyperplane is
￿
i∈S

i
y
i
)K(x
i
,x
j
) +b = 0
and the optimal decision function is defined as
f(x) = sign
￿
￿
i∈S

i
y
i
)K(x
i
,x
j
) +b
￿
(2)
where x =[x
1
,x
2
,...,x
l
] is the input data,α
i
and y
i
are Lagrange multipliers.
A new object x can be classified using (2).The vector x
i
is shown only in the
way of inner product.There is a Lagrangian multiplier α for each training point.
When the maximum margin of the hyperplane is found,only the closed points
to the hyperplane satisfy α > 0.These points are called support vectors SV,the
other points satisfy α = 0.
Clustering essentially deals with the task of splitting a set of patterns into a
number of more-or-less homogenous classes (clusters) with respect to a suitable
similarity measure,such that the patterns belonging to any one of the clusters
are similar and the patterns of the different clusters are as dissimilar as possible.
Let us formulate the fuzzy clustering problem as:consider a finite set of ele-
ments X = {x
1
,x
2
,...,x
n
} with d −dimension in the Euclidian space R
d
,i.e.,
x
j
∈ R
d
,j = 1,2,...,n.The problem is to perform a partitioning of these data
into k fuzzy sets with respect to a given criterion.The criterion is usually to op-
timize an objective function.The result of the fuzzy clustering can be expressed
by a partitioning matrix U such that U = [u
ij
]
i=1...k,j=1...n
,where u
ij
is a
numeric value in [0,1].There are two constraints on the value of u
ij
.First,total
memberships of the element x
j
∈ X in all classes is equal to 1.Second,every
constructed cluster is non-empty and different from the entire set,i.e.,
k
￿
i=1
u
ij
= 1,for all j = 1,2,...,n
0 <
n
￿
j=1
u
ij
< n,for all i = 1,2,...,k.
(3)
A general form of the objective function is
J (u
ij
,v
k
) =
k
￿
i=1
n
￿
j=1
k
￿
l=1
g [w(x
i
),u
ij
] d(x
j
,v
i
)
SVM Classification Based on Fuzzy Clustering for Large Data Sets 575
where w(x
i
) is the a priori weight for each x
i
,d(x
j
,v
i
) is the degree of dissimi-
larity between the data x
j
and the supplemental element v
i
,which can be con-
sidered as the central vector of i-th cluster.The degree of dissimilarity is defined
as a measure that satisfies two axioms:1) d(x
j
,v
i
) ≥ 0,2) d(x
j
,v
i
) = d(x
i
v
j
).
The fuzzy clustering can be formulated into an optimization problem:
Min J (u
ij
,v
i
),i = 1,2...k;j = 1,2...n
Subject to:(3)
This objective function is
J (u
ij
,v
i
) =
k
￿
i=1
n
￿
j=1
u
m
ij
x
j
,v
i

2
,m> 1 (4)
where m is call a exponential weight which influences the degree of fuzziness of
the membership function.To solve the minimization problem,we differentiate
the objective function in (4) with respect to v
i
(for fixed u
ij
,i = 1,...,k,
j = 1,...,n) and to u
ij
(for fixed v
i
,i = 1,...,k) and apply the conditions of
(3)
v
i
=
1
n
￿
j=1
(u
ij
)
m
n
￿
j=1
(u
ij
)
m
x
j
,i = 1,...,k (5)
u
ij
=
￿
1/x
j
,v
i

2
￿
1/m−1
k
￿
l=1
￿
1/x
j
−v
k

2
￿
1/m−1
(6)
where i = 1...k;j = 1...n.The system described by (5) and (6) cannot be
solved analytically.However,the following fuzzy clustering algorithm provides
an iterative approach:
Step 1:Select a number of clusters k (2 ≤ k ≤ n) and exponential weight
m (1 < m< ∞).Choose an initial partition matrix U
(0)
and a termination
criterion .
Step 2:Calculate the fuzzy cluster centers
￿
v
(l)
i
| i = 1,2,...,k
￿
using U
(l)
and (5).
Step 3:Calculate the new partition matrix U
(l+1)
by using
￿
v
(l)
i
| i = 1,2,...,k
￿
and (6).
Step 4:Calculate Δ =
￿
￿
U
(l+1)
−U
(l)
￿
￿
= max
i,j
￿
￿
￿
u
l+1
ij
−u
(l)
ij
￿
￿
￿
.If Δ > ,then
l = l +1 and go to Step 2.If Δ≤ ,then stop.
The iterative procedure described above minimizes the objective function in
(4) and leads to some of its local minimum.
576 J.Cervantes,X.Li,and W.Yu
3 SVM Classification Based on Fuzzy Clustering
Assuming that we have an input-output dataset.The task here is to find a set of
support vectors from the input-output data,which maximize the space between
classes.From the above discussion,we need to
– eliminate the input data subset which is far away from the decision hyper-
plane;
– find support vectors from the reduced data set.
Our approach can be formed into the following steps.
3.1 Fuzzy Clustering:Sectioning the Input Data Space into
Clusters
According to (1),let X = {x
1
,x
2
,...,x
n
} be the set of n inputs data,where
each data x
i
can be considered as a point represented by a vector of dimension
d as follows:
x
i
= w(x
1
),w(x
2
),...,w(x
d
)
where 1 ≤ i ≤ n.w(x
i
) denotes the weight of x
j
in x
i
,where 0 < w
ij
< 1 and
1 ≤ j ≤ d.Assume that we can divide the original input data set into k clusters,
where k > 2.Note that the clusters number must be strictly > 2,because we
want to reduce the original input data set,eliminating the clusters far from the
decision hyperplane.If k = 2 the input data set would be split into the number
of existent classes and the data set could not be eliminated.
The fuzzy clustering obtains k clusters center of data and also the membership
function of each data in the clusters minimizing the objective function (4),where
u
m
ij
denote the membership grade of x
j
in the cluster A
k
,v
i
denotes the center of
A
k
and x
j
,v
i
 denotes the distance of x
j
to center v
i
..Furthermore,x
j
and v
i
are vectors of dimension d.For other hand,m influences the degree of fuzziness
of the membership function.Note that the total membership of the element
x
j
∈ X in all classes is equal to 1.0,i.e.,
k
￿
i=1
u
ij
= 1 where 1 ≤ i ≤ n.The
membership grade of x
j
in A
k
is calculated as in (6) and the cluster center v
i
of
A
k
is calculated as in (5).The complete fuzzy clustering algorithm is developed
by means of the steps 1,2,3 and 4 shown in the section II.
3.2 Classification of Clusters Center Using SVM
Let (X,Y ) be the training patterns set where X = {x
1
,...,x
j
,...,x
n
} and
Y = {y
1
,...,y
j
,...,y
n
} where y
j
∈ (−1,1),and x
j
= (x
j1
,x
j2
,...,x
jd
)
T
∈ R
d
and each measure x
ji
is a characteristic (attribute,dimension or variable).
The process of fuzzy clustering is based on finding k partitions of X,C =
{C
1
,...,C
k
} (k < n),such that a) ∪
k
i=1
C
i
= X and b) C
i


= ∅,i = 1,...,k,
where:
C
i
= {∪x
j
| y
j
∈ (−1,1)},i = 1,...,k (7)
SVM Classification Based on Fuzzy Clustering for Large Data Sets 577
Fig.1.Uniform clusters and mixed clusters
That is,independently that each obtained cluster contains elements with dif-
ferent membership grade or membership function,each element have the original
possession to a class,which is shown by (7).The cluster elements obtained can
have a possession of uniform class (as it is appreciated in the Figure 1(b),where
the cluster elements 1,2,3,4,7 and 8 have one only possession),this type of
clusters is defined as:
C
u
= {∪x
j
| y
j
∈ −1 ∧y
j
∈ 1}
or in the contrary case a possession of mixed class,where each cluster has ele-
ments of one or another class (see Figure 1(b) where the clusters 5 and 6 contain
elements with possession of mixed class),this type of clusters is defined as:
C
m
= {∪x
j
| y
j
∈ −1 ∨ y
j
∈ 1}
Fig.2.Reduction of the input data set.(a) clusters with mixed elements,(b) clusters
close to separation hiper-plane.
We identified the clusters with elements of mixed category and the clusters
with elements of uniform category.The clusters with elements of mixed
possession are separated for a later evaluation (see Figure 2 a) ),because these
contain elements with bigger likelihood to be support vectors.Each one of the
578 J.Cervantes,X.Li,and W.Yu
remaining clusters has a center and a label of possession of uniform class (see
Figure 2 b) ),with these clusters center and the possession labels,we used SVM
to find a decision hyperplane,which is defined by means of the support vectors,
the support vectors found are separated of the rest as shows in the Figure 3 b).
Fig.3.Reduction of input data using partitioning clustering.a) Clusters of original
data set.b) Set of reduced clusters.
3.3 De-clustering:Getting a Data Subset
The data subset obtained from the original data set given in (1) is formed for
l
￿
i=1
C
m
i
,l ≤ k and
p
￿
i=1
C
u
i
,| C
u
∈ svc,p ≤ k
where svc is defined as clusters set with uniformelements which are support vec-
tors,C
m
and C
u
are defined as the clusters with mixed and uniforms elements
respectively.In Figure 3 b) the clusters set with uniform elements is represented
by the clusters 3,4 and 5.We should note that the uniform elements of the
clusters set are also support vectors (svc).While,the clusters set with elements
of mixed possession is represented by the clusters 1 and 2.To this point,the
original data set has been reduced getting the clusters close to the optimal deci-
sion hyperplane and eliminating the clusters far away from the optimal decision
hyperplane.However,the subset obtained is formed with clusters and we needed
de-clustering and to get the elements from the clusters.
The data subset obtained (X

,Y

) has characteristics:
1) X

⊂ X
2) X

∩ X = X

,X

∪ X = X
(8)
where this characteristics come true for X and Y.i.e.,the data set obtained is a
data subset from the original data set.
SVM Classification Based on Fuzzy Clustering for Large Data Sets 579
3.4 Classification of Reduced Data Using SVM
In this step,we used SVMon the data subset in steps 1,2 and 3.Since the original
data set is reduced significantly in the previous steps,eliminating the clusters
with center far away of optimal decision hyperplane,the time of training in this
stage is a lot of minor in comparison with the time of training of the original
data set.
4 Experimental Results
To test and demonstrate the proposed techniques for large data sets classifica-
tion,several experiments are designed and performed and the results are reported
and discussed in this section.
First,let consider a simple case of classification.We generate a set of 40000
data at random,unspecified the ranges of aleatory generation.The data set
has two dimensions and are labeled by X.The record r
i
is labeled as ”+” if
wx +b > th and ”-” if wx +b < th,here th is the threshold,w is the weight
associated to input data and b is the bias.In this way,the data set is linearly
separable.
Figure 4 shows the input data set with 10000 (see a)),1000 (see b)) and 100
(see c)).Now we use the fuzzy clustering to reduce the data set.After the first
3 steps of the algorithm proposed in this paper (Sections 3.1,3.2 and 3.3),the
input data sets are reduced to 337,322 and 47 respectively,see d),e) and f) in
Figure 4.
Fig.4.Original data sets and reduced data sets with 10
4
data
We use normal SVM with linear kernel.The comparison performances of the
SVMbased on fuzzy clustering and normal SVMare shown in Table 1.Here ”%”
represents the percentage of the original data set,”#” data number,”V
s
” is the
580 J.Cervantes,X.Li,and W.Yu
number of support vectors,”t” is the total experimental time in seconds,”k”
is cluster number,”Acc” is the accuracy.We can see that the data reduction
is carried out with the distance between the clusters center and the optimal
decision hyperplane,which provokes that some support vectors be eliminated.
In spite of this,SVMbased on fuzzy clustering has better performances and the
training time is small compared with the normal SVM.
Table 1.Comparison between the SVM based on fuzzy clustering and normal SVM
with 10
4
data
normal SVM
SVM based on fuzzy clustering
%
#
V
S
t
Acc
V
S
t
Acc
k
0.125
500
5
96.17
99.2%
4
3.359
99.12%
10
0.25%
1000
5
1220
99.8%
4
26.657
99.72%
20
25%
10000
-
-
-
3
114.78
99.85%
50
50%
20000
-
-
-
3
235.341
99.88%
100
75%
30000
-
-
-
3
536.593
99.99%
150
100%
40000
-
-
-
3
985.735
99.99%
200
Second,we generated a set of 100,000 data at random,specified range and
radio of aleatory generation as in [19].Figure 5 shows the input data set with
1000 data (see a)),10000 data (see b)) and 50000 data (see c)).After the fuzzy
clustering,the input data sets are reduced to 257,385 and 652 respectively,
see d),e) and f) in Figure 5.For testing sets,we generated sets of 56000 and
109000 data for the first example (Figure 4) and the second example (Figure 5)
respectively,the comparison performances of the SVMbased on fuzzy clustering
and normal SVM are shown in Table 2.
Fig.5.Classification results with 10
5
data
SVM Classification Based on Fuzzy Clustering for Large Data Sets 581
Table 2.Comparison between the SVM based on fuzzy clustering and normal SVM
with 10
5
data
normal SVM
SVM based on fuzzy clustering
%
#
V
S
t
Acc
V
S
t
Acc
k
0.25%
2725
7
3554
99.80%
6
36.235
99.74%
20
25%
27250
-
-
-
4
221.297
99.88%
50
50%
54500
-
-
-
4
922.172
99.99%
100
75%
109000
-
-
-
4
2436.372
99.99%
150
Third,we use the Waveform data set [3].The data set has 5000 waves and
three classes.The classes are generated as from a combination of two or three
basics waves.Each record has 22 attributes with continuous values between 0
to 6.In this example we use normal SVM with RBF kernel.Table 3 shows the
training time with different data number,the performance of the SVMbased on
fuzzy clustering is good.The margin of optimal decision is almost the same as
the normal SVM.The training time is very small with respect to normal SVM.
Table 3.Comparison between the SVM based on fuzzy clustering and normal SVM
for the Wave dataset
normal SVM
SVM based on fuzzy clustering
%
#
V
S
t
Acc
V
s
t
Acc
k
8
400
168
35.04
88.12%
162
22.68
87.3%
40
12
600
259
149.68
88.12%
244
47.96
87.6%
60
16
800
332
444.45
88.12%
329
107.26
88.12%
80
20
1000
433
1019.2
88.12%
426
194.70
88.12%
100
24
1200
537
2033.2
-
530
265.91
88.12%
120
5 Conclusion
In this paper,we developed a new classification method for large data sets.It
takes the advantages of the fuzzy clustering and SVM.The algorithm proposed
in this paper has a similar idea as the sequential minimal optimization (SMO),
i.e.,in order to work with large data sets,we partition the original data set into
several clusters and reduce the size of QP problems.The experimental results
show that the number of support vectors obtained using the SVM classification
based on fuzzy clustering is similar to the normal SVM approach while the
training time is significantly smaller.
References
1.Awad M.,Khan L.,Bastani F.y Yen I.L.,“An Effective support vector machine
(SVMs) Performance Using Hierarchical Clustering,” Proceedings of the 16th IEEE
International Conference on Tools with Artificial Intelligence (ICTAI’04) - Volume
00,2004 November 15 - 17,2004 Pages:663- 667
582 J.Cervantes,X.Li,and W.Yu
2.Balcazar J.L.,Dai Y.y Watanabe O.,“Provably Fast Training Algorithms for
support vector machine”,In Proc.of the 1
st
IEEE Int.Conf.on Data Mining.,
IEEE Computer Society 2001,pp.43-50.
3.Chih-Chung Ch.,and Chih-Jen L.,LIBSVM:a library for support vector machines,
2001,http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
4.Collobert R.y Bengio S.,“ SVMTorch:support vector machine for large regresion
problems”.Journal of Machine Learning Research,1:143-160,2001.
5.Daniael B.y Cao D.,“Training support vector machine Using Adaptive Cluster-
ing”,in Proc.of SIAM Int.Conf on Data Mining 2004,Lake Buena Vista,FL,
USA
6.Joachims T.,“Making large-scale support vector machine learning practical”.In
A.S.B.Scholkopf,C.Burges,editor,Advances in Kernel Methods:support vector
machine.MIT Press,Cambridge,MA 1998.
7.S.W.Kim;Oommen,B.J.;Enhancing prototype reduction schemes with recursion:
a method applicable for ”large” data sets,IEEE Transactions on Systems,Man
and Cybernetics,Part B,Volume 34,Issue 3,1384 - 1397,2004
8.Lebrun,G.;Charrier,C.;Cardot,H.;SVM training time reduction using vector
quantization,Proceedings of the 17th International Conference on pattern recogni-
tion,Volume 1,160 - 163,2004.
9.K.Li;H.K.Huang;Incremental learning proximal support vector machine classifiers,
Proceedings.2002 International Conference on machine learning and cybernetics,
Volume 3,1635 - 1637,2002
10.Luo F.,Khan L.,Bastani F.,Yen I.,y Zhou J.,“A Dynamical Growing Self Or-
ganizing Tree (DGSOT) for Hierarquical Clustering Gene Expression Profiles”,
Bioinformatics,Volume 20,Issue 16,2605-2617,2004.
11.Pavlov,D.;Mao,J.;Dom,B.;Scaling-up support vector machine using boosting
algorithm,Proceedings of 15th International Conference on pattern recognition,
Volume 2,219 - 222,2000
12.Platt J.,“Fast Training of support vector machine using sequential minimal op-
timization.In A.S.B.Scholkopf,C.Burges,editor,Advances in Kernel Methods:
support vector machine.MIT Press,Cambridge,MA 1998.
13.Rui Xu;Wunsch,D.,II.,“Survey of clustering algorithms”,IEEE Transactions on
Neural Networks,Volume 16,Issue 3,May 2005 Page(s):645 - 678
14.Schohn G.y Cohn D.,“Less is more:Active Learning with support vector ma-
chine”,In Proc.17th Int.Conf.Machine Learning,Stanford,CA,2000.
15.Shih L.,Rennie D.M.,Chang Y.y Karger D.R.,“Text Bundling:Statistics-based
Data Reduction”,In Proc of the Twentieth Int.Conf.on Machine Learning (ICML-
2003),Washington DC,2003.
16.Tong S.y Koller D.,“Support vector machine active learning with applications to
text clasifications”,In Proc.17th Int.Conf.Machine Learning,Stanford,CA,2000
17.Vapnik V.,“The Nature of Statistical Learning Theory,” Springer,N.Y.,1995.
18.Van Gestel,T.;Suykens,J.A.K.;De Moor,B.;Vandewalle,J.;Bayesian inference
for LS-SVMs on large data sets using the Nystrommethod,Proceedings of the 2002
International Joint Conference on neural networks,Volume 3,2779 - 2784,2002
19.Yu H.,Yang J.y Han Jiawei,“Classifying Large Data Sets Using SVMs with
Hierarchical Clusters”,in Proc.of the 9
th
ACM SIGKDD 2003,August 24-27,
2003,Washington,DC,USA.
20.R.Xu,D.Wunsch,Survey of Clustering Algorithms,IEEE Trans.Neural Networks,
vol.16,pp.645–678,2005.