CBC: Clustering Based Text Classification

muttchessΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 5 μήνες)

64 εμφανίσεις

CBC: Clustering Based Text Classification
Requiring Minimal Labeled Data

Hua-Jun Zeng
1
Xuan-Hui Wang
2
Zheng Chen
1
Wei-Ying Ma
1


1
Microsoft Research Asia, Beijing, P. R. China
{i-hjzeng, zhengc, wyma}@microsoft.com

2
University of Science and Technology of China, Anhui Hefei, P. R. China
xhwang6@mail.ustc.edu.cn


Abstract. Semi-supervised learning methods construct classifiers using both labeled and
unlabeled training data samples. While unlabeled data samples can help to improve the accuracy
of trained models to certain extent, existing methods still face difficulties when labeled data is
not sufficient and biased against the underlying data distribution. In this paper, we present a
clustering based classification (CBC) approach. Using this approach, training data, including
both the labeled and unlabeled data, is first clustered with the guidance of the labeled data. Some
of unlabeled data samples are then labeled based on the clusters obtained. Discriminative
classifiers can subsequently be trained with the expanded labeled dataset. The effectiveness of
the proposed method is justified analytically. Related issues such as expanding labeled dataset
and interacting clustering with classification are discussed. Our experimental results
demonstrated that CBC outperforms existing algorithms when the size of labeled dataset is very
small.
1. Introduction
Text classification is a supervised learning task of assigning natural language text documents to one or
more predefined categories or classes according to their contents. While it is a classical problem in the
field of information retrieval for a half century, it has recently attracted an increasing amount of attention
due to the ever-expanding amount of text documents available in digital form. Its applications span a
number of areas including auto-processing of emails, filtering junk emails, cataloguing Web pages and
news articles, etc. A large number of techniques have been developed for text classification, including
Naive Bayes (Lewis 1998), Nearest Neighbor (Masand 1992), neural networks (Ng 1997), regression
(Yang 1994), rule induction (Apte 1994), and Support Vector Machines (SVM) (Vapnik 1995, Joachims
1998). Among them SVM has been recognized as one of the most effective text classification methods.
Yang & Liu gave a comparative study of many algorithms (Yang 1999).
As supervised learning methods, most existing text classification algorithms require sufficient
training data so that the obtained classification model can generalize well. When the number of training
data in each class decreases, the classification accuracy of traditional text classification algorithms
degrade dramatically. However, in practical applications, labeled documents are often very sparse
because manually labeling data is tedious and costly, while there are often abundant unlabeled
documents. As a result, exploiting these unlabeled data in text classification has become an active
research problem in text classification recently. The general problem of exploiting unlabeled data in
supervised learning leads to a semi-supervised learning or labeled-unlabeled problem in different
context.
The problem, in the context of text classification, could be formalized as follows. Each sample text
document is represented by a vector x∈ℜ
d
. We are given two datasets D
l
and D
u
. Dataset D
l
is a labeled
dataset, consisting of data samples (x
i
, t
i
), where 1≤i≤n, and t
i
is the class label with 1≤ t
i
≤ c. Dataset D
u

is an unlabeled dataset, consisting of unlabeled sample data x
i
, n+1 ≤ i ≤n+m. The semi-supervised
learning task is to construct a classifier with small generalization error on unseen data
1
based on both D
l

and D
u
. There have been a number of work reported in developing semi-supervised text classification
recently, including Co-Training (Blum & Mitchell, 1998), Transductive SVM (TSVM) (Joachims, 1999),
and EM (Nigram et al., 2000); and a comprehensive review could be found in Seeger (2001).
While it has been reported that those methods obtain considerable improvement over traditional
supervised methods when the size of training dataset is relatively small, our experiments indicated that
they still face difficulties when the labeled dataset is extremely small, e.g. containing less than 10 labeled
examples in each class. This is somehow expected as most of those methods adopt the same iterative
approach which train an initial classifier heavily based on the distribution presented in the labeled data.
When containing very small number of samples, and the samples are far apart from corresponding class
centers due to the high dimensionality, those methods will often have a poor starting point and cumulate
more errors in iterations. On the other hand, although there are much more unlabeled data sample
available which should be more representative, they are not made full use of in the classification process.
The above observation motivated our work reported in this paper. We present CBC, a clustering
based approach for text documents classification with both labeled and unlabeled data. The
philosophical difference between our approach and existing ones is that, we treat semi-supervised
learning as clustering aided by the labeled data, while the existing algorithms treated it as classification
aided by the unlabeled data. Traditional clustering is unsupervised and requires no training examples.
However, the labeled data can provide important hint for the latent class variables. The labeled data also
help determine parameters associated with clustering methods, thus impacting on the final clustering
result. Furthermore, the label information could be propagated to unlabeled data according to clustering
results. The expanded labeled set could be used in subsequent discriminative classifiers to obtain low
generalization error on unseen data. Experimental results indicated that our approach outperforms
existing approaches, especially when the original labeled dataset is very small.
Our contributions can be summarized as follows. (1) We proposed a novel clustering based


1
A transductive setting of this problem just uses seen unlabeled data as testing data.
classification approach that requires minimal labeled data in the training dataset to achieve high
classification accuracy; (2) We provided analysis that gives some insight to the problem and proposed
various implementation strategies; (3) We conducted comprehensive experiments to validate our
approach and study related issues. The remainder of the paper is organized as follows. Section 2 reviews
several existing methods. Our approach is outlined in Section 3 with some analysis. The detailed
algorithm is then presented in Section 4. A performance study using several standard text datasets is
presented in Section 5. Finally, Section 6 concludes the paper.
2. Semi-Supervised Learning: Motivations
As defined in the previous section, semi-supervised uses both the labeled dataset D
l
and the unlabeled
dataset D
u
to construct a classification model. However, how the unlabeled data could help in
classification is not a trivial problem. Different methods were proposed according to different view of
unlabeled data.
Expectation-Maximization (EM) (Dempster et al, 1977) has a long history in semi-supervised
learning. The motivation of EM is as follows. Essentially, any classification method is to learn a
conditional probability model P(t|x,θ), from a certain model family to fit the real joint distribution P(x, t).
With unlabeled data, a standard statistical approach to assessing the fitness of learned models P(t|x,θ) is

∑ ∑∑
∈∈
+
ul
D tD
ii
tPtPtPtP
xx
xx )(),|(log)(),|(log θθ
(1)
where the latent labels of unlabeled data are treated as missing variables. Given Eq. 1, a Maximum
Likelihood Estimation (MLE) process can be conducted to find an optimal θ. Because the form of
likelihood often makes it difficult to maximize by partial derivatives, Expectation-Maximization (EM)
algorithm is generally used to find a local optimal θ. For example, Nigram et al. (2000) combined EM
with Naive Bayes and got improved performance over supervised classifiers. Theoretically if a θ close to
the global optima could be found, the result will also be optimal under the given model family. However,
the selection of a plausible model family is difficult, and the local optima problem is serious especially
when given a poor starting point. For example, in Nigrams approach, EM is initialized by Naive Bayes
classifiers on labeled data, which may be heavily biased when there is no sufficient labeled data.
Co-Training and TSVM methods were proposed more recently and have shown superior
performance over EM in many experiments. Co-Training method (Blum & Mitchell, 1998) splits the
feature set by x=(x
1
, x
2
) and trains two classifiers θ
1
and θ
2
each of which is sufficient for classification.
With the assumption of compatibility, i.e. P(t|x
1

1
)= P(t|x
2

2
), Co-Training uses unlabeled data to place
an additional restriction on the model parameter distribution P(θ), thus improving the estimation of real
θ. The algorithm initially constructs two classifiers based on labeled data, and mutually selects several
confident examples to expand the training set. This is based on the assumptions that an initial weak
predictor could be found and the two feature sets are conditional independent. However, when labeled
dataset is small, it is often heavily biased against the real data distribution. The above assumptions will
be seriously violated.
TSVM (Joachims, 1999) adopts a totally different way of exploiting the unlabeled data. It
maximizes margin over both the labeled data and the unlabeled data. TSVM works by finding a labeling
t
n+1
, t
n+2
, ..., t
n+m
of the unlabeled data D
u
and a hyperplane <w, b> which separates both D
l
and D
u
with
maximum margin. TSVM expects to find a low-density area of data and constructs a linear separator in
this area. Although empirical results indicate the success of the method, there is a concern that the large
margin hyperplane over the unlabeled data is not necessary to be the real classification hyperplane
(Zhang, 2000). In text classification, because of the high dimensionality and data sparseness, there are
often many low-density areas between positive and negative labeled examples.
Instead of using two conditional independent features in the co-training setting, Raskutti (2002)
co-trained two SVM classifiers using two feature spaces from different views. One is the original feature
space and the other is derived from clustering the labeled and unlabeled data. Nigam K. & Ghani R.
(2002) proposed two hybrid algorithms, co-EM and self-training, using two randomly split features in
co-training setting. After exhaustive experiments, they found that co-training is better than
non-co-training algorithms such as self-training.
As a summary, all existing semi-supervised methods still work in the supervised fashion, that is,
they pay more attention to the labeled dataset, and rely on the distribution presented in the labeled
dataset heavily. With the help of the unlabeled data, extra information on data distribution can help to
improve the generalization performance. However, if the number of samples contained in the labeled
data is extremely small, such algorithms may not work well as the labeled data can hardly represent the
distribution in unseen data from the beginning. This is often the case for text classification where the
dimensionality is very high and a labeled dataset of small size just represents a few isolated points in a
huge space. In Figure 1, we depict the results of applying two algorithms to a text classification problem.
The X-axis is the number of samples in each class, and the Y-axis is their performance in terms of F
Micro

that will be defined in Section 5. We can see that the performance of both algorithms degrades
dramatically when the number of samples in each class dropped to less than 16.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 4 8 16 32 64 128 256 512
Number of labeled data in each class
Micro-F1 Measure
TSVM
Co-Training
K-means

Figure 1. Performance of two existing algorithms with different size of labeled data
(No. classes = 5, total number of training data samples = 4000 )
In Figure 1, we depict another line, the dotted line, to indicate the performance of using a clustering
method, K-means, to cluster the same set of training data. In the experiments, we ignore the labels; hence
it represents the performance of unsupervised learning. It is interesting to see that when the number of
labeled data in each class is less than 4, unsupervised learning in fact gives better performance than both
semi-supervised learning algorithms. This motivated us to develop a clustering based approach to the
problem of semi-supervised learning.
3. Clustering Based Classification
Our approach clusters the unlabeled data with the guidance of labeled data first. With those clusters,
some originally unlabeled data can be viewed as labeled data with high confidence. Such an expanded
labeled dataset can be subsequently used in classification algorithms to construct the final classification
model.
3.1 The Basic Approach
CBC consists of the following two steps:
1. Clustering step: to cluster the training dataset including both the labeled and unlabeled data
and expand the labeled set according to the clustering result;
2. Classification step: to train classifiers with the expanded labeled data and remaining unlabeled
data.
Figure 2 gives an example to illustrate the traditional approach and our clustering based approach,
respectively. The black points and grey points in the Figure represent data samples of two different
classes. We have very small number of labeled data, e.g. one for each class, represented by the points
with  + and  - signs. Apparently, a classification algorithm trained with these two points will most
likely find line A as shown in Figure 2(a) as the class boundary; and it s also rather difficult to discover
the real boundary B even with the help of those unlabeled data points. Firstly, because the initial labeled
samples are highly biased, they will cause poor starting points for iterative reinforcement algorithms
such as Co-Training and EM. Moreover, TSVM algorithm may also take line A as the result because it
happens to lie in a low density area. In fact in a feature space with high dimensionality, a single sample

Figure 2. An illustrative example of clustering based classification. The black and gray data points are
unlabeled examples. The big + and - are two initially labeled example, and small + and - are
examples expanded by clustering.
(a) Classification with
original labeled data
(b) Expand labeled set
by clustering
(c) Classification with
newly labeled data
A
B
B
is often highly biased; and many low density areas will exist.
Our clustering based approach is shown in Figure 2(b) and (c). During the first step, a clustering
algorithm is applied to the training samples. In the example, it results in two clusters. Then we propagate
the labels of the labeled data samples to the unlabeled samples which are closest to cluster centroids. As
a result, we have more labeled data samples, as shown in Figure 2(b). The second step of the approach is
to use the expanded labeled data and remaining unlabeled data to train a classifier. As the result, we can
obtain the better class boundary, as shown in Figure 2(c).
From the above description, we can see that, our approach aims to combining the merits of both
clustering and classification methods. That is, we use clustering method to reduce the impact of the bias
caused by the initial sparse labeled data. At the same time, with sufficient expanded labeled data, we can
use discriminative classifiers to achieve better generalization performance than pure clustering methods.
3.2 Benefits of the Clustering Based Approach
In this subsection, we further analyze the benefit of integrating clustering into the classification process.
First, clustering methods are more robust to the bias caused by the initial sparse labeled data. Let us
take k-means, the most popular clustering algorithm as an example. In essence, k-means is a simplified
version of EM working on spherical Gaussian distribution models. They can be approximately described
by MLE of k spherical Gaussian distributions, where the means µ
1
,

, µ
k
and the identical covariances Σ
are latent variables. Thus with the aid of labeled data, the objective is to find an optimal θ=<µ
1
,

, µ
k
,

Σ> to maximize the log-likelihood of Eq. 1 where the P(x|t
i
,θ) equals to

))()(
2
1
exp(
)2(
1
1
2/1
2/
i
T
i
d
xΣx
Σ
µµ
π
−−−⋅


(2)
When the number of labeled examples is small, the bias of labeled example will not affect much the
likelihood estimation and the finding of the optimal θ.
Second, our clustering method is in fact a generative classifier, i.e., it constructs a classifier derived
from the generative model of its data P(x|t,θ). Ng & Jordan theoretically and empirically analyzed the
asymptotic performance of generative classifier and discriminative classifier (such as logistic regression,
which is a general form of SVM). They showed that generative classifiers reach their asymptotic
performance faster than discriminative classifiers (Ng & Jordan 2002). Thus our clustering method is
more effective with small training data; and easier to achieve high performance when the labeled data is
sparse. To address the problem that generative classifiers usually lead to higher asymptotic error than
discriminative classifiers, discriminative classification method such as TSVM can be used in the second
step of our approach, i.e., after clustering unlabeled data and expanding the labeled data set.
Our clustering is guided by labeled data. Generally, clustering methods address the issue of finding
a partition of available data which maximizes a certain criterion, e.g. intra-cluster similarity and
inter-cluster dissimilarity. The labeled data could be used to modify the criterion. There are also some
parameters associated with each clustering algorithm, e.g. the number k in k-means, or split strategy of
dendrogram in hierarchical clustering. The labeled data can also be used to guide the selection of these
parameters. In our current implementation, we use the soft-constraint version of k-means algorithm for
clustering, where k is equal to the number of classes in the given labeled data set. The labeled data points
are used to obtain the initial labeled centroids, which are used in the clustering process to constraint the
cluster result. The detailed algorithm is described in Section 4.
3.3 Combining Clustering with Classification
It is interesting to note that a TSVM classifier can also provide more confident examples for clustering.
That is, two-step clustering based classification, i.e., clustering followed by classification, can be viewed
as a conceptual approach. Another strategy of combining clustering and classification is through iterative
reinforcement. That is, we first train a clustering model L
1
based on all available data, obtaining an
approximately correct classifier. Afterwards, we select from unlabeled data examples that are confidently
classified by L
1
(i.e. examples with high likelihood) and combine them with original labeled data to train
a new model L
2
. Because more labeled data are used, the obtained L
2
is expected to be more accurate
and can provide more confident training examples for L
1
. We use the new labeled dataset to train L
1

again. This process is iterated until all examples are labeled.
One key issue here is how to expand the labeled dataset. In principle, we can just assign the label to
the most confident p% of examples from each of the resulting clusters. If we choose p=100% after first
clustering process, we actually have a two-step approach.
First, we need to determine the value of p. The selection of p is a tradeoff between the number of
labeled samples and possible noise introduced by the labeling error. Obviously, with higher p, a large
labeled dataset will be obtained. In general, a classifier with higher accuracy can be obtained with more
training samples. On the other hand, when we expand more samples, we might introduce incorrectly
labeled samples into the labeled dataset, which become noise and will degrade the performance of a
classification algorithm. Furthermore, small p means more iteration in the reinforcement process.
Second, we need to choose  confident examples. Note that any learned model is an estimation of
the real data model P(x,t). We can find examples that are confidently classified by a given model if a
slightly change of θ has no impact on them. When more examples are given, the model estimation will
become more accurate, and the number of confident examples will grow. As illustrated in Figure 2 (a)
and (b), even when some of the data points are wrongly classified, the most confident data points, i.e. the
ones with largest margin under classification model and the ones nearest to the centroid under clustering
model, are confidently classified. That is, a slightly change of the decision boundary or centroid will not
affect the label of these data.
We assume that class labels t are uniformly distributed. Since the Gaussian is spherical, the
log-likelihood of a given data point and the estimated label is

2
2
*
1
*****
)|(),|(log())|,(log( cctPtxPtxP +−−== µθθθ x
(3)
where c
1
and c
2
are positive constants. The most probable points in a single Gaussian distribution are the
points that are nearest to the distribution mean.
To get the most confident examples from the result of TSVM, we should draw a probabilistic view
of the TSVM. Let us take logistic regression as an example, which is a general form of discriminative
methods. The objective is to maximize



+
i
xfy
ii
e
),(
1
1
log
θ
(4)
where f(x
i
,θ) is some linear function depending on the parameter θ. θ is typically a linear combination of
training examples. Under the margin maximization classifier such as SVM, the likelihood of a given
point x
*
and its label t
*
=+ can be derived from the above equation:

)
)))((exp(1
1
1)(()*,(
∑ ∑
+⋅+
−=+
j
j
k
k
jj
bt
xPxP
xxβ
(5)
which considers points with largest margin the most probable.
4. The Algorithm
In this section, we present the detailed algorithm when CBC is applied to text data, which is generally
represented by sparse term vectors in a high dimensional space. Following the traditional IR approach,
we tokenize all documents into terms and construct one component for each distinct term. Thus each
document is represented by a vector (w
i1
, w
i2
, ..., w
id
) where w
ij
is weighted by TFIDF (Salton, 1991), i.e.
)/log(
jijij
DFNTFw ×=
, where N is total number of documents.
Assuming the term vectors are normalized, cosine function is a commonly used similarity measure
for two documents:

=
⋅=
d
i
ikijkj
wwdocdocsim
1
),(
. This measure is also used in the clustering
algorithm to calculate the distance from an example to a centroid (which is also normalized).
This simple representation proves to be efficient for supervised learning (Joachims, 1998), e.g. in
most tasks they are linear separatable. For the classification step, we use a TSVM classifier with a linear
kernel. The detailed algorithm is presented as following.




Algorithm CBC

Input:
￿ Labeled data set D
l

￿ Unlabeled data set D
u
.

Output:
￿ The full labeled set D
l
′ =D
l
+(D
u
, T
u
*
)
￿ A classifier L

1. Initialize current labeled and unlabeled set D
l
′=D
l
, D
u
′=D
u

2. Repeat until D
u
′ = ∅
Clustering Step
￿
Calculate initial centroids o
i
=

=∀ itj
j
j
x
,
, i=1 c, x
j

D
l
, and set current centroids o
i
*
=o
i
.
￿ The label of the centroids t(o
i
)= t(o
i
*
) are equal to labels of the corresponding examples.
￿ Repeat until cluster result doesnt change any more
− Calculate the nearest centroids o
j
*
for each o
i
. If t(o
i
)≠t(o
j
*
), exit the loop.
− Assign t(o
i
*
) to each x
i
∈D
l
+D
u
that are nearer to o
i
*
than to other centroids.
− Update current centroids o
i
*
=

=∀ itj
j
j
x
,
, i=1 c, x
j
∈D
l
+D
u
.
￿
From each cluster, select p% examples x
i

D
u

which is nearest to o
i
*
, and add them to D
l

.
Classification Step
￿ Train a TSVM classifier based on D
l
′ and D
u
′.
￿
From each class, select p% examples x
i

D
u

with the largest margin, and add them to D
l

.

The algorithm implements the iterative reinforcement strategy. During each iteration, a
soft-constrained version of k-means is used for clustering. We compute the centroids of the labeled data
for each class (which is called  labeled centroids ) and use them as the initial centroids for k-means. The
k value is set to the number of classes in the labeled data. Then we run k-means on both labeled and
unlabeled data. The loop is terminated when clustering result doesn t change anymore, or just before a
labeled centroid being assigned to a wrong cluster. This sets  soft constraints on clustering because the
constraints are not based on exact examples but on their centroid. The constraints will reduce bias in the
labeled examples. Finally, unlabeled data are assigned the same labels as labeled centroid in the same
cluster.
After clustering, we select the most confident examples (i.e. examples nearest to cluster centroids)
to form a new labeled set, together with remaining unlabeled data, to train a TSVM classifier. Then,
examples with largest margin will be selected into the new labeled set for next iteration.
It should be noted that the time complexity of a TSVM classifier is much higher than that of a SVM
classifier, because it repeatedly switches estimated labels of unlabeled data and tries to find the maximal
margin hyperplane. The more unlabeled data, the more time it requires. When there is no unlabeled data,
TSVM becomes a pure SVM and runs much faster. In our algorithm, the last run of classification is done
by a standard SVM classifier.
5. A Performance Study
To evaluate the effectiveness of our approach, comprehensive performance study has been conducted
using several standard text dataset.
5.1 Datasets
Three commonly used datasets, 20-Newsgroups, Reuters-21578, and Open Directory Project (ODP)
webpages were used in our experiments. For each document, we extract features from its title and body
separately for Co-Training algorithm, and extract a single feature vector from both title and body for all
other algorithms. Stop-words are eliminated and stemming is performed for all features. For body
features, all words with low document frequency (set to less than three in the experiments) are removed.
TFIDF is then used to index both titles and bodies, where IDF is calculated only based on the training
dataset. The words that appear only in the test dataset but not in the train dataset are discarded. Because
training time for the TSVM classifier is proportional with the number of classes, we did not use all the
classes in each dataset. In the experiments, we select five classes from 20-Newsgroups (the same as
Nigam K. & Ghani R. (2002)), ten biggest classes from Reuters-21578 as many researches were
conducted, and six biggest classes from ODP.
From the 20-Newsgroups dataset
2
, we only select the five comp.* discussion groups, which forms a
very confusing but evenly distributed dataset for classification, i.e. there are almost 1000 articles in each
group. We choose 80% of each group as training set and the remaining 20% as test set, which give us
4000 train examples and 1000 test examples. Such a split is similar to what used in Nigam K. & Ghani R.
(2002). For each article, the subject line and body are retained and other content is discarded. After
preprocessing, there are 14171 distinct terms, with 14059 in body feature, and 2307 in title feature.
The dataset Reuters-21578 is downloaded from Yiming Yang s homepage
3
. We use the ModApte
split to form the training set and test set, where there are 7769 training examples and 3019 test examples.
From the whole dataset, we select the biggest ten classes: earn, acq, money-fx, grain, crude, trade,
interest, ship, wheat, and corn. After selecting the biggest ten classes, there are 6649 training examples
and 2545 test examples. After preprocessing, there are only 7771 distinct terms, with 7065 in body
feature and 6947 in title feature respectively.
The ODP webpages dataset used in our experiments is composed of the biggest six classes in the
second level of ODP directory
4
: Business/Management (858 documents), Computers/Software (2411),
Shopping/Crafts (877) Shopping/Home&Garden (1170), Society/Religion &Spirituality (886), and


2
Available at http://www.ai.mit.edu/people/jrennie/20Newsgroups/.
3
http://www-2.cs.cmu.edu/~yiming/
4
http://dmoz.org/
Society/ Holidays (881). We select 50% of each category as the training data and the remainder as the
test data. From the experiment result below, we see that there is only a small difference between 90%
and 50%. So we conclude that such a split is enough for training a classifier and is reasonable. The
webpages are preprocessed by a HTML parser and pure text is extracted. After preprocessing, there are
16818 distinct terms on body, 3729 on title and 17050 on title+body.
The Jochimss SVM-light package
5
is used for SVM and TSVM classification. We use a linear
kernel and set the weight C of the slack variables to default. The basic classifiers for the two feature sets
in the Co-Training method are Naive Bayes. During each iteration, we add 1% of examples with the
maximal classification confidence, i.e. examples with largest margin, into the labeled set. And the
performance is evaluated on the combined features.
5.2 Evaluation Metrics
We use micro-averaging of F1 measure among all classes to evaluate the classification result.
Since it is a multi-classification problem, the TSVM or SVM in our algorithm construct several
one-to-many binary classifiers. For each class i∈[1,c], let A
i
be the number of documents whose real
label is i, and B
i
the number of documents whose label is predicted to be i, and C
i
the number of
correctly predicted examples in this class. The precision and recall of the class i are defined as P
i
=C
i
/B
i

and R
i
=C
i
/A
i
respectively.
F1 Measure could be used to combine precision and recall into a unified measure. Because of the
difference of F1 for different classes, two averaging functions could be used to judge the combined
performance. The first is macro-averaging:
F1
macro
=

=
+
××

c
i
ii
ii
RP
RP
c
1
21
(6)
The second averaging function is micro-averaging, which first calculates the total precision and
recall on all classes P=ΣC
i
/ΣB
i
and R=ΣC
i
/ΣA
i
, and the F1 measure is:
F1
micro
=
RP
RP
+
××2
(7)
Because the micro-averaging of F1 is in fact a weighted average over all classes, which is more
plausible for highly unevenly distributed classes, we use this measure in the following experiments.
5.3 Results
We first evaluate our algorithms and compare them with TSVM and Co-Training on 5 comp.*
newsgroups data set (See Figure 3). In our algorithm, we set the parameter p to 100%, which means to
only cluster once and classify once. The selection of p will be explained later.


5
Available at http://svmlight.joachims.org/.

Figure 3. Comparative study of TSVM, Co-Training, and our algorithm CBC when given small training
data on 5 comp.* newsgroups data set.
The algorithms run several rounds with the number of labeled examples ranging from 1, 2, 4, 8, ...,
512 per class. For each number of labeled data, we randomly choose 10 sets to test the three methods,
and draw the average of them on Figure 3.
From Figure 3, we can see that when the number of the labeled data is very small, CBC performs
significantly better than the other algorithms. For example, when the number of the labeled data in each
class is 1, our method outperforms TSVM by 5% and Co-Training by 22%; when the number is 2, our
method outperforms TSVM by 9% and Co-Training by 12%. Only when the number of the labeled
samples exceeds 64, the performances of TSVM and Co-Training achieves a slightly better performance
than our method.
To evaluate the performance of CBC with a large range of labeled data, we run the same algorithm
together with TSVM, Co-Training and SVM on different percentage of the labeled data on the above 3
datasets. Figure 4, 5, and 6 illustrate the results. The horizontal axis indicates the percentage of the
labeled data in all training data, and the vertical axis is the value of Micro-F1 measure. We vary the
percentage from 0.50% to 90%. The performance of a basic SVM which do not exploit unlabeled data is
always the worst because the labeled data is too sparse for SVM to effectively generalize.
In all datasets, CBC performs best when labeled data percentage is less than 1%. When the number
of labeled documents increases, the performance of our algorithm is still comparative to other methods.
The Co-Training algorithm has low performance when training data is small because it is based on the
assumptions that an initial  weak predictor could be found. When training data is rather small, it is
often largely biased against the real data distribution. However, Co-Training sometimes got superior
result than others, especially in Figure 4 which uses Reuters dataset, because it exploits the feature split
to form two views of the training data. In Figure 4, 5, and 6, TSVM always has an intermediate
0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

1

2

4

8

16

32

64

128

256

512

Number of labeled data in each class

Micro-F1 Measure

K-means

TSVM

Co-Training

CBC

performance between Co-Training and our algorithm.
Reuters
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.50% 1% 5% 10% 50% 90%
Percentage of the Train Data Labeled
Micro-F1 Measure
TSVM
Co-Training
SVM
CBC

Figure 4. Micro-F1 of four algorithms on the Reuters data set
ODP
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.50% 1% 5% 10% 50% 90%
Percentage of Train Data Labeled
Micro-F1 Measure
TSVM
Co-Training
SVM
CBC

Figure 5. Micro-F1 of four algorithms on the ODP data set
5 Comp.*
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.50% 1% 5% 10% 50% 90%
Percentage of Train Data Labeled
Micro-F1 Measure
TSVM
Co-Training
SVM
CBC

Figure 6. Micro-F1 of four algorithms on the 5 comp.* newsgroups data set
￿
￿￿￿
￿￿￿
￿￿￿
￿￿￿
￿￿￿
￿￿￿
￿￿￿
￿￿￿
￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿￿
￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿￿
￿￿￿￿
￿￿￿￿

Figure 7. Micro-F1 of different algorithm settings with 0.5% labeled data on 5 comp.* news groups.

Figure 8. Variance of different algorithms given different labeled data size (each calculated over 10
randomly selected samples).
One limitation of our algorithm is that with the increase of labeled data, the performance grows
slowly, and sometimes will drop (as in Figure 4 based on Reuters set). This could be explained by the
simple method of integration of labeled examples in the clustering step. In fact, labeled examples can not
only provide constraints of the clustering, but are also used to modify the similarity measure. We hope
this can be further evaluated in future works.
In Figure 7, we empirically analyze our proposed algorithm under different parameter p based on
experiments on 5 comp.* newsgroups and same 0.5% labeled data initially. Different p controls how
much data to be labeled in each iteration. The 10% curve means we increase the labeled data by 10%
each time; the 100% curve means to label all the data only once; and the exp2 curve means that we add
examples by a sequence of exponential of 2 (i.e. we sequentially add 0.5%, 1%, 2%, 4%, ... examples to
the labeled set).
As can be seen from Figure 7, although the 10% selection and exp2 selection can improve the
classification monotonously, a simpler but more effective setting of this algorithm is to just cluster once
0.00001

0.0001

0.001

0.01

0.1

1

2

4

8

16

32

64

128

256

512

sample size

v
ariance

TSVM

Co-Traning

CBC
and classify once. This can be explained by the fact that although the clustering step provides
informative examples for classification step, there is a lack of informative examples selected by the
classification step for subsequent clustering in the proposed algorithm. That is, the examples selected by
TSVM and SVM classifiers cannot provide enough information for clustering algorithms.
Finally we evaluate the stability of the algorithms by the variance of the three algorithms calculated
over 10 randomly selected examples under different sample size (see Figure 8). This experiment is also
conducted on 5 comp.* newsgroups. As can be seen, generally the TSVM algorithm has smallest
variance and Co-Training has largest variance. It is natural to understand the large variance of
Co-Training, as the aforementioned reason that the initial labeled data has large impact on the initial
weak predictor. The variance of our algorithm is lower than Co-Training but higher than TSVM. It is
mainly because our simple version of constrained k-means algorithm may also fall into a local minimal
given a poor starting point.
6. Conclusion and Future Work
This paper presented a clustering based approach for semi-supervised text classification. The
experiments showed the superior performance of our method over existing methods such as TSVM and
Co-Training when labeled data size is extremely small. When there is sufficient labeled data, our method
is comparable to TSVM and Co-Training.
To the best of our knowledge, no work has been focused on clustering examples aided by labeled
data has been reported. Some works on constraint clustering (Wagstaff et al., 2001; Klein et al., 2002)
could be considered most relevant ones. They use prior knowledge in the form of cannot-link and
must-link constraints to guide clustering algorithms. While in our case, the labeled data provide not only
these constraints but also label information which could be exploited to assign labels for unlabeled data.
The constrained clustering method in this paper may not be sophisticated enough to capturer all
information carried in labeled data. We plan to evaluate other clustering methods and further adjust the
similarity measure with the aid of labeled examples. Another direction is to evaluate the validity of two
general classifiers used in our framework, and investigate the problems of the example selection
approach, confidence assessment, and noise control for further performance improvement.
References
Apte, C., Damerau, F., & Weiss, S.M. (1994) Automated learning of decision rules for text
categorization, ACM TOIS, Vol 12, No. 3. 223-251
Bhavani Raskutti, Herman Ferra, & Adam Kowalczyk. (2002). Combining Clustering and
Co-Training to enhance text classification using unlabeled data. In Proceedings of the SIGKDD
International Conference on Knowledge Discovery and Data Mining.
Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with Co-Training. In
Proceedings of the 11th Annual Conference on Computational Learning Theory (pp. 92-100).
Dempster, A. P., Laird, N.M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38.
Joachims, T. (1998). Text categorization with support vector machines: Learning with Many
Relevant Features. In C. Ndellec and C. Rouveirol (Eds.), Proceedings of the European Conference on
Machine Learning (pp. 137--142), Berlin: Springer.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many
relevant features. ECML98.
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In
Proceedings of 16th International Conference on Machine Learning (pp. 200-209). San Francisco:
Morgan Kaufmann.
Klein, D., Kamvar, S. D., & Maning, C. D. (2002). From instance-level constraints to space-level
constraints: making the most of prior kowldege in data clustering. In Proceedings of the Nineteenth
International Conference on Machine Learning.
Lewis, D.D (1998). Naïve Bayes at forty: The independence assumption in information retrieval,
ECML98.
Masand, B., Linoff, G., & Waltz, D. (1992). Classifying news stories using memory based reasoning,
15
th
ACM SIGIR Conference, 59-64.
Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of
logistic regression and naive Bayes. Advances in Neural Information Processing Systems 14.
Ng, T.H., Goh, W.B., & Low, K.L. (1997). Feature selection, perception learning and a usability
case study for text categorization, 20
th
ACM SIGIR Conference.
Nigam K. & Ghani R. (2002). Analyzing the effectiveness and applicability of co-training. In
Proceedings of 9th International Conference on Information and Knowledge Management.
Nigam, K., McCallurn, A. K., Thrun, S. & Mitchell, T. (2000). Text classification from labeled and
unlabeled documents using EM. Machine Learning, 39(2/3):103-134.
Salton, G. (1991). Developments in automatic text retrieval. Science, 253:974-979.
Seeger, M. (2001). Learning with labeled and unlabeled data. Technical report, Edinburgh
University.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Wagstaff, K., Cardie, C., Rogers, S., & Caruana, R. (2001). Constrained k-means clustering with
background knowledge. In Proceedings of the Eighteenth International Conference on Machine
Learning.
Yang, Y. & Chute, C.G. (1994). An example-based mapping method for text categorization and
retrieval, ACM TOIS, Vol 12, No. 3, 252-277.
Yang, Y. & Liu, X. (1999). An re-examination of text categorization, 22
th
ACM SIGIR Conference.
Zhang, T. & Oles, F. (2000). A probability analysis on the value of unlabeled data for classification
problems. In Proceedings of the Seventeenth International Conference on Machine Learning.