CBC: Clustering Based Text Classification

Requiring Minimal Labeled Data

Hua-Jun Zeng

1

Xuan-Hui Wang

2

Zheng Chen

1

Wei-Ying Ma

1

1

Microsoft Research Asia, Beijing, P. R. China

{i-hjzeng, zhengc, wyma}@microsoft.com

2

University of Science and Technology of China, Anhui Hefei, P. R. China

xhwang6@mail.ustc.edu.cn

Abstract. Semi-supervised learning methods construct classifiers using both labeled and

unlabeled training data samples. While unlabeled data samples can help to improve the accuracy

of trained models to certain extent, existing methods still face difficulties when labeled data is

not sufficient and biased against the underlying data distribution. In this paper, we present a

clustering based classification (CBC) approach. Using this approach, training data, including

both the labeled and unlabeled data, is first clustered with the guidance of the labeled data. Some

of unlabeled data samples are then labeled based on the clusters obtained. Discriminative

classifiers can subsequently be trained with the expanded labeled dataset. The effectiveness of

the proposed method is justified analytically. Related issues such as expanding labeled dataset

and interacting clustering with classification are discussed. Our experimental results

demonstrated that CBC outperforms existing algorithms when the size of labeled dataset is very

small.

1. Introduction

Text classification is a supervised learning task of assigning natural language text documents to one or

more predefined categories or classes according to their contents. While it is a classical problem in the

field of information retrieval for a half century, it has recently attracted an increasing amount of attention

due to the ever-expanding amount of text documents available in digital form. Its applications span a

number of areas including auto-processing of emails, filtering junk emails, cataloguing Web pages and

news articles, etc. A large number of techniques have been developed for text classification, including

Naive Bayes (Lewis 1998), Nearest Neighbor (Masand 1992), neural networks (Ng 1997), regression

(Yang 1994), rule induction (Apte 1994), and Support Vector Machines (SVM) (Vapnik 1995, Joachims

1998). Among them SVM has been recognized as one of the most effective text classification methods.

Yang & Liu gave a comparative study of many algorithms (Yang 1999).

As supervised learning methods, most existing text classification algorithms require sufficient

training data so that the obtained classification model can generalize well. When the number of training

data in each class decreases, the classification accuracy of traditional text classification algorithms

degrade dramatically. However, in practical applications, labeled documents are often very sparse

because manually labeling data is tedious and costly, while there are often abundant unlabeled

documents. As a result, exploiting these unlabeled data in text classification has become an active

research problem in text classification recently. The general problem of exploiting unlabeled data in

supervised learning leads to a semi-supervised learning or labeled-unlabeled problem in different

context.

The problem, in the context of text classification, could be formalized as follows. Each sample text

document is represented by a vector x∈ℜ

d

. We are given two datasets D

l

and D

u

. Dataset D

l

is a labeled

dataset, consisting of data samples (x

i

, t

i

), where 1≤i≤n, and t

i

is the class label with 1≤ t

i

≤ c. Dataset D

u

is an unlabeled dataset, consisting of unlabeled sample data x

i

, n+1 ≤ i ≤n+m. The semi-supervised

learning task is to construct a classifier with small generalization error on unseen data

1

based on both D

l

and D

u

. There have been a number of work reported in developing semi-supervised text classification

recently, including Co-Training (Blum & Mitchell, 1998), Transductive SVM (TSVM) (Joachims, 1999),

and EM (Nigram et al., 2000); and a comprehensive review could be found in Seeger (2001).

While it has been reported that those methods obtain considerable improvement over traditional

supervised methods when the size of training dataset is relatively small, our experiments indicated that

they still face difficulties when the labeled dataset is extremely small, e.g. containing less than 10 labeled

examples in each class. This is somehow expected as most of those methods adopt the same iterative

approach which train an initial classifier heavily based on the distribution presented in the labeled data.

When containing very small number of samples, and the samples are far apart from corresponding class

centers due to the high dimensionality, those methods will often have a poor starting point and cumulate

more errors in iterations. On the other hand, although there are much more unlabeled data sample

available which should be more representative, they are not made full use of in the classification process.

The above observation motivated our work reported in this paper. We present CBC, a clustering

based approach for text documents classification with both labeled and unlabeled data. The

philosophical difference between our approach and existing ones is that, we treat semi-supervised

learning as clustering aided by the labeled data, while the existing algorithms treated it as classification

aided by the unlabeled data. Traditional clustering is unsupervised and requires no training examples.

However, the labeled data can provide important hint for the latent class variables. The labeled data also

help determine parameters associated with clustering methods, thus impacting on the final clustering

result. Furthermore, the label information could be propagated to unlabeled data according to clustering

results. The expanded labeled set could be used in subsequent discriminative classifiers to obtain low

generalization error on unseen data. Experimental results indicated that our approach outperforms

existing approaches, especially when the original labeled dataset is very small.

Our contributions can be summarized as follows. (1) We proposed a novel clustering based

1

A transductive setting of this problem just uses seen unlabeled data as testing data.

classification approach that requires minimal labeled data in the training dataset to achieve high

classification accuracy; (2) We provided analysis that gives some insight to the problem and proposed

various implementation strategies; (3) We conducted comprehensive experiments to validate our

approach and study related issues. The remainder of the paper is organized as follows. Section 2 reviews

several existing methods. Our approach is outlined in Section 3 with some analysis. The detailed

algorithm is then presented in Section 4. A performance study using several standard text datasets is

presented in Section 5. Finally, Section 6 concludes the paper.

2. Semi-Supervised Learning: Motivations

As defined in the previous section, semi-supervised uses both the labeled dataset D

l

and the unlabeled

dataset D

u

to construct a classification model. However, how the unlabeled data could help in

classification is not a trivial problem. Different methods were proposed according to different view of

unlabeled data.

Expectation-Maximization (EM) (Dempster et al, 1977) has a long history in semi-supervised

learning. The motivation of EM is as follows. Essentially, any classification method is to learn a

conditional probability model P(t|x,θ), from a certain model family to fit the real joint distribution P(x, t).

With unlabeled data, a standard statistical approach to assessing the fitness of learned models P(t|x,θ) is

∑ ∑∑

∈∈

+

ul

D tD

ii

tPtPtPtP

xx

xx )(),|(log)(),|(log θθ

(1)

where the latent labels of unlabeled data are treated as missing variables. Given Eq. 1, a Maximum

Likelihood Estimation (MLE) process can be conducted to find an optimal θ. Because the form of

likelihood often makes it difficult to maximize by partial derivatives, Expectation-Maximization (EM)

algorithm is generally used to find a local optimal θ. For example, Nigram et al. (2000) combined EM

with Naive Bayes and got improved performance over supervised classifiers. Theoretically if a θ close to

the global optima could be found, the result will also be optimal under the given model family. However,

the selection of a plausible model family is difficult, and the local optima problem is serious especially

when given a poor starting point. For example, in Nigrams approach, EM is initialized by Naive Bayes

classifiers on labeled data, which may be heavily biased when there is no sufficient labeled data.

Co-Training and TSVM methods were proposed more recently and have shown superior

performance over EM in many experiments. Co-Training method (Blum & Mitchell, 1998) splits the

feature set by x=(x

1

, x

2

) and trains two classifiers θ

1

and θ

2

each of which is sufficient for classification.

With the assumption of compatibility, i.e. P(t|x

1

,θ

1

)= P(t|x

2

,θ

2

), Co-Training uses unlabeled data to place

an additional restriction on the model parameter distribution P(θ), thus improving the estimation of real

θ. The algorithm initially constructs two classifiers based on labeled data, and mutually selects several

confident examples to expand the training set. This is based on the assumptions that an initial weak

predictor could be found and the two feature sets are conditional independent. However, when labeled

dataset is small, it is often heavily biased against the real data distribution. The above assumptions will

be seriously violated.

TSVM (Joachims, 1999) adopts a totally different way of exploiting the unlabeled data. It

maximizes margin over both the labeled data and the unlabeled data. TSVM works by finding a labeling

t

n+1

, t

n+2

, ..., t

n+m

of the unlabeled data D

u

and a hyperplane <w, b> which separates both D

l

and D

u

with

maximum margin. TSVM expects to find a low-density area of data and constructs a linear separator in

this area. Although empirical results indicate the success of the method, there is a concern that the large

margin hyperplane over the unlabeled data is not necessary to be the real classification hyperplane

(Zhang, 2000). In text classification, because of the high dimensionality and data sparseness, there are

often many low-density areas between positive and negative labeled examples.

Instead of using two conditional independent features in the co-training setting, Raskutti (2002)

co-trained two SVM classifiers using two feature spaces from different views. One is the original feature

space and the other is derived from clustering the labeled and unlabeled data. Nigam K. & Ghani R.

(2002) proposed two hybrid algorithms, co-EM and self-training, using two randomly split features in

co-training setting. After exhaustive experiments, they found that co-training is better than

non-co-training algorithms such as self-training.

As a summary, all existing semi-supervised methods still work in the supervised fashion, that is,

they pay more attention to the labeled dataset, and rely on the distribution presented in the labeled

dataset heavily. With the help of the unlabeled data, extra information on data distribution can help to

improve the generalization performance. However, if the number of samples contained in the labeled

data is extremely small, such algorithms may not work well as the labeled data can hardly represent the

distribution in unseen data from the beginning. This is often the case for text classification where the

dimensionality is very high and a labeled dataset of small size just represents a few isolated points in a

huge space. In Figure 1, we depict the results of applying two algorithms to a text classification problem.

The X-axis is the number of samples in each class, and the Y-axis is their performance in terms of F

Micro

that will be defined in Section 5. We can see that the performance of both algorithms degrades

dramatically when the number of samples in each class dropped to less than 16.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 2 4 8 16 32 64 128 256 512

Number of labeled data in each class

Micro-F1 Measure

TSVM

Co-Training

K-means

Figure 1. Performance of two existing algorithms with different size of labeled data

(No. classes = 5, total number of training data samples = 4000 )

In Figure 1, we depict another line, the dotted line, to indicate the performance of using a clustering

method, K-means, to cluster the same set of training data. In the experiments, we ignore the labels; hence

it represents the performance of unsupervised learning. It is interesting to see that when the number of

labeled data in each class is less than 4, unsupervised learning in fact gives better performance than both

semi-supervised learning algorithms. This motivated us to develop a clustering based approach to the

problem of semi-supervised learning.

3. Clustering Based Classification

Our approach clusters the unlabeled data with the guidance of labeled data first. With those clusters,

some originally unlabeled data can be viewed as labeled data with high confidence. Such an expanded

labeled dataset can be subsequently used in classification algorithms to construct the final classification

model.

3.1 The Basic Approach

CBC consists of the following two steps:

1. Clustering step: to cluster the training dataset including both the labeled and unlabeled data

and expand the labeled set according to the clustering result;

2. Classification step: to train classifiers with the expanded labeled data and remaining unlabeled

data.

Figure 2 gives an example to illustrate the traditional approach and our clustering based approach,

respectively. The black points and grey points in the Figure represent data samples of two different

classes. We have very small number of labeled data, e.g. one for each class, represented by the points

with + and - signs. Apparently, a classification algorithm trained with these two points will most

likely find line A as shown in Figure 2(a) as the class boundary; and it s also rather difficult to discover

the real boundary B even with the help of those unlabeled data points. Firstly, because the initial labeled

samples are highly biased, they will cause poor starting points for iterative reinforcement algorithms

such as Co-Training and EM. Moreover, TSVM algorithm may also take line A as the result because it

happens to lie in a low density area. In fact in a feature space with high dimensionality, a single sample

Figure 2. An illustrative example of clustering based classification. The black and gray data points are

unlabeled examples. The big + and - are two initially labeled example, and small + and - are

examples expanded by clustering.

(a) Classification with

original labeled data

(b) Expand labeled set

by clustering

(c) Classification with

newly labeled data

A

B

B

is often highly biased; and many low density areas will exist.

Our clustering based approach is shown in Figure 2(b) and (c). During the first step, a clustering

algorithm is applied to the training samples. In the example, it results in two clusters. Then we propagate

the labels of the labeled data samples to the unlabeled samples which are closest to cluster centroids. As

a result, we have more labeled data samples, as shown in Figure 2(b). The second step of the approach is

to use the expanded labeled data and remaining unlabeled data to train a classifier. As the result, we can

obtain the better class boundary, as shown in Figure 2(c).

From the above description, we can see that, our approach aims to combining the merits of both

clustering and classification methods. That is, we use clustering method to reduce the impact of the bias

caused by the initial sparse labeled data. At the same time, with sufficient expanded labeled data, we can

use discriminative classifiers to achieve better generalization performance than pure clustering methods.

3.2 Benefits of the Clustering Based Approach

In this subsection, we further analyze the benefit of integrating clustering into the classification process.

First, clustering methods are more robust to the bias caused by the initial sparse labeled data. Let us

take k-means, the most popular clustering algorithm as an example. In essence, k-means is a simplified

version of EM working on spherical Gaussian distribution models. They can be approximately described

by MLE of k spherical Gaussian distributions, where the means µ

1

,

, µ

k

and the identical covariances Σ

are latent variables. Thus with the aid of labeled data, the objective is to find an optimal θ=<µ

1

,

, µ

k

,

Σ> to maximize the log-likelihood of Eq. 1 where the P(x|t

i

,θ) equals to

))()(

2

1

exp(

)2(

1

1

2/1

2/

i

T

i

d

xΣx

Σ

µµ

π

−−−⋅

⋅

−

(2)

When the number of labeled examples is small, the bias of labeled example will not affect much the

likelihood estimation and the finding of the optimal θ.

Second, our clustering method is in fact a generative classifier, i.e., it constructs a classifier derived

from the generative model of its data P(x|t,θ). Ng & Jordan theoretically and empirically analyzed the

asymptotic performance of generative classifier and discriminative classifier (such as logistic regression,

which is a general form of SVM). They showed that generative classifiers reach their asymptotic

performance faster than discriminative classifiers (Ng & Jordan 2002). Thus our clustering method is

more effective with small training data; and easier to achieve high performance when the labeled data is

sparse. To address the problem that generative classifiers usually lead to higher asymptotic error than

discriminative classifiers, discriminative classification method such as TSVM can be used in the second

step of our approach, i.e., after clustering unlabeled data and expanding the labeled data set.

Our clustering is guided by labeled data. Generally, clustering methods address the issue of finding

a partition of available data which maximizes a certain criterion, e.g. intra-cluster similarity and

inter-cluster dissimilarity. The labeled data could be used to modify the criterion. There are also some

parameters associated with each clustering algorithm, e.g. the number k in k-means, or split strategy of

dendrogram in hierarchical clustering. The labeled data can also be used to guide the selection of these

parameters. In our current implementation, we use the soft-constraint version of k-means algorithm for

clustering, where k is equal to the number of classes in the given labeled data set. The labeled data points

are used to obtain the initial labeled centroids, which are used in the clustering process to constraint the

cluster result. The detailed algorithm is described in Section 4.

3.3 Combining Clustering with Classification

It is interesting to note that a TSVM classifier can also provide more confident examples for clustering.

That is, two-step clustering based classification, i.e., clustering followed by classification, can be viewed

as a conceptual approach. Another strategy of combining clustering and classification is through iterative

reinforcement. That is, we first train a clustering model L

1

based on all available data, obtaining an

approximately correct classifier. Afterwards, we select from unlabeled data examples that are confidently

classified by L

1

(i.e. examples with high likelihood) and combine them with original labeled data to train

a new model L

2

. Because more labeled data are used, the obtained L

2

is expected to be more accurate

and can provide more confident training examples for L

1

. We use the new labeled dataset to train L

1

again. This process is iterated until all examples are labeled.

One key issue here is how to expand the labeled dataset. In principle, we can just assign the label to

the most confident p% of examples from each of the resulting clusters. If we choose p=100% after first

clustering process, we actually have a two-step approach.

First, we need to determine the value of p. The selection of p is a tradeoff between the number of

labeled samples and possible noise introduced by the labeling error. Obviously, with higher p, a large

labeled dataset will be obtained. In general, a classifier with higher accuracy can be obtained with more

training samples. On the other hand, when we expand more samples, we might introduce incorrectly

labeled samples into the labeled dataset, which become noise and will degrade the performance of a

classification algorithm. Furthermore, small p means more iteration in the reinforcement process.

Second, we need to choose confident examples. Note that any learned model is an estimation of

the real data model P(x,t). We can find examples that are confidently classified by a given model if a

slightly change of θ has no impact on them. When more examples are given, the model estimation will

become more accurate, and the number of confident examples will grow. As illustrated in Figure 2 (a)

and (b), even when some of the data points are wrongly classified, the most confident data points, i.e. the

ones with largest margin under classification model and the ones nearest to the centroid under clustering

model, are confidently classified. That is, a slightly change of the decision boundary or centroid will not

affect the label of these data.

We assume that class labels t are uniformly distributed. Since the Gaussian is spherical, the

log-likelihood of a given data point and the estimated label is

2

2

*

1

*****

)|(),|(log())|,(log( cctPtxPtxP +−−== µθθθ x

(3)

where c

1

and c

2

are positive constants. The most probable points in a single Gaussian distribution are the

points that are nearest to the distribution mean.

To get the most confident examples from the result of TSVM, we should draw a probabilistic view

of the TSVM. Let us take logistic regression as an example, which is a general form of discriminative

methods. The objective is to maximize

∑

−

+

i

xfy

ii

e

),(

1

1

log

θ

(4)

where f(x

i

,θ) is some linear function depending on the parameter θ. θ is typically a linear combination of

training examples. Under the margin maximization classifier such as SVM, the likelihood of a given

point x

*

and its label t

*

=+ can be derived from the above equation:

)

)))((exp(1

1

1)(()*,(

∑ ∑

+⋅+

−=+

j

j

k

k

jj

bt

xPxP

xxβ

(5)

which considers points with largest margin the most probable.

4. The Algorithm

In this section, we present the detailed algorithm when CBC is applied to text data, which is generally

represented by sparse term vectors in a high dimensional space. Following the traditional IR approach,

we tokenize all documents into terms and construct one component for each distinct term. Thus each

document is represented by a vector (w

i1

, w

i2

, ..., w

id

) where w

ij

is weighted by TFIDF (Salton, 1991), i.e.

)/log(

jijij

DFNTFw ×=

, where N is total number of documents.

Assuming the term vectors are normalized, cosine function is a commonly used similarity measure

for two documents:

∑

=

⋅=

d

i

ikijkj

wwdocdocsim

1

),(

. This measure is also used in the clustering

algorithm to calculate the distance from an example to a centroid (which is also normalized).

This simple representation proves to be efficient for supervised learning (Joachims, 1998), e.g. in

most tasks they are linear separatable. For the classification step, we use a TSVM classifier with a linear

kernel. The detailed algorithm is presented as following.

Algorithm CBC

Input:

Labeled data set D

l

Unlabeled data set D

u

.

Output:

The full labeled set D

l

′ =D

l

+(D

u

, T

u

*

)

A classifier L

1. Initialize current labeled and unlabeled set D

l

′=D

l

, D

u

′=D

u

2. Repeat until D

u

′ = ∅

Clustering Step

Calculate initial centroids o

i

=

∑

=∀ itj

j

j

x

,

, i=1 c, x

j

∈

D

l

, and set current centroids o

i

*

=o

i

.

The label of the centroids t(o

i

)= t(o

i

*

) are equal to labels of the corresponding examples.

Repeat until cluster result doesnt change any more

− Calculate the nearest centroids o

j

*

for each o

i

. If t(o

i

)≠t(o

j

*

), exit the loop.

− Assign t(o

i

*

) to each x

i

∈D

l

+D

u

that are nearer to o

i

*

than to other centroids.

− Update current centroids o

i

*

=

∑

=∀ itj

j

j

x

,

, i=1 c, x

j

∈D

l

+D

u

.

From each cluster, select p% examples x

i

∈

D

u

′

which is nearest to o

i

*

, and add them to D

l

′

.

Classification Step

Train a TSVM classifier based on D

l

′ and D

u

′.

From each class, select p% examples x

i

∈

D

u

′

with the largest margin, and add them to D

l

′

.

The algorithm implements the iterative reinforcement strategy. During each iteration, a

soft-constrained version of k-means is used for clustering. We compute the centroids of the labeled data

for each class (which is called labeled centroids ) and use them as the initial centroids for k-means. The

k value is set to the number of classes in the labeled data. Then we run k-means on both labeled and

unlabeled data. The loop is terminated when clustering result doesn t change anymore, or just before a

labeled centroid being assigned to a wrong cluster. This sets soft constraints on clustering because the

constraints are not based on exact examples but on their centroid. The constraints will reduce bias in the

labeled examples. Finally, unlabeled data are assigned the same labels as labeled centroid in the same

cluster.

After clustering, we select the most confident examples (i.e. examples nearest to cluster centroids)

to form a new labeled set, together with remaining unlabeled data, to train a TSVM classifier. Then,

examples with largest margin will be selected into the new labeled set for next iteration.

It should be noted that the time complexity of a TSVM classifier is much higher than that of a SVM

classifier, because it repeatedly switches estimated labels of unlabeled data and tries to find the maximal

margin hyperplane. The more unlabeled data, the more time it requires. When there is no unlabeled data,

TSVM becomes a pure SVM and runs much faster. In our algorithm, the last run of classification is done

by a standard SVM classifier.

5. A Performance Study

To evaluate the effectiveness of our approach, comprehensive performance study has been conducted

using several standard text dataset.

5.1 Datasets

Three commonly used datasets, 20-Newsgroups, Reuters-21578, and Open Directory Project (ODP)

webpages were used in our experiments. For each document, we extract features from its title and body

separately for Co-Training algorithm, and extract a single feature vector from both title and body for all

other algorithms. Stop-words are eliminated and stemming is performed for all features. For body

features, all words with low document frequency (set to less than three in the experiments) are removed.

TFIDF is then used to index both titles and bodies, where IDF is calculated only based on the training

dataset. The words that appear only in the test dataset but not in the train dataset are discarded. Because

training time for the TSVM classifier is proportional with the number of classes, we did not use all the

classes in each dataset. In the experiments, we select five classes from 20-Newsgroups (the same as

Nigam K. & Ghani R. (2002)), ten biggest classes from Reuters-21578 as many researches were

conducted, and six biggest classes from ODP.

From the 20-Newsgroups dataset

2

, we only select the five comp.* discussion groups, which forms a

very confusing but evenly distributed dataset for classification, i.e. there are almost 1000 articles in each

group. We choose 80% of each group as training set and the remaining 20% as test set, which give us

4000 train examples and 1000 test examples. Such a split is similar to what used in Nigam K. & Ghani R.

(2002). For each article, the subject line and body are retained and other content is discarded. After

preprocessing, there are 14171 distinct terms, with 14059 in body feature, and 2307 in title feature.

The dataset Reuters-21578 is downloaded from Yiming Yang s homepage

3

. We use the ModApte

split to form the training set and test set, where there are 7769 training examples and 3019 test examples.

From the whole dataset, we select the biggest ten classes: earn, acq, money-fx, grain, crude, trade,

interest, ship, wheat, and corn. After selecting the biggest ten classes, there are 6649 training examples

and 2545 test examples. After preprocessing, there are only 7771 distinct terms, with 7065 in body

feature and 6947 in title feature respectively.

The ODP webpages dataset used in our experiments is composed of the biggest six classes in the

second level of ODP directory

4

: Business/Management (858 documents), Computers/Software (2411),

Shopping/Crafts (877) Shopping/Home&Garden (1170), Society/Religion &Spirituality (886), and

2

Available at http://www.ai.mit.edu/people/jrennie/20Newsgroups/.

3

http://www-2.cs.cmu.edu/~yiming/

4

http://dmoz.org/

Society/ Holidays (881). We select 50% of each category as the training data and the remainder as the

test data. From the experiment result below, we see that there is only a small difference between 90%

and 50%. So we conclude that such a split is enough for training a classifier and is reasonable. The

webpages are preprocessed by a HTML parser and pure text is extracted. After preprocessing, there are

16818 distinct terms on body, 3729 on title and 17050 on title+body.

The Jochimss SVM-light package

5

is used for SVM and TSVM classification. We use a linear

kernel and set the weight C of the slack variables to default. The basic classifiers for the two feature sets

in the Co-Training method are Naive Bayes. During each iteration, we add 1% of examples with the

maximal classification confidence, i.e. examples with largest margin, into the labeled set. And the

performance is evaluated on the combined features.

5.2 Evaluation Metrics

We use micro-averaging of F1 measure among all classes to evaluate the classification result.

Since it is a multi-classification problem, the TSVM or SVM in our algorithm construct several

one-to-many binary classifiers. For each class i∈[1,c], let A

i

be the number of documents whose real

label is i, and B

i

the number of documents whose label is predicted to be i, and C

i

the number of

correctly predicted examples in this class. The precision and recall of the class i are defined as P

i

=C

i

/B

i

and R

i

=C

i

/A

i

respectively.

F1 Measure could be used to combine precision and recall into a unified measure. Because of the

difference of F1 for different classes, two averaging functions could be used to judge the combined

performance. The first is macro-averaging:

F1

macro

=

∑

=

+

××

⋅

c

i

ii

ii

RP

RP

c

1

21

(6)

The second averaging function is micro-averaging, which first calculates the total precision and

recall on all classes P=ΣC

i

/ΣB

i

and R=ΣC

i

/ΣA

i

, and the F1 measure is:

F1

micro

=

RP

RP

+

××2

(7)

Because the micro-averaging of F1 is in fact a weighted average over all classes, which is more

plausible for highly unevenly distributed classes, we use this measure in the following experiments.

5.3 Results

We first evaluate our algorithms and compare them with TSVM and Co-Training on 5 comp.*

newsgroups data set (See Figure 3). In our algorithm, we set the parameter p to 100%, which means to

only cluster once and classify once. The selection of p will be explained later.

5

Available at http://svmlight.joachims.org/.

Figure 3. Comparative study of TSVM, Co-Training, and our algorithm CBC when given small training

data on 5 comp.* newsgroups data set.

The algorithms run several rounds with the number of labeled examples ranging from 1, 2, 4, 8, ...,

512 per class. For each number of labeled data, we randomly choose 10 sets to test the three methods,

and draw the average of them on Figure 3.

From Figure 3, we can see that when the number of the labeled data is very small, CBC performs

significantly better than the other algorithms. For example, when the number of the labeled data in each

class is 1, our method outperforms TSVM by 5% and Co-Training by 22%; when the number is 2, our

method outperforms TSVM by 9% and Co-Training by 12%. Only when the number of the labeled

samples exceeds 64, the performances of TSVM and Co-Training achieves a slightly better performance

than our method.

To evaluate the performance of CBC with a large range of labeled data, we run the same algorithm

together with TSVM, Co-Training and SVM on different percentage of the labeled data on the above 3

datasets. Figure 4, 5, and 6 illustrate the results. The horizontal axis indicates the percentage of the

labeled data in all training data, and the vertical axis is the value of Micro-F1 measure. We vary the

percentage from 0.50% to 90%. The performance of a basic SVM which do not exploit unlabeled data is

always the worst because the labeled data is too sparse for SVM to effectively generalize.

In all datasets, CBC performs best when labeled data percentage is less than 1%. When the number

of labeled documents increases, the performance of our algorithm is still comparative to other methods.

The Co-Training algorithm has low performance when training data is small because it is based on the

assumptions that an initial weak predictor could be found. When training data is rather small, it is

often largely biased against the real data distribution. However, Co-Training sometimes got superior

result than others, especially in Figure 4 which uses Reuters dataset, because it exploits the feature split

to form two views of the training data. In Figure 4, 5, and 6, TSVM always has an intermediate

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

1

2

4

8

16

32

64

128

256

512

Number of labeled data in each class

Micro-F1 Measure

K-means

TSVM

Co-Training

CBC

performance between Co-Training and our algorithm.

Reuters

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.50% 1% 5% 10% 50% 90%

Percentage of the Train Data Labeled

Micro-F1 Measure

TSVM

Co-Training

SVM

CBC

Figure 4. Micro-F1 of four algorithms on the Reuters data set

ODP

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.50% 1% 5% 10% 50% 90%

Percentage of Train Data Labeled

Micro-F1 Measure

TSVM

Co-Training

SVM

CBC

Figure 5. Micro-F1 of four algorithms on the ODP data set

5 Comp.*

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.50% 1% 5% 10% 50% 90%

Percentage of Train Data Labeled

Micro-F1 Measure

TSVM

Co-Training

SVM

CBC

Figure 6. Micro-F1 of four algorithms on the 5 comp.* newsgroups data set

Figure 7. Micro-F1 of different algorithm settings with 0.5% labeled data on 5 comp.* news groups.

Figure 8. Variance of different algorithms given different labeled data size (each calculated over 10

randomly selected samples).

One limitation of our algorithm is that with the increase of labeled data, the performance grows

slowly, and sometimes will drop (as in Figure 4 based on Reuters set). This could be explained by the

simple method of integration of labeled examples in the clustering step. In fact, labeled examples can not

only provide constraints of the clustering, but are also used to modify the similarity measure. We hope

this can be further evaluated in future works.

In Figure 7, we empirically analyze our proposed algorithm under different parameter p based on

experiments on 5 comp.* newsgroups and same 0.5% labeled data initially. Different p controls how

much data to be labeled in each iteration. The 10% curve means we increase the labeled data by 10%

each time; the 100% curve means to label all the data only once; and the exp2 curve means that we add

examples by a sequence of exponential of 2 (i.e. we sequentially add 0.5%, 1%, 2%, 4%, ... examples to

the labeled set).

As can be seen from Figure 7, although the 10% selection and exp2 selection can improve the

classification monotonously, a simpler but more effective setting of this algorithm is to just cluster once

0.00001

0.0001

0.001

0.01

0.1

1

2

4

8

16

32

64

128

256

512

sample size

v

ariance

TSVM

Co-Traning

CBC

and classify once. This can be explained by the fact that although the clustering step provides

informative examples for classification step, there is a lack of informative examples selected by the

classification step for subsequent clustering in the proposed algorithm. That is, the examples selected by

TSVM and SVM classifiers cannot provide enough information for clustering algorithms.

Finally we evaluate the stability of the algorithms by the variance of the three algorithms calculated

over 10 randomly selected examples under different sample size (see Figure 8). This experiment is also

conducted on 5 comp.* newsgroups. As can be seen, generally the TSVM algorithm has smallest

variance and Co-Training has largest variance. It is natural to understand the large variance of

Co-Training, as the aforementioned reason that the initial labeled data has large impact on the initial

weak predictor. The variance of our algorithm is lower than Co-Training but higher than TSVM. It is

mainly because our simple version of constrained k-means algorithm may also fall into a local minimal

given a poor starting point.

6. Conclusion and Future Work

This paper presented a clustering based approach for semi-supervised text classification. The

experiments showed the superior performance of our method over existing methods such as TSVM and

Co-Training when labeled data size is extremely small. When there is sufficient labeled data, our method

is comparable to TSVM and Co-Training.

To the best of our knowledge, no work has been focused on clustering examples aided by labeled

data has been reported. Some works on constraint clustering (Wagstaff et al., 2001; Klein et al., 2002)

could be considered most relevant ones. They use prior knowledge in the form of cannot-link and

must-link constraints to guide clustering algorithms. While in our case, the labeled data provide not only

these constraints but also label information which could be exploited to assign labels for unlabeled data.

The constrained clustering method in this paper may not be sophisticated enough to capturer all

information carried in labeled data. We plan to evaluate other clustering methods and further adjust the

similarity measure with the aid of labeled examples. Another direction is to evaluate the validity of two

general classifiers used in our framework, and investigate the problems of the example selection

approach, confidence assessment, and noise control for further performance improvement.

References

Apte, C., Damerau, F., & Weiss, S.M. (1994) Automated learning of decision rules for text

categorization, ACM TOIS, Vol 12, No. 3. 223-251

Bhavani Raskutti, Herman Ferra, & Adam Kowalczyk. (2002). Combining Clustering and

Co-Training to enhance text classification using unlabeled data. In Proceedings of the SIGKDD

International Conference on Knowledge Discovery and Data Mining.

Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with Co-Training. In

Proceedings of the 11th Annual Conference on Computational Learning Theory (pp. 92-100).

Dempster, A. P., Laird, N.M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data

via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38.

Joachims, T. (1998). Text categorization with support vector machines: Learning with Many

Relevant Features. In C. Ndellec and C. Rouveirol (Eds.), Proceedings of the European Conference on

Machine Learning (pp. 137--142), Berlin: Springer.

Joachims, T. (1998). Text categorization with support vector machines: Learning with many

relevant features. ECML98.

Joachims, T. (1999). Transductive inference for text classification using support vector machines. In

Proceedings of 16th International Conference on Machine Learning (pp. 200-209). San Francisco:

Morgan Kaufmann.

Klein, D., Kamvar, S. D., & Maning, C. D. (2002). From instance-level constraints to space-level

constraints: making the most of prior kowldege in data clustering. In Proceedings of the Nineteenth

International Conference on Machine Learning.

Lewis, D.D (1998). Naïve Bayes at forty: The independence assumption in information retrieval,

ECML98.

Masand, B., Linoff, G., & Waltz, D. (1992). Classifying news stories using memory based reasoning,

15

th

ACM SIGIR Conference, 59-64.

Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of

logistic regression and naive Bayes. Advances in Neural Information Processing Systems 14.

Ng, T.H., Goh, W.B., & Low, K.L. (1997). Feature selection, perception learning and a usability

case study for text categorization, 20

th

ACM SIGIR Conference.

Nigam K. & Ghani R. (2002). Analyzing the effectiveness and applicability of co-training. In

Proceedings of 9th International Conference on Information and Knowledge Management.

Nigam, K., McCallurn, A. K., Thrun, S. & Mitchell, T. (2000). Text classification from labeled and

unlabeled documents using EM. Machine Learning, 39(2/3):103-134.

Salton, G. (1991). Developments in automatic text retrieval. Science, 253:974-979.

Seeger, M. (2001). Learning with labeled and unlabeled data. Technical report, Edinburgh

University.

Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag.

Wagstaff, K., Cardie, C., Rogers, S., & Caruana, R. (2001). Constrained k-means clustering with

background knowledge. In Proceedings of the Eighteenth International Conference on Machine

Learning.

Yang, Y. & Chute, C.G. (1994). An example-based mapping method for text categorization and

retrieval, ACM TOIS, Vol 12, No. 3, 252-277.

Yang, Y. & Liu, X. (1999). An re-examination of text categorization, 22

th

ACM SIGIR Conference.

Zhang, T. & Oles, F. (2000). A probability analysis on the value of unlabeled data for classification

problems. In Proceedings of the Seventeenth International Conference on Machine Learning.

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο