Automated
Extraction of Concepts
and Identification of Topics
from
Large
T
ext Document
Collections
1 Overview
1.1
Document
representation
Typically vector space model is used for object representation:
original
documents
are represented by
vectors of
virt
ually any dimensionality,
where
vector’s
scalar components are called attributes or features
(or terms, in case of text documents)
.
To build a term vector the content of a document is analysed to
extract the terms and count their frequencies, whereby prepr
ocessing methods such as stemming, stop word
removal, case folding and thesaural substitution of synonyms improve the results
of subsequent processing
steps
by a
significant
m
argin. Each vector is weighted,
typically
using
a TF/IDF weighting schema
,
and
te
rms carrying
a small amount of information may be dropped altogether leaving
only terms with high
discriminative power
(dimensionality reduction by term selection, see next section
for more on
dimensionality reduction
)
.
A
ny
pair of vectors can be compared
by a similarity coefficient or a distance
measure which defines a metrics in the vector space. Note that o
bjects
may
also be represented by other
data structures
(for example suffix trees), provided a function for comparing
any pair of
objects can be
defin
ed.
1.2 Dimensionality Reduction Methods
Two major types of dimension reduction methods can be distinguish
ed
: linear and non

linear. Linear
techniques result in each of the components of the new variable being a linear combination of the original
variables
. Linear techniques are usually simpler and easier to implement than more recent methods
considering non

linear transforms.
Pricipal Componant Analysis:
Princ
ipal component analysis (PCA) is
, in the mean

square error sense,
the best linear dimension reduct
ion technique
(note
:
that there is a non

linear version too)
.
Being based
on the covariance matrix of the variables, it is a second

order method.
PCA reduce
s
the dimension of the
data by finding orthogonal linear
c
ombinations (the PCs) of the original vari
ables with the largest
variance.
Since the variance depends on the scale of the variables, each variable
is standardized
to have
mean zero and standard deviation one. After the standardization, the original variables with possibly
different units of measu
rement are all in comparable units.
The
fi
rst PC
is the linear combin
ation with
the largest variance, t
he second PC is the linear combination with the second largest
variance and
orthogonal to the fi
rst PC, and so on.
The in
terpretation of the PCs can be
d
ifficult and despite the fact
that they are uncorrelated variables constructed as linear combinations of the original variables they do
not necessarily correspond to meaningful concepts.
Latent Semantic Indexing (LSI
, also called Latent Semantic Analysis
)
/ Singular Value Decomposition
(SVD)
:
LSI was developed to resolve the so

called vocabulary mismatch problem. It handles synonymy
(variability in human word choice) and polysemy (same word has often different meanings) by
considering the context of words.
LSI
infers dependence among the original terms
and
produces
new,
independent dimensions
by looking at
the
p
atterns
of
term
cooccurrence
of
original
document vectors
and compressing
these
vectors
into
a lower

dimensional
space whose dimensions are combinat
ions of
the original dimensions.
At the heart of LSI lies an advanced statistical technique

the singular value
decomposition (SVD)

to extract latent terms,
whereby
a latent term corresponds to a concept that may
be described by several keywords. A term

document matrix is built from weighted documents' term
vectors and is submitted to SVD which constructs an n

dimensional abstract semantic space in which
each original wo
rd is presented as a vector. LSI
's representation of a document is the average of the
vectors of the words it contains independent of their order. Construction of the SVD matrix is
computationally expensive and although there may be cases in which the matrix size cannot be reduced
effectively, LSI dimensionality reduction helps to reduce n
oise and automatically organizes documents
into a semantic structure allowing efficient and powerful retrieval: relevant documents are retrieved,
even if they did not literally contain the query words.
Similar to PCI, LSI has
is computationally very
expens
ive
and therefore can hardly be applied to larger data sets.
Multidimensional Scaling (MDS):
Multidimensional Scaling (MDS) attempts to find the structure in a
matrix
of proximity measures
between objects
.
A matrix containing (dis

)similarity values betwee
n each
pair of objects is computed in the original, high

dimensional space. Objects are projected into a lower

dimensional space
by solving a minimization problem such
that the distances between points in the low

dimensional space match the
original
(dis

)
simi
larities as closely as possible minimizing a
goodness

of

fit measure called stress
.
MDS can be used to analyze any kind o
f distance or similarity matrix, but
there
is no
simple way to
interpret the nature of the resulting dimensions
:
axes from the MDS
analysis are
arbitrary, and can be rotated in any direction.
MDS has been one of the most widely used mapping
techniques in information science,
especially for document visualization
. Traditionally MDS is
computationally very expensive, however recently
di
fferent
nonlinear MDS approaches have been
proposed that promise to handle larger data sets.
Factor Analysis (FA):
Factor analysis (FA) is a linear multivariate exploratory technique, based on the
second

order data summaries
that can be used to examine a w
ide range of data sets. Primary applications
of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the
relationships between variables, or to classify variables. Contrary to other methods the factors can of
ten
be interpreted. FA assumes that the measured variables depend on some unknown, and often
unmeasurable, common factors. The goal of FA is to uncover such relations, and thus can be used to
reduce the dimension of datasets following the factor model.
Sel
f

Organizing Maps
(SOMs)
:
SOMs are an artificial neural networks approach to information
visualization. During the learning phase, a self

organizing map algorithm iteratively modifies weight
vectors to produce a typically 2

dimensional map in the output la
yer that will exhibit as best as possible
the relationship of the input layer. SOMs appear to be one of the most promising algorithms for
organizing large volumes of information, but they have some significant deficiencies comprising the
absence of a cost
function, and the lack of a theoretical basis for choosing learning rate parameter
schedules and neighborhood parameters to ensure topographic ordering. There are no general proofs of
convergence, and the model does not define a probability density.
Random
projections
:
Random Projections (RP): a result of Johnson and Lindenstrauss asserts that any
set of n points in d

dimensional Euclidean space can be embedded into q

dimensional Euclidean space,
where q is logarithmic in n and independent of d so that all
pairwise distances are maintained within an
arbitrarily small factor. Constructions of such embeddings involve projecting the n points onto a random
k

dimensional hyperplane. The computational cost of RP is low but it still offers distance

preserving
prope
rties that make it an attractive candidate for certain dimensionality reduction tasks. Where the
co
mputational complexity of for example PCA is
O(n*p^2)+O(p^3), n being the number of data items
and p being the dimensionality of the original space, RP has a
tim
e
complexity of O(npq), where q is the
dimensionality of the target space which is logarithmic in n.
Independent component analysis
(ICA):
ICA is a higher

order method that seeks linear projections
(although there is a non

linear variant)
, not necessar
ily orthogonal to each other, that are as nearly
statistically independent as possible. Statistical independence is a much stronger condition than
uncorrelatdness: while the latter only involves the second

order statistics, the former depends on all the
hi
gher

order statistics.
Projection pursuit (PP): PP is a linear method that, unlike PCA and FA, can incorporate higher than
second

order information, and thus is useful for non

Gaussian datasets. It is more computationally
intensive than second

order method
s.
Pathfinder Network Scaling: Pathfinder Network Scaling is a structural and procedural modeling
technique which extracts underlying patterns in proximity data and represents them spatially in a class of
networks called Pathfinder Networks (PFnets). Pathf
inder algorithms take estimates of the proximities
between pairs of items as input and define a network representation of the items that preserves only the
most important links. The resulting Pathfinder network consists of the items as nodes and a set of l
inks
(which may be either directed or undirected for symmetrical or non symmetrical proximity estimates)
connecting pairs of the nodes.
1.
3
Cluster Analysis
C
lustering
methods form categories of related data objects from unorganised sets of data objects. O
bjects,
which are assigned to the same category, or cluster, must be similar according to a certain criteria, while
objects which are not related must be assigned to different clusters. The procedure may also be applied
hierarchically to create a hierarchy
of clusters. This makes clustering similar to automatic classification,
with the difference that in case of classification categories are already known before processing (supervised
process), while in case of clustering categories are created dynamically
during processing (unsupervised
process). Clustering algorithms can be applied to almost any kind of data, not only text documents. Text
documents are typically categorised thematically on the basis of their content, but the grouping of related
objects can
take place according to any other criteria. Generally speaking the number of approaches and
different principles used for clustering is very large, and new methods are continuously being developed,
each having different characteristics tuned for applicati
on in specific areas
.
Concept indexing is a dimensionality reduction technique based on clustering methods aiming to be equally
effective for unsupervised and supervised dimensionality reduction. The technique promises to achieve
comparable retrieval perfo
rmance to that obtained using LSI, while requiring an order of magnitude less
time. CI computes a k

dimensional representation of a collection of documents by first clustering
the
documents into k clusters which represent
the axes of the lower

dimensional
space, and then
by
describing
each document in terms of these new dimension
s
.
2 Definitions
Each
data object
(document)
i
d
from the corpus
}
,...,
{
1
N
d
d
D
is a
represented by a feature
vector
v
of dimensio
nality
L
with the form
:
)
,...,
(
1
L
x
x
v
. T
he scalar
components
j
x
,
where
L
j
...
1
,
are
the
frequencies of object’s features (terms)
.
A set of
N
vectors
in a space of d
imensionalit
y
L
can be represented by a
L
N
matrix
)
...
(
1
N
d
d
S
.
A cluster
i
c
containing n
objects can be viewed as a set
}
,...,
{
1
n
i
d
d
c
if the order of its members
is
of no
significance
, or as an ordered list
n
i
d
d
c
,...,
1
if
the order of its member plays a role.
A fractional degree of membership, used by fuzzy clustering methods, of the object
i
d
to the
cluster
j
c
is den
oted
with
1
,
0
ij
u
where 0 stand for no membership and 1 stands for full membership
.
For
non

fuzzy clustering methods
ij
u
is discrete and has a value of either 0 or 1.
A set of
M
clusters is denoted
by
}
,...,
{
1
M
c
c
C
if the order of
clusters
is of no
significance
or as
an ordered list
M
c
c
C
,...,
1
if
the order of clusters plays a role
.
A similarity coefficient of the form
)
,
(
2
1
v
v
S
can be applied to any pair of vecto
rs and returns a value
in the interval
1
,
0
, where 0 means no similarity at all and 1 means that the vectors are equal.
A distance measure of the form
)
,
(
2
1
v
v
D
can be applied to any pair of vectors and returns a value in
the interval
infinity
,
0
, where 0 means
highest possible relatedness and infinity means
that vectors
are
not related at all
.
ci
n

number of documents in cluster
i
c
kj
n

number of d
ocuments in a previously know category
i
k
kj
ci
n
,

number of documents in cluster
i
c
from category
i
k
3
Clustering Algorithms
3.1
Characteristics of Clustering Algorithms
Parti
tional vs. hierarchical: Clustering methods are usually divided according to the cluster structure
which they produce.
P
artitional methods divide a set of N
data object
into M clusters, producing a "flat"
structure of clusters
, i.e. each cluster contains o
nly data objects
. The more complex hierarchical methods
produce
a
nested
hierarchy of
clusters
where each cluster may contain clusters and/or data objects.
Agglomerative vs. divisive: Hierarchical methods can be agglomerative or divisive. The agglomerativ
e
methods, which are more common in praxis, produce up to N

1 connections to pairs, beginning from
single
data object,
each representing one cluster. The divisive methods begin with all
data objects
placed
in a single cluster and perform up to N

1 division
s of the clusters into smaller units to produce the
hierarchy.
Exclusive vs. overlapping: If the clustering method is exclusive, it means that each object can be
assigned to only one cluster at a time. Overlapping strategy allows
multiple assignments
of an
y object.
Fuzzy vs. hard clusters: Fuzzy methods are overlapping in their nature, assigning to an object a degree of
membership between 0.0 and 1.0 to every cluster. Hard clusters allow a degree of membership to be
either 0 or 1.
Deterministic vs. stochast
ic: Deterministic methods produce the same clusters if applied to the same
starting conditions, which is not the case with stochastic methods.
Incremental vs. non

incremental: Incremental methods allow successive adding of objects to an already
existing cl
ustering of objects. Non

incremental methods require that all items be known in advance,
before the actual processing takes place.
Order sensitive vs. order insensitive methods: Order sensitive methods produce clusters that depend on
the order in which the
objects are added. In order insensitive methods the order of items does not play a
role, the method produces the same
clusters
independently of the data object order
.
Ordered vs. unordered clusters:
The order of
created
clusters
and their children has a d
efined meaning.
An example of an ordered classification is a hierarchy, which can be helpful for searching.
Scalable vs. non

scalabl
e
:
An algorithm yielding excellent quality of results may be inapplicable to large
data sets because of high time and/or spa
ce complexity. Scalable methods are capable of handling
large
datasets (which may not
even
fit into the main memory) while producing good results.
High

dimensional vs. low

dimensional: An algorithm yielding
good
results in a low

dimensional space
may perfo
rm poorly in when high dimensionality is involved. High

dimensionality is harder and
typically
requires special approaches (may not be suited for handling low

dimensional data sets).
Noise

insensitive vs. noise

sensitive (c
apability to handle outliers
): A
noise

sensitive algorithm may
perform well on a data set with no noise but will produce poor results even when a moderate amount of
noise
is present, in some cases even a few outliers may be the cause of performance
drop
. Noise

insensitive methods handle n
oise with a significantly smaller drop in performance.
I
rregularly shaped clusters vs.
hyperspherical
clusters
:
An algorithm may be capable of identifying
irregularly

shaped or elongated clusters, or it may only be capable of finding hyperspherical cluster
s.
Depending on the application this capability may or may not be of advantage.
Monothetic vs. polythetic: Monothetic means that only one feature is taken
into consideration to
determine
the membership in a cluster. Polythetic means that more features are
taken into consideration
at once.
Feature type depende
nt vs. feature type independent: A
method may only be capable of handling
features having Boolean
or
discrete
or real values
.
Feature type independent algorithm is capable of
handling any type of featur
e.
Similarity (distance) measure dependent vs. similarity measure independent: An algorithm (or a
particular implementation) may only perform well if a particular similarity (distance) measure is used, or
may work equally well with any similarity (distance
) function.
Interpretability of results: A method may deliver
Reliance on a priori knowledge or pre

defined parameters:
This includes setting different thresholds or
setting the number of clusters to identify. Providing a priori knowledge may be mandatory
or may be a
hint for the algorithm in order to produce better results.
3.2 Classification of Clustering Algorithms
Partitioning Methods
o
Relocation Algorithms
o
Probabilistic Clustering
o
Square Error
(
K

means,
K

medoids Methods
)
o
Graph Theoretic
o
Mixture

Resol
ving
(EM)
and Mode

Seeking Algorithms
Locality based methods
o
Random D
istribution
Methods
o
Density

Based Algorithms
(
Connectivity Clustering
,
Density
Function
Clustering
)
Clustering Algorithms Used in Machine Learning
o
Gradient Descent
o
Artificial Neural Netwo
rks
(SOMs, ARTs, …)
o
Evolutionary Methods
(GAs, EPs
, …
)
Hierarchical Methods
o
Agglomerative Algorithms (bottom up)
o
Divisive Algorithms (top

down)
Grid

Based Methods
Methods Based on Co

Occurrence of Categorical Data
Constraint

Based Clustering
Nearest Neigh
bor Clustering
Fuzzy Clustering
Search

Based Approaches
(deterministic,
stochastic)
Scalable Clustering Algorithms
Algorithms
f
or High Dimensional Data
o
Subspace Clustering
(top

down, bottom

up)
o
Projection Techniques
o
Co

Clustering Techniques
3.
3
Represent
ation of Clusters
R
epresentation through cluster members by using one object or a set of selected members.
Centroid is the most common way of representing a cluster. It is usually defined as the center
of mass of
all contained objects:
c
=n Pi=1Din
Vectors
for the objects contained in a cluster should be normalised. If
this is not the case large
vectors will have a much stronger impact on the position of the centroid than
small vectors. If
only relative profiles of the objects and not their sizes should be t
aken into
consideration all
vectors must be normalised.
Geometric representation. Some boundary points, a bounding polygon containing all members
of the
cluster or a convex hull constructed from the cluster members can be used.
Decision tree or predicate r
epresentation. A cluster can be represented by nodes in a decision
tree, which
is equivalent to using a series of conjunctive logical expressions like x < 5^x > 2.
3.4 Evaluation
of Clustering Results
Internal qualities measur
es depend on the representatio
n:
o
S
elf similarity
of cluster
k
c
is
the average similarity of the documents
in a cluster
:
)
1
(
)
,
(
)
(
..
1
..
1
N
N
d
d
simil
c
S
j
N
i
i
j
N
j
i
k
o
Squared error
of cluster
i
c
(also called distortion)
is the sum of squared distances between each
cl
uster member and the cluster
centroid
:
2
..
1
..
1
2


)
(
j
N
i
M
j
i
i
c
d
c
E
o
Compactness of cluster
j
c
is the average
squared distances between each cluster member
j
i
c
d
and the cluster
centroid
j
c
:
ci
c
d
j
i
j
n
c
d
c
C
j
i
2


)
(
o
Maximum
relative
error
of cluster
j
c
:
)
(


max
)
(
2
max
,
j
j
i
c
d
j
rel
c
C
c
d
c
E
j
i
TODO: check if this
makes sense
External quality measures are
based on a known categorization:
o
Precision
of cluster
i
c
in relation to cate
gory
j
k
:
ci
kj
ci
kj
ci
n
n
P
/
,
,
o
Recall
of cluster
i
c
in relation to category
j
k
:
kj
kj
ci
kj
ci
n
n
R
/
,
,
o
M
ax. precision
of cluster
i
c
:
kj
ci
j
ci
P
P
,
max,
max
o
M
ax.
recall
of cluster
i
c
:
kj
ci
j
ci
R
R
,
max,
max
o
Clustering precision:
i
M
i
ci
ci
P
n
n
P
..
1
max,
o
Clustering
recall
:
i
M
i
ci
ci
R
n
n
R
..
1
max,
o
Entropy for a cluster:
j
kj
ci
kj
ci
ci
P
P
E
)
log(
,
,
o
Clustering entropy:
i
ci
ci
E
n
n
E
o
Information gain
: the difference in entropy between the clustering and
the
entropy
of
a random
partition.
o
F

measure
of cluster
i
c
:
ci
ci
ci
ci
ci
R
P
R
P
F
max,
max,
max,
max,
2
o
Clustering F

measure:
???
M
i
ci
M
i
ci
ci
n
F
n
F
..
1
..
1
???
TODO: check this
4
Research Directions
When dealing with text documents
collections three problems arise: very high

dimensionality of the vector
space, a high level of noise and, very often, a large number of data items to be analyzed. Common
d
imensionality reduction techniques
such as LSI scal
e poorly and can not be applied to large document
collections. Only recently techniques which can deal with large document collections, such as random
projec
tions (RM) have been introduced
.
Clustering of text
documents
collections
is hard
due to high

dime
nsionality:
the number of scaleable
algorithms capable of handling high dimensionality is small. Recently techniques such as
Subspace
Clustering and P
rojection Techniques
have been explored and seem to have yielded acceptable results.
Another problem pose
s the interpretability of results. Finding human

readable, expressive labels describing
the clusters is another problem we plan to address.
Comments 0
Log in to post a comment