Automated Extraction of Concepts and Identification of Topics from Large Text Document Collections

overratedbeltAI and Robotics

Nov 25, 2013 (3 years and 10 months ago)

66 views

Automated
Extraction of Concepts

and Identification of Topics

from

Large
T
ext Document

Collections


1 Overview

1.1

Document

representation

Typically vector space model is used for object representation:
original
documents

are represented by
vectors of
virt
ually any dimensionality,
where
vector’s

scalar components are called attributes or features

(or terms, in case of text documents)
.
To build a term vector the content of a document is analysed to
extract the terms and count their frequencies, whereby prepr
ocessing methods such as stemming, stop word
removal, case folding and thesaural substitution of synonyms improve the results
of subsequent processing
steps
by a
significant

m
argin. Each vector is weighted,
typically
using

a TF/IDF weighting schema
,

and
te
rms carrying

a small amount of information may be dropped altogether leaving

only terms with high
discriminative power

(dimensionality reduction by term selection, see next section

for more on
dimensionality reduction
)
.

A
ny

pair of vectors can be compared
by a similarity coefficient or a distance
measure which defines a metrics in the vector space. Note that o
bjects
may

also be represented by other
data structures

(for example suffix trees), provided a function for comparing
any pair of
objects can be
defin
ed.

1.2 Dimensionality Reduction Methods

Two major types of dimension reduction methods can be distinguish
ed
: linear and non
-
linear. Linear
techniques result in each of the components of the new variable being a linear combination of the original
variables
. Linear techniques are usually simpler and easier to implement than more recent methods
considering non
-
linear transforms.



Pricipal Componant Analysis:
Princ
ipal component analysis (PCA) is
, in the mean
-
square error sense,
the best linear dimension reduct
ion technique

(note
:

that there is a non
-
linear version too)
.

Being based
on the covariance matrix of the variables, it is a second
-
order method.

PCA reduce
s

the dimension of the
data by finding orthogonal linear
c
ombinations (the PCs) of the original vari
ables with the largest
variance.

Since the variance depends on the scale of the variables, each variable

is standardized

to have
mean zero and standard deviation one. After the standardization, the original variables with possibly
different units of measu
rement are all in comparable units.
The
fi
rst PC

is the linear combin
ation with
the largest variance, t
he second PC is the linear combination with the second largest
variance and
orthogonal to the fi
rst PC, and so on.

The in
terpretation of the PCs can be
d
ifficult and despite the fact
that they are uncorrelated variables constructed as linear combinations of the original variables they do
not necessarily correspond to meaningful concepts.




Latent Semantic Indexing (LSI
, also called Latent Semantic Analysis
)

/ Singular Value Decomposition
(SVD)
:
LSI was developed to resolve the so
-
called vocabulary mismatch problem. It handles synonymy
(variability in human word choice) and polysemy (same word has often different meanings) by
considering the context of words.

LSI
infers dependence among the original terms

and
produces

new,

independent dimensions

by looking at
the

p
atterns
of
term

cooccurrence

of

original
document vectors

and compressing

these

vectors
into
a lower
-
dimensional

space whose dimensions are combinat
ions of
the original dimensions.

At the heart of LSI lies an advanced statistical technique
-

the singular value
decomposition (SVD)
-

to extract latent terms,
whereby

a latent term corresponds to a concept that may
be described by several keywords. A term
-
document matrix is built from weighted documents' term
vectors and is submitted to SVD which constructs an n
-
dimensional abstract semantic space in which
each original wo
rd is presented as a vector. LSI
's representation of a document is the average of the

vectors of the words it contains independent of their order. Construction of the SVD matrix is
computationally expensive and although there may be cases in which the matrix size cannot be reduced
effectively, LSI dimensionality reduction helps to reduce n
oise and automatically organizes documents
into a semantic structure allowing efficient and powerful retrieval: relevant documents are retrieved,
even if they did not literally contain the query words.

Similar to PCI, LSI has
is computationally very
expens
ive

and therefore can hardly be applied to larger data sets.



Multidimensional Scaling (MDS):
Multidimensional Scaling (MDS) attempts to find the structure in a
matrix

of proximity measures

between objects
.
A matrix containing (dis
-
)similarity values betwee
n each
pair of objects is computed in the original, high
-
dimensional space. Objects are projected into a lower
-
dimensional space

by solving a minimization problem such

that the distances between points in the low
-
dimensional space match the
original

(dis
-
)
simi
larities as closely as possible minimizing a

goodness
-
of
-
fit measure called stress
.
MDS can be used to analyze any kind o
f distance or similarity matrix, but
there
is no
simple way to
interpret the nature of the resulting dimensions
:
axes from the MDS
analysis are
arbitrary, and can be rotated in any direction.

MDS has been one of the most widely used mapping
techniques in information science,

especially for document visualization
. Traditionally MDS is
computationally very expensive, however recently

di
fferent
nonlinear MDS approaches have been
proposed that promise to handle larger data sets.



Factor Analysis (FA):
Factor analysis (FA) is a linear multivariate exploratory technique, based on the
second
-
order data summaries

that can be used to examine a w
ide range of data sets. Primary applications
of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the
relationships between variables, or to classify variables. Contrary to other methods the factors can of
ten
be interpreted. FA assumes that the measured variables depend on some unknown, and often
unmeasurable, common factors. The goal of FA is to uncover such relations, and thus can be used to
reduce the dimension of datasets following the factor model.



Sel
f
-
Organizing Maps

(SOMs)
:

SOMs are an artificial neural networks approach to information
visualization. During the learning phase, a self
-
organizing map algorithm iteratively modifies weight
vectors to produce a typically 2
-
dimensional map in the output la
yer that will exhibit as best as possible
the relationship of the input layer. SOMs appear to be one of the most promising algorithms for
organizing large volumes of information, but they have some significant deficiencies comprising the
absence of a cost
function, and the lack of a theoretical basis for choosing learning rate parameter
schedules and neighborhood parameters to ensure topographic ordering. There are no general proofs of
convergence, and the model does not define a probability density.



Random

projections
:
Random Projections (RP): a result of Johnson and Lindenstrauss asserts that any
set of n points in d
-
dimensional Euclidean space can be embedded into q
-
dimensional Euclidean space,
where q is logarithmic in n and independent of d so that all
pairwise distances are maintained within an
arbitrarily small factor. Constructions of such embeddings involve projecting the n points onto a random
k
-
dimensional hyperplane. The computational cost of RP is low but it still offers distance
-
preserving
prope
rties that make it an attractive candidate for certain dimensionality reduction tasks. Where the
co
mputational complexity of for example PCA is

O(n*p^2)+O(p^3), n being the number of data items
and p being the dimensionality of the original space, RP has a

tim
e

complexity of O(npq), where q is the
dimensionality of the target space which is logarithmic in n.



Independent component analysis

(ICA):
ICA is a higher
-
order method that seeks linear projections

(although there is a non
-
linear variant)
, not necessar
ily orthogonal to each other, that are as nearly
statistically independent as possible. Statistical independence is a much stronger condition than
uncorrelatdness: while the latter only involves the second
-
order statistics, the former depends on all the
hi
gher
-
order statistics.



Projection pursuit (PP): PP is a linear method that, unlike PCA and FA, can incorporate higher than
second
-
order information, and thus is useful for non
-
Gaussian datasets. It is more computationally
intensive than second
-
order method
s.



Pathfinder Network Scaling: Pathfinder Network Scaling is a structural and procedural modeling
technique which extracts underlying patterns in proximity data and represents them spatially in a class of
networks called Pathfinder Networks (PFnets). Pathf
inder algorithms take estimates of the proximities
between pairs of items as input and define a network representation of the items that preserves only the
most important links. The resulting Pathfinder network consists of the items as nodes and a set of l
inks
(which may be either directed or undirected for symmetrical or non symmetrical proximity estimates)
connecting pairs of the nodes.

1.
3

Cluster Analysis

C
lustering

methods form categories of related data objects from unorganised sets of data objects. O
bjects,
which are assigned to the same category, or cluster, must be similar according to a certain criteria, while
objects which are not related must be assigned to different clusters. The procedure may also be applied
hierarchically to create a hierarchy

of clusters. This makes clustering similar to automatic classification,
with the difference that in case of classification categories are already known before processing (supervised
process), while in case of clustering categories are created dynamically
during processing (unsupervised
process). Clustering algorithms can be applied to almost any kind of data, not only text documents. Text
documents are typically categorised thematically on the basis of their content, but the grouping of related
objects can

take place according to any other criteria. Generally speaking the number of approaches and
different principles used for clustering is very large, and new methods are continuously being developed,
each having different characteristics tuned for applicati
on in specific areas
.

Concept indexing is a dimensionality reduction technique based on clustering methods aiming to be equally
effective for unsupervised and supervised dimensionality reduction. The technique promises to achieve
comparable retrieval perfo
rmance to that obtained using LSI, while requiring an order of magnitude less
time. CI computes a k
-
dimensional representation of a collection of documents by first clustering

the
documents into k clusters which represent

the axes of the lower
-
dimensional
space, and then

by
describing
each document in terms of these new dimension
s
.

2 Definitions



Each

data object

(document)

i
d

from the corpus
}
,...,
{
1
N
d
d
D


is a
represented by a feature
vector

v

of dimensio
nality
L

with the form
:
)
,...,
(
1
L
x
x
v


. T
he scalar
components
j
x
,

where

L
j
...
1

,

are
the
frequencies of object’s features (terms)
.



A set of
N

vectors
in a space of d
imensionalit
y
L

can be represented by a
L
N


matrix
)
...
(
1



N
d
d
S
.



A cluster
i
c

containing n

objects can be viewed as a set
}
,...,
{
1
n
i
d
d
c


if the order of its members
is
of no

significance
, or as an ordered list


n
i
d
d
c
,...,
1


if

the order of its member plays a role.




A fractional degree of membership, used by fuzzy clustering methods, of the object
i
d

to the

cluster

j
c


is den
oted
with


1
,
0

ij
u

where 0 stand for no membership and 1 stands for full membership
.

For
non
-
fuzzy clustering methods
ij
u

is discrete and has a value of either 0 or 1.



A set of
M

clusters is denoted

by
}
,...,
{
1
M
c
c
C


if the order of
clusters

is of no
significance

or as
an ordered list



M
c
c
C
,...,
1


if

the order of clusters plays a role
.



A similarity coefficient of the form
)
,
(
2
1
v
v
S

can be applied to any pair of vecto
rs and returns a value
in the interval


1
,
0
, where 0 means no similarity at all and 1 means that the vectors are equal.



A distance measure of the form
)
,
(
2
1
v
v
D

can be applied to any pair of vectors and returns a value in
the interval


infinity
,
0
, where 0 means
highest possible relatedness and infinity means
that vectors
are
not related at all
.



ci
n

-

number of documents in cluster
i
c



kj
n

-

number of d
ocuments in a previously know category
i
k



kj
ci
n
,

-

number of documents in cluster
i
c

from category
i
k

3
Clustering Algorithms

3.1
Characteristics of Clustering Algorithms



Parti
tional vs. hierarchical: Clustering methods are usually divided according to the cluster structure
which they produce.
P
artitional methods divide a set of N
data object

into M clusters, producing a "flat"
structure of clusters
, i.e. each cluster contains o
nly data objects
. The more complex hierarchical methods
produce
a
nested

hierarchy of

clusters
where each cluster may contain clusters and/or data objects.



Agglomerative vs. divisive: Hierarchical methods can be agglomerative or divisive. The agglomerativ
e
methods, which are more common in praxis, produce up to N
-
1 connections to pairs, beginning from
single
data object,

each representing one cluster. The divisive methods begin with all
data objects

placed
in a single cluster and perform up to N
-
1 division
s of the clusters into smaller units to produce the
hierarchy.



Exclusive vs. overlapping: If the clustering method is exclusive, it means that each object can be
assigned to only one cluster at a time. Overlapping strategy allows
multiple assignments

of an
y object.



Fuzzy vs. hard clusters: Fuzzy methods are overlapping in their nature, assigning to an object a degree of
membership between 0.0 and 1.0 to every cluster. Hard clusters allow a degree of membership to be
either 0 or 1.



Deterministic vs. stochast
ic: Deterministic methods produce the same clusters if applied to the same
starting conditions, which is not the case with stochastic methods.



Incremental vs. non
-
incremental: Incremental methods allow successive adding of objects to an already
existing cl
ustering of objects. Non
-
incremental methods require that all items be known in advance,
before the actual processing takes place.



Order sensitive vs. order insensitive methods: Order sensitive methods produce clusters that depend on
the order in which the

objects are added. In order insensitive methods the order of items does not play a
role, the method produces the same
clusters

independently of the data object order
.



Ordered vs. unordered clusters:
The order of
created
clusters

and their children has a d
efined meaning.

An example of an ordered classification is a hierarchy, which can be helpful for searching.



Scalable vs. non
-
scalabl
e
:
An algorithm yielding excellent quality of results may be inapplicable to large
data sets because of high time and/or spa
ce complexity. Scalable methods are capable of handling
large

datasets (which may not
even
fit into the main memory) while producing good results.



High
-
dimensional vs. low
-
dimensional: An algorithm yielding
good

results in a low
-
dimensional space
may perfo
rm poorly in when high dimensionality is involved. High
-
dimensionality is harder and
typically
requires special approaches (may not be suited for handling low
-
dimensional data sets).



Noise
-
insensitive vs. noise
-
sensitive (c
apability to handle outliers
): A
noise
-
sensitive algorithm may
perform well on a data set with no noise but will produce poor results even when a moderate amount of

noise

is present, in some cases even a few outliers may be the cause of performance

drop
. Noise
-
insensitive methods handle n
oise with a significantly smaller drop in performance.



I
rregularly shaped clusters vs.
hyperspherical

clusters
:
An algorithm may be capable of identifying
irregularly
-
shaped or elongated clusters, or it may only be capable of finding hyperspherical cluster
s.
Depending on the application this capability may or may not be of advantage.



Monothetic vs. polythetic: Monothetic means that only one feature is taken
into consideration to
determine

the membership in a cluster. Polythetic means that more features are
taken into consideration
at once.



Feature type depende
nt vs. feature type independent: A

method may only be capable of handling

features having Boolean

or

discrete
or real values
.

Feature type independent algorithm is capable of
handling any type of featur
e.



Similarity (distance) measure dependent vs. similarity measure independent: An algorithm (or a
particular implementation) may only perform well if a particular similarity (distance) measure is used, or
may work equally well with any similarity (distance
) function.



Interpretability of results: A method may deliver



Reliance on a priori knowledge or pre
-
defined parameters:
This includes setting different thresholds or
setting the number of clusters to identify. Providing a priori knowledge may be mandatory

or may be a
hint for the algorithm in order to produce better results.

3.2 Classification of Clustering Algorithms



Partitioning Methods

o

Relocation Algorithms

o

Probabilistic Clustering

o

Square Error

(
K
-
means,
K
-
medoids Methods
)

o

Graph Theoretic

o

Mixture
-
Resol
ving

(EM)

and Mode
-
Seeking Algorithms



Locality based methods

o

Random D
istribution

Methods

o

Density
-
Based Algorithms

(
Connectivity Clustering
,
Density
Function

Clustering
)



Clustering Algorithms Used in Machine Learning

o

Gradient Descent

o

Artificial Neural Netwo
rks

(SOMs, ARTs, …)

o

Evolutionary Methods
(GAs, EPs
, …
)



Hierarchical Methods

o

Agglomerative Algorithms (bottom up)

o

Divisive Algorithms (top
-
down)



Grid
-
Based Methods



Methods Based on Co
-
Occurrence of Categorical Data



Constraint
-
Based Clustering



Nearest Neigh
bor Clustering



Fuzzy Clustering



Search
-
Based Approaches

(deterministic,
stochastic)



Scalable Clustering Algorithms



Algorithms
f
or High Dimensional Data

o

Subspace Clustering

(top
-
down, bottom
-
up)

o

Projection Techniques

o

Co
-
Clustering Techniques



3.
3

Represent
ation of Clusters



R
epresentation through cluster members by using one object or a set of selected members.



Centroid is the most common way of representing a cluster. It is usually defined as the center

of mass of
all contained objects:
c

=n Pi=1Din

Vectors
for the objects contained in a cluster should be normalised. If
this is not the case large

vectors will have a much stronger impact on the position of the centroid than
small vectors. If

only relative profiles of the objects and not their sizes should be t
aken into
consideration all

vectors must be normalised.



Geometric representation. Some boundary points, a bounding polygon containing all members

of the
cluster or a convex hull constructed from the cluster members can be used.



Decision tree or predicate r
epresentation. A cluster can be represented by nodes in a decision

tree, which
is equivalent to using a series of conjunctive logical expressions like x < 5^x > 2.

3.4 Evaluation

of Clustering Results



Internal qualities measur
es depend on the representatio
n:

o

S
elf similarity

of cluster
k
c

is

the average similarity of the documents

in a cluster
:

)
1
(
)
,
(
)
(
..
1
..
1







N
N
d
d
simil
c
S
j
N
i
i
j
N
j
i
k

o

Squared error

of cluster
i
c

(also called distortion)

is the sum of squared distances between each
cl
uster member and the cluster

centroid
:

2
..
1
..
1
2
||
||
)
(
j
N
i
M
j
i
i
c
d
c
E







o

Compactness of cluster
j
c

is the average
squared distances between each cluster member
j
i
c
d


and the cluster

centroid

j
c
:

ci
c
d
j
i
j
n
c
d
c
C
j
i




2
||
||
)
(

o

Maximum
relative
error

of cluster
j
c
:
)
(
||
||
max
)
(
2
max
,
j
j
i
c
d
j
rel
c
C
c
d
c
E
j
i




TODO: check if this
makes sense




External quality measures are
based on a known categorization:

o

Precision

of cluster
i
c

in relation to cate
gory
j
k
:
ci
kj
ci
kj
ci
n
n
P
/
,
,


o

Recall

of cluster
i
c

in relation to category
j
k
:

kj
kj
ci
kj
ci
n
n
R
/
,
,


o

M
ax. precision

of cluster
i
c
:
kj
ci
j
ci
P
P
,
max,
max


o

M
ax.
recall
of cluster
i
c
:
kj
ci
j
ci
R
R
,
max,
max


o

Clustering precision:
i
M
i
ci
ci
P
n
n
P




..
1
max,

o

Clustering
recall
:
i
M
i
ci
ci
R
n
n
R




..
1
max,

o

Entropy for a cluster:


j
kj
ci
kj
ci
ci
P
P
E
)
log(
,
,

o

Clustering entropy:


i
ci
ci
E
n
n
E

o

Information gain
: the difference in entropy between the clustering and
the
entropy

of
a random
partition.

o

F
-
measure

of cluster
i
c
:

ci
ci
ci
ci
ci
R
P
R
P
F
max,
max,
max,
max,
2



o

Clustering F
-
measure:

???






M
i
ci
M
i
ci
ci
n
F
n
F
..
1
..
1

???

TODO: check this



4

Research Directions

When dealing with text documents
collections three problems arise: very high
-
dimensionality of the vector
space, a high level of noise and, very often, a large number of data items to be analyzed. Common
d
imensionality reduction techniques
such as LSI scal
e poorly and can not be applied to large document
collections. Only recently techniques which can deal with large document collections, such as random
projec
tions (RM) have been introduced
.

Clustering of text

documents

collections

is hard
due to high
-
dime
nsionality:
the number of scaleable
algorithms capable of handling high dimensionality is small. Recently techniques such as
Subspace
Clustering and P
rojection Techniques

have been explored and seem to have yielded acceptable results.


Another problem pose
s the interpretability of results. Finding human
-
readable, expressive labels describing
the clusters is another problem we plan to address.