DIMENSIONALITY REDUCTION BY
RANDOM PROJECTION AND
LATENT SEMANTIC INDEXING
Jessica Lin and
Dimitrios
Gunopulos
Ângelo Cardoso
IST/UTL
December 2009
1
Outline
1.
Introduction
1.
Latent Semantic Indexing (LSI)
2.
Random Projection (RP)
2.
Combining LSI and Random Projection
3.
Experiments
1.
Dataset and pre

processing
2.
Document Similarity
3.
Document Clustering
2
Introduction
Latent Semantic Indexing
Vector

space model
Term

to

document matrix where each entry is the
relative frequency of a term in the document
Find a subspace with k dimensions to project the
original term

to

document matrix
SVD is the optimal solution in mean squared error sense
Speed up queries
Address synonymy
Find the intrinsic dimensionality of data
3
Introduction
Random Projection
4
What if we randomly construct the subspace to project?
Johnson

Lindenstrauss lemma
If points in vector space are projected onto a randomly
selected subspace of suitably high dimensions, then the
distances between the points are approximately preserved
Making the subspace orthogonal is computationally
expensive
However we can rely on a result by Hecht

Nielsen:
In a high

dimensional space, there exists a much larger number of
almost orthogonal than orthogonal directions
Combining LSI and Random Projection
Motivation
5
LSI
C
aptures
the underlying semantics
Highly accurate
Can improve retrieval performance
Time complexity is expensive
O(
cmn
) where m is the number of terms, c is the average number
of terms per document and n the number of documents
Random Projection
Efficient in terms of computational time
Does not preserve as much information as LSI
Combining LSI and Random Projection
Algorithm
6
Proposed in
Latent Semantic Indexing: A Probalistic Analsys;
Papadimitriou, C.H. and
Raghavan
, P. and
Tamaki, H. and
Vempala
, S.; Journal of Computer and System Sciences; 2000
Idea
Improve Random Projection accuracy
Improve LSI computional time
First the data is pre

processed to a
lowe
r
dimension k1 using Random Projection
LSI is applied on the reduced lower

dimensional data, to further reduce the data to
the desired dimension k2
Complexity is O(ml (l + c))
RP on original data
O(
mcl
)
LSI on reduced lower

dimensional data)
O(ml
²)
Experiments
–
Similarity
Dataset and Pre

processing
7
Two subsets of Reuters categorization text collection
Common and rare words are removed
Porter stemming
Term

document matrix representation
Normalized to unit length
Sets
Larger subset
10377 documents
12113 terms
Term

document matrix density is 0,4%
Smaller subset
1831 documents
5414 terms
Term

document matrix density is 0,8%
Experiments
–
Similarity
Layout
8
Three techniques for dimensionality reduction are
compared
Latent Semantic Indexing (LSI)
Random Projection (RP)
Combination of Random Projection and LSI (RP_LSI)
The dimensionality of the original data is reduced
to lower k

dimensions
k = 50, 100, 200, 300, 400, 500, 600
Experiments
–
Similarity
Metrics
Euclidean Distance
Cosine of the angle between documents
Determining the error
Randomly select 100 document pairs and then calculate
their distances before and after dimensionality reduction
Compute the correlation between the distance vectors
before
(x)
and after (y) dimensionality reduction
Error is defined as
9
Experiments

Similarity
Distance before and after dimensionality reduction
10
The best technique in terms of error is LSI as
expected
We can see that RP_LSI improves the accuracy of
RP in terms of euclidean distance and dot product
* RP_LSI: k1 = 600
Experiments

Similarity
RP_LSI

k1 and k2 parameters
11
The amount of the second reduction (the final
dimension) is more important to achieve a smaller
error than the amount of the first reduction
This suggests that LSI plays a more important role in
preserving similarity than RP
Experiments

Similarity
Running
Time
RP_LSI performs slightly worse than LSI for the
larger dataset (more sparse)
RP_LSI achieves a significant improvement over LSI
in the smaller dataset (less sparse)
12
* RP_LSI: k1 = 600
Experiments
–
Clustering
Layout
13
Clustering is applied on the data before and after
dimensionality reduction.
Experiments are performed on the smaller dataset
Clustering algorithm choosen is classic k

Means
Effective
Low computional cost
Documents vectors are normalized to unit lenght before
clustering
Centroids are normalized to unit lenght after clustering
Experiments
–
Clustering
k

Means
14
k

Means objective function is to minimize the sum of
intra

cluster errors
The quality of dimensionality reduction is evaluated using
this criterion
Since the dimensionality of data is reduced we have to compute
this criteria on the original space to make the comparison possible
The number of clusters is set to 5
Since it’s rougly the number of main topics in the dataset
Initialization is random
k

Means is repeated 20 times for each experiment and the
average is taken
Experiments
–
Clustering
Results
15
LSI and RP_LSI show results similar
to the original data even for
smaller dimensions
RP shows significantly worse
performance for smaller
dimensions and more similar
performance for larger
dimensions
LSI shows slightly better results
than RP_LSI
Clustering results using euclidean
distance are similar
Conclusion
16
LSI and Random Projection were compared
The combination of Random Projection and LSI is
analyzed
The sparseness of the data seems to play central role in the
effectiveness of this technique
The technique
appears to be
more effective the less sparse the
original data is
SVD complexity is linear on the sparseness of the data
Random Projection makes the data completely dense
The gain in reducing first the data dimensionality rivals with the
additional complexity added to the SVD calculation by making
the data completely dense
Conclusion
17
Additional experiments are necessary to prove that
it is indeed the sparsness of the data that causes the
discrepancy on the running time to what was
previously expected
Other dimensionality reduction algorithms that
preserve the sparseness of the data might be useful
in improving the running time of LSI
Questions
18
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment