# DIMENSIONALITY REDUCTION BY

AI and Robotics

Nov 25, 2013 (4 years and 5 months ago)

94 views

DIMENSIONALITY REDUCTION BY
RANDOM PROJECTION AND

LATENT SEMANTIC INDEXING

Jessica Lin and
Dimitrios

Gunopulos

Ângelo Cardoso

IST/UTL

December 2009

1

Outline

1.
Introduction

1.
Latent Semantic Indexing (LSI)

2.
Random Projection (RP)

2.
Combining LSI and Random Projection

3.
Experiments

1.
Dataset and pre
-
processing

2.
Document Similarity

3.
Document Clustering

2

Introduction

Latent Semantic Indexing

Vector
-
space model

Term
-
to
-
document matrix where each entry is the
relative frequency of a term in the document

Find a subspace with k dimensions to project the
original term
-
to
-
document matrix

SVD is the optimal solution in mean squared error sense

Speed up queries

Find the intrinsic dimensionality of data

3

Introduction

Random Projection

4

What if we randomly construct the subspace to project?

Johnson
-
Lindenstrauss lemma

If points in vector space are projected onto a randomly
selected subspace of suitably high dimensions, then the
distances between the points are approximately preserved

Making the subspace orthogonal is computationally
expensive

However we can rely on a result by Hecht
-
Nielsen:

In a high
-
dimensional space, there exists a much larger number of
almost orthogonal than orthogonal directions

Combining LSI and Random Projection

Motivation

5

LSI

C
aptures

the underlying semantics

Highly accurate

Can improve retrieval performance

Time complexity is expensive

O(
cmn
) where m is the number of terms, c is the average number
of terms per document and n the number of documents

Random Projection

Efficient in terms of computational time

Does not preserve as much information as LSI

Combining LSI and Random Projection

Algorithm

6

Proposed in

Latent Semantic Indexing: A Probalistic Analsys;
Raghavan
, P. and
Tamaki, H. and
Vempala
, S.; Journal of Computer and System Sciences; 2000

Idea

Improve Random Projection accuracy

Improve LSI computional time

First the data is pre
-
processed to a
lowe
r
dimension k1 using Random Projection

LSI is applied on the reduced lower
-
dimensional data, to further reduce the data to
the desired dimension k2

Complexity is O(ml (l + c))

RP on original data

O(
mcl
)

LSI on reduced lower
-
dimensional data)

O(ml
²)

Experiments

Similarity

Dataset and Pre
-
processing

7

Two subsets of Reuters categorization text collection

Common and rare words are removed

Porter stemming

Term
-
document matrix representation

Normalized to unit length

Sets

Larger subset

10377 documents

12113 terms

Term
-
document matrix density is 0,4%

Smaller subset

1831 documents

5414 terms

Term
-
document matrix density is 0,8%

Experiments

Similarity

Layout

8

Three techniques for dimensionality reduction are
compared

Latent Semantic Indexing (LSI)

Random Projection (RP)

Combination of Random Projection and LSI (RP_LSI)

The dimensionality of the original data is reduced
to lower k
-
dimensions

k = 50, 100, 200, 300, 400, 500, 600

Experiments

Similarity

Metrics

Euclidean Distance

Cosine of the angle between documents

Determining the error

Randomly select 100 document pairs and then calculate
their distances before and after dimensionality reduction

Compute the correlation between the distance vectors
before
(x)

and after (y) dimensionality reduction

Error is defined as

9

Experiments
-

Similarity

Distance before and after dimensionality reduction

10

The best technique in terms of error is LSI as
expected

We can see that RP_LSI improves the accuracy of
RP in terms of euclidean distance and dot product

* RP_LSI: k1 = 600

Experiments
-

Similarity

RP_LSI
-

k1 and k2 parameters

11

The amount of the second reduction (the final
dimension) is more important to achieve a smaller
error than the amount of the first reduction

This suggests that LSI plays a more important role in
preserving similarity than RP

Experiments
-

Similarity

Running

Time

RP_LSI performs slightly worse than LSI for the
larger dataset (more sparse)

RP_LSI achieves a significant improvement over LSI
in the smaller dataset (less sparse)

12

* RP_LSI: k1 = 600

Experiments

Clustering

Layout

13

Clustering is applied on the data before and after
dimensionality reduction.

Experiments are performed on the smaller dataset

Clustering algorithm choosen is classic k
-
Means

Effective

Low computional cost

Documents vectors are normalized to unit lenght before
clustering

Centroids are normalized to unit lenght after clustering

Experiments

Clustering

k
-
Means

14

k
-
Means objective function is to minimize the sum of
intra
-
cluster errors

The quality of dimensionality reduction is evaluated using
this criterion

Since the dimensionality of data is reduced we have to compute
this criteria on the original space to make the comparison possible

The number of clusters is set to 5

Since it’s rougly the number of main topics in the dataset

Initialization is random

k
-
Means is repeated 20 times for each experiment and the
average is taken

Experiments

Clustering

Results

15

LSI and RP_LSI show results similar
to the original data even for
smaller dimensions

RP shows significantly worse
performance for smaller
dimensions and more similar
performance for larger
dimensions

LSI shows slightly better results
than RP_LSI

Clustering results using euclidean
distance are similar

Conclusion

16

LSI and Random Projection were compared

The combination of Random Projection and LSI is
analyzed

The sparseness of the data seems to play central role in the
effectiveness of this technique

The technique
appears to be

more effective the less sparse the
original data is

SVD complexity is linear on the sparseness of the data

Random Projection makes the data completely dense

The gain in reducing first the data dimensionality rivals with the
the data completely dense

Conclusion

17

Additional experiments are necessary to prove that
it is indeed the sparsness of the data that causes the
discrepancy on the running time to what was
previously expected

Other dimensionality reduction algorithms that
preserve the sparseness of the data might be useful
in improving the running time of LSI

Questions

18