DIMENSIONALITY REDUCTION BY

plantationscarfΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

79 εμφανίσεις

DIMENSIONALITY REDUCTION BY
RANDOM PROJECTION AND

LATENT SEMANTIC INDEXING

Jessica Lin and
Dimitrios

Gunopulos

Ângelo Cardoso

IST/UTL

December 2009

1

Outline

1.
Introduction

1.
Latent Semantic Indexing (LSI)

2.
Random Projection (RP)

2.
Combining LSI and Random Projection

3.
Experiments

1.
Dataset and pre
-
processing

2.
Document Similarity

3.
Document Clustering

2

Introduction

Latent Semantic Indexing


Vector
-
space model


Term
-
to
-
document matrix where each entry is the
relative frequency of a term in the document


Find a subspace with k dimensions to project the
original term
-
to
-
document matrix


SVD is the optimal solution in mean squared error sense


Speed up queries


Address synonymy


Find the intrinsic dimensionality of data

3

Introduction

Random Projection

4


What if we randomly construct the subspace to project?


Johnson
-
Lindenstrauss lemma


If points in vector space are projected onto a randomly
selected subspace of suitably high dimensions, then the
distances between the points are approximately preserved


Making the subspace orthogonal is computationally
expensive


However we can rely on a result by Hecht
-
Nielsen:


In a high
-
dimensional space, there exists a much larger number of
almost orthogonal than orthogonal directions

Combining LSI and Random Projection

Motivation

5


LSI


C
aptures

the underlying semantics


Highly accurate


Can improve retrieval performance


Time complexity is expensive


O(
cmn
) where m is the number of terms, c is the average number
of terms per document and n the number of documents


Random Projection


Efficient in terms of computational time


Does not preserve as much information as LSI

Combining LSI and Random Projection

Algorithm

6


Proposed in


Latent Semantic Indexing: A Probalistic Analsys;
Papadimitriou, C.H. and
Raghavan
, P. and
Tamaki, H. and
Vempala
, S.; Journal of Computer and System Sciences; 2000


Idea


Improve Random Projection accuracy


Improve LSI computional time


First the data is pre
-
processed to a
lowe
r
dimension k1 using Random Projection


LSI is applied on the reduced lower
-
dimensional data, to further reduce the data to
the desired dimension k2


Complexity is O(ml (l + c))


RP on original data


O(
mcl
)


LSI on reduced lower
-
dimensional data)


O(ml
²)

Experiments


Similarity

Dataset and Pre
-
processing

7


Two subsets of Reuters categorization text collection


Common and rare words are removed


Porter stemming


Term
-
document matrix representation


Normalized to unit length


Sets


Larger subset


10377 documents


12113 terms


Term
-
document matrix density is 0,4%


Smaller subset


1831 documents


5414 terms


Term
-
document matrix density is 0,8%

Experiments


Similarity

Layout

8


Three techniques for dimensionality reduction are
compared


Latent Semantic Indexing (LSI)


Random Projection (RP)


Combination of Random Projection and LSI (RP_LSI)


The dimensionality of the original data is reduced
to lower k
-
dimensions


k = 50, 100, 200, 300, 400, 500, 600


Experiments


Similarity

Metrics


Euclidean Distance


Cosine of the angle between documents


Determining the error


Randomly select 100 document pairs and then calculate
their distances before and after dimensionality reduction


Compute the correlation between the distance vectors
before
(x)

and after (y) dimensionality reduction




Error is defined as

9

Experiments
-

Similarity

Distance before and after dimensionality reduction

10


The best technique in terms of error is LSI as
expected


We can see that RP_LSI improves the accuracy of
RP in terms of euclidean distance and dot product


* RP_LSI: k1 = 600

Experiments
-

Similarity

RP_LSI
-

k1 and k2 parameters

11


The amount of the second reduction (the final
dimension) is more important to achieve a smaller
error than the amount of the first reduction


This suggests that LSI plays a more important role in
preserving similarity than RP

Experiments
-

Similarity

Running

Time


RP_LSI performs slightly worse than LSI for the
larger dataset (more sparse)


RP_LSI achieves a significant improvement over LSI
in the smaller dataset (less sparse)

12

* RP_LSI: k1 = 600

Experiments


Clustering

Layout

13


Clustering is applied on the data before and after
dimensionality reduction.


Experiments are performed on the smaller dataset


Clustering algorithm choosen is classic k
-
Means


Effective


Low computional cost


Documents vectors are normalized to unit lenght before
clustering


Centroids are normalized to unit lenght after clustering

Experiments


Clustering

k
-
Means

14


k
-
Means objective function is to minimize the sum of
intra
-
cluster errors


The quality of dimensionality reduction is evaluated using
this criterion


Since the dimensionality of data is reduced we have to compute
this criteria on the original space to make the comparison possible


The number of clusters is set to 5


Since it’s rougly the number of main topics in the dataset


Initialization is random


k
-
Means is repeated 20 times for each experiment and the
average is taken


Experiments


Clustering

Results

15


LSI and RP_LSI show results similar
to the original data even for
smaller dimensions


RP shows significantly worse
performance for smaller
dimensions and more similar
performance for larger
dimensions


LSI shows slightly better results
than RP_LSI


Clustering results using euclidean
distance are similar

Conclusion

16


LSI and Random Projection were compared


The combination of Random Projection and LSI is
analyzed


The sparseness of the data seems to play central role in the
effectiveness of this technique


The technique
appears to be

more effective the less sparse the
original data is


SVD complexity is linear on the sparseness of the data


Random Projection makes the data completely dense


The gain in reducing first the data dimensionality rivals with the
additional complexity added to the SVD calculation by making
the data completely dense


Conclusion

17


Additional experiments are necessary to prove that
it is indeed the sparsness of the data that causes the
discrepancy on the running time to what was
previously expected


Other dimensionality reduction algorithms that
preserve the sparseness of the data might be useful
in improving the running time of LSI


Questions


18