A Tutorial Survey

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

68 εμφανίσεις

Data Mining For Hypertext:

A Tutorial Survey


Based on a paper by:


Soumen Chakrabarti


Indian Institute Of technology Bombay.


Soumen@cse.iitb.ernet.in


Lecture by:


Noga Kashti


Efrat Daum



11
/
11
/
01

sdbi


winter
2001

11
/
11
/
01

sdbi
-

winter
2001

2

Lets start with definitions



Hypertext
-

a collection of documents (or
"nodes") containing cross
-
references or
"links" which, with the aid of an interactive
browser program, allow the reader to move
easily from one document to another.



Data Mining
-

Analysis of data in a database
using tools which look for trends or anomalies
without knowledge of the meaning of the
data.


11
/
11
/
01

sdbi
-

winter
2001

3

Two Ways For Getting
Information From The Web :



Clicking On Hyperlinks



Searching Via Keyword Queries

11
/
11
/
01

sdbi
-

winter
2001

4

Some History



Before the popular Web, Hypertext has
been used by ACM, SIGIR,

SIGLINK/SIGWEB and DIGITAL
LIBRARIES.


The old IR (Information retrieval) deals
with documents whereas the Web deals
with semi
-
structured data.

11
/
11
/
01

sdbi
-

winter
2001

5

Some Numbers ..


The Web exceeds
800
million HTML
pages on about
three million
servers.



Almost a million
pages are added
daily.






A typical page
changes in a few
months.



Several hundred
gigabytes change
every month.

11
/
11
/
01

sdbi
-

winter
2001

6

Difficulties With Accessing
Information On The Web:


Usual problems of text search (synonymy,
polysemy, text sensitivity) become much
more severe.



Semi
-
structured data.



Sheer size and flux.



No consistent standard or style.

11
/
11
/
01

sdbi
-

winter
2001

7

The Old Search Process Is
Often Unsatisfactory!


Deficiency of scale.



Poor accuracy (low recall and low
precision).


11
/
11
/
01

sdbi
-

winter
2001

8

Better Solutions: Data Mining
And Machine Learning



NL Techniques.



Statistical Techniques for learning
structure in various forms from text
hypertext and semi
-
structured data.

11
/
11
/
01

sdbi
-

winter
2001

9

Issues We

ll Discuss


Models


Supervised learning


Unsupervised learning


Semi
-
supervised learning


Social network analysis

11
/
11
/
01

sdbi
-

winter
2001

10

Models For Text


Representation for text with statistical
analyses only (bag
-
of
-
words):


The vector space model


The binary model


The multi
-
nominal model

11
/
11
/
01

sdbi
-

winter
2001

11

Models For Text (cont.)


The vector space model:


Documents
-
> tokens
-
>canonical forms.


Canonical token is an axis in a Euclidean
space.


The t
-
th coordinate of d is n(d,t)


t is a term


d is a document

11
/
11
/
01

sdbi
-

winter
2001

12

The Vector Space Model: Normalize
The Document Length To
1




























t
t
d
n
d
,
1




t
t
d
n
d
2
2
,


t
d
n
d
t
,
max


N
N

log


1


IDF(t)
t




IDF(t)
d
t
d,
n


TFIDF
1


11
/
11
/
01

sdbi
-

winter
2001

13

More Models For Text


The Binary Model

: A document is a set of
terms, which is a subset of the lexicon. Word
counts are not significant.



The multinomial model

: a die with |T|
faces. Every face has a probability θ
t

of
showing up when tossed. Deciding of total
word count, the author tosses the die while
writing the term that shows up.


11
/
11
/
01

sdbi
-

winter
2001

14

Models For Hypertext


Hypertext: text with hyperlinks.


Varying levels of detail.



Example: Directed Graph(D,L)


D


The set of nodes/documents/pages


L


The set of links

11
/
11
/
01

sdbi
-

winter
2001

15

Models For Semi
-
structured
Data


A point of convergence for the
web(documents) and database(data)
communities


11
/
11
/
01

sdbi
-

winter
2001

16

Models For Semi
-
structured
Data(cont.)


like Topic Directories with tree
-
structured hierarchies.



Examples: Open Directory Project ,
Yahoo!


Another representation: XML.

11
/
11
/
01

sdbi
-

winter
2001

17

Supervised Learning
(classification)


Algorithm Initialization
: training
data, each item is marked with a label
or class from a discrete finite set.



Input
: unlabeled data.



Algorithm roll
: guess the data labels.

11
/
11
/
01

sdbi
-

winter
2001

18

Supervised Learning (cont.)



Example
:
topic directories



Advantages
: help structure, restrict
keyword search, can enable powerful
searches.

11
/
11
/
01

sdbi
-

winter
2001

19

Probabilistic Models For Text
Learning


Let c
1
,

,c
m

be m classes or topics with
some training documents D
c
.



Prior probability of a class:




T : the universe of terms in all the
training documents.


c
c
c
D
D
11
/
11
/
01

sdbi
-

winter
2001

20

Probabilistic Models For Text
Learning (cont.)


Naive Bayes classification:


Assumption: for each class c, there is
binary text generator model.



Model parameters: Φ
c,t



the probability
that a document in class c will mention
term t at lease once.

11
/
11
/
01

sdbi
-

winter
2001

21

Naive Bayes classification
(cont.)







Problems:


short documents are discouraged.


Pr (d|c) estimation is likely to be greatly
distorted.









d
t
d
t
T
t
t
c
t
c
c
d
,
,
,
)
1
(
)
|
Pr(


11
/
11
/
01

sdbi
-

winter
2001

22

Naive Bayes classification
(cont.)


With the multinomial model:














d
t
t
d
n
t
c
t
d
n
d
c
d
)
,
(
,
1
)
,
(
)
|
Pr(

11
/
11
/
01

sdbi
-

winter
2001

23

Naive Bayes classification
(cont.)


Problems:


Again, short documents are
discouraged.


Inter
-
term correlation ignored.


Multiplicative Φ
c,t


surprise


factor.



Conclusion:


Both model are effective.

11
/
11
/
01

sdbi
-

winter
2001

24

More Probabilistic Models For
Text Learning


Parameter smoothing and feature
selection.


Limited dependence modeling.


The maximum entropy technique.


Support vector machines (SVMs).


Hierarchies over class labels.

11
/
11
/
01

sdbi
-

winter
2001

25

Learning Relations


Classification extension : a combination of
statistical and relational learning.


Improve accuracy.


The ability to invent predicates.


Can represent hyperlink graph structure and
word statistics of neighbor documents.


Learned rules will not be dependent on
specific keywords.


11
/
11
/
01

sdbi
-

winter
2001

26

Unsupervised learning



hypertext documents



a hierarchy among the documents



What is a good clustering?


11
/
11
/
01

sdbi
-

winter
2001

27

Basic clustering techniques


Techniques for Clustering:


k
means


hierarchical agglomerative clustering


11
/
11
/
01

sdbi
-

winter
2001

28

Basic clustering techniques


documents


unweighted vector

space


TFIDF vector

space



similarity between two documents


cos(

),


= the angle between their
corresponding vectors



the distance between the vectors lengths



(
normalized
)


11
/
11
/
01

sdbi
-

winter
2001

29

k
means clustering


the k
means algorithm:



input:


d
1
,

,d
n

-

set of n documents


k
-

the number of clusters desired (k

n
)



output:


C
1
,

,C
k



k clusters with the n classifier
documents


11
/
11
/
01

sdbi
-

winter
2001

30

k
means clustering


the k
means algorithm (cont.):



initial: guess k initial means: m
1
,

m
k



Until there are no changes in any means:


For each document d
-

d is in c
i

if ||d
-
mi|| is
the minimum of all the k distances.


For
1

i

k
-

replace m
i

with the means of all
the documents for c
i
.

11
/
11
/
01

sdbi
-

winter
2001

31

k
means clustering


the k
means algorithm


Example:


m
1

m
2

m
1

m
2

m
1

m
2

m
1

m
2

m
1

m
2

m
1

m
2

m
1

m
2

K=
2

m
1

m
2

m
3

m
1

m
2

m
3

m
1

m
2

m
3

m
1

m
2

m
3

m
1

m
2

m
3

m
1

m
2

m
3

m
1

m
2

m
3

K=
3

11
/
11
/
01

sdbi
-

winter
2001

32

k
means clustering (cont.)


Problem:


high dimensionality


e.g.: if
30000
dimensions has only two possible
values, the vector space size is
2
30000



Solution:


Projecting out some dimensions

11
/
11
/
01

sdbi
-

winter
2001

33

Agglomerative clustering


documents are

merged into super
documents or
groups until

only one group is left



Some definitions:




= the similarity between documents d
1

and

d
2




the
self
-
similarity
of group A
:


















2
1
2
1
,
2
1
)
,
(
1
1
)
(
d
d
A
d
d
d
d
s
A
A
A
s
)
,
(
2
1
d
d
s
11
/
11
/
01

sdbi
-

winter
2001

34

Agglomerative clustering


The agglomerative clustering algorithm:



input:


d
1
,

,d
n

-

set of n documents



output:


G


the final group with a nested hierarchy


11
/
11
/
01

sdbi
-

winter
2001

35

Agglomerative clustering
(cont.)


The agglomerative clustering algorithm:



Initial: G := {G
1
,

,G
n
}, where G
i
={d
i
}



while |G|>
1
:


Find A and B in G such as s(A


B) is maximized


G := (G


{A,B})


{A


B}



Times: O(n
2
)



11
/
11
/
01

sdbi
-

winter
2001

36

Agglomerative clustering
(cont.)


The agglomerative clustering algorithm


Example:

a

c

d

e

b

g

f

Initial:

-

0


-

0.1


-

0.2


-

0.3


-

0.4


-

0.5


-

0.6


-

0.7


-

0.8

a

c

d

e

b

g

f

Step
1
:

s(b,c)=
0.7

b

c

f

g

a

c

d

e

b

g

f

Step
2
:

s(b,c)=
0.7

s(f,g)=
0.6

b

c

a

c

d

e

b

g

f

Step
3
:

s(b,c)=
0.7

s(f,g)=
0.6

s(b
-
c,d)=
0.5

d

b

c

f

g

a

c

d

e

b

g

f

Step
6
:

s(b,c)=
0.7

s(f,g)=
0.6

s(b
-
c,d)=
0.5

s(e,f
-
g)=
0.4

s(a,b
-
c
-
d)=
0.3

s(a
-
b
-
c
-
d,e
-
f
-
g)=
0.1

a

e

d

b

c

f

g

a

c

d

e

b

g

f

Step
5
:

s(b,c)=
0.7

s(f,g)=
0.6

s(b
-
c,d)=
0.5

s(e,f
-
g)=
0.4

s(a,b
-
c
-
d)=
0.3

a

e

d

b

c

f

g

a

c

d

e

b

g

f

Step
4
:

s(b,c)=
0.7

s(f,g)=
0.6

s(b
-
c,d)=
0.5

s(e,f
-
g)=
0.4

e

d

b

c

f

g

11
/
11
/
01

sdbi
-

winter
2001

37

Techniques from linear
algebra



Documents and terms are represented
by vectors
in Euclidean space
.



Applications of linear algebra to text
analysis:


Latent semantic indexing (LSI)


Random projections

11
/
11
/
01

sdbi
-

winter
2001

38

Co
-
occurring terms


Exemple:


auto

car

transmission

gearbox

Auto
Transmission

interchange
W/
404
to
504
??



Linear
potentiometer
for a racing
car gearbox



11
/
11
/
01

sdbi
-

winter
2001

39

Latent semantic indexing (LSI)


Vector Space model of documents:


Let m=|T|, the lexicon size


Let n=the number of documents


Define A
mxn

= term
-
by
document matrix


where: a
ij

= the number of occurrences of
term i in document j.


11
/
11
/
01

sdbi
-

winter
2001

40

Latent semantic indexing (LSI)








How to reduce it?

terms

documents




m
t
t
t


2
1



n
d
d
d
...
...
2
1
car

auto

similarity

11
/
11
/
01

sdbi
-

winter
2001

41

Singular Value

Decomposition
(SVD)



Let A

R
mxn
, m


n be a matrix.


The
singular value decomposition

of A is the
factorization A=UDV
T
, where:



U and V are orthogonals, U
T
U=V
T
V=I
n


D=diag(

1
,



n
)

with

i

0
,
1

i

n

then,


U=[u
1
,

u
n
], u
1
,

u
n

are the
left singular vectors


V=[v
1
,

v
n
], v
1
,

v
n

are the
right singular vectors



1
,



n
are the
singular values

of A.

11
/
11
/
01

sdbi
-

winter
2001

42

Singular Value

Decomposition
(SVD)



AA
T
=(UDV
T
)(VD
T
U
T
)=UDIDU
T
=UD
2
U
T




AA
T
U=UD
2
=[

1
2
u
1
,

,

n
2
u
n
]


for
1

i

n, AA
T
u
i
=

i
2
u
i





the columns of U are the eigenvectors of AA
T
.



Similary, A
T
A=VD
2
V
T




the columns of V are the eigenvectors of A
T
A.



The eigenvalues of AA
T
(or

A
T
A) are

1
2
,

,

n
2

11
/
11
/
01

sdbi
-

winter
2001

43

Singular Value

Decomposition
(SVD)










Let


be the
k
-
truncated

SVD.


rank(A
k
)=k


||A
-
A
K
||
2


||A
-
M
K
||
2

for any matrix M
k

of rank k.





T
n
n
n
T
T
T
n
n
T
T
n
T
n
T
T
n
n
T
v
u
v
u
v
u
v
v
v
u
u
u
v
v
v
u
u
u
UDV
A




























































...
,...,
,
0
...
0
0
...
0
0
...
0
,...,
,
2
2
2
1
1
1
2
2
1
1
2
1
2
1
2
1
2
1




n
k
v
u
v
u
v
u
A
T
k
k
k
T
T
k





,
...
2
2
2
1
1
1



11
/
11
/
01

sdbi
-

winter
2001

44

Singular Value

Decomposition
(SVD)




Note: A, A
k



R
mxn











T
V
D
U
A
nxn
nxn
mxn
mxn



reduction









T
k
k
k
k
V
D
U
A
kxn
kxk
mxk
mxn



11
/
11
/
01

sdbi
-

winter
2001

45

LSI with SVD


Define q

R
m



a query vector.


q
i

0
if term i is a part of the query.



Then, A
T
q

R
n
, the answer vector.


(A
T
q)
j

0
if document j contains one or
more terms in the query.



How to do it better?

11
/
11
/
01

sdbi
-

winter
2001

46

LSI with SVD


Use A
k

instead of A:




calculate A
k
T
q



Now,
query on

car


will return a
document containing the word

auto

.

11
/
11
/
01

sdbi
-

winter
2001

47

Random projections


Theorem:


let:



-

a unit vector


H
-

a randomly oriented
-
dimensional subspace through the
origin


X
-

random variable of the square of the length of the projection
of v on H



then:



and if is chosen between and









where




n
l

X
E
l


n
log



n
O


4
1
2
2
Pr

















l
e
l
n
l
n
l
X
2
1
0



n
R
v

l
11
/
11
/
01

sdbi
-

winter
2001

48

Random projections


A projection

of a set of points to a
randomly oriented subspace.


Small distortion in inter
-
points distances



The technique:


reducing the dimensionality of the points


speed up the

distances computation

11
/
11
/
01

sdbi
-

winter
2001

49

Semi
-
supervised learning


Real
-
life applications:


labeled documents


unlabeled documents



Between supervised and unsupervised
learning

11
/
11
/
01

sdbi
-

winter
2001

50

Learning from labeled and
unlabeled documents


Expectation Maximization (EM) Algorithm:


Initial: train a naive Bayes classifier using only
labeled data.


Repeat EM iteration
until near convergence

:


E
step:




M
step: assign class probabilities Pr(c/d) to all

documents not labeled by the

c,t

estimates.


error is reduced

by a third in the best cases.

















d
d
t
c
d
n
d
c
T
d
c
t
d
n



,
Pr
Pr
,
1
,
11
/
11
/
01

sdbi
-

winter
2001

51

Relaxation labeling


The hypertext model:


documents are nodes in a hypertext graph.


There are other sources of information
induced by the links.


?

?

?

?

11
/
11
/
01

sdbi
-

winter
2001

52

Relaxation labeling


c=class, t=term, N=neighbors


In supervised learning: Pr(t|c)


In hypertext, using neighbors


terms:
Pr( t(d),t(N(d)) |c)


Better model, using neighbors


classes :
Pr( t(d),c(N(d)) |c]


Circularity


11
/
11
/
01

sdbi
-

winter
2001

53

Relaxation labeling


Resolve the circularity:


Initial: Pr
(
0
)
(c|d) to each document
d

N(d
1
) where d
1

i
s a test document (use
text
-
only)


Iterations:





))
(
(
)
(
)
(
)
1
(
)))
(
(
,
|
(
Pr
))
(
(
Pr
))
(
,
|
(
Pr
d
N
c
r
r
r
d
N
c
d
c
d
N
c
d
N
d
c
11
/
11
/
01

sdbi
-

winter
2001

54

Social network analysis


Social networks:


between academics by co
authoring, advising.


between movie personnel by directing and acting.


between people by making phone calls


between web pages by hyperlinking to other web
pages.



Applications


Google


HITS

11
/
11
/
01

sdbi
-

winter
2001

55






where:




means

link to



N = total
number of nodes in the Web graph


simulated a random walk on the web graph


used a score of popularity


the popularity score is precomputed
independent of the query









v
u
u
OutDegree
u
PageRank
p
N
p
v
PageRank
)
(
)
(
)
1
(
)
(
11
/
11
/
01

sdbi
-

winter
2001

56

Hyperlink induced topic search
(HITS)


Depended on a search engine


For each node u in the graph calculated
Authorities

scores (a
u
) and Hubs scores (h
u
):


Initialize hu=au=
1


Repeat until convergence:









are normalized to
1









v
u
v
u
v
u
u
v
a
h
and
h
a
:
:


v
v
u
u
a
and
h
11
/
11
/
01

sdbi
-

winter
2001

57


Interesting page include links to others
interesting pages.



The goal:


many
relevant
pages


few irrelevant pages


fast



11
/
11
/
01

sdbi
-

winter
2001

58

Conclusion


Supervised learning


Probabilistic models


Unsupervised learning


Techniques for clustering:


k
-
means (top
-
down)


agglomerative (bottom
-
up)


Techniques for reducing:


LSI with SVD


Random projections


Semi
-
supervised learning


The EM algorithm


Relaxation labeling

11
/
11
/
01

sdbi
-

winter
2001

59

referance


http://www.engr.sjsu.edu/~knapp/HCIRDFSC/C/k_m
eans.htm



http://ei.cs.vt.edu/~cs
5604
/cs
5604
cnCL/CL
-
illus.html



http://www.cs.utexas.edu/users/inderjit/Datamining



Scatter/Gather: A Cluster
based Approach to Browsing
Large Document Collections (Cutting, Karger,
Pedersen, Tukey)