Searching the Web

odecrackAI and Robotics

Oct 29, 2013 (4 years and 14 days ago)

98 views

Searching the Web

Basic Information Retrieval

Who I Am


Associate Professor at UCLA Computer Science


Ph.D. from Stanford in Computer Science


B.S. from SNU in Physics


Got involved in early Web
-
search engine projects


Particularly in Web crawling part


Research on search engines and social Web



Brief Overview of the Course


Basic principles and theories behind Web
-
search engines


Not much discussion on implementation or tools, but will be
happy to discuss them if there are any questions


Topics


Basic IR models, data structures, and algorithms


Topic
-
based models


Latent Semantic index


Latent
Dirichlet

Analysis


Link
-
based ranking


Search
-
engine architecture


Issues of scale, Web crawling

Who Are You?


Background


Expectation


Career goal

Today’s Topic


Basic Information Retrieval (IR)


Three approaches for computer
-
based information
management


Bag of words assumption


Boolean Model


String
-
matching algorithm


Inverted index


Vector
-
space model


Document
-
term matrix


TF
-
IDF vector and cosine similarity


Phrase queries


Spell correction

Computer
-
based Information Management


Basic problem


How to use computers to help humans store, organize and
retrieve information?


What approaches have been taken and what has been
successful?

Three Major Approaches


Database approach


Expert
-
system approach


Information
-
retrieval approach

Database Approach


Information is stored in a highly
-
structured way


Data is stored in relational tables as
tuples


Simple data model and query language


Relational model and SQL query language


Clear interpretation of data and query


No ambition to be “intelligent” like humans


Mainly focus on highly efficient system


“Performance, performance, performance”


It has been hugely successful


All major businesses use a RDB system


>$20B market


What are the pros and cons?


Expert
-
System Approach


Information is stored as a set of logical predicates


Bird(x), Cat(x), Fly(x), …


Given a query, the system infers the answer through
logical inference


Bird(Ostrich)


Fly(Ostrich)?


Popular approach in 80s, but has not been successful for
general information retrieval


What are the pros and cons?

Information
-
Retrieval Approach


Uses existing text documents as information source


No special structuring or database construction required


Text
-
based query language


Keyword
-
based query or natural
-
language query


The system returns best
-
matching documents given the
query


Had a limited appeal until the Web became popular


What are the pros and cons?

Main Challenge of IR Approach


Relational Model


Interpretation of query and data is straightforward


Student(name,
birthdate
, major, GPA)


SELECT * FROM Student WHERE GPA > 3.0


Information Retrieval


Both queries and data are “fuzzy”


Unstructured text and “natural language” query


What documents are good matches for a query?


Computers do not “understand” the documents or the queries


Developing a
computerizable

“model” is essential to implement this
approach

Bag of Words: Major Simplification


Consider each document as a “bag of words”


“bag”
vs

“set”


Ignore word ordering, but keep word count


Consider queries as bag of words as well


Great oversimplification, but works adequately in many
cases


“John loves only Jane”
vs

“Only John loves Jane”


The limitation still shows up on current search engines


Still how do we match documents and queries?


Boolean Model


Return all documents that contain the words in the query


Simplest model for information retrieval


No notion of “ranking”


A document is either a match or non
-
match


Q: How to find and return matching documents?


Basic algorithm?


Useful data structure?

String
-
Matching Algorithm


Given string “
abcde
”, find what documents contain the
string


Q: Computational complexity of naïve matching of string
of length m over a document of length n?


Q: Any efficient way

String Matching Example (1)



m

0123456789


D: ABCABABABC
(doc)


W: ABABC

(word)


i

01234



m

0123456789


D: ABCABABABC
(doc)


W:

ABABC

(word)


i

01234



Two cursors:
m=2,
i
=1


m:
beginning of matching part in D


i
:
the location of matching char in W

String Matching Example (2)



m

0123456789


D: ABCABABABC
(doc)


W: ABABC



(word)


i

01234



Mismatch at
m=0,i=2


Q: What can we do? Start again at
m=1,i=0
?

String Matching Example (2)



m

0123456789


D: ABCABABABC
(doc)


W:

ABABC


(word)


i

01234



Mismatch at
m=3,i=4


Q: What can we do? Start at
m=7,i=0
?

String Matching Example (3)

Algorithm KMP


If no substring in W is self
-
repeated, we can slide W
“completely” for matched portion


m <
-

m +
i


i

<
-

0


If the suffix of the matched part is equal to the prefix of
W, we have to slide back a little bit


m <
-

m +
i



x // x

is how much to slide back


i

<
-

x


The exact value of x depends on the length of the prefix
matching the
the

suffix of the matched part


T[0…m]: “slide
-
back” table recording x values


Algorithm KMP

W:

string to look for

D:

document

T:

“slide
-
back” table in case of mismatch


while (m +
i
) < |D| do:


if W[
i
] = D[m +
i
],


let
i

=
i

+ 1


if
i

= |W|, return m


otherwise,


let m = m +
i

-

T[
i
],


if
i

> 0, let
i

= T[
i
]


return no
-
match


Algorithm KMP: T[
i
] Table


W: AB
CD
AB
D (word)

i

01234
56



m <
-

m +
i



T[
i
]



T[0]=
-
1, T[1]= 0



Q: What should be
T[
i
]

for
i
=2…6
?


Data Structure for Quick Document
Matching


Boolean model


Find all documents that contain the keywords in Q.


Q: What data structure will be useful to do it fast?

Inverted Index


Allows quick lookup of document ids with a
particular word









Q: How can we use this to answer “UCLA Physics”?

lexicon/dictionary DIC

3

8

10

13

16

20

Stanford

UCLA

MIT



1

2

3

9

16

18

PL(Stanford)

PL(UCLA)

Postings
list

4

5

8

10

13

19

20

22

PL(MIT)

Inverted Index


Allows quick lookup of document ids with a
particular word

lexicon/dictionary DIC

3

8

10

13

16

20

Stanford

UCLA

MIT



1

2

3

9

16

18

PL(Stanford)

PL(UCLA)

Postings
list

4

5

8

10

13

19

20

22

PL(MIT)

Size of Inverted Index (1)


100M docs, 10KB/doc,

1000 unique words/doc, 10B/word, 4B/
docid



Q: Document collection size?



Q: Inverted index size?




Heap’s Law: Vocabulary size = k
n
b

with 30 < k < 100 and 0.4 <
b < 1


k = 50 and b = 0.5 are good rule of thumb

Size of Inverted Index (2)


Q: Between dictionary and postings lists, which one is
larger?




Q: Lengths of postings lists?




Zipf’s

law: collection term frequency


1/frequency rank


Q: How do we construct an inverted index?

Inverted Index Construction

C: set of all documents (corpus)

DIC: dictionary of inverted index

PL(
w
): postings list of word
w


1:

For each document
d



C
:

2:


Extract all words in content(
d
) into
W

3:

For each
w



W
:

4:


If
w


DIC
, then add
w

to
DIC

5:


Append id(
d
) to PL(
w
)


Q: What if the index is larger than main memory?

Inverted
-
Index Construction


For large text corpus


Block
-
sorted based construction


Partition and merge

Evaluation: Precision and Recall


Q: Are all matching documents what users want?



Basic idea: a model is good if it returns document if and
only if it is “relevant”.



R: set of “relevant” document

D: set of documents returned by a model


Vector
-
Space Model


Main problem of Boolean model


Too many matching documents when the corpus is large


Any way to “rank” documents?


Matrix interpretation of Boolean model


Document


Term matrix


Boolean 0 or 1 value for each entry


Basic idea


Assign real
-
valued weight to the matrix entries depending on
the importance of the term


“the”
vs

“UCLA”


Q: How should we assign the weights?



TF
-
IDF Vector


A term t is important for document d


If t appears many times in d or


If t is a “rare” term


TF: term frequency


# occurrence of t in d


IDF: inverse document frequency


# documents containing t


TF
-
IDF weighting


TF X Log(N/IDF)



Q: How to use it to compute query
-
document relevance?

Cosine Similarity


Represent both query and document as a TF
-
IDF vector


Take the inner product of the two normalized vectors to
compute their similarity




Note: |Q| does not matter for document ranking.


Division by |D| penalizes longer document.

Cosine Similarity: Example


idf
(UCLA)=10,
idf
(good)=0.1,

idf
(university) =
idf
(car) =
idf
(racing) = 1



Q = (UCLA, university), D = (car, racing)




Q = (UCLA, university), D = (UCLA, good)




Q = (UCLA, university), D = (university, good)




Finding High Cosine
-
Similarity Documents


Q: Under vector
-
space model, does precision/recall make
sense?




Q: How to find the documents with highest cosine
similarity from corpus?




Q: Any way to avoid complete scan of corpus?

Inverted Index for TF
-
IDF


Q

d
i

= 0 if
d
i

has no query words


Consider only the documents with query words


Inverted Index: Word


Document


35

Word

IDF

Stanford

UCLA

MIT



1/3530

1/9860

1/937

docid

TF

D1

D14

D376

2

30

8

(TF may be normalized


by document size)

Posting

list

Lexicon

Phrase Queries



Havard

University Boston” exactly as a phrase


Q: How can we support this query?


Two approaches


Biword

index


Positional index


Q: Pros and cons of each approach?




Rule of thumb: x2


x4 size increase for positional index
compared to
docid

only




Spell correction


Q: What the user may have truly intended for the query

Britnie

Spears”? How can we find the correct spelling?


Given a user
-
typed word w, find its correct spelling c.


Probabilistic approach: Find c with the highest probability
P(
c|w
).


Q: How to estimate it?


Bayes
’ rule: P(
c|w
) = P(
w|c
)P(c)/P(w)


Q: What are these probabilities and how can we estimate
them?


Rule of thumb: 4/3 misspells are within edit distance 1.
98% are within edit distance 2.


Summary


Boolean model


Vector
-
space model


TF
-
IDF weight, cosine similarity


String
-
matching algorithm


Algorithm KMP


Inverted index


Boolean model


TF
-
IDF model


Phrase queries


Spell correction