Lecture 9: Unstructured Data

cowphysicistInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 7 μήνες)

55 εμφανίσεις

Slide
1

Lecture 9: Unstructured Data


Information Retrieval


Types of Systems, Documents, Tasks


Evaluation: Precision, Recall


Search Engines (Google)


Architecture


Web Crawling


Query Processing


Inverted Indexes


PageRank (!)




Most of the IR portion of this material is take from the course "Information retrieval
on the Internet" by Maier and Price, taught at PSU in alternate years.

Slide
2

Leaarning Objectives


LO9.1 Given a Transition matrix draw a transition
graph, and vice versa.


LO9.2 Given a Transition matrix, and a residence
vector, decide if it is the PageRank for that matrix.


Slide
3

Information Retrieval (IR)


The study of
Unstructured
Data is called
Information
Retrieval

(IR)


A
Database

refers to
Structured

Data

DBMS

IR

Target

Structured

Data: rows in
tables

Unstructured
Data: documents,
media, etc.

Queries

SQL

Keyword

Matching

precise

approximate

Results

unordered
(unless
specified) list

List ordered by
matching priority

Slide
4

General types of IR systems


Web Pages


Full text documents


Bibliographies


Distributed variations


Metasearch


Virtual document collections


Slide
5

Types of Documents in IR Systems


Hyperlinked or not


Format


HTML


PDF


Word Processed


Scanned OCR


Type


Text


Multimedia


Semistructured, e.g., XML


Static or Dynamic

Slide
6

Types of tasks in IR systems


Find


an overview


a fact/answer a question


comprehensive information



a known item (document, page or site)


a site to execute a transaction (e.g., buy a book,
download a file)

Slide
7

Evaluation


How can we evaluate
performance

of an IR system?


System perspective


User perspective


User perspective: Relevance


(How well) does a document satisfy a user's need?


Ideally, an IR system will retrieve exactly those items
that satisfy the user's needs,
no more, no less
.


More: wastes user's time


Less: user misses valuable information

Slide
8

Notation

In response to a user’s query:

The IR system


re
T
rieves a set of documents
T

The user


knows the set of re
L
evant documents
L


|X|

denotes the number of documents in X


Ideally, T = L, no more (no
junk
), no less(no
missing
)

Slide
9

The big picture

T

T

L

L

Retrieved, Not
Relevant =
Junk

Relevant, Not
Retrieved =
Missing


|T

L|


|T|


= 1 if No Junk


Precision


= fraction of retrieved
items that were relevant


=1 if all retrieved items
were relevant


|T

L|


|L|


= 1 if No Missing


Recall


= fraction of relevant
items that were retrieved


=1 if all the relevant
items were retrieved

Slide
10

Context


Precision, Recall were created for IR systems that
retrieved from a
small

set of items.


In that case one could calculate T and L.


Web search engines do not fit this model well; T and
L are
huge
.


Recall does not make sense in this model, but we
can apply the definition of “precision@10”, measuring
the fraction of relevant items that were retrieved
among the first 10 displayed.

Slide
11

Experiment


Compute Precision@10,20 for Google, Bing and
Yahoo for this query:


Paris Hilton Hotel


Precision

= fraction of retrieved items that are relevant

Precision@10

Google

Bing

Yahoo

Slide
12

Search Engine Architecture


How often do you
google
?


What happens when you google?


http://www.google.com/corporate/tech.html


Average time:
half

a second


We need a
crawler

to create the indexes and docs.


Notice that the web crawler creates the docs.


From the docs, the indexes are created and the docs are
given ranks… cf. later slides.


Let's study the Web Crawler Algorithm (
WCA
)


Page 1143 of the handout

Slide
13

Web Crawler Algorithm


Input: Set of popular URLs S


Output: Repository of visited web pages R


Method:

1.
If S is empty, end

2.
Select page
p

from S to crawl, delete
p

from S

3.
Get p* (page that p points to).

4.
If p* is in R, return to (1),


Else add p* to R, and add to S all outlinks from p* unless
they are already in R or S

5.
Return to step (1)

Slide
14

WCA: Terminating Search


Limit the number of pages crawled


Total number of pages, or


Pages per site


Limit the depth of the crawl

Slide
15

WCA: Managing the Repository


Don't add duplicates to S


Need an index on S, probably hash


Don't add duplicates to R


Cannot happen since we search each URL only once?


A page can come from >1 URL; mirror sites


So use hash table of pages in R

Slide
16

WCA: Select Next Page in S?


Can use Random Search


Better: Most Important First


Can consider first set of pages to be most important


As pages are added, make them less important


Breadth first search


Can do a simplified PageRank (cf. later) calculation

Slide
17

WCA: Faster, Faster


Multiprogramming, Multiprocessing


Must manage locks on S


With billions of URLs, this becomes a bottlneck


So assign each process to a host/site, not a URL


This can become a denial
-
of
-
service attack, so throttle down and
take on several sites, organized by hash buckets


R also has bottleneck problems, and can be handled with
locks

Slide
18

On to Query Processing


Very different from structured data: no SQL, parser,
optimizer


Input is boolean combination of keywords


data [and] base


data OR base


Google's goal is an engine that
"understands exactly
what you mean and gives you back exactly what you
want "

Slide
19

Inverted Indexes


When the crawl is complete, the search engine
builds, for
each and every

word, an
inverted index
.


An inverted index is a list of all documents
containing

that word


The index may be a bit vector


It may also contain the location(s) of the word in the
document


Word: any word in any language, plus misspelling,
plus any sequence of characters surrounded by
punctuation!


Hundreds of millions of words


Farms of PCs, e.g. near Bonneville Dam, to hold all this data


Slide
20

Mechanics of Query Processing

1.
Relevant inverted indexes are found

1.
Typically the indexes are in memory, otherwise this could
take a full half second

2.
If they are
bit

vectors, they are
ANDed

or
ORed
,
then materialized, then
lists

are handled


Result is many URLs.


Next step is to determine their rank so the highest
ranked URLs can be delivered to the user.


Slide
21

Ranking Pages


Indexes have returned pages. Which ones are
most

relevant

to you?


There are many criteria for ranking pages; here are
some no
-
brainers (except
!
)


Presence of all words


All words close together


Words in important locations and formats on the page


!
Words near anchor text of links in reference pages


But the
killer

criteria is PageRank


Slide
22

PageRank Intuition


You need to find a plumber. How do you do it?

1.
Call plumbers and talk to them

2.
! Call friends and ask for plumber references


Then choose plumbers who have the most references

3.
!! Call friends
who know a lot

about plumbers (
important

friends) and ask them for plumber references


Then choose plumbers who have the most references from
important

people.


Technique 1 was used before Google.


Google introduced technique 2 to search engines


Google also introduced technique 3


Techniques 2, and especially 3, wiped out the competition.


The
big challenge
: determine which pages are important

Slide
23

What does this mean for pages?

1.
Most search engines look for pages containing the
word "plumber"

2.
Google searches for pages that are linked to by
pages containing "plumber".

3.
Google searches for pages that are linked to by
important

pages containing "plumber".


A web page is important if many important pages
link to it.


This is a recursive equation.


Google solves it by imagining a web walker.

Slide
24

The Web Walker


From page p, the
walker

follows a random link in p


Note that all links in p have equal weight


The walker walks for a very, very,
long

time.


A
residence vector

[ y a m ] describes the percentage
of time that the walker spends on each page


What does the vector [1/3 1/3 1/3 ] mean?


In
steady state
, the residence vector will be (1
st

draft
of) the
PageRank


Observe: pages with
many in
-
links

are visited often


Observe:
important

pages are visited
most

often


Slide
25

Stochastic Transition Matrix


To describe the page walker's moves, we use a
stochastic
transition matrix
.


Stochastic = each column sums to 1


There are 3 web pages:
Y
ahoo,
A
mazon and
M
icrosoft


This matrix means that the
Y
ahoo page has 2 outlinks, to
Y
ahoo (a self
-
link) and to
A
mazon, etc.

Matrix =

½ ½ 0

½ 0 1

0 ½ 0

Y

A

M

Slide
26

Transition Graph


Each Transition Matrix corresponds to a Transition
Graph, e.g.

Y

A

M

1/2

1/2

1/2

1/2

1

Slide
27

LO9.1:Transition Graph*


What is the Transition Graph for this Matrix?

0
½





0






½

0

Y

A

M

Slide
28

Solving for Page Rank


For small dimension matrices it is simple to calculate
the PageRank using Gaussian Elimination.


Remember [y,a,m] is the time the walker spends at
each site. Since it is a probability distribution,
y+a+m=1. Since the walker has reached steady
state,

½ ½ 0

½ 0 1

0 ½ 0

y

a

m

y

a

m

=

Slide
29

Solving, ctd


Solving such small equations is easy, but in reality
the matrix
dimension

is the number of pages in the
web
, so it is in the
billions
.


There is a simpler way, called
relaxation
.


Start with a distribution, typically equal values, and
transform it by the matrix.

½ ½ 0

½ 0 1

0 ½ 0

1/3

1/3

1/3

=

2/6

3/6

1/6

Slide
30

Solving, ctd


If we repeat this only 5
-
10* times the vectors
converge to values very close to [2/5,2/5,1/5]. Check
that this is a solution:

½ ½ 0

½ 0 1

0 ½ 0

2/5

2/5

1/5

=

2/5

2/5

1/5


This solution gives the
PageRank

of each page on
the Web.


It is also called the
eigenvector

of the matrix with
eigenvalue one
.


Does this agree with our intuition about Page Rank?


*For real web values, at most 100 iterations suffice

Slide
31

LO9.2: Identify Solution


Is [ 3/8, 1/4, 3/8 ] a solution for this transition matrix ?

0
½





0





½

0

Slide
32

A Spider Trap


Let's look at a more realistic example called a
spider
trap
.

M =

½ ½ 0

½ 0 0

0 ½ 1


The Transition Graph is:




M represents any set of
web pages that does not
have a link outside the
set.

Y

A

M

1/2

1/2

1/2

1/2

1

Slide
33

A Spider Trap


The Page Rank is:

½ ½ 0

½ 0 0

0 ½ 1

0

0

1

=

0

0

1


Relaxation arrives at this vector because a random
walker arrives at M and stays there in a loop.


This Page Rank vector violates the Page Rank
principle that inlinks should determine importance.

Slide
34

A Dead End


A similar example, called a
dead end
, is

M =

½ ½ 0

½ 0 0

0 ½ 0

Y

A

M

1/2

1/2

1/2

1/2


The Transition Graph is:



M represents any set of
web pages that does not
have out
-
links.

Slide
35

A Dead End, ctd


A dead end matrix is not stochastic, because M does
not obey the stochastic rule.


The only eigenvector for a dead end matrix is the
zero vector.


Relaxation arrives at the zero vector because a
random walker arrives at M and then has nowhere to
go.

Slide
36

What to do?


In these cases, which happen
all the time

on the web,
the web walker algorithm does not identify which
pages are truly
important
.


But we can
tweak

the algorithm to do so: Every 5
th

walk, or so, the walker steps to a random page on the
web.


Then the walk (spider trap example) becomes

½ ½ 0

½ 0 0

0 ½ 1

1/3

1/3

1/3

P
new
= 0.8 *

P
old
+ 0.2 *

Slide
37

Teleporter


Now our tweaked random walker is a
teleporter
.


With probability 80%* s/he follows a random link from
the current page,
as before
.


But

with probability 20% s/he teleports to a random
page with
uniform

probability.


It could be anywhere on the web, even the current page


If s/he is at a
dead end
, with
100%

probability s/he
teleports to a random page with uniform probability.



*80
-
20% are tunable paramaters

Slide
38

Solving the Teleporter Equation


The equation on slide 36 describes the teleporter's
walk. It can be solved using relaxation or Gaussian
elimination.


The
solution

is (7/33, 5/33, 21/33) .


It gives unreasonably high importance to M, but does
recognize that Y is more important than A.