Lecture 8: Web Searching

nostrilshumorousInternet και Εφαρμογές Web

18 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

123 εμφανίσεις

Lecture 8:

Web Searching

SEEM 5760 Client
-
Server Information Systems

2012
-
2013 Term 2

Department of Systems Engineering & Engineering Management

The Chinese University of Hong Kong


Dorbin

Ng

1

Copyright © 2013 The Chinese University of Hong Kong. All Rights Reserved.

CUHK . 2013 . SEEM 5760 .
Dorbin

Ng . Version
1.3,
March
15,
2013

Review

Overview

2


Cloud computing


SaaS
,
PaaS
,
IaaS


Cloud computing
engineering


SOA, Web services


Mashups

& Web APIs


Mobile Computing


MVC & Push notification


Internet of Things


RFID, NFC


Social Networking



Web Searching


Information Retrieval &
Web Search Engines


Web Crawler


Indexing


PageRank: Link Analysis


Searching


Vector Space Model


Multimedia Presentation


Multimedia Search

Information Retrieval (IR)


The area of study concerned with searching for some
information needs


For documents


For information within documents


For metadata about documents


For data in relational databases


For web pages/information in the World Wide Web


IR is interdisciplinary


Computer science, mathematics, library science


Information science, information architecture


Cognitive psychology, linguistics, statistics


Web search engines are the most visible IR applications


Google

3

Information Retrieval

The 10 Best Search Engines of 2013 (About.com)

4

Search Site

Features

1

dogpile

Pleasant presentation & helpful crosslink

results

2

yippy

Searching

for obscure blogs, information, news, etc.

3

Ask.com

Cleaner & easier presentation for reading

4

bing

Offering suggestions in the leftmost column

5

Duck

Duck

Go

“Zero
-
click” information; disambiguation prompts

6

The Internet Archive

Snapshots

of the entire WWW for years

7

Webopedia

Encyclopedic

resource on techno terminology &
computer definitions

8

Mahalo

“Human
-
灯w敲敤
” search site by a committee of
editors to manually sift and vet content

9

Yahoo!

“Web portal” for searching, discovery

☠e硰汯xa瑩潮



G潯ole

F慳琬

r敬ev慮琬 瑨攠污Wg敳琠獩湧汥WcaW慬潧略 潦⁷敢
灡来猠av慩a慢汥aW潤oy

http://
netforbeginners.about.com/od/navigatingthenet/tp/top_10_search_engines_for_beginners.htm

User Requirements for Web Search Engines


The wanted 3 key features from a search engines


Relevant results


Results you are actually interested in


Or, just one result to fill what you are looking for


Uncluttered, easy to read interface


User Interface design for showing the result set


Is a result list the best user interface? Any other choice?


Helpful options to broaden or tighten a search


Providing suggestions for refining query


Providing filtering options to trim down the result set


Out of 290
search engines,
About.com Guide has picked 10
best search engines of 2013 from comments given by its
readers.


In general, these 10
should meet 99% of the searching needs of
a regular everyday
user
.

5

Number of Monthly Web Searches

6

65.2%,

115 B

Searches
Per Month
in Dec 2012

8.2%

4.9%

2.8%

Billions of
searches
per month

16.3%

Bing + finding things
within Microsoft.com

http://
searchengineland.com/google
-
worlds
-
most
-
popular
-
search
-
engine
-
148089
, Feb 11, 2013.

Number of Unique Searchers Per Month

7

Millions

Per Month

76.6%,

1.17
B

Unique

Searchers

i
n Dec 2012

19.2%

19.2%

4.9%

Worldwide Popularity, Over Time: 2007
-
2012

8

65.2%

4.9%

2.8%

2.5%

% Per Month

Number of

Searches

2007
-
2012


Google has tripled its
volume


monetization
opportunities


Growth


Baidu
: 3x


Microsoft (Bing): 2x


Yahoo: 1x (almost the
same)


Yandex
: ~7x

9

Billions of
searches
per month

Billions of searches
per month

System Specifications of Web Search Engines


Performing 3 basic tasks:


Searching the Internet


or selecting pieces of the Internet


based on important words


Keeping an index of the words they find, and where they find
them


Allowing users to look for words or combinations of words
found in that index


Typically these days, a top search engine


Indexing hundreds of millions of pages


Responding to tens of millions of queries per day


Search engine in a nut shell


“What is Search Engine Optimization (SEO)?”
http://
www.youtube.com/watch?feature=player_embedded
&v=hF515
-
0Tduk

10

Web Search

Architecture

11

Search
Engine

Web
Crawlers

Indexing
Engines

Link
Analysis

Index
Database

Page
Database

World Wide Web

Link
Database

Relevance
Ranker

Query
Engine

Brief History of Web Search Engine Development


Archie


Created in 1990, the first tool for searching the Internet


Downloading directory listings of all files located on public
anonymous FTP servers


Creating a searchable database of filenames


Gopher


Created in 1991


Indexing plain text documents


“Veronica” and “
Judhead



Came along to search Gopher’s index systems


Wandex


Created in June 1993, the first actual Web search engine


With the creation of the first web robot


the Perl
-
based World
Wide Web Wanderer


to generate an index called “
Wandex


12

Web Crawler


A computer program that browses the World Wide Web in a
systematical, automated manner or in an orderly fashion


Aka: ants, automatic indexers, bots, web spiders, web robots


Web sites or applications using
spidering

as a means of
providing up
-
to
-
date data for


S
earch engine services


Creating a copy of all the visited pages for later processing by a search
engine that will index the downloaded pages to provide fast searches


Web maintenance services


Automating processes of checking links and validating HTML code


Information analysis


Pattern extraction: harvesting email addresses (usually for spam)


Content analysis: finance, marketing, science, national security

13

Web Crawler

Web Crawler Mechanism


Spidering

mechanism


Starting with a list of URLs to visit, called the seeds


Repeating the following steps


Visiting these URLs


Extracting all the hyperlinks in the pages


Adding them to the list of URLs to visit


Performance requirement


Scalable as the web evolves continuously


Efficient as crawling generating network traffics


Fresh as pages being updated from time to time


Web crawling
p
olicies


Selection policy for stating which pages to download


Re
-
visit policy for stating when to check for changes to the pages


Politeness policy for stating how to avoid overloading web sites


Parallelization policy for stating how to coordinate distributed web
crawlers

14

Web Crawler Architectures


Challenges in web crawling:


Downloading hundreds of millions
of pages over several weeks


Leading to challenges in


System design, I/O & networking
efficiency, robustness &
manageability


Kept as

business

secrets

15

Crawling More & Deeper

16


Some well
-
known web
crawlers


Googlebot


Yahoo! Slurp


Msnbot
: Microsoft’s Bing
webcrawler


Open
-
source crawlers


crawler4j


GRUB


HTTrack
: creating mirror
of web site


Open Search Server


tkWWW

Robot


Crawling the deep web


A vast amount of web pages
lie in the deep or invisible
web


Typically only accessible by
submitting queries to a
database


“Pages” with no links


Crawling Web 2.0
applications


Making AJAX applications
crawlable


Allowing for dynamically
created content to be visible
to crawlers

Indexing in Information Retrieval


Aka search engine indexing


Collecting, parsing, and storing data


T
o facilitate fast and accurate information retrieval


Popular & common search engines focus on


Full
-
text indexing of online, natural language documents


Media types


Text, video, audio, graphics, & other searchable objects


Incorporating interdisciplinary concepts from


Linguistics, cognitive psychology, mathematics, informatics,
physics, and computer science

17

Indexing

Purpose of Indexing


Creating & storing
an
index


To
optimize speed
&
performance in finding relevant documents for a
search
query


Achieving the shortest possible query response time


Users don’t want to wait for getting search results!


Or even worse, users can’t imagine they need to wait!


Without
an index, the search engine would scan every document in the
corpus, which would require considerable time &

computing
power


And doing this scanning for every query


E.g.,
while an index of 10,000 documents can be queried within
milliseconds, a sequential scan of every word in 10,000 large documents
could take
hours


Storage & maintenance tradeoff


Extra: additional
computer storage required to store the
index


Extra: time
required for
updates
to take
place continuously


Saving: time
saved
for all searching requests

18

Tokenization


The process of breaking up a stream of text into words,
phrases, symbols, or other meaningful elements called
tokens

or
terms


List of tokens becoming fundamental elements to support
further processing such as parsing or text mining


Typically, tokenization occurring at the word level


However, sometimes difficult to define what is meant by a
“word”


Some simple
tokenizer

heuristics


All contiguous strings of alphabetic characters being part of one
token; likewise with numbers


Tokens being separated


By whitespace characters such as a space or line break


By punctuation characters


Punctuation & whitespace may or may not be included in the
resulting list of tokens

19

Tokenization: Language Dependent


In English and many other languages using some form of
the Latin alphabet


Space being a good approximation of a word delimiter


Resulting tokens may be normalized


To just lower (or upper) case


By applying stemming (plural & singular, tense, etc.)


In Chinese

and Japanese


Sentences but not words being delimited


In Thai and Lao


Phrases & sentences but not words being delimited


In Arabic


Distinctive initial, medial, & final letter shapes being signals
for text segmentation

20

From Forward Index to Inverted Index


Forward index


Storing a list of words per document


Infeasible to query the forward index directly


Requiring sequential iteration through each document
&
t
o each word to verify a matching document


The time, memory, and processing resources to
perform such a query are not always technically
realistic


Inverted index


Storing a list of documents per word (index)


That is, inverting the document
-
to
-
indexes listing to
index
-
to
-
documents listing


With inverted index being created, the query can
be resolved by jumping to the word id (via
random access) in the inverted index

21

Document Set

Doc1:

Mary has a
little lamb.

Doc2:

The wolf ate
the lamb.


Forward Index

Doc1:

{Mary, has, a,
little,
lamb
}

Doc2:

{
The
, wolf,
ate,
the
,
lamb
}


Inverted Index

Mary:

{Doc1}

Has:

{Doc1}

A:

{Doc1}

Little:

{Doc1}

Lamb:

{Doc1, Doc2}

The:

{Doc2}

Wolf:

{Doc2}

Ate:

{Doc2}

In 2
documents

Represented
Once

Inverted Index


In computer science, an index data structure storing a
mapping from content, such as words or numbers, to its
locations


In a database, a document, or a set of documents


Allowing fast full text searches


At a cost of increased processing when a document is added to
the database


Two main variants of inverted indexes


Record level inverted index


Containing a list of references to documents for each word


Word level inverted index


Additionally containing the positions of each word within a document


Supporting extra functionality like phrase searches, but needs more
time and space to be created

22

Record Level Inverted Index Example

23


Document set


Doc



text content


D
0

= “It is what it is.”


D
1

= “What is it?”


D
2

= “It is a banana.”


Forward index


Applying tokenization &
case
-
conversion on each
document


Doc



indexes


D
0


= (it, is, what, it, is)


D
1


= (what, is, it)


D
2


= (it, is, a, banana)


Inverted
Index


For each index in the document
set, collecting a set of documents
having the index


Index



doc set


“a”

=
{D
2
}


“banana


=
{D
2
}


“is”

=
{D
0
, D
1
, D
2
}


“it”

=
{D
0
, D
1
, D
2
}


“what”

=
{D
0
, D
1
}


Search for “what”, “is”
and

“it”


{D
0
, D
1
}

{D
0
, D
1
, D
2
}

{
D
0
, D
1
, D
2
}


{D
0
, D
1
}


Note:

means
“and”


Word Level Inverted Index Example


Forward index


Doc



indexes


D
0


= [(it, 0), (is, 1), (what, 2), (it, 3), (is, 4)]


D
1


= [(what, 0), (is, 1), (it, 2)]


D
2


= [(it, 0), (is, 1), (a, 2), (banana, 3)]


Inverted Index


Index



doc set


“a”

=
{(D
2
, 2)}


“banana”

=
{(D
2
, 3)}


“is”

=
{(D
0
, 1), (D
0
, 4), (D
1
,
1), (D
2
, 1)}


“it”

=
{(D
0
, 0
), (D
0
, 3), (D
1
,
2), (D
2
, 0)}


“what”

=
{(D
0
,
2), (D
1
, 0)}


Search for “
what is it” (in this particular sequence)


{D
0
, D
1
}

{D
0
, D
1
, D
2
}

{D
0
, D
1
, D
2
}



{D
0
, D
1
}

after apply index
sequence



{D
1
}

24

Document Set

D
0

= “It is what it is.”

D
1

= “What is it?”

D
2

= “It is a banana.”

Measuring Term Frequency


Term count in a given document
d
j


Number of times a given term
t
i

appears in that document


n
ij

= number of occurrences of term
t
i

in document
d
j


Term frequency


A measure of the importance of the term
t
i

within the
particular document
d
j


A
normalized term count
to prevent a bias towards longer
documents possibly having a higher term count regardless of
the actual importance of that term in the document


tf
ij

=
n
ij

/

k

n
kj



k

n
kj

= sum of number of occurrences of all
k

terms in document
d
j


Specificity


The higher
tf

is, the more important or specific the term is in
d
j

25

Measuring Document Frequency


Document count for a term
t
i


|D
i
|

=

number of documents where term
t
i

appears; i.e.,
tf
ij

≠ 0


Document frequency


A measure of the importance of the term
t
i

within
the entire document set
(or called corpus)


A normalized document
count to prevent a
numeric bias relative to term
frequency in a large document set,
D


df
i

=
|D
i
|

/
|
D|


Generality


The higher
df

is, the more
general
or
less important
the term is in
D


When a term
t
i

appears in almost all documents in a corpus,
t
i

does not have
much distinguish power


Inverse document frequency,
idf


The importance of a term is proportional to the
inverse
df


A logarithm of the
df

is used to scale down the difference in order of
magnitude


i
df
i

= log(
|
D|

/
|
D
i
|
)

26

Measuring Term Importance


Combining
term frequency
&
document frequency
of a
given term to measure its importance


A term with
high
tf




the term is more important because of being specific


A term with
high
df

or
low
idf




the term is less important because of being too general


Thus,
tfidf
ij

provides a combined weight to evaluate how
important a word is to a document in a corpus




term frequency
-
inverse document frequency
,
tfidf


The importance increases proportionally to the number of
times a word appears in the document

but

is offset by the frequency of the word in the corpus


Forming one of the simplest ranking functions by summing
the
tfidf

for each query term

27

t
fidf

Intuition


Given a set of English text documents, we want to find which
document is
the most relevant to the query
“the brown cow.”


Starting with eliminating documents not containing all three query
words


To further distinguish the rest, we might


Count the number of times each query term occurs in each document


term frequency


Sum them all together


However, query term “the” is
too common
, its term frequency dominates
those of the other two more meaningful terms “brown” & “cow.”


That is, “the” (a common term) is not a good keyword to distinguish relevant &
non
-
relevant documents and terms.


Contrarily, the rarely occurred terms “brown” & “cow” are good keywords to
distinguish relevant documents from non
-
relevant ones


Thus, inverse document frequency (or 1 / document frequency) factor


Diminishing the weight of terms that occur very frequently in the document
collection &


Increasing the weight of terms that occur rarely

28

t
fidf

Example 1


A given document containing 100 words wherein the word
“cow” appears 3 times




tf

for “cow” =
n
ij

/

k

n
kj


= 3/100 = 0.03


Assuming that this corpus has 10 million documents &
“cow” appears in one thousand of these




idf

for “cow” =
log(|
D|

/
|
D
i
|)

= log(10,000,000/1,000) = 4


where

|D|

is number of documents in a document set



|
D
i
|

is number of documents having word
i


Therefore,



tfidf

for “cow” = 0.03 x 4 = 0.12


Just for comparison,


If
|
D
i
|

= 10,
idf

= log(10,000,000/10) = 6


tdidf

= 0.03x6 = 0.18


If
|
D
i
|

=
1M,
idf

=
log(10M/1M)
= 1



tdidf

=
0.03x1
=
0.03

29

t
fidf

Example 2


For the simplicity without deviating from original concepts,


N
umber of documents in a corpus |
D
| = 1,000


Number of terms in a document = 100


tf
ij

=
n
ij
, instead of
tf
ij

=
n
ij

/

k

n
kj


(i.e., no normalization)


df
i


= |D
i
|
, instead of
df
i

= log(|D
i
|

/
|
D
|)


tfidf
ij

=
tf
ij

/
df
i


Given: two of the documents


D
0

= (…, “the [
tf
=20]”, “
chinese

[
tf
=10]”, “university [
tf
=20]”, …)


D
1

= (…, “the [
tf
=40
]”, “
chinese

[
tf
=1]”,
“university [
tf
=1]”, …)


Given: document frequencies



“the”,

df

= 900




chinese
”,

df

= 10



“university”,

df

= 50


t
fidf

for terms in
D
0
:


“the”,
tfidf

= 20/900 = 0.0222



chinese
”,
tfidf

= 10/10 = 1


“university”
tfidf

= 20/50 = 0.4






tfidf

= 1.4222



30


tfidf

for terms in
D
1
:


“the”,
tfidf

= 40/900 = 0.0444



chinese
”,
tfidf

= 1/10 = 0.1


“university”
tfidf

= 1/50 = 0.02





tfidf

= 0.1644





D
0

being ranked higher based on
tfidf

criteria

Link Analysis


A subset of network analysis, exploring associations between
objects such as people or web pages


Providing the crucial relationships and associations between very
many objects of different types that are not apparent from
isolated pieces of information


Computer
-
assisted or fully automatic computer
-
based link
analysis increasingly employed by many domains


Banks & insurance agencies in fraud detection


Medical sector in epidemiology & pharmacology


Law enforcement investigations, and many more


Web link analysis


Using link
-
based centrality metrics, e.g., Google’s PageRank


To understand & extract information from the structure of collections of
web pages


E.g., interlinking between politician’s web sites or blogs

31

PageRank: Link Analysis

PageRank


A link analysis algorithm developed by Larry Page in
Stanford University in 1998 & used by the Google Internet
search engine to rank resulting web pages


PageRank
assigning a numerical weighting
to each element
of a hyperlinked set of documents, such as the World Wide
Web, with the purpose of
“measuring”
its relative
importance within the
set


The algorithm able to be applied
to any collection of
entities with reciprocal quotations and
references
(citations)


The
numerical weight
that it assigns to any given element E
referring to
as the
PageRank of E
and denoted by
PR(E
)

32

PageRank Concept of Importance


In general,
highly linked pages
are more “important”
than pages with few links


Adding in common sense notion of importance:


A single
link from a very important page
should also be
important


It should be ranked higher than many other pages with more
links but from obscure places


Intuitive description of PageRank


A page has high rank if the
sum of the ranks of its backlinks
is
high


This covers both the case


When a page
has many backlinks
&


When a page
has a few highly ranked backlinks

33

PageRank Concept:

Described by Google


Quantity
of incoming links


PageRank reflects our
view of the
importance of web pages
by
considering more than 500 million
variables and 2 billion
terms.


Pages
that we believe are
important
pages
receive a higher PageRank
and
are more likely to appear at the top of
the search results
.

34


Quality

of incoming links


PageRank also considers the
importance of each page that casts a vote
,
as
votes from some pages are considered to have greater value
, thus
giving the linked page greater value.


We have always taken a pragmatic approach to help improve search
quality and create useful products, and our technology
uses the
collective intelligence
of the web to determine a page's importance
.

Graph
-
based Mathematical Algorithm


A PageRank results from a mathematical algorithm based on
the graph created by all World Wide Web pages as nodes
and hyperlinks, taking into consideration authority hubs like
Wikipedia.


The rank value indicates an importance of a particular
page.


A
hyperlink to a page counts as a vote of support.


The
PageRank of a page is defined recursively and depends
on the number and PageRank metric of all pages that link to
it ("incoming links
").


A
page that is linked to by many pages with high PageRank
receives a high rank
itself.


If
there are no links to a web page there is no support for
that page.

35

PageRanks

for a Simple Network


Page C

has a
higher PageRank
than
Page E


Even though
Page C

has
fewer links to it


The link
Page C

has is of a much
higher value

36

PageRank Algorithm


PageRank is a
probability distribution
used to represent the
likelihood that a person randomly clicking on links will arrive at
any particular
page
.


PageRank
can be calculated for collections of documents of any
size.


Computational process:


It
is assumed in several research papers that the
distribution is evenly
divided

among all documents in the collection
at the beginning

of the
computational
process.


The
PageRank computations require several passes, called "
iterations
",
through the collection to
adjust approximate PageRank values

to more
closely reflect the theoretical true value.


A
probability

is expressed as a numeric value between 0
& 1.


A PageRank
of 0.5 means there is a 50% chance that a person clicking on a
random link will be directed to the document with the 0.5 PageRank.

37

Simplified PageRank Algorithm for Illustration


Assume a small universe of four web pages:
A
,
B
,
C
,
D
.


The
initial approximation
of PageRank would be evenly
divided between these four
documents.


Hence
, each document would begin with an estimated
PageRank of 0.25
.


If pages
B
,
C
, and
D

each only link to
A
, they would each
confer 0.25 PageRank to
A
.


All
PageRank
PR( )
in this simplistic
system

would
thus gather to
A

because

all
links would be pointing to
A
.


PR(A) = PR(B) + PR(C) + PR(D) = 0.75

38

A

B

D

C

0.25

0.25

0.25

Illustration of Simplified Algorithm


Suppose
that


Page
B

has a link to page
C
as well as to page
A


Page
D
has links to all three
pages


The
value of the link
-
votes is divided
among all
the
outbound links on a
page.


Thus
, page
B

gives a vote worth 0.125
(=0.25/2) to
page
A

&
a vote worth 0.125 to page
C
.


Only
one third of
D
's PageRank
is

counted
for
A
's
PageRank

(approximately 0.083 [=0.25/3]).


PR(A) =
PR(B)/2

+
PR(C)/1

+
PR(D)/3

= 0.25/2 + 0.25/1 + 0.25/3

= 0.125 + 0.25 + 0.083

= 0.458

39

A

B

D

C

0.083

0.125

0.25

Simplified Algorithm


In other words, the PageRank
conferred by an outbound
link

is equal to the document's own PageRank score
divided by the normalized number of outbound links

L
()
.


It
is assumed that links to specific URLs only count once per
document.


PR(A) = PR(B)/
L(B)

+ PR(C)/
L(C)

+ PR(D)/
L(D)


In the general case, the PageRank value for any page
u

can be expressed as


PR(u) =

v
ϵ
B
u
(
PR(v)

/
L(v)

)


T
he
PageRank value for a page

u

is dependent on the
PageRank values for each page

v

out of the
set
B
u

(this set
contains all pages linking to page

u
), divided by the
number

L
(
v
) of links from page

v
.

40

Damping Factor in PageRank Algorithm


The PageRank theory holds that even an imaginary surfer
who is randomly clicking on links will eventually stop
clicking.


The
probability, at any step, that the person will continue is a
damping factor
d
.


Various
studies have tested different damping factors, but it is
generally assumed that the damping factor will be set around
0.85.


i.e., 85% chance continue to click, or 15% chance stop clicking


The damping factor is subtracted from
1


And
this term is then added to the product of the damping factor
and the sum of the incoming PageRank
scores




So any page's PageRank is derived in large part from the
PageRanks

of other
pages.


The
damping factor adjusts the derived value downward.

41

PageRank Equation






Where

p
1
,
p
2
,...,
p
N

are the pages
under consideration,

M
(
p
i
)

is the set of pages that link to

p
i
,

L
(
p
j
)

is the number of outbound links on page

p
j
,

N

is the total number of
pages, and

d

is the damping factor set to be 0.85.

42

Search Engine


A search engine is an
information retrieval system
designed to help find
information stored on a
computer
system.


The
search results are usually
presented in a list and are
commonly called
hits.


Search
engines help to
minimize the time required to
find information and the
amount of information which
must be consulted, akin to
other techniques for managing
information overload
.

43

Search Engine

Web Search Engine

44


Designed
to search for
information on the World
Wide Web and FTP
servers


The information may
consist of web pages,
images, information and
other types of
files.


Some
search engines also
mine data available in
databases or open
directories.


The three most widely
used web search engines
and their approximate
share as of late 2010

Web Search Query

45


A
query that a user
enters into web search
engine to satisfy his or
her information
needs


Web
search queries are
distinctive in that they
are unstructured and
often
ambiguous


They
vary greatly from
standard query
languages which are
governed by strict
syntax
rules


Web query characteristics, from
analysis of Excite search engine
in 2001


A
verage query length:
2.4
terms


About
50% entered
a single query
while a little less than
1/3 entered
three or more unique queries.


Close to
50% examined
only the
first one or two pages of results
(10 results per page).


Less than 5% of users used
advanced search features (e.g.,
Boolean operators like AND, OR,
and NOT).

Vector Space Model


An algebraic model for representing text documents
(and any objects, in general) as


Vectors of identifiers, such as index terms


A model supporting


Information retrieval, indexing, and relevancy rankings


Relevance rankings of documents in a keyword search
can be calculated, using the assumptions of document
similarities
theory


By
comparing the deviation of
angles (
similarities
)
between
each document vector and the original query vector where
the query is represented as same kind of vector as the
documents
.


Relevance denoting
how well a retrieved document or set of
documents meets the information need of the user

46

Vector Space Model

Vector Space Model Definition


Documents and queries are represented as vectors


d
j

= (
w
1
j
,
w
2
j
,...,
w
tj
)
, document
j

with term weights
w
1
,…,
w
t

in it


q

= (
w
1
q
,
w
2
q
,...,
w
tq
)
, query with term weights
w
1
,…,
w
t

in it


Each dimension corresponds to a separate term


If a term occurs in the document, its value in the vector is non
-
zero


Several different ways of computing these term weight values,
also known as term weights, have been developed


One of the best known schemes is
tfidf

weighting


That is,
w
1j

=
tfidf
1j


47

Cosine Similarity Function


Calculating the cosine of the angle

between two vectors


Query & document vectors


For searching: how close (similar) a document

is to a query


D
ocument & document vectors


F
or clustering: how close (similar) two documents are


Given two vectors of attributes,
A

and
B
, the cosine similarity,
θ
,
is represented using a dot product
&
magnitude
as




The
attribute vectors A and B are usually
their
tfidf

vectors


The
cosine similarity of two documents will range from 0 to
1,

since
the
tfidf

weights
cannot be
negative


0 means nothing in common whereas 1 means the same

48

IR Performance Measures


Many different measures for evaluating the
performance of information retrieval systems have been
proposed.


The
measures require a collection of documents and a query
.


All
common measures described here assume a ground
truth notion of
relevancy:


Every
document is known to be either relevant or non
-
relevant to a particular
query.


In
practice queries may be ill
-
posed and there may be
different shades of relevancy
.


Precision and recall are two widely used metrics for
evaluating
IR performance

49

Precision & Recall


When using precision and recall, the set of possible
labels for a given instance is divided into two
subsets


One
of which is considered "relevant" for the purposes of the
metric


Recall
is
computed
as the fraction of correct instances
among all instances that actually belong to the relevant
subset


Precision
is the fraction of correct instances among
those that the algorithm believes to belong to the
relevant
subset


Precision can be seen as a measure of
exactness

or
fidelity, whereas recall is a measure of
completeness

50

Graphic Illustration of Precision & Recall

51


Recall and precision
depend on the outcome
(oval) of a query and its
relation to all relevant
documents (left) and the
non
-
relevant documents
(right
).


The
more correct results
(green), the better.


Precision
: horizontal
arrow.


Recall
: diagonal arrow.

Relevant

Non
-
relevant

Search Outcome (Result)

Precision & Recall Definition


Basis for evaluation


A set of retrieved documents


The list of documents produced by a web search engine for a query


A set of relevant documents


The list of all documents on the Internet that are relevant for a
certain topic


Precision: the fraction of retrieved documents that are
relevant to the search




Recall: the fraction of the documents that are relevant
to the query that are successfully retrieved

52

Multimedia Search Result Presentation

53


Tired of reading a result list
from a web search
&

then
reading some more pages one
by one?


Will it be better that you can consume
or experience the search result in
another way?


How about weaving together multiple
data sources in near real
-
time to
create an information consumption
experience like interactive video
presentations on reference topics?


You may want to test drive this:


http://www.qwiki.com
/

(pre
-
Feb 2013)

Multimedia Presentation & Search

54

Multimedia Search:
Informedia

Digital Video Library

http://
www.informedia.cs.cmu.edu/dli2/index.html

More illustrative videos at

http://www.informedia.cs.cmu.edu/demos/index.html

55

Content
-
based indexing over video, image,
audio, speech, text

Processing: speech recognition, face detection,
machine leaning, name
-
entity extraction,

co
-
occurrence analysis, etc.

Summary


Web searching is virtually built upon client
-
server
architecture.


Different backend services like crawlers, indexers, &
analyzers are working hard on gathering information on the
Internet.


This time
-
consuming and resource
-
intensive process is to
ensure that we can search for what we look for in no time.


Now, we have grown beyond the stage of getting the
needed search result quickly.


We pretty much trust that web searches can bring us
relevant information.


What’s left is that we want to consume the result in a better
way.

56

Summary