MidtermReview

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

67 views

Internet Systems

Review

Generally Speaking



Understand the essence of the
papers/systems we’ve studied.


Understand taxonomies/criteria for
comparison.


Terminology


Closed books/notes

Papers


Kleinberg


Google


Ferguson, Google vs. Microsoft


Cho


Rich get Richer


Pitkow


Lieberman


Nelson


Berners
-
Lee


Systems


Google


HITS


Outride


Direct Hit


Letizia


Powerscout


Watson


Margin Notes


Xanadu


Webtop/Open Search




Search Evaluation



Precision and Recall


Relevance


consensus relevance


author relevance


topic
-
specific relevance



Evaluations provided in papers


Google


HITS


Cho


Outride


TREC


Text Retrieval Conference


Standard testbeds for search evaluation


Precision and Recall

What is your precision and recall if:



You have a repository of a million documents, and you need to
find out about government subsidies for llama farming.



Of those million documents, twenty are relevant to your needs.


You do a search and the first page of your result list contains
sixteen documents.



Of those sixteen, ten are among those relevant to llama
subsidies.


Precision and Recall, Answer


Recall is 10/20, or 50%.



Precision is 10/16, or 62.5%.

Hubs and Authorities


Hub
--

A page that points to many
authorities


Authority: A page that is pointed to by
many hubs.


What current system uses this concept
for “subject
-
specific” ranking.


HITS


Get initial result list using traditional IR


Add ins/outs to set


Run iterative algorithm, computing hub
and authority score for each page on
each iteration.

HITS


Hubs and Authorities

Consider the following link

graph table. An x in the row

labeled d1 means d1

points at that page, e.g., d1

points at d2 and d4.




Suppose after the initial text
-

based search and afteradding

ins and outs, we were left wit

the seven documents in the

table above.

Compute the Hub and

Authority score of the seven

documents, given an initial

score of 1 for each. You need

not normalize any scores and

you need run through only two

iterations.

d1

d2

d3

d4

d5

d6

d7

d1

x

x

d2

x

x

x

d3

x

d4

x

x

d5

x

x

d6

x

d7

x

x

HITS vs. Page Rank

How could the concept of
hubs/authorities improve on page rank?




Important in general vs.

Authority for a specific topic

Generally important

Authority for topic A

Hubs for topic A

What are disadvantages of
HITS relative to Page Rank



Potential Topic Drift


TF not counted in Ranking


But only documents with terms used.


Run
-
Time Delay

Page Rank

PR(p) = (1
-
d)


+




d


(PR(in1)/outDegree(in1) +



PR(in2)/outDegree(in2) + …



)


where p is the page for which you are
computing page rank, d is a dampening
factor,


in
i

is the ith page pointing at page p.



Explain the heuristics on which this formula is
based.

Heuristics in Page Rank


Popular page is one pointed to by lots of
popular pages.



If a page links to a bunch of other pages
including p, p gets less credit


random surfer model basis


See
http://www.iprcom.com/papers/pagerank/ind
ex.html

for more info on how page rank
works.

Easy Question

With the Random Surfer model is the user

randomly visiting pages?



Inverted Index

word


hit


hit


hit

word2


hit
-

hit


hit


hit

….

plain/fancy

docid

position in document

If two keywords input to a search, how are results
computed?

Anchors



Google associates text in anchor with
page and page pointed to.


Reason 1: Anchors often provide more
accurate descriptions of pointed to
page.


Reason 2: Anchors provide text for
images, programs, etc.

Building an inverse index

Suppose the following two documents were crawled by

a search engine that built an inverse index similar to

that of Google's. Show the inverse index that would be

built.



www.nothing.edu/doc1.htm

hello

world <a
href="http://www.nothing.edu/doc2.htm"> Nothing
</a>


www.nothing.edu/doc2.htm

big bad world

Sample inverse index

hello


doc1

world


doc1


doc2

Nothing


doc1


doc2

big


doc2

bad


doc2

Pages without keywords

Describe how Page Rank and HITS allow

pages that don’t contain keywords to be

discovered as results.


Does this help recall or precision? Both?

What else is it helpful for?

Cho: The Rich get Richer


Search
-
dominant model


User’s rarely look at any but top results


New, quality pages have difficulty
breaking in.


When popularity does increase, its quite
sudden.

Personalization and Contextual
Computing


Outride


Letizia


Powerscout


Watson


Margin Notes


Google




What contextual information used



How is it applied?



Transparency



Obtrusiveness



Privacy



What contextual information is
used?


User Profile(s)


data explicitly input by user


browsing history


usage statistics


click popularity, stickiness


bookmarks


documents


Currently Open Documents


Collaborative filtering

How is context applied


Query Augmentation


and automated query creation (automated
information queries often using TFIDF)


Result Processing


Limiting the Search Space


Notifying user of previous searches


Eurekster


Limiting Search Space


Domain
-
specific libraries


explicit user choice (webtop)


automated two
-
phase (webtop++)



Neighborhood of current page (Letizia)



Seen/Haven’t seen (Outride)

Contextual Computing Issues


Identifying context switching, changing interests


Task model


Multiple profiles


Transparency


Does the user know what the system is doing?


User
-
Agent collaboration (e.g., Google Personal)


Obtrusiveness


Especially for automated information queries, but also
consider complexity of search.


Efficiency (Pitkow stressed this)


Privacy


Metasearch


API based as opposed to Scraping


Exploits advantage of subsets of web


Role a Standard API could play


dynamic list of information sources


Independence of sources/metasearch


Search in the World



Index Everything


phone conversations, email, pdf data


Hidden web


The Role of APIs


Separating presentation and data.


Economic benefit?


Standards


Search Results


Clustering


Tree/Graph view


see
TouchGraph

Personal Information
Management


Associative Trails (Bush)


Entity Associations NOT made by author
and NOT embedded in either entity


De.lic.io.us is shared bookmarks (King)


bookmark = url


assoc


comment


Semantic web generalizes (Berners
-
Lee)


thing


assoc
--

thing

Personal Information
Management


“Document” wrong granularity


Blogs sending us this way


Document as a list of content pointers (Nelson)


Versioning and Permanence


global address space (Nelson, Berners
-
Lee,
Archive)


Deep 2
-
way links


Can get to the full context of content


Structured over unstructured data