ITIS 1210 Introduction to Web-Based Information Systems

lilactruckInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

68 views

ITIS 1210

Introduction to Web
-
Based
Information Systems

Internet Research Two

How Search Engines Rank Pages &

Constructing Complex Searches

How do Search Engines Crawl?


Gathering data from the Web is like
browsing:

1.
Visit a page.

2.
Record all the words on the page

3.
Choose a link you haven’t seen/recorded

4.
Click on the link.


Repeat 8 billion times.


Crawling the Web


One person with a Web browser, following
one link per second.


How long does it take to browse the
surface Web (8 billion pages)?

8 billion seconds = 133 million minutes

= 2 million hours

= 93 thousand days

= 256 years


Crawling the Web


How many people would it take to crawl
the surface Web in a week? If each person
follows one link per second (with no
sleep):



One week = six hundred thousand seconds

Six hundred thousand / eight billion =
thirteen thousand


Challenges:


Remembering where you’ve been


Remembering where you haven’t been


Storing all the data

A (small) Server Farm

The Deep Web


Not all pages get crawled:


Private pages on Intranets (company
networks)


Pages that people don’t want crawled


Dynamic content pages (from databases)


Dynamic content pages make the size of the
Internet infinite!

Dynamic Content Example


zillow.com


Won’t be
indexed

Identifying High Quality Web
Pages


Google has ranked billions of Web pages
by "quality".



You enter your search terms:


UNC Charlotte HCI



Google finds the
highest quality page

associated with these search terms.


Google Pagerank

Pretend you're surfing the Web randomly.


To move from page to page you could:


1) type in an address (
www.
sis.uncc.edu
)


includes using a bookmark


OR


2) follow a link.


Pagerank

measures how likely you are to reach a
particular page through random surfing (either 1 or 2).

The main idea is that links to your page from important
web pages indicate that your page is important.


Computing Pagerank

(what’s the probability of getting to this page?)


Q

A, B, C, ...

L(A), L(B), L(C),...

= Web page

= Pages pointing to
Q

= number of links on each page

Pagerank of
Q
:


R(Q) = (
1
-
d
) +
d

(R(A)/L(A) + R(B)/L(B) + ...)


d

represents the relative chance of
following a link

to page Q
and
1
-
d

represents the relative chance of going directly to
page Q (via typing in

the address or using a bookmark):


Usually these are:
d

=

0.9
(
1
-
d
)
= 0.1

Computing Pagerank

Pretend the Web has only four pages:


W X Y Z


Links:


W


X Y


W Y


Z Z


W



L(W)
=1
L(X)
=0
L(Y)
=2
L(Z)
=1


Which page has the highest “quality”?

Computing Pagerank

Links:
W


X Y


W Y


Z Z

W



L(W)
=1
L(X)
=0
L(Y)
=2
L(Z)
=1


R(W)

=
(1
-
d) + d
*
(R(Y)/L(Y) + R(Z)/L(Z))



= 0.1
+

0.9
*
(R(Y)/
2
+ R(Z)/
1
))


R(X)
= 0.1 + 0.9 *
R(W)


R(Y)

= 0.1


R(Z)

= 0.1
+

0.9
*

(R(Y)/
2
)

Now, solve for:


R(W), R(X), R(Y),
R(Z)

Computing Values for R(W), R(X), R(Y) and R(Z)

We could use algebra to find the values, in the
same way we could solve for
x

and
y

in:





x =
1

+
2
x + y




y =
2

+ x +
3
y


Algebraic Solution

w
=
R(W) x = R(X) y = R(Y) z = R(Z)




w
= 0.1
+

0.45
y

+ 0.9
z



x
= 0.1 + 0.9
w




y
= 0.1



z

= 0.1
+

0.45
y


y

= 0.1

z

= 0.145

w =
0.2775


x =
0.34795

But solving for eight billion variables is hard.
Instead, we'll use
fixed point iteration
.


Solution by Fixed
-
Point Iteration

Apply equations to compute
new

estimates:


new
R(W)

= 0.1 + 0.9
*
(R(Y)
/2
+ R(Z))


= 0.1 + 0.9 *
(
1.0
/2 +
1.0
)


=
1.45


new
R(X)

= 0.1 + 0.9 *
R(W)
=

0.1 + 0.9 *
1.0

=
1.0


new

R
(Y)
=
0.1


new
R(Z)
= 0.1 + 0.9 *
(R(Y)/
2
)

= 0.1 + 0.9 * (
1.0
/2) =
0.55


Start with
initial

estimates of PageRank for each page:


R(W)

=
1.0

R(X)

=
1.0

R(Y)

=
1.0

R(Z)

=
1.0

Solution by Fixed
-
Point Iteration

Start with updated estimates:


R(W)

=
1.45

R(X)

=
1.0

R(Y)

=
0.1

R(Z)

=
0.55


Apply equations to compute new estimates:


new
R(W)

= 0.1 + 0.9
*
(R(Y)
/2
+ R(Z))


= 0.1 + 0.9 *
(
0.1
/2 +
0.55
)


=
0.64


new
R(X)

= 0.1 + 0.9 *
R(W)
=

0.1 + 0.9 *
1.45

=
1.405


new

R
(Y)
=
0.1


new
R(Z)
= 0.1 + 0.9 *
(R(Y)/
2
)

= 0.1 + 0.9 * (
0.1
/2) =
0.145

Solution by Iteration

iteration
R(W) R(X) R(Y) R(Z)

0 1.00000 1.00000 1.00000 1.00000

1 1.45000 1.00000 0.10000 0.55000

2 0.64000 1.40500 0.10000 0.14500






Compute new estimates from the old
until

the estimates

stop changing.
Note that this is t
he

same answer as the
traditional algebraic approach, but this way scales better.

3 0.27550 0.67600 0.10000 0.14500

4 0.27550 0.34795 0.10000 0.14500

5 0.27550 0.34795 0.10000 0.14500


... ... ... ...


Final Pageranks

highest

page
X

R(X)

= 0.34795



.


page
W

R(W)

= 0.2755


.


.


page
Z

R(Z)

= 0.14500


lowest

page
Y

R(Y)

= 0.10000

Final Pageranks

Y

W

X

Z

2

1

1

0

0.10000

0.34795

0.27550

0.14500

How does Google Use
Pagerank?


You enter search terms, such as “UNC
Charlotte HCI”


Google finds all the pages that have
all

those words on them


Of all those pages, Google will list the
ones with the highest page rank first, but…


…other ‘magic ingredients’ are used by
Google: trade secrets of their algorithms.

Introduction


Basic queries are somewhat limited


One or two keywords


Simple relationships


Limited syntax


Complex queries provide more power


Keywords & phrase can be connected to form
more complex relationships


Search filters can be employed to limit results

Understanding Boolean Operators


Syntax


Rules for combining simple words to form
complex sentences


Search engine syntax implemented by
applying Boolean logic


George Boole


1815
-
1864

Understanding Boolean Operators


Boolean logic


Keywords act as nouns


Boolean operators act as conjunctions


They define the connections between keywords


Illustrated with Venn diagrams


John Venn


1834
-
1923

Understanding Boolean Operators

cats
W W W

All web pages containing the word cats

Understanding Boolean Operators

dogs
W W W

All web pages containing the word dogs

Understanding Boolean Operators

W W W

dogs
cats
All web pages containing the words cats
and

dogs

Intersection of
the two sets

Searches containing
both

words

Understanding Boolean Operators

W W W

cats
dogs
All web pages containing the words cats
or

dogs

Searches containing
either

word

Union of the
two sets

Understanding Boolean Operators

W W W

cats
dogs
All web pages containing the words cats
and not

dogs

Exclusion of
the dogs set

Searches containing one word but not the other

Understanding Boolean Operators

W W W

All web pages containing the words dogs
and not

cats

dogs
cats
Exclusion of
the cats set

Searches containing one word but not the other

Understanding Boolean Operators


Boolean operators


AND


OR


NOT


Instruct the engine on how to combine
keywords to produce results


Always use capital letters to avoid
confusion with and, or, not as keywords

Understanding Boolean Operators


AND


All these keywords must be on the Web page


OR


These keywords may or may not be on the
Web page


At least one of them must be


NOT


None of these keywords can be on the Web
page

Understanding Boolean Operators


Default operator


Some engines have a default Boolean
operator


Usually AND


Might be OR


Some engines may search for multiple
words as phrases

Understanding Boolean Operators


Boolean operators may be


Allowed on main page


Confined to Advanced search pages


Some engines use symbols instead


+ for AND


-

for NOT


No space between sign and word:


+solar +energy
-
windmill

Narrowing Searches with AND


AND


Limits results


Forces inclusion of a stop word


Indicates that
all

keywords must be found
on Web page


Adding more ANDed keywords limits
search more


Results should be more relevant because
the keyword list has expanded

Narrowing Searches with AND


Example:


“solar energy association” AND Portland

W W W

Solar energy
association

Portland

Narrowing Searches with AND


Example:


Henry +I same as “Henry I”

W W W

Henry

I

Expanding Searches with OR


OR expands results


Useful if you didn’t get enough returns from
your first search


The more keywords you add, the more results
you should get


Every page returned must have at least
one of the keywords on it


Good to use when you have synonyms

Expanding Searches with OR


Example:


oregon OR northwest

W W W

oregon

northwest

Restricting Queries with AND NOT


AND NOT excludes the keyword that
follows NOT


Limits your search


Produces fewer results


Useful if first search returns irrelevant
results


Use AND NOT to get rid of those results

Restricting Queries with AND NOT


Equivalent forms:


cats AND NOT dogs


cats AND
-
NOT dogs


cats NOT dogs


cats

dogs

Restricting Queries with AND NOT


Example:


“solar energy association” AND portland
AND NOT maine

Solar energy
association

portland

maine

Multiple Boolean Operators


Boolean operators allow you to focus a
search


Any logical combination of operators is
allowed


If it makes sense when spoken like a
sentence it’s probably OK to use


Order of operations is usually left to right


Use parentheses to organize terms

Multiple Boolean Operators


Bad example:


constitution +american OR “united states”

constitution

american

“united states”

Multiple Boolean Operators


Good example:


constitution +(american OR “united states”)

constitution

american

“united states”