# ITIS 1210 Introduction to Web-Based Information Systems

Internet and Web Development

Dec 4, 2013 (4 years and 6 months ago)

87 views

ITIS 1210

Introduction to Web
-
Based
Information Systems

Internet Research Two

How Search Engines Rank Pages &

Constructing Complex Searches

How do Search Engines Crawl?

Gathering data from the Web is like
browsing:

1.
Visit a page.

2.
Record all the words on the page

3.
Choose a link you haven’t seen/recorded

4.

Repeat 8 billion times.

Crawling the Web

One person with a Web browser, following

How long does it take to browse the
surface Web (8 billion pages)?

8 billion seconds = 133 million minutes

= 2 million hours

= 93 thousand days

= 256 years

Crawling the Web

How many people would it take to crawl
the surface Web in a week? If each person
follows one link per second (with no
sleep):

One week = six hundred thousand seconds

Six hundred thousand / eight billion =
thirteen thousand

Challenges:

Remembering where you’ve been

Remembering where you haven’t been

Storing all the data

A (small) Server Farm

The Deep Web

Not all pages get crawled:

Private pages on Intranets (company
networks)

Pages that people don’t want crawled

Dynamic content pages (from databases)

Dynamic content pages make the size of the
Internet infinite!

Dynamic Content Example

zillow.com

Won’t be
indexed

Identifying High Quality Web
Pages

Google has ranked billions of Web pages
by "quality".

UNC Charlotte HCI

highest quality page

associated with these search terms.

Pretend you're surfing the Web randomly.

To move from page to page you could:

1) type in an address (
www.
sis.uncc.edu
)

includes using a bookmark

OR

Pagerank

measures how likely you are to reach a
particular page through random surfing (either 1 or 2).

web pages indicate that your page is important.

Computing Pagerank

Q

A, B, C, ...

L(A), L(B), L(C),...

= Web page

= Pages pointing to
Q

= number of links on each page

Pagerank of
Q
:

R(Q) = (
1
-
d
) +
d

(R(A)/L(A) + R(B)/L(B) + ...)

d

represents the relative chance of

to page Q
and
1
-
d

represents the relative chance of going directly to
page Q (via typing in

the address or using a bookmark):

Usually these are:
d

=

0.9
(
1
-
d
)
= 0.1

Computing Pagerank

Pretend the Web has only four pages:

W X Y Z

W

X Y

W Y

Z Z

W

L(W)
=1
L(X)
=0
L(Y)
=2
L(Z)
=1

Which page has the highest “quality”?

Computing Pagerank

W

X Y

W Y

Z Z

W

L(W)
=1
L(X)
=0
L(Y)
=2
L(Z)
=1

R(W)

=
(1
-
d) + d
*
(R(Y)/L(Y) + R(Z)/L(Z))

= 0.1
+

0.9
*
(R(Y)/
2
+ R(Z)/
1
))

R(X)
= 0.1 + 0.9 *
R(W)

R(Y)

= 0.1

R(Z)

= 0.1
+

0.9
*

(R(Y)/
2
)

Now, solve for:

R(W), R(X), R(Y),
R(Z)

Computing Values for R(W), R(X), R(Y) and R(Z)

We could use algebra to find the values, in the
same way we could solve for
x

and
y

in:

x =
1

+
2
x + y

y =
2

+ x +
3
y

Algebraic Solution

w
=
R(W) x = R(X) y = R(Y) z = R(Z)

w
= 0.1
+

0.45
y

+ 0.9
z

x
= 0.1 + 0.9
w

y
= 0.1

z

= 0.1
+

0.45
y

y

= 0.1

z

= 0.145

w =
0.2775

x =
0.34795

But solving for eight billion variables is hard.
fixed point iteration
.

Solution by Fixed
-
Point Iteration

Apply equations to compute
new

estimates:

new
R(W)

= 0.1 + 0.9
*
(R(Y)
/2
+ R(Z))

= 0.1 + 0.9 *
(
1.0
/2 +
1.0
)

=
1.45

new
R(X)

= 0.1 + 0.9 *
R(W)
=

0.1 + 0.9 *
1.0

=
1.0

new

R
(Y)
=
0.1

new
R(Z)
= 0.1 + 0.9 *
(R(Y)/
2
)

= 0.1 + 0.9 * (
1.0
/2) =
0.55

initial

estimates of PageRank for each page:

R(W)

=
1.0

R(X)

=
1.0

R(Y)

=
1.0

R(Z)

=
1.0

Solution by Fixed
-
Point Iteration

R(W)

=
1.45

R(X)

=
1.0

R(Y)

=
0.1

R(Z)

=
0.55

Apply equations to compute new estimates:

new
R(W)

= 0.1 + 0.9
*
(R(Y)
/2
+ R(Z))

= 0.1 + 0.9 *
(
0.1
/2 +
0.55
)

=
0.64

new
R(X)

= 0.1 + 0.9 *
R(W)
=

0.1 + 0.9 *
1.45

=
1.405

new

R
(Y)
=
0.1

new
R(Z)
= 0.1 + 0.9 *
(R(Y)/
2
)

= 0.1 + 0.9 * (
0.1
/2) =
0.145

Solution by Iteration

iteration
R(W) R(X) R(Y) R(Z)

0 1.00000 1.00000 1.00000 1.00000

1 1.45000 1.00000 0.10000 0.55000

2 0.64000 1.40500 0.10000 0.14500

Compute new estimates from the old
until

the estimates

stop changing.
Note that this is t
he

traditional algebraic approach, but this way scales better.

3 0.27550 0.67600 0.10000 0.14500

4 0.27550 0.34795 0.10000 0.14500

5 0.27550 0.34795 0.10000 0.14500

... ... ... ...

Final Pageranks

highest

page
X

R(X)

= 0.34795

.

page
W

R(W)

= 0.2755

.

.

page
Z

R(Z)

= 0.14500

lowest

page
Y

R(Y)

= 0.10000

Final Pageranks

Y

W

X

Z

2

1

1

0

0.10000

0.34795

0.27550

0.14500

Pagerank?

You enter search terms, such as “UNC
Charlotte HCI”

Google finds all the pages that have
all

those words on them

Of all those pages, Google will list the
ones with the highest page rank first, but…

…other ‘magic ingredients’ are used by

Introduction

Basic queries are somewhat limited

One or two keywords

Simple relationships

Limited syntax

Complex queries provide more power

Keywords & phrase can be connected to form
more complex relationships

Search filters can be employed to limit results

Understanding Boolean Operators

Syntax

Rules for combining simple words to form
complex sentences

Search engine syntax implemented by
applying Boolean logic

George Boole

1815
-
1864

Understanding Boolean Operators

Boolean logic

Keywords act as nouns

Boolean operators act as conjunctions

They define the connections between keywords

Illustrated with Venn diagrams

John Venn

1834
-
1923

Understanding Boolean Operators

cats
W W W

All web pages containing the word cats

Understanding Boolean Operators

dogs
W W W

All web pages containing the word dogs

Understanding Boolean Operators

W W W

dogs
cats
All web pages containing the words cats
and

dogs

Intersection of
the two sets

Searches containing
both

words

Understanding Boolean Operators

W W W

cats
dogs
All web pages containing the words cats
or

dogs

Searches containing
either

word

Union of the
two sets

Understanding Boolean Operators

W W W

cats
dogs
All web pages containing the words cats
and not

dogs

Exclusion of
the dogs set

Searches containing one word but not the other

Understanding Boolean Operators

W W W

All web pages containing the words dogs
and not

cats

dogs
cats
Exclusion of
the cats set

Searches containing one word but not the other

Understanding Boolean Operators

Boolean operators

AND

OR

NOT

Instruct the engine on how to combine
keywords to produce results

Always use capital letters to avoid
confusion with and, or, not as keywords

Understanding Boolean Operators

AND

All these keywords must be on the Web page

OR

These keywords may or may not be on the
Web page

At least one of them must be

NOT

None of these keywords can be on the Web
page

Understanding Boolean Operators

Default operator

Some engines have a default Boolean
operator

Usually AND

Might be OR

Some engines may search for multiple
words as phrases

Understanding Boolean Operators

Boolean operators may be

Allowed on main page

+ for AND

-

for NOT

No space between sign and word:

+solar +energy
-
windmill

Narrowing Searches with AND

AND

Limits results

Forces inclusion of a stop word

Indicates that
all

keywords must be found
on Web page

search more

Results should be more relevant because
the keyword list has expanded

Narrowing Searches with AND

Example:

“solar energy association” AND Portland

W W W

Solar energy
association

Portland

Narrowing Searches with AND

Example:

Henry +I same as “Henry I”

W W W

Henry

I

Expanding Searches with OR

OR expands results

Useful if you didn’t get enough returns from

The more keywords you add, the more results
you should get

Every page returned must have at least
one of the keywords on it

Good to use when you have synonyms

Expanding Searches with OR

Example:

oregon OR northwest

W W W

oregon

northwest

Restricting Queries with AND NOT

AND NOT excludes the keyword that
follows NOT

Produces fewer results

Useful if first search returns irrelevant
results

Use AND NOT to get rid of those results

Restricting Queries with AND NOT

Equivalent forms:

cats AND NOT dogs

cats AND
-
NOT dogs

cats NOT dogs

cats

dogs

Restricting Queries with AND NOT

Example:

“solar energy association” AND portland
AND NOT maine

Solar energy
association

portland

maine

Multiple Boolean Operators

Boolean operators allow you to focus a
search

Any logical combination of operators is
allowed

If it makes sense when spoken like a
sentence it’s probably OK to use

Order of operations is usually left to right

Use parentheses to organize terms

Multiple Boolean Operators

constitution +american OR “united states”

constitution

american

“united states”

Multiple Boolean Operators

Good example:

constitution +(american OR “united states”)

constitution

american

“united states”