ITIS 1210
Introduction to Web

Based
Information Systems
Internet Research Two
How Search Engines Rank Pages &
Constructing Complex Searches
How do Search Engines Crawl?
Gathering data from the Web is like
browsing:
1.
Visit a page.
2.
Record all the words on the page
3.
Choose a link you haven’t seen/recorded
4.
Click on the link.
Repeat 8 billion times.
Crawling the Web
One person with a Web browser, following
one link per second.
How long does it take to browse the
surface Web (8 billion pages)?
8 billion seconds = 133 million minutes
= 2 million hours
= 93 thousand days
= 256 years
Crawling the Web
How many people would it take to crawl
the surface Web in a week? If each person
follows one link per second (with no
sleep):
One week = six hundred thousand seconds
Six hundred thousand / eight billion =
thirteen thousand
Challenges:
Remembering where you’ve been
Remembering where you haven’t been
Storing all the data
A (small) Server Farm
The Deep Web
Not all pages get crawled:
Private pages on Intranets (company
networks)
Pages that people don’t want crawled
Dynamic content pages (from databases)
Dynamic content pages make the size of the
Internet infinite!
Dynamic Content Example
zillow.com
Won’t be
indexed
Identifying High Quality Web
Pages
Google has ranked billions of Web pages
by "quality".
You enter your search terms:
UNC Charlotte HCI
Google finds the
highest quality page
associated with these search terms.
Google Pagerank
Pretend you're surfing the Web randomly.
To move from page to page you could:
1) type in an address (
www.
sis.uncc.edu
)
includes using a bookmark
OR
2) follow a link.
Pagerank
measures how likely you are to reach a
particular page through random surfing (either 1 or 2).
The main idea is that links to your page from important
web pages indicate that your page is important.
Computing Pagerank
(what’s the probability of getting to this page?)
Q
A, B, C, ...
L(A), L(B), L(C),...
= Web page
= Pages pointing to
Q
= number of links on each page
Pagerank of
Q
:
R(Q) = (
1

d
) +
d
∙
(R(A)/L(A) + R(B)/L(B) + ...)
d
represents the relative chance of
following a link
to page Q
and
1

d
represents the relative chance of going directly to
page Q (via typing in
the address or using a bookmark):
Usually these are:
d
=
0.9
(
1

d
)
= 0.1
Computing Pagerank
Pretend the Web has only four pages:
W X Y Z
Links:
W
X Y
W Y
Z Z
W
L(W)
=1
L(X)
=0
L(Y)
=2
L(Z)
=1
Which page has the highest “quality”?
Computing Pagerank
Links:
W
X Y
W Y
Z Z
W
L(W)
=1
L(X)
=0
L(Y)
=2
L(Z)
=1
R(W)
=
(1

d) + d
*
(R(Y)/L(Y) + R(Z)/L(Z))
= 0.1
+
0.9
*
(R(Y)/
2
+ R(Z)/
1
))
R(X)
= 0.1 + 0.9 *
R(W)
R(Y)
= 0.1
R(Z)
= 0.1
+
0.9
*
(R(Y)/
2
)
Now, solve for:
R(W), R(X), R(Y),
R(Z)
Computing Values for R(W), R(X), R(Y) and R(Z)
We could use algebra to find the values, in the
same way we could solve for
x
and
y
in:
x =
1
+
2
x + y
y =
2
+ x +
3
y
Algebraic Solution
w
=
R(W) x = R(X) y = R(Y) z = R(Z)
w
= 0.1
+
0.45
y
+ 0.9
z
x
= 0.1 + 0.9
w
y
= 0.1
z
= 0.1
+
0.45
y
y
= 0.1
z
= 0.145
w =
0.2775
x =
0.34795
But solving for eight billion variables is hard.
Instead, we'll use
fixed point iteration
.
Solution by Fixed

Point Iteration
Apply equations to compute
new
estimates:
new
R(W)
= 0.1 + 0.9
*
(R(Y)
/2
+ R(Z))
= 0.1 + 0.9 *
(
1.0
/2 +
1.0
)
=
1.45
new
R(X)
= 0.1 + 0.9 *
R(W)
=
0.1 + 0.9 *
1.0
=
1.0
new
R
(Y)
=
0.1
new
R(Z)
= 0.1 + 0.9 *
(R(Y)/
2
)
= 0.1 + 0.9 * (
1.0
/2) =
0.55
Start with
initial
estimates of PageRank for each page:
R(W)
=
1.0
R(X)
=
1.0
R(Y)
=
1.0
R(Z)
=
1.0
Solution by Fixed

Point Iteration
Start with updated estimates:
R(W)
=
1.45
R(X)
=
1.0
R(Y)
=
0.1
R(Z)
=
0.55
Apply equations to compute new estimates:
new
R(W)
= 0.1 + 0.9
*
(R(Y)
/2
+ R(Z))
= 0.1 + 0.9 *
(
0.1
/2 +
0.55
)
=
0.64
new
R(X)
= 0.1 + 0.9 *
R(W)
=
0.1 + 0.9 *
1.45
=
1.405
new
R
(Y)
=
0.1
new
R(Z)
= 0.1 + 0.9 *
(R(Y)/
2
)
= 0.1 + 0.9 * (
0.1
/2) =
0.145
Solution by Iteration
iteration
R(W) R(X) R(Y) R(Z)
0 1.00000 1.00000 1.00000 1.00000
1 1.45000 1.00000 0.10000 0.55000
2 0.64000 1.40500 0.10000 0.14500
Compute new estimates from the old
until
the estimates
stop changing.
Note that this is t
he
same answer as the
traditional algebraic approach, but this way scales better.
3 0.27550 0.67600 0.10000 0.14500
4 0.27550 0.34795 0.10000 0.14500
5 0.27550 0.34795 0.10000 0.14500
... ... ... ...
Final Pageranks
highest
page
X
R(X)
= 0.34795
.
page
W
R(W)
= 0.2755
.
.
page
Z
R(Z)
= 0.14500
lowest
page
Y
R(Y)
= 0.10000
Final Pageranks
Y
W
X
Z
2
1
1
0
0.10000
0.34795
0.27550
0.14500
How does Google Use
Pagerank?
You enter search terms, such as “UNC
Charlotte HCI”
Google finds all the pages that have
all
those words on them
Of all those pages, Google will list the
ones with the highest page rank first, but…
…other ‘magic ingredients’ are used by
Google: trade secrets of their algorithms.
Introduction
Basic queries are somewhat limited
One or two keywords
Simple relationships
Limited syntax
Complex queries provide more power
Keywords & phrase can be connected to form
more complex relationships
Search filters can be employed to limit results
Understanding Boolean Operators
Syntax
Rules for combining simple words to form
complex sentences
Search engine syntax implemented by
applying Boolean logic
George Boole
1815

1864
Understanding Boolean Operators
Boolean logic
Keywords act as nouns
Boolean operators act as conjunctions
They define the connections between keywords
Illustrated with Venn diagrams
John Venn
1834

1923
Understanding Boolean Operators
cats
W W W
All web pages containing the word cats
Understanding Boolean Operators
dogs
W W W
All web pages containing the word dogs
Understanding Boolean Operators
W W W
dogs
cats
All web pages containing the words cats
and
dogs
Intersection of
the two sets
Searches containing
both
words
Understanding Boolean Operators
W W W
cats
dogs
All web pages containing the words cats
or
dogs
Searches containing
either
word
Union of the
two sets
Understanding Boolean Operators
W W W
cats
dogs
All web pages containing the words cats
and not
dogs
Exclusion of
the dogs set
Searches containing one word but not the other
Understanding Boolean Operators
W W W
All web pages containing the words dogs
and not
cats
dogs
cats
Exclusion of
the cats set
Searches containing one word but not the other
Understanding Boolean Operators
Boolean operators
AND
OR
NOT
Instruct the engine on how to combine
keywords to produce results
Always use capital letters to avoid
confusion with and, or, not as keywords
Understanding Boolean Operators
AND
All these keywords must be on the Web page
OR
These keywords may or may not be on the
Web page
At least one of them must be
NOT
None of these keywords can be on the Web
page
Understanding Boolean Operators
Default operator
Some engines have a default Boolean
operator
Usually AND
Might be OR
Some engines may search for multiple
words as phrases
Understanding Boolean Operators
Boolean operators may be
Allowed on main page
Confined to Advanced search pages
Some engines use symbols instead
+ for AND

for NOT
No space between sign and word:
+solar +energy

windmill
Narrowing Searches with AND
AND
Limits results
Forces inclusion of a stop word
Indicates that
all
keywords must be found
on Web page
Adding more ANDed keywords limits
search more
Results should be more relevant because
the keyword list has expanded
Narrowing Searches with AND
Example:
“solar energy association” AND Portland
W W W
Solar energy
association
Portland
Narrowing Searches with AND
Example:
Henry +I same as “Henry I”
W W W
Henry
I
Expanding Searches with OR
OR expands results
Useful if you didn’t get enough returns from
your first search
The more keywords you add, the more results
you should get
Every page returned must have at least
one of the keywords on it
Good to use when you have synonyms
Expanding Searches with OR
Example:
oregon OR northwest
W W W
oregon
northwest
Restricting Queries with AND NOT
AND NOT excludes the keyword that
follows NOT
Limits your search
Produces fewer results
Useful if first search returns irrelevant
results
Use AND NOT to get rid of those results
Restricting Queries with AND NOT
Equivalent forms:
cats AND NOT dogs
cats AND

NOT dogs
cats NOT dogs
cats
–
dogs
Restricting Queries with AND NOT
Example:
“solar energy association” AND portland
AND NOT maine
Solar energy
association
portland
maine
Multiple Boolean Operators
Boolean operators allow you to focus a
search
Any logical combination of operators is
allowed
If it makes sense when spoken like a
sentence it’s probably OK to use
Order of operations is usually left to right
Use parentheses to organize terms
Multiple Boolean Operators
Bad example:
constitution +american OR “united states”
constitution
american
“united states”
Multiple Boolean Operators
Good example:
constitution +(american OR “united states”)
constitution
american
“united states”
Comments 0
Log in to post a comment