IELM 231: IT for Logistics and Manufacturing

cowphysicistInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 14 μέρες)

61 εμφανίσεις

IELM 231: IT for Logistics and Manufacturing


Course Agenda

Introduction

IT applications design: Human
-
Computer Interface

Fundamental IT tools: sorting, searching

The Client
-
Server architecture, Interacting applications

IT in logistics, Case study 2: web
-
based auctions

IT in logistics, Case study 1: web
-
search

How auctions work

Web issues: session tracking

Web issues: secure communications

Web issues: cash transactions

Search robots

Data processing, Data storage/retrieval (DB, indexes)

Data presentation: page ranking techniques

The

Internet

and

Web

searching

-

The architecture of the Internet

-

The ISO/OSI 7
-
layer communication protocol

-

Client
-
server architecture for applications

Part 1. The infrastructure of the web

-

Basic searching

-

Advanced searching (Booleans) and filtering (setting up local engines)

Part 2. How to use a web search engine [case study: Google]

-

Crawling the internet

-

Indexing of information, database design

-

Page ranking

Part 3. How search engines work

History

of

search

engines

1993: First web robot


World Wide Web Wanderer

Matthew Gray, Physics student from MIT

Objective: track all pages on web to monitor growth of the web

1994: First search engine


WebCrawler

Brian Pinkerton, CS student from U of Washington

Objective: download web pages, store the links linked to keyword
-
searchable DB

1994: Jerry’s Guide to the Internet

Jerry Yang, David Filo, Stanford University

Objective:
crawl

for web pages,
organize

them by content into
hierarchies





Y
et
A
nother
H
ierarchical
O
fficious
O
racle (Yahoo)

1994
-
97: Lycos, Infoseek, AltaVista, Excite, LookSmart (meta engine)

1998: Google (Sergey Brin, Larry Page, CS students, Stanford University)

Basics

of

web

searching
:

two

modes

of

searching

Web directories

[Boolean] keyword search

Basics

of

web

directories

A hierarchy of main topics is formed (by search company)

Each topic may have several sub
-
topics, etc.

Users search for ‘main’ articles in any subtopic by ‘browsing’

All pages ‘known’ to engine are linked to the hierarchy by humans (editors)

Pages deep inside a hierarchy can be searched using ‘keyword search’

Basics

of

web

searching
:

user’s

view

Basic search modes

Any page with the
search term(s)

occurring in it is returned

What if you search of multiple words?

implicit AND: page must contain all the words in search term

Exact phrase search

Typically by quoting the search term: “now is the time”

What happens if you use:
now is the time

Typically:

is
and
the
will be omitted, results will differ from above!

Basics

of

web

searching
:

user’s

view
..

Boolean (Set theoretic) operations on search results:

Union: OR


manufacturing die OR shell casting


Intersection: AND (implicit)


manufacturing die casting

Some useful Google™ search modes:

Definition of a word or phrase


define:search

Page contains numbers in a given range


ming emperor 1400..1500

Simple
calculator

functions


150*(1+0.05)**5
-
150

Difference (NOT): ‘
-



manufacturing cold forging

-
hot

Basics

of

web

searching
:

user’s

view


Filtering

Anchor
: the text in an HTML tag that carries the hyperlink to a page

Example:

<a href=http://www
-
ieem.ust.hk/dfaculty/ajay/courses/ielm330/330intro.html>
IELM 330 web page

</a>

Anchors have special significance in Google (we shall see this later)

Focused search

Search only on pages that occur within a particular site


IELM 330 site:ust.hk

Anchor text often indicates the content of the page it links

Search term must occur in the anchor text


inanchor:manufacturing
-
wiki

Search term must occur in the title of page


allintitle:joneja ajay

The

Internet

and

Web

searching

-

The architecture of the Internet

-

The ISO/OSI 7
-
layer communication protocol

-

Client
-
server architecture for applications

Part 1. The infrastructure of the web

-

Basic searching

-

Advanced searching (Booleans) and filtering (setting up local engines)

Part 2. How to use a web search engine [case study: Google]

-

Crawling the internet

-

Indexing of information, database design

-

Page ranking

Part 3. How search engines work

Collecting

information

on

the

internet
:

Crawlers

How do search engines know which page has what data?

-

By the use of
robots

(also called
spiders

or
crawlers
)

-

By asking sites to tell them who they are (registering)

Crawling:


Start with a small list of web
-
pages (seeds) := OPEN


For each URL in OPEN:


Fetch page;


Process page (index its contents


see later);


Add all web
-
links (URL’s) found on the page to OPEN

Problems

in

crawling

Too many web
-
sites > 100 million


Each website has many pages (estimated total: ~ 25 billion)

Suppose a web
-
site generates a page based on some inputs from user

How can we program a crawler to get this content ?

What potential problems can be faced by this technique?

Art gallery website:


-

3 methods to sort thumbnails; 3 choices of font size; 2 skins



18 URL’s (using HTTP GET with different options) for same page!

Dynamic page generation (e.g. DB based, or rapidly changing)

Some pages may not be linked from any seed

Crawler

behavior

Basic crawler is easy to program

Policies used by crawlers determine their efficiency:

Selection policy
: which page to download (among all URL’s in OPEN)

Breadth first

is better than
depth first

why?

Crawler

behavior
..

Re
-
visit policy
: how often to re
-
visit a page (to account for dynamic content)

How often to re
-
visit?

Depends on how frequently the page is changed

out
-
dated data


cost [
but how to estimate this cost
?]

Freshness

of a page p in repository at time t =

1, if p is identical to local copy

0, otherwise

Age

of a page p in repository at time t =

t
-

time when p was modified, otherwise

0, if p is identical to local copy

[source: wikipedia]

Crawler

behavior
...

Policies used by crawlers determine their efficiency:

Politeness policy
: how not to overload resources

Crawlers can send download requests faster than the server, the network

or the indexer can handle

Typical policy: wait
n

seconds before sending next page request to same server


n

= 10 sec … 180 sec

Crawler

behavior

.

Policies used by crawlers determine their efficiency:

Parallelization policy
: how to use multiple crawlers in a coordinated search

advantage
: the search engine can use multiple computers to build indexes

problems
:


-

how to coordinate the merge the results of the multiple computers


-

how to ensure that same page is not downloaded by more than one process

Search

engines
:

Indexing

What to do with the pages downloaded by the crawler?

Processing the data:


-

categorization (
mime
-
types
)


-

text processing


-

delete
stop words

[Why?] [How?]


-

Identify and handle words


-

Identify and handle phrases [How many ?]


-

Handling similar words


-

plurals, past/present, mis
-
spellings:
stemming

Search

engines
:

Indexing
..

Categorization (mime
-
types)

The indexer must know what type of data it received, and handle it accordingly:

A/V mime
-
types: audio/mpeg, audio/x
-
wav, audio/x
-
ms
-
wma, video/mpeg, …


Currently: check meta data (anchor text, ALT
-
tags, HTTP Header, …


Future: Digital signal processing


voice recognition


text


index


Image mime
-
types: image/gif, image/jpeg, image/png, image/tiff, …


Currently: check meta data (anchor text, ALT
-
tags, HTTP Header, …)

Application mime
-
types: application/javascript, application/xhtml+xml,


application/octet
-
stream, application/x
-
shockwave
-
flash, …


Currently: If interpreted language


handle as text, otherwise meta data

Other mime
-
types: e.g. 3D data etc


meta data, filenames, etc.

Search

engines
:

Indexing


Text processing

-

Identify and handle words

Stop words
: very common words in text, such as
the, a, an, on, …

How to define?

language experts (human) make the list

How to handle?

string
-
match


do not index

stop words

-

delete
stop words

what is a word?


-

English: dictionary


-

Programming: contiguous alphanumeric with ‘_’


-

hyphenated [
how does Google handle hyphenated search
-
terms
?]


-

special characters

?

Search

engines
:

Indexing

.

-

Identify and handle phrases

what is a phrase?


-

English: dictionary of phrases, human experts


-

How to identify: string match


-

How many?


-

Too many


Index is too large


-

Too few


real
-
time Boolean operations during search (takes time!)

Google n
-
gram dataset: ~24 GB compressed text files


Number of tokens: 1,024,908,267,229

Number of unigrams: 13,588,391

Number of bigrams: 314,843,401

Number of trigrams: 977,069,902

Number of fourgrams: 1,313,818,354

Number of fivegrams: 1,176,470,663

Search

engines
:

Indexing

..

-

Search for ‘expert’ == ? == Search for ‘experts’

-

Handling similar words


-

plurals, past/present, mis
-
spellings:
stemming

Stemming
: To reduce a given word to a ‘canonical’ root word

Basic methods:


-

Use a dictionary


-

Analyze sentence, guess the meaning of the word used, find root from look
-
up table


-

Use
simple processing

rules (English)


-

if word ends in ‘ed’, ‘ing’, ‘ly’


remove
‘ed’, ‘ing’, ‘ly’

-

How to identify mis
-
spellings?


-

Dictionary lookup and match


expensive


-

Look
-
up table with list of commonly misspelled words

Search

engines
:

Indexing
……

Inverted indexes
: How to store processed data

Most common storage structure: Inverted Index

ngram

Documents

storm

URL1, URL6, …, URL_n

teacup

URL1, …, URL_k



How to store?

-

Database file (lookup time may be long)

-

Hash (faster lookup on average)

Challenge
: Updating an index [
Why
?] [
How
?]

Challenge
: Merging indexes [
Why
?


Parallel processing
] [
How
?]

Search

engines
:

presenting

results

to

user

Typical query may match 100 millions web pages (
hits
)

Typical user may look at perhaps 3
-
5 pages to find the information they want

Successful search engine


must provide a link to ‘best’ page among top few

How to rank the relative likelihood that a page will provide ‘wanted’ info?

-

Google uses PageRank™

Search

engines
:

PageRank

How to rank the relative likelihood that a page will provide ‘wanted’ info?

-

Google uses PageRank™

Page
Rank: Invented by PhD student, Larry
Page

Similar older ideas, e.g. link analysis (graph theory), citation index

Main idea of PageRank
:


1. Importance of page is related to how many other pages link to it,
and


2. How important are those pages

Is PageRank effective? [Google’s market capitalization: US$ 200B!]

Page

Ranking

Consider a web with 4 pages

Main Reference for PageRank notes: Bryan and Leise, The 25B eigenvector: The Linear Algebra Behind Google

Arrow from A


B => Page A has a link to Page B

1

2

4

3

Let:
importance score

of page
k

=
score

of
k

=
x
k

Simplest scheme:


x
k

= number of back
-
links to page
k

x
1

= 2,
x
2

= 1,
x
3

= 3,
x
4

= 2

Problem ?

does it account for condition 2?

Page

Ranking
..

1

2

4

3

Let:
importance score

of page
k

=
score

of
k

=
x
k

Suppose page
j

is more important (e.g. yahoo.com)




those pages that are linked from page
j

should be important




additional importance

of page linked from page
j



x
j

where

n
j

= number of outgoing links on page
j

Let
L
k

be the set of page
k
’s backlinks, then:




k
L
j
j
j
k
n
x
x
Note
: we ignore any link from a page to itself

Page

Ranking


1

2

4

3

Let
L
k

be the set of page
k
’s backlinks, then:




k
L
j
j
j
k
n
x
x
x
1

= x
3

+ x
4
/2

x
2

= x
1
/3

x
3

= x
1
/3 + x
2
/2 + x
4
/2

x
4

= x
1
/3 + x
2
/2

x
1

x
2

x
3

x
4

x
1

x
2

x
3

x
4

0 0 1 1/2

1/3 0 0 0

1/3 1/2 0 1/2


1/3 1/2 0 0

=

A

Recall from Linear Algebra that for an
n
X
n

square matrix
A
, a scalar
l

is an

eigenvalue

if there is an
eigenvector

x

≠ 0

such that
Ax =

l
x

x

x

PageRank of pages is the eigenvector of
A

corresponding to eigenvalue = 1

Page

Ranking

.

1

2

4

3

x
1

x
2

x
3

x
4

x
1

x
2

x
3

x
4

0 0 1 1/2

1/3 0 0 0

1/3 1/2 0 1/2


1/3 1/2 0 0

=

To find the eigenvector
: solve the system using Gaussian elimination

In our example,
any scalar multiple

of:

12

4

9

6

Normalize:

12/31

4/31

9/31

6/31

.387

.129

.29

.194

=

Problem: Is it necessary that the link matrix of
any web

has eigenvalue = 1?

Page

Ranking


..

Dangling node
:
A web page that has no outgoing links

Problem: Is it necessary that the link matrix of
any web

has eigenvalue = 1?

Theorem:

The link matrix of a web with no dangling nodes always has 1 as an eigenvalue

Examine matrix A:


A
ij

= 1/n
j
,
if page

j
links to page

i


A
ij

= 0,
otherwise



sum of all entries in column
j

= total (unweighted) contribution of page
j

= 1

Column stochastic matrix
:
A square matrix with no negative entries, and


sum of each column = 1

Page

Ranking


...

Column stochastic matrix
:
A square matrix with no negative entries, and


sum of each column = 1



Our link matrix is a column stochastic matrix

Theorem:

Every column stochastic matrix has 1 as an eigenvalue

Proof:

1. (linear algebra): square matrix
A

and its Transpose,
A
T

have same eigenvalues

2. Let e = an
n
X
1

vector with all entries = 1

3. A is column stochastic


A
T

e = e




A
T

has an eigenvector =
e, (
with eigenvalue = 1
)




A

has an eigenvalue = 1.

QED

Page

Ranking

issues

Problems with the Link Matrix approach, and How to handle them

Non
-
unique eigenvector for eigenvalue = 1

Dangling nodes

Computational effort and time

Other factors that also contribute to ‘importance of the page’

Page

Ranking

issues
:

non
-
unique

eigenvector

Consider the web in the figure

1

2

4

3

5

0 1 0 0 0

1 0 0 0 0

0 0 0 1 ½

0 0 1 0 ½

0 0 0 0 0

A

=

What are the eigenvectors for
A
, corresponding to eigenvalue = 1 ?

Any linear combination of:

½

½

0

0

0

0

0

½

½

0

e.g.:

.375

.375

.125

.125

0

What is the relative importance of the pages??

Page

Ranking

issues
:

non
-
unique

eigenvector
..

Consider the web in the figure

1

2

4

3

5

0 1 0 0 0

1 0 0 0 0

0 0 0 1 ½

0 0 1 0 ½

0 0 0 0 0

A

=

Intuitive reason for the problem: this web can be partitioned into two independent sub
-
webs

Solution:

-

Reorganize A into block diagonal form

-

Find Page Rank for each block independently

Mathematically: Non
-
unique eigenvector results when A has a block diagonal form

A
1

0 … 0

0 A
2

… 0

0 …

0 0 … A
k

A

=

Page

Ranking

issues
:

Dangling

nodes

Suppose a web has
dangling nodes

(any page with no outgoing links)




Corresponding column of A is all 0’s




The matrix A is not column stochastic (it is
column sub
-
stochastic
)




For column sub
-
stochastic matrices, there must exist at least one


eigenvalue,
0 <
l

< 1
, whose eigenvector has
all non
-
negative

entries



[Note: this eigenvector is called the
Perron eigenvector
]


[Note 2: we will omit the proof of its existence in this course]




We could use the
Perron eigenvector

to get page ranks (
but we won’t
)

Dangling

nodes
:

A

probabilistic

view

of

the

Link

matrix

Consider an imaginary, randomized web
-
surfer:


He starts at a random page in the web and spends time
t

on the page.


He then jumps randomly to one of the links in that page

The Page Rank of a page = expected time spent on that page by the randomized surfer!

How to model this view
-
point into our matrix A ?

However: if our “randomized surfer” ends up on a dangling node, then he is

stuck there for ever ?!

One solution:

If we are stuck at a dangling node, jump to a random page in the web

The
S

matrix
: For any 0
-
column (


dangling node) in A, make all entries = 1/n

Question
: Is the matrix
S

column stochastic ?

Page

Ranking

Issues
:

Other

factors

influencing

rank

Main idea (Google’s damping factor):

Allow our randomized surfer to make a purely random jump “occasionally”

How to implement ?

What is the link matrix of a purely random surfer?

1/n 1/n … 1/n

1/n 1/n … 1/n




1/n 1/n … 1/n

R

=

Note
: R is column stochastic

Consider the Google matrix,
G =

d

S +
(1


d)

R


d


[0, 1]

Note
:
G

is column stochastic (
Why?
)

Effect of using G
: With probability
d
, we follow the randomized surfer
S
, and with


probability (
1


d
), we make a purely random jump.

Page

Ranking

Issues
:

Other

factors

influencing

rank

Main idea (Google’s damping factor):

Allow our randomized surfer to make a purely random jump “occasionally”

Google matrix,
G =

d

S +
(1


d)

R


Google found that using
matrix
G
, and damping factor
d = 0.85

“works very

well” (


gives ranking that tends to rank “useful pages” higher)

Page

Ranking

Issues
:

Computing

the

eigenvector

The
Rank of page n

is the
n
-
th

element

of the
eigenvector,
I
,

of the
G

matrix

for
eigenvalue = 1

Size of
G
: approx
25Billion X 25 Billion

!!!



Using Gaussian elimination will take months on a supercomputer!

Properties of
G
:


Large size (
25Billion X 25 Billion
)


Sparse (approx
10 non
-
zero entries

in typical non
-
dangling node column)


1 is the

largest eigenvalue

for column stochastic matrix
G

We can use the
Power Method

to compute the
I
-
vector

Page

Ranking

Issues
:

Computing

the

eigenvector
..

The
Power Method

to compute the
eigenvector
of the dominant

eigenvalue

Start with an arbitrary rank vector, I
0

Iterate:


I
1

= G I
0

, …, …, I
k+1

= G I
k


I
k

converges

to the eigenvector of the dominant eigenvalue



We only need to do matrix multiply for sparse matrix (~10 multiply per column)


[NOTE:

by writing
G

=
d
S

+ (1


d)
R
, the effect of
R

is merely to add a fixed


quantity to each row of I


faster computation
]

Brin and Page:

Typically, Google needs
50
-
100 iterations

to get a good I
-
vector



Typically, Google computers take
few days

to calculate



Typically Google updates rankings
once per month

(
Google dance
)

Concluding

notes

Why is web
-
search important?


-

For researchers, only practical way to find information from web



-

For E
-
commerce companies: higher rank


larger sales (
why?
)



This is the reason for high stock value of Google, Baidu, …

Current trend:


-

Page Ranking method dominates Directory
-
structured search engines

Technology note:


-

Page Ranking: a practical application of abstract Linear Algebra concepts!

References

and

Further

Reading

Books:

Sherman and Price,

The Invisible Web: Uncovering Information Sources Search
Engines Can’t See
, Cyber Age Books, 2001

[Note: some information in this book is already obsolete]

Web sources:

1. Googleguide:
google search guide


2. David Austin’s article on how
Google PageRank works

Next: Auctions, Web
-
transactions, Ebay