Cornell Information Science

jabgoldfishAI and Robotics

Oct 19, 2013 (3 years and 11 months ago)

80 views





William Y. Arms

Manuel Calimlim

Lucy Walle

Felix Weigel

January 23, 2007


Research Seminar: The Web Lab

http://weblab.infosci.cornell.edu/

Cornell Information Science

2

The Web Lab: A Joint Project of Cornell
University and the Internet Archive

Faculty

William Arms, Johannes Gehrke, Dan Huttenlocher, Jon Kleinberg,
Michael Macy, David Strang,...

Researchers

Manuel Calimlim, Dave Lifka, Ruth Mitchell, Lucia Walle, Felix
Weigel,...

Students

Selcuk Aya, Pavel Dmitriev, Blazej Kot,
with more than 50 M.Eng.,
and undergraduate students from Information Science and Computer
Science

Internet Archive

Brewster Kahle, Tracey Jacquith, Michael Stack, Kris Carpenter,...

3

Introduction to the Web Lab

Mining the History of the Web

The Internet Archive's Web Collection



Complete crawls of the Web, every two months since 1996



Total archive is about 110,000,000,000 pages (110 billion)



Recent crawls are about 60+ TByte (compressed)



Total archive is about 1,900 TByte (compressed)


Metadata contains format, links, anchor text

4

The Library Stacks: the Internet Archive

5

The Wayback Machine


Demo:

http://www.archive.org/

6

Research using Metadata about Web Pages

Current NSF grant

Research using anchor text



links to microsoft.com and google.com

Changes to the link structure of the Web




differences between crawls



densification (increases in average node degree)

Formation of online groups

7

Example of Past Work: Social and Information
Networks, Joining a Community


Close to one billion (user, community) instances






Work by: Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and
Xiangyang Lan

8

The Never
-
ending Research Dialog

Here's an
analysis we
would like to
do...


Not as you
suggest it,
but here's
another
idea...

We don't know
how to do that
analysis. Would
this be any use
to you?


That might be
possible, with
the following
modification...


RESEARCHER

INFORMATION
SCIENTIST

Let's try it and see.

9

The Role of Web Data

for Social Science Research

Social networks are an important research topic


Emergence of global phenomena from local effects


Viral spreading of rumors


Behavior of individuals in a community


Roles in discussion threads, herd behavior in opinion polls


Network structure and dynamics


Strength of weak ties, triangle relations, homophily

10

How to Observe a Social Network?



Social network research before the web


Talk to people, make notes


Distribute questionnaires, gather statistics



Problems with this approach


Tedious task


Small scale



The Internet Archive is a great resource for research


Contains web pages with social networks


Records the history of the pages

11

Social Networks on the Web

The web contains many social networks


Sites for social networking, social bookmarking, file sharing


MySpace, Facebook, Flickr, Delicious


Community portals


Yahoo Groups, DBLife


Encyclopedia and folksonomy projects


Wikipedia, Wikia


Review sites and customer comments


Amazon, Netflix


Blogs, web forums, Usenet


12

The Bliss and Curse of Digital Data

Opportunities


Collecting network data at an unprecedented scale


Verifying hypotheses in many different networks


Monitoring communities at a finer granularity


Mining and searching social networks

Challenges


Finding suitable information on the web


Extracting information from web pages


Making web data persistent


Processing very large data sets


Access rights and privacy


13

Web Lab and Social Science Research



Collaboration with Cornell’s Institute for the Social Sciences



Our goal: Make data available to researchers


Large web graph database with multiple crawls


Packaged subsets of crawls for analysis


Visual extraction tool for creating new data sets (ongoing)


Small
-
scale crawling for adding new web sites (starting)


Full
-
text indexing (planned)


Demo of the extraction tool available at

http://www.cs.cornell.edu/~weigel/WrapperDemo/

14

Web Data Extraction


Researchers often don’t care about web pages, but specific
substructures inside the pages


Blog postings


Web forums


Social tagging


News headlines


Tables of content


Bibliographies


Product details


Customer reviews

15

Web Data Collaboration Server

Data extraction



Writing extraction code is a tedious task



Create tools to make the data easily accessible in a structured
format (e.g., tables in a database)

Data sharing



Extracting the same data repeatedly is a waste of time and storage
space



Let users share their data and extraction rules

Data curation



Web data is often incomplete and erroneous



Let users collaborate to correct and complete the data

16

Demonstration





Demo of the extraction tool available at

http://www.cs.cornell.edu/~weigel/WrapperDemo/

17

The Web Lab System

Wayback
Machine

INTERNET ARCHIVE

Text indexes

Web Collection

File
server

Computer
cluster

Text indexes

Page store

Structure
database

National
super
-
computers

CORNELL UNIVERSITY

18

Technical Processing: the Web Lab




Networking

Internet 2, National Lambda Rail



Wayback Machine

Commodity computers with



local file systems


Structure database

Relational database system on



large shared memory computer


Data analysis

Specialized Linux cluster with



Hadoop distributed file system



and
MapReduce programming





Different types of computer for different functions

19

The Research Process

Select a sub
-
set for analysis



SQL query the relational database directly



Use the GetPages tool on the Web site to send an SQL query

Download the sub
-
set



To the researcher's computer



To the Web Lab file server

Clean
-
up the data



MapReduce tasks on the Hadoop cluster

Data analysis



MapReduce tasks on the Hadoop cluster

20

Selection Methods

By known identifier (Wayback Machine)


web pages with the URL http://www.nsf.gov/

By character string (full text indexing)
--

future


all pages containing, "Internet is doubling every six months"


all page containing the SARS
-
CoV genetic sequence

By metadata criteria


all web pages that link to microsoft.com but not to google.com


all email addresses that I used to receive mail from but have not had
mail from recently*


* Example provided by Marc Smith

21

Benefits of Using a Relational Database



Simple query language for retrieving data



Transaction support



Concurrency control for parallel queries



Multiple indices for high performance



Reliability since databases have built
-
in recovery functionality

22

Metadata Loading



The crawler outputs compressed metadata files (DAT
files).



Each DAT file has a set of crawled pages with page
metadata, including things like crawl time, IP address,
mime type, language encoding, etc.



Most importantly, the outgoing links from each page are
parsed, including the full URL and associated anchor
text.

23

Database Schema


Crawl


Name of the
crawl from which data is
loaded


Page


Metadata about
each webpage plus
fields to help find and
extract the full html text


Link


The outgoing
links from crawled
pages


Url


Lookup table for
unique URLs


Host


Lookup table for
unique hostnames

24

Crawls Loaded Into SQL DB

Crawl

Period

Databa
se size

Pages

Links

Urls

Hosts

DJ

Jan
-
April 2002

2.5 TB

1.1 billion

26 billion

250
million

16 million

DV

Jan
-
April 2004

15 TB

1.3 billion

110 billion

TBD

TBD

EB

Jan
-
March
2005

20 TB

3 billion

130 billion

20 billion

380 million

Amazon

Jan
-
April
2004, Jan
-
August 2005

570 GB

40 million

3 billion

35 million

356

Cornell

Jan
-
April
2002, Jan
-
April 2004

5 GB

800,000

12 million

750,000

40,000

25

Selection from the Database



SQL query the relational database directly



(Contact Manuel Calimlim)



Use the GetPages tool on the Web site to send an SQL query
--

work in progress






26

Demonstration

Demonstration of the Web Lab web site


http://weblab.infosci.cornell.edu/

and the GetPages tool

27

Massive Data Analysis by Non
-
Specialists

A typical scientist or social scientist:



Has deep domain knowledge



Has good algorithmic understanding



Is often a competent computer user or has a research assistant
who is familiar with languages such as Fortran, Python, and
Matlab, or applications packages such as SAS and Excel.

But...



Has limited understanding of large
-
scale data analysis



Is not skilled at any form of computing that requires parallel
computing or concurrency

Typical problem of scale:

Given 100 billion URLs, how do you

identify duplicates?



28

Hadoop and MapReduce Programming




Hadoop

An open source distributed file system similar to the Google

File System. It supports MapReduce programming.


http://lucene.apache.org/hadoop/

MapReduce

A functional programming style to support large
-
scale data

analysis without the need for global data structures.


In the 1960s, Fortran gave scientists a simple way to
translate mathematical problems into efficient computer
codes.


MapReduce programming gives researchers a simple way
to run massive data analysis on large computer clusters.

29

The MapReduce Paradigm

split 0

split 1

split 2

split 3

split 4

Output 0

Output 1

Input
data split
into files

Output
files

M
map
tasks

R

reduce
tasks

Intermediate
files

Each intermediate file is
divided into
R

partitions

Each reduce task
corresponds to one partition

30

A Web Graph Example

1

2

3

4

5

6

31

Building the Web Graph

URLs, pages, and links:




URLs contained in Web pages may link to pages never crawled



URLs not canonicalized: different URLs may refer to same page



Links are from a page to a URL

Web graph from crawl data:



Nodes are union of pages crawled and URLs seen



Each node and edge has time interval(s) over which it exists


32

Web Graph Example

Problem:

Given a set of URL pairs in uncanonicalized form (
u
0
,
v
0
), create

a list of all the edges that point to each node of the web graph:



Replace each
u
0

or
v
0

with its canonicalized form
u

or
v
.




Create a list of all nodes of the graph, i.e., the set of unique
u
.



Discard all (
u
,
v
) pairs, where
u

=
v
, or
v

is not a node of the graph.



Discard all duplicate edges.



For each node
v
, create a list (
v
, {
u
}), where {
u
} is the set of nodes
that have edges to node
v
.

Each step is a simple programming task for a small numbers of links on

a single computer. How can this simplicity be retained with huge

numbers of links on a very large computer cluster?

33

MapReduce Example

Map task

Input
:

(
u
0
,
v
0
)

Output
: (
u
,
d
)


// Indicate that
u

is a from
-
URL


(
v
,
u
)


// Indicate that
v

is a to
-
URL with link from
u


d

is a dummy marker. Do not output if
u

=

v
.

This is simple application code to write.


34

A MapReduce Example

Merge

The input to the reduce process merges the output values

from the map task that correspond to each URL.

For each URL,
w
, it creates a list:


w
, {
d
, ... ,
d
,
u
1
, ...,
u
k
}

This merge is performed automatically by the system libraries.

35

A MapReduce Example

Reduce

Input:

w
, {
d
, ... ,
d
,
u
1
, ...,
u
k
}, where
w

is any URL.

Output:



If there is no marker
d

in the list, discard and do not output. This
corresponds to a URL that never appears only as the first element
of a (
u
,
v
) pair
.


Otherwise remove duplicates from
u
1
, ...,
u
k

and output.

The output is a to
-
URL and a list of the nodes that link to it:


v
, {
u
1
, ...,
u
k
}

This is simple application code to write.


36

For the Future:

Examples of Tools and Services


The Web Lab is steadily building a set of tools for researchers



API and Web services



GetPages Web forms to select dataset by query of a relational
database with indexes by date, URL, domain name, file type,
anchor text, etc.



Focused Web crawling (modification of Heritrix crawler)



Extraction of Web graph from subset and calculations, e.g.,
PageRank, hubs and authorities



Graph visualization



Natural language processing of anchor text

37

The Web Lab is Ready for Use

We are ready to work with a number of researchers:

Systems


Relational database operational


Hadoop pilot cluster (large cluster soon)


File server and web server operational

People


Manuel Calimlim (database)


Lucy Walle (Hadoop + MapReduce)

Tools


A variety of tools in prototype


Experience with large volumes of anchor text and URLs

38

Thanks

This work would not be possible without the forethought and
long standing commitment of Brewster Kahle and the Internet
Archive to capture and preserve the content of the Web for
future generations.

This work has been funded in part by the National Science
Foundation, grants CNS
-
0403340, DUE
-
0127308, SES
-
0537606, IIS
-
0634677, and IIS
-
0705774
.






William Y. Arms

Manuel Calimlim

Lucy Walle

Felix Weigel

January 23, 2007


Research Seminar: The Web Lab

http://weblab.infosci.cornell.edu/

Cornell Information Science