Growing Parallel Paths for Entity-Page Retrieval

voltaireblingData Management

Nov 20, 2013 (3 years and 8 months ago)

87 views

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Growing Parallel Paths for

Entity
-
Page Retrieval

Tim Weninger
, Cindy
Xide

Lin, and
Jiawei

Han

Department of Computer Science

University of Illinois Urbana
-
Champaign, Urbana, IL


Work Submitted to VLDB'10

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Problem: Entity Page Retrieval

Given: Reference page

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

…Can We find Entity Pages of the same Type?

Problem: Entity Page Retrieval

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

…Can We find Entity Pages of the same Type?

Problem: Entity Page Retrieval

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Definitions:

Defn

1: Root to link path:




-

href
X


contains


HTML
-
TABLE
-
TR
1

TD
-
href
X


Defn

2: Parallel Links:


Share a root to link path.


i.e.
, lists of links


Defn

3: Intra
-
page parallel paths:





-

href
C

ǁ


-

href
B





-

href
C

ǁ



-

href
X


Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Definitions:

Defn

5: Parallel Web site paths


Share intra or inter
-
page parallel paths
across multiple pages

Defn

4: Inter
-
page parallel




-

href
C

in Page A ǁ


-

href
W

in Page B

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Properties of Parallel Paths

Prop. 1: Equal Path Length Property:


Parallel paths must contain the same number of pages.

Prop. 2: Parallel Page Property:


The test of two paths being in parallel is equivalent to the result of tests of
respective pages.

Prop. 3: Equal Page Length Property:


Parallel paths must have the same number of nodes across pages.



Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Properties of Parallel Paths

Prop. 4: Divergent Path Property:


Parallel Paths can extend through separate pages

Prop. 5: Early Termination Property:


The test of two paths can be terminated at the first occurrence of a dissimilar
node



Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Finding Paths

Naive Method



Can be very costly


Growing Parallel Paths


First find example path


Then grow paths which are in parallel to the example


Repeat with alternate paths


This makes magic happen


Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Repeating with alternate paths

k
-
shortest paths


Do k
-
shortest path
search. Explore all of
these paths


Removing links


After exploring a path remove the edges from the graph


Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Interpreting the Output

Side Effect of Repeating with Alternate paths


Given:
Jiawei

Han


Result:

Jiawei

Han

40




Cheng
Zhai

38




Kevin Chang

38




Dan Roth

32




Vikram

Adve

4




Roy Campbell

3







Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Interpreting the Output

Side Effect of Path Finding


What does the link labels on the path tell us about the entity


First path

People

Faculty

Jiawei

Han

Personal Site


Second path

Research

Data Mining


Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Experiments

Top 25 CS Departments in US (according to US News)


Find all professors

United States Congress


Find all senators, representatives, and committees

UIUC only


Find all courses


Final all research groups


Baseline



Google’s
find similar
search (essentially TFIDF
-
type ranking)

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Results

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Results

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Results

Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Conclusions and Future Work

Given a reference page and an example entity type we
can retrieve all entity pages of the same type


Implications:


We can use this for information integration


Search, retrieval can be enhanced


Shortcomings:


Most errors due to incorrect list finding


Data and Information Systems Laboratory

University of Illinois Urbana
-
Champaign

Advanced Data Mining

May 4, 2010

Questions?