WDB

thumbpinchInternet και Εφαρμογές Web

18 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

66 εμφανίσεις


Part II


Large
-
Scale

Web Database Integration Systems

Definitions


Web database

(
database search engine
): Web
-
accessible database (
WDB
)


Characteristics:


Data are structured and are stored in database systems.


Data are accessible through a Web search interface.


Result pages are dynamically generated by wrapping data
in HTML files.


Web database integration
: the process of enabling
unified access to multiple Web databases in the same
application domain.

An Example Web Database

More Examples

WDB Integration System vs. MSE


Major differences between Web databases and
regular document search engines (DSE):


DSE searches Web pages while WDB searches database
entities.


WDB usually has a complex interface while DSE usually
has a simple interface.


DSE ranks results by similarity while WDB usually ranks
results by some attribute values.

WDB Integration System Architecture





… …

WDB m

Database Selection

Query Translation

and Dispatch

Entity Identification

Result Extraction

WDB 1

User query

World Wide Web

Web

Web Database

Discovery

WDB List

WDB Interface

Schema

Extraction

Query Processing Module.

Result

WDB Clustering

By Domain

WDB Cluster 1

WDB Cluster n


. . . . . .

Interface

Integration

Integrated Interface 1

Integrated Interface n


. . . . . .

Integrated Interface Generation Module.

Result Annotation

Domain Mapping

Result Merging

Integrated Interface

Main Technical Problems


WDB Search Interface Modeling


WDB Search Interface Extraction


WDB Search Interface Clustering


WDB Search Interface Integration


Global Query Mapping and Optimization


Search Result Extraction and Annotation


Online Entity Identification


Remaining Research Challenges

A Related Book


Eduard Dragut, Weiyi Meng, Clement Yu.
Deep Web
Query Interface Understanding and Integration
.
Morgan & Claypool Publishers, June 2012.


Table of Content


Introduction


Query Interface Representation and Extraction


Query Interface Clustering and Categorization


Query Interface Matching


Query Interface Attribute Integration


Query Interface Integration


Summary and Future Research



WDB Query Interface Modeling

Problem
: Represent the information on each interface
in a format that is suitable for integration and
query submission.

An Example WDB Interface

An attribute

WDB Interface Modeling

Different

models

have

been

proposed
:


WISE

Three
-
Level

Model
:

site
-
level,

attribute
-
level,

and

element
-
level
.



Hierarchical Model
:
A search interface is modeled
as an ordered tree of elements.


Hierarchical model is designed to capture the
order semantics

and the
nested grouping

of the attributes in an interface.


Querying Capability Model
: Formally characterize
what kinds of queries are valid for a search
interface.

Hierarchical Model: An Example

origin

From: City
or Airport
Code


aa.com

1. Where Do You
Want to Go?

2. When Do You
Want to Go?

3. Number of
Passengers

4. What are
Your Service
Preferences?

carrier

5. Choose a
Carrier

destination

To: City or
Airport
Code

Departure Date

Return Date

depMonth

depDay

depTime

retMonth

retDay

retTime

numAdult

Adults

numChild

Children

cabinClass

Class of
Service

maxiumStops

Number of
Connections

Query Interface Extraction


Automatic interface extraction
: Automatically extract
information described in an interface representation
model from any given WDB interface.


Primarily two tasks:


Attribute extraction



Extract elements and labels from the interface.


Group elements and labels into logical attributes.


Attribute analysis


Extract and derive meta
-
information about each attribute
based on the interface representation model.

WDB Query Interface Clustering

Objective
: Group WDBs into different clusters such
that all WDBs in the same cluster are related to the
same domain (e.g., sell the same type of products).

Techniques:

1.
First, construct a concept hierarchy.

2.
Then apply one of the following techniques


Supervised clustering (training required)


Unsupervised clustering (no training required)

Query Interface Integration


It is related to database schema integration.


Schema integration has been studied since 1980s.


Based on different data models
: ER model, relational
model, object
-
oriented model, etc.


In different context
: a single database during database
design, or multiple databases in multidatabase/data
warehouse systems.


Key issues
: resolve name conflict, data type conflict,
structural conflicts, data inconsistency, etc.


Manual approach
: Integration rules are manually
written.

Schema Integration vs. Interface
Integration

Comparing WDB interface integration and database schema
integration.


WDB interface schema is simpler (one table/view versus
multiple tables of a database schema).


Attributes in WDB interface are more complex as they may
consist of multiple elements.


WDB interface mixes attributes and query conditions while
database schema don




Meta
-
data need to be extracted from WDB interface while
they are readily available in database schema.


WDB interface integration needs to integrate
element format
,
attribute layout

and
external values

while database schema
integration doesn



Attribute Matching

A key problem in schema/interface integration is to match
attributes from different schemas/interfaces.

A general framework for
attribute matching

[Rahm and
Bernstein, VLDB Journal 2001].


Develop a number of
matchers

based on different
information.


Dictionary
-
level information
: attribute names


Schema
-
level information
: data type, key, foreign key, …


Instance
-
level data
: values of attributes


Utilize auxiliary information: Special dictionaries, thesaurus,
user
-
input, …

Attribute Integration


After attribute matching, attributes are divided
into clusters such that each cluster corresponds to
a global attribute in the integrated interface.

Remaining issues:

1.
Determine the
name
of the global attribute for
each cluster.

2.
Determine the
domain type

of each global
attribute. The domain type will determine the
format.

3.
Determine the
external values

of each global
attribute.

Hierarchical Interface Integration (1)

An example of hierarchical schema representation

1. Where Do You Want to Go?


From: City To: City


2. When Do You Want to Go?


Departure Date



Return Date


3. Number of Passengers?


Adults Children


4. Class of Service


Economy


Business


First Class

Jan

1

1am

Jan

1

1am

1

0

Root

Where …

When …

Number …

Class …

From To Departure Return Adult ……

Dmonth Dday Dtime Rmonth Rday Rtime

Siblings are ordered!

Hierarchical Interface Integration (2)

Simple mapping versus complex mapping


Simple mapping: 1
-
to
-
1 mapping between two
fields


Complex mapping: 1
-
to
-
m mapping between one
field in one interface and multiple fields in another
interface

Examples of 1
-
to
-
m mappings

departure


date

from date

month day year


No. of

passengers

passengers

adults children

Hierarchical Interface Integration (3)

American Express

Company

address

State

City

Please tell us

about yourself

Please tell us about

your employment

Occupation

Street

Phone

Chase

Address

Country

State

Please tell us about

your employment

Years there

How to merge?

Tree Merging

Hierarchical Interface Integration (4)



Grouping Constraint
: Given subgroups in


different user interfaces, is it possible to find a


group such that all elements in each subgroup


are in adjacent locations?

Example
: The following example satisfies this
requirement:


{state, city, street}


{country, state}

{country, state, city, street}

Hierarchical Interface Integration (5)

American Express

Company

address

State

City

Please tell us

about yourself

Please tell us about

your employment

Occupation

Street

Phone

Chase

Address

Country

State

Please tell us about

your employment

Years there

Integrated

address

State

City

Please tell us

about yourself

Please tell us about

your employment

Occupation

Street

Phone

Years there

Country

Preserving ancestor
-
descendant relationships

Hierarchical Interface Integration (6)

Naming attributes



Group Naming Compatibility
: Names of attributes
within a group in a user interface should be
compatible.


Example: Compatible naming


{adults, children}


{adults, infants}

Incompatible naming:


{adults, children}


{#children, #infants}

{adults, children, infants}

{adults, children, #infants}

Search Result Annotation

Goal
: Identify the semantic meaning of each piece of
information within each
search result record

(SRR).


Before result annotation, SRRs on the result pages returned
from search engines need to be extracted first.


Some approaches combine result extraction and result
annotation in one step.

Data annotation is needed for


Comparison
-
shopping applications: entity identification,
result merging, …


Deep Web crawling and data collection

Result Annotation: Problem Description

title

authors

Entity Identification


Problem
: Automatically derive rules to determine if
two search result records from different WDBs are
in fact the same entity (product).


Entity identification is closely related to
entity
matching
,
entity resolution
,
duplicate detection
, and
record linkage
.


It is a classical problem in federated systems that
deal with data from multiple sources.

Remaining Research Challenges (1)

1. Automatic WDB discovery

Goal: Discover Web database interfaces from the Web
automatically.

Some issues to consider:


How to identify web pages that have a search interface?


There are already some existing work on this.


How to differentiate search interfaces for Web databases
from those for text search engines?


Is the information from the search interface sufficient? Do
we need information from search results?


How to learn a classifier?

Remaining Research Challenges (2)

2. Extraction and understanding of dynamic query
interfaces


A
n
increasing number of query interfaces are
dynamic in the sense that the query interface may
alter after certain fields are selected. Two types of
dynamic changes have been observed.


The
change of values of some fields (e.g., values under a
selection list).


The
structure of the query interface (e.g., some fields are
added, deleted or modified).


Current
query interface models do not consider
dynamic query
interfaces.

Remaining Research Challenges (3)

3. Handling boundary query interfaces in Web
-
scale
clustering.


There
are two challenges in Web
-
scale clustering of
query interfaces [
Madhavan

et el., 2007; Mahmoud
and
Aboulnaga
, 2010].


The
number of domains is unknown in advance, which
means that the number of clusters is unknown in advance.


There
are likely many query interfaces with unclear
domains, i.e., they appear between boundaries of multiple
domains.


However
, the current solutions are not sufficiently
accurate and have significant room to improve.

Remaining Research Challenges (4)

4. Web database selection

Goal: For any given user query, identify the Web
databases that are most likely to return good results.

Some issues to consider:


How to summarize the content of a Web database?


Numerical attributes


Categorical attributes


Textual attributes


Relationships among the attributes

Remaining Research Challenges (5)

Web database selection

(continued)


How to obtain the summaries

automatically?


How to design sample queries for each type of attributes?


How to use the summaries
to do Web database
selection?


How to measure

畳敦畬湥獳


扡獥搠潮⁤楦o敲敮e 瑹灥t映
慴a物扵r敳e


How to combine

畳敦畬湥獳


慣牯獳⁤楦晥f敮琠慴a物扵r敳e

Remaining Research Challenges (6)

5. Automatic SRR extraction from complex result pages

Goal: Automatically identify the rules to extract search
result records from complex result pages.

Some characteristics of complex result pages:


Record contains both text and images


SRRs may be organized into multiple columns/multiple
sections.


SRRs have a variety of formats.


Have no fixed sections (i.e., some sections only appear in some
result pages)


Some SRRs are divided into multiple blocks.

Remaining Research Challenges (7)

6. Global query processing and optimization

Goal: Evaluate global queries efficiently and correctly.

Some issues to consider:


It consists of many steps:


Identify relevant Web databases (global cost)


Translate/map global queries to local queries (global cost)


Submit queries and receive results (communication cost)


Evaluate translated queries by local Web databases (local cost)


Extract search results from result pages (global cost)


Filter out unqualified results (global cost)


How to optimize the above process?


What are the differences between Web integration systems
and multidatabase/federated database systems?

The End!