Reconciling Schemas of Disparate Data Sources: A Machine ...

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 4 χρόνια και 28 μέρες)

93 εμφανίσεις

Reconciling Schemas of
Disparate Data Sources: A
Machine
-
Learning Approach

AnHai Doan

Pedro Domingos

Alon Halevy

CS 652 Information Extraction and Information Integration

Problem & Solution


Problem


Large
-
scale Data Integration Systems


Bottleneck: Semantic Mappings


Solution


Multi
-
strategy Learning


Integrity Constraints


XML Structure Learner


1
-
1 Mappings



CS 652 Information Extraction and Information Integration

L
earning
S
ource
D
escriptions
(LSD)


Components


Base learners


Meta
-
learner


Prediction converter


Constraint handler


Operating Phases


Training phase


Matching phase


CS 652 Information Extraction and Information Integration

Learners


Basic Learners


Name Matcher (Whirl)


Content Matcher (Whirl)


Naïve Bayes Learner


County
-
Name Recognizer


XML Learner


Meta
-
Learner (Stacking)

CS 652 Information Extraction and Information Integration

Naïve Bayes Learner

Input instance=

bags of
tokens

CS 652 Information Extraction and Information Integration

XML Learner

Input instance=

bags of tokens
including
text tokens

and
structure tokens

CS 652 Information Extraction and Information Integration

Domain Constraint Handler


Domain Constraints


Impose semantic regularities on schemas
and source data in the domain


Can be specified at the beginning


When creating a mediated schema


Independent of any actual source schema


Constraint Handler


Domain constraints + Prediction Converter
+ Users’ feedback + Output mappings


CS 652 Information Extraction and Information Integration

Training Phase


Manually Specify
Mappings for Several
Sources


Extract Source Data


Create Training Data for
each Base Learner


Train the Base
-
Learner


Train the Meta
-
Learner

CS 652 Information Extraction and Information Integration

Example1 (Training Phase)

CS 652 Information Extraction and Information Integration

Example1 (Cont.)

Source Data

Training Data

CS 652 Information Extraction and Information Integration

Example1 (Cont.)

(“location”

ADDRESS)

(“Miami, FL”, ADDRESS)

Source Data: (
location
:
Miami, FL
)

CS 652 Information Extraction and Information Integration

Matching Phase


Extract and Collect Data


Match each Source
-
DTD
Tag


Apply the Constraint
Handler

CS 652 Information Extraction and Information Integration

Example2 (Matching Phase)

CS 652 Information Extraction and Information Integration

Example2 (Cont.)

CS 652 Information Extraction and Information Integration

Example2 (Cont.)

CS 652 Information Extraction and Information Integration

Experimental Evaluation


Measures


Matching accuracy of a source


Average matching accuracy of a source


Average matching accuracy of a domain


Experiment Results


Average matching accuracy for different domains


Contributions of base learners and domain constraint
handler


Contributions of schema information and instance
information


Performance sensitivity to the amount data instances




CS 652 Information Extraction and Information Integration

Limitations


Enough Training Data


Domain Dependent Learners


Ambiguities in Sources


Efficiency


Overlapping of Schemas



CS 652 Information Extraction and Information Integration

Conclusion and Future Work


Improve over time


Extensible framework


Multiple types of knowledge


Non 1
-
1 mapping ?