Querying Web-Sources within a Data Federation

basesprocketΔιαχείριση Δεδομένων

31 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

78 εμφανίσεις

1

Querying Web
-
Sources
within a Data
Federation

Lynn Wu
1
,
Aykut Firat
2
,

Tarik Alatovic
3
, Stuart Madnick
1


1
MIT Sloan School of Management

2
Northeastern University

3
INSEAD


International Conference on Information Systems (ICIS)

December 11, 2006

2

Motivating Scenario

You want:



The current stock quotes of all
companies listed on the Stock Exchange



that are in the biotechnology industry.


And you want to see each of the stock
quotes in all the major currencies.


3

Good News

All of the necessary information is available (and
for free) on the Web …


Listing of companies in an industry

Stock price for any company

Conversion between any two currencies

So what’s the problem?


4

Process


Part 1

Web sites are not like Relational (SQL) databases.

Must go step
-
by
-
step: first find all the biotech companies.

5

Biotechnology

Ticker

Acadia Pharmaceuticals Inc.

ACAD

Accentia Biopharmaceuticals, I

ABPI

Achillion Pharmaceuticals, Inc

ACHN

Acorda Therapeutics, Inc.

ACOR

Adherex Technologies Inc.

ADH

Advanced Cell Technology Inc.

ACTC.OB

Advanced Life Sciences Holding

ADLS

Advaxis Inc.

ADXS.OB

Adventrx Pharmaceuticals Inc.

ANX

Alfacell Corp.

ACEL

Alnylam Pharmaceuticals Inc.

ALNY

……

……

ACHN

ABPI

ADLS

ACOR

ADH

ANX

ACAD

237 Biotech firms

Process


Part 2

Then must find the stock price
of each, one
-
by
-
one.

6

Process


Part 3

Ticker

Price($)

ACAD

8.96

ABPI

3.55

ACHN

17.52

ACOR

17.13

ADH

0.31

ACTC.OB

0.76

ADLS

2.43

ADXS.OB

0.14

ANX

2.42

ACEL

1.62

ALNY

23.81

Ticker

Price($)

EURO

JPY

ACAD

8.96

6.8096

792.2289

ABPI

3.55

2.698

313.8853

ACHN

17.52

13.3152

1549.09

ACOR

17.13

13.0188

1514.607

ADH

0.31

0.2356

27.4097

ACTC.OB

0.76

0.5776

67.19798

ADLS

2.43

1.8468

214.8567

ADXS.OB

0.14

0.1064

12.37858

ANX

2.42

1.8392

213.9725

ACEL

1.62

1.2312

143.2378

ALNY

23.81

18.0956

2105.242

Then must convert stock price
of each, one
-
by
-
one.

7

General Scenario


Users often have to browse through many
websites and collect and process a lot of
information manually.


Wouldn’t it be great if you could get all the stock
quotes in the biotech industry using one query?



select ticker, price from yahooF where ticker IN
(select companyticker from companytable
where industry='Biotechnology')

8

Why is this so difficult?


Websites have various capability restrictions.


Web sites do not accept general queries (e,g.,
SQL).


Assuming they somehow accepted general queries,
there are still problems. For example:


select price from yahooF

This is not answerable as Yahoo! Finance requires at least one
ticker at a time to get the stock quote.


select exchanged, expressed, rate, date from olsen where
expressed='USD' and date= '12/10/06'

Must specify both currencies.

9

Existing Solutions


Commercial databases can incorporate heterogeneous
data sources through the use of wrappers:


However, there is no general
-
purpose wrapper that can
query the entire Web.


Need to construct one wrapper per website.


This is our focus


how can these be improved ?



Other options:


Using highly expressive context
-
free grammars to express
the capability restrictions


Has not been used widely in commercial systems due to their
complexity.

10

How does a Federated database system
handle the problem?

Example: IBM DB2

Wrapper

Web

Sources

Capability

Handler

Wrapper for S1

Capability

Handler

Data

Extraction

Wrapper for S2

Capability

Handler

Data

Extraction

Wrapper for

S3

S1
-
website

Wrapper: Request

-

Reply Protocol

Federation Engine

Query: Select ..from s1,s2,s3



IBM DB2

Data

Extraction

S2
-
website

S3
-
website

For web sites (S1, S2, S3),
each wrapper must be
custom crafted.

11

Research Contribution


Offer a complete, practical, and scalable solution
to easily incorporate websites into a data
federation.


Abstract wrapper components into separate
reasoning engines.


Capability reasoning engine for query planning and
execution


Data extraction engine

12

Our Solution

Two
-
Layered Architecture

current IBM solution

Three
-
Layered
Architecture


with
capability declaration


Wrapper

Web

Sources

Capability

Handler

Wrapper for S1

Capability

Handler

Data

Extraction

Wrapper for S2

Capability

Handler

Data

Extraction

Wrapper for S3

S1
-
website

Wrapper: Request

-

Reply Protocol

Federation Engine

Query: Select ..from s1,s2,s3



IBM DB2

Data

Extraction

S2
-
website

S3
-
website

Wrapper: Request

-

Reply Protocol

Federation Engine

Query: Select ..from s1,s2,s3



Wrapper,

Capability

Engine

S1
-
website

Web

Sources

Data

Extraction

Engine

IBM DB2

Data

Extraction

Engine

Query

planning

with

capability

declaration

CR

for S1

CR

for S3

CR

for S2

C

apability

R

ecord

Declaration

DE

for S1

DE

for S2

DE

for S3

D

ata

E

xtraction

Spec Files

Wrapper: Request

-

Reply Protocol

Federation Engine

Query: Select ..from s1,s2,s3



Wrapper,

Capability

Engine

Web

Sources

Data

Extraction

Engine

IBM DB2

Data

Extraction

Engine

Query

planning

with

capability

declaration

CR

for S1

CR

for S3

CR

for S2

C

apability

R

ecord

Declaration

DE

for S1

DE

for S2

DE

for S3

D

ata

E

xtraction

Spec Files

S2
-
website

S3
-
website

13

Adding a web source is simple.


Define the data extraction rules.


Define the capability record.



No procedural coding
involved at all !

14

Data Extraction: Cameleon Engine



Extract data from web
pages using declarative
specifications that extract
specific fields within a
website.



Can answer rudimentary
queries involving only a
single website.

Input param

Regular expression
identifying the region
and extracts the price

Example data extraction rules for Yahoo! Finance

15

Cameleon Studio tool enables quick creation
and testing of the data extraction rules

16

Capability Record


For Yahoo Finance!, we have two attributes of interest.


Cameleon extracts data and form a table format




Capability Record


Ticker

Price

relation(‘YahooF’,


[[‘Ticker’, string, bound(1)],


[‘Price’, number, free]],



['='])

relation(olsen,


[['Exchanged',string, bound(1)],


['Expressed',string, bound(1)],


['Rate',number, free],


['Date',string, bound(1)]],


['=']).


relation(‘companytable’,


[[‘Industry’, string, bound(1)],


[‘CompanyTicker’, string, free]],



['='])

Must provide one (and only one) Ticker at a time

(some sites allow up to 50 Tickers at a time).

Price is value returned.

Can only use equality (=) operator.

17

IBM DB2


Uses wrapper to access non
-
relational data sources.


DB2 first decomposes the original query into query
fragments and then sends them to wrappers.


Wrapper sends the result back to DB2 which then
assembles the final results.

DB2 XML Wrapper (Adapted from IBM).

18

Request
-
Reply
-
Compensate Protocol

Request
-
Reply
-
Compensate protocol example

Query Fragment

select price * 1.3

from YahooF

where ticker in (‘GE’, ‘IBM’, ‘MSFT’);

Request

HXP: Price

Table: YahooF

Predicates: ticker in (‘GE’, ‘IBM’, ‘MSFT’)

Wrapper plan 1

HXP: Price

Table: YahooF

Predicate: ticker = ‘GE’

Wrapper plan 2

HXP: Price

Table: YahooF

Predicate: ticker = ‘IBM’

Wrapper plan 3

HXP: Price

Table: YahooF

Predicate: ticker = ‘MSFT’

19

Query Planning


Now we have a capability record defined.


Add a secondary mini query planner that is
designed specifically to work with capability
records.


Can answer queries involving multiple web sources.


Specify a query execution order of query fragments.


Independent query fragments are executed first.


Followed by dependent query fragments that can
uses the prior results.

20

Our Solution


Example 1


Find all the stock quotes of biotech companies.

SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER IN
(SELECT COMPANYTICKER FROM COMPANYTABLE WHERE
INDUSTRY='BIOTECHNOLOGY' AND COMPANYTICKER <'AD'))

SELECT TICKER, PRICE FROM YAHOOF WHERE
TICKER =
[<unbound kind>]

Depends on the
previous query
fragment

SELECT COMPANYTICKER, INDUSTRY FROM
COMPANYTABLE WHERE INDUSTRY =
BIOTECHNOLOGY’ AND COMPANYTICKER < 'AD')


Independent query
fragment

21

Example Query

COMPANYTICKER INDUSTRY

---------------------------------------------


ACAD Biotechnology

ACAM Biotechnology

ACOR Biotechnology

ACEL Biotechnology

SELECT COMPANYTICKER,
INDUSTRY FROM COMPANYTABLE
WHERE INDUSTRY =
BIOTECHNOLOGY AND
COMPANYTICKER < AD


Independent
query
fragment

SELECT TICKER, PRICE FROM YAHOOF
WHERE TICKER =
[<unbound kind>]

Depends on the
previous query
fragment

SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER =
ACAD

SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER =
ACAM

SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER =
ACOR

SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER =
ACEL

TICKER PRICE

-------------------------------------------

ACAD 14.90

ACAM 6.51

ACOR 5.10

ACEL 3.18

22

Example 2


Now you want the stock price in all major currencies.


(select ticker, price from yahooF




where ticker IN (select companyticker from companytable



where industry=‘biotechnology’)


(select currency, olsen.rate from



(select currency from currency_map




where currency <> ‘USD') currency_map,




(select exchanged, 'USD', rate, ‘12/10/06'



from olsen where expressed= 'USD' and date=‘12/10/06') olsen




where currency_map.currency = olsen.exchanged and


currency_map.currency <> 'USD ') as exchange

select yahooF.ticker, yahooF.price * exchange.rate, exchange.curency from

23

Example 2

Get all the exchange rates against the USD
on Dec 10 2006

Query
fragment 1

Query
fragment 2

select olsen.rate,


from


(select currency,


from currency_map


where currency <> ‘USD')

currency_map
,


(select exchanged, ‘USD', rate, ‘12/10/06'


from olsen


where expressed=‘USD'


and date=‘12/10/06') olsen,

where currency_map.currency = olsen.exchanged

and currency_map.currency <> ‘USD'

24

(
select exchanged,
'
USD', rate, ’12/10/2006'


from olsen


where expressed= 'USD'


and date='12/10/06'


and
exchanged in (select currency from
currency_map where currency<>’USD’))

(select exchanged, ‘USD', rate, ’12/10/06'


from olsen


where expressed=‘USD'


and date=’12/10/06’) olsen

relation(olsen,


[['Exchanged',string, bound(1)],


['Expressed',string, bound(1)],


['Rate',number, free],


['Date',string, bound(1)]],


['=']).


select olsen.rate


from


(select currency


from currency_map


where currency <> 'USD')
currency_map
,


(select exchanged, 'USD', rate, '12/10/06'


from olsen


where expressed= 'USD'


and date='12/10/06') olsen,

where

currency_map.currency = olsen.exchanged

and currency_map.currency <> 'USD'

Query fragment 2

Modified Query
fragment 2

Capability record

25

Currency


rate

----------------------------------------

AUD



1.46


CAD



1.32


HKD






7.72

YPY



113.00

TICKER PRICE

-------------------------------------------

ACAD 14.90

ACAM 6.51

ACOR 5.10

TICKER PRICE($)

PRICExRATE

CURRENCY

-------------------------------------------------------------------------------------------------------------

ACAD 14.90

21.754


AUD

ACAD 14.90

19.668


CAD

ACAD 14.90

115.028


HKD

ACAD 14.90

1683.7


YPY

ACAM 6.51

9.505


AUD

ACAM 6.51

8.593


CAD


ACAM 6.51

50.257


HKD

ACAM

6.51

735.63


YPY

ACOR 5.10


7.446


AUD

ACOR 5.10


6.732


CAD

ACOR 5.10


39.372


HKD

ACOR 5.10


576.3


YPY

select ticker, price * exchange.rate, exchanged.currency

26

Conclusion


Three
-
layered architecture for querying web sources.



Instead of burying capability handling in each wrapper,
we created a generic capability handler.



Using this capability handler, adding a web source to a
federated database is as simple as declaring the
extraction rules and capability record for the source.



This was implemented and successfully tested.



This makes millions of semi
-
structured web sites into
useful “databases.”