STRUCTURED DATA ON THE WEB

homelybrrrInternet and Web Development

Dec 4, 2013 (3 years and 6 months ago)

98 views

STRUCTURED DATA ON THE WEB

TIANTIAN REN

NICOARA TALPES

Overview


This article discusses the nature of
Web
-
embedded structure data and
the challenges of managing it.

WEB


Web are known as:


-

A vast repository of shared
documents


-

Contains a significant amount of
structured data
covering a complete
range of topics.

Web Data

Forms

of structured data on the Web:


HTML tables


HTML lists


Back
-
end Deep Web databases(books sold on
Amazon.com)


More than
one billion
data sets as of February 2011:

Only in English


Language
web:

More than
150 Million
sources come from HTML tables;

The same number
resources come from HTML lists

Structured Data


Shares many similarities with the data
traditionally managed by commercial database
systems .


But also reflects unusual characteristics of it own:


-

Data in “page context” must be extracted


-

No centralized data design or data quality
control


-

Covers everything!!!(Traditional Database only
focus on a single domain.)

Structured Data

Given the unusual characteristic of structured
data, the benefits of structured data are :


Improve Web search


Enable question answering


Enable data integration from multiple Web
sources

WebTables

It is designed to extract relational
-
style data from
the Web expressed using HTML tag.


WebTables

Figure 1 Listing American Presidents

Four columns each with topic
-
specific label and
type as a date range

A
tuple

of data for each row


This Web page contains a small relational
database.



WebTables

Not All table tags carry relational data. Many are
used for page layout, calendars, and other
nonrelational

purpose.




WebTables

Less than 1% of the content embedded in the
HTML table tags represents good tables.

Two main problems on these databases:


How to extract structured data from the Web in
the first place, given that 98.9% of tables carry
no relational data.


What to do with the resulting huge collection
of databases



Table Extraction


Filter out all the non
-
relational tables


solution:


a) topic
-
independent features of each table


b) whether each column contains a uniform data
type


Recover metadata


solution:


a) determine whether or not the first row of the
table includes labels for each column.

Result

The two techniques together
allow web
-
tables
to
recover 125 million high
-
quality databases
from a large general Web crawl.

The tables in this corpus contained more than 2.6
million unique “schemas”.

What to do with the resulting huge
collection of databases


Structured data search
: takes a keyword query
and returns a ranked list of databases;


Attribute Correlation Statistics
Database(
ACSDb
):

summing individual attribute
counts over all entries in
ACSDb
.


WebTable

computes the attribute probabilities,
give
ransomly

chosen database; computes
conditional probabilities.




What to do with the resulting huge
collection of databases


Attribute Correlation Statistics
Database(
ACSDb
):


Using these probabilities in different ways, we
can build new applications:


-

Scheme Auto
-
complete




-

Synonym finding

Webtables

innovations


ranking tables:
SchemaRank
: does not rely on search
engine ranking. linear regression of features +
ACSDb

schema coherency score (how coherent is the table)


Indexing: standard index cannot retrieve features
above, so modify ‘linear model’. For each element
remember location (
x,y
). Enables new queries


Join graph traversal


navigate the millions of
schemas


why is structured data useful


Holy grail of Information Extraction: to create a
knowledge base of all facts on the web


Enable access of vast collections of data sources


IE has focused on specific tasks: particular domains,
specific websites.


The systems in our paper apply to the entire web !


Crux of problem of web: data about everything!
Uncertainty is key: unfeasible to manually create
mappings, maintain heterogeneity in one domain,
boundaries
domain unclear

deep
-
web crawling


Surfaces contents of backend databases that are
accessible via HTML forms


Data available to users via interfaces


Thought to be beyond reach: Deep web


Estimate 10 million useful forms


Most structured data on the web


So far ROI low


Will then be indexed by
WebTables

!

example


google’s

deep crawler


First large
-
scale Deep
-
web surfacing system


Goal: to scale and be domain, language independent


Results: 1 million databases in 50 languages


Contribute results to 1000 search queries/sec


Validated by search traffic
clickthrough


implementation


Input combinations solution is 1
st



Has 2 contributions: test for ‘informative templates’;
algorithm that traverses space of query templates


A few hundred submissions


proportional to size db.



implementation


Data integration approach: create semantic
mappings between mediator form and data sources


Surfacing approach: pre
-
compute most relevant form
submissions for html forms, index resulting
urls


This leverages existing infrastructure!


brute force generates too many
urls


Input values solution: iterative probing by analyzing
text on web page; algorithm to predict input values
for text boxes



improvement areas


Extract everything


Make use of annotations in index to distinguish bad
query results


Deeper extraction: recover more semantics by
predicting mapping form input


column name


use test of
informativeness

of form input to better
deep web


Crawl subsets of websites


Develop heuristics for recognizing data types of
inputs



semantic services


extracted a collection of forms, of schemas, of
columns & values



Aggregate forms
-

metadata for a domain.


Services by aggregating large amounts data:


for deep
-
web: given an attribute, get values for
column; for IR: return proprieties for entity


other structured data


Relational data


Xml data


Data warehouses


Spatial and multimedia databases


Workflows


Probabilistic databases


Parse tree databases


?Graph
-
structured data, data streams



integration


Queries as keywords: bypass the learning curve of
knowing query languages


Detect user intention, since fewer search engines


Present structured data alongside regular web
-
search

other extreme: socially created data sets


Trade
-
off between investment in creating data
structure and benefits


Enable broader set of people to create data


annotated schemes: users add tags to describe
underlying content


Annotations are light
-
weight, uploaded in XML


Ex:
Flickr
, Google co
-
op, blogs (KR)


Structure aware search engines: travel, weather,
services


In search: customized search engines



Conclusion


Challenges on structured data on the Web:


-

difficult to extract


-

disorganized, messy


Virtues of Web data: can be created by anyone
and covers every topic imaginable