on Database Research

lilactruckInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 14 μέρες)

49 εμφανίσεις

The Claremont Report

on Database Research

2009
-
10
-
28

淡江大學

周清江

2

Background


Senior database researchers have gathered every
few years to assess the state of database research
and to recommend problems and problem areas
deserve additional focus.


Laguna Beach, Calif. in 1989


Palo Alto, Calif. (“Lagunita”) in 1990 and 1995


Cambridge, Mass. in 1996


Asilomar, Calif. in 1998


Lowell, Mass. in 2003


Claremont
,
Calif.

i
n 200
8


3

New F
ocus

Areas


New database engine architectures


Declarative programming languages


Interplay of structured and unstructured data


Cloud data services


Mobile and virtual worlds



4

A Turning Point in Database Research


Unusually rich opportunities for


Technical advances, intellectual achievement,
entrepreneurship, and impact on science and society




Sense of change as a function of several factors


Breadth of excitement about Big data


Data analysis as a profit center


Ubiquity of structured and unstructured data


Expanded development demand


Architecture shift in computing


5

Research Portfolio Change


Impact and Breadth


Evaluated by external measures


Helping new classes of users


Powering new computing platforms


Making conceptual breakthroughs across computing




6

Two Promising Approaches


Reformation


Deconstucting core data
-
centric ideas and systems


Reforming for new applications and architectural
realities



Synthesis


Leverage good research ideas that have yet to
develop identifiable, agreed
-
upon system
architectures


Data integration, information extraction, data privacy, etc.

7

Research Opportunities


Revisiting Database Engines


Declarative Programming for Emerging Platforms


The Interplay of Structured and Unstructured
Data


Cloud Data Services


Mobile Applications and Virtual Worlds



8

Research Opportunities


Main issues cut across the above topics


Management of uncertain information


data privacy and security


e
-
science and other scholarly applications


human centric interactions with data


social networks and Web 2.0


personalization and contextualization of query
-

and
search
-
related tasks


streaming and networked data


self
-
tuning and adaptive systems, and


the challenges raised by new hardware technologies and
energy constraints

9

Revisiting Database Engines


Data
-
intensive tasks for which relational DBs
provide poor price/performance


Ex: text indexing, serving web pages, media delivery


Room for significant innovation within traditional
application domains


Analytics for business and science


The cost of software and management relative to hardware
is exorbitant


OLTP


Need to address data lifecycle issues


Data provenance, schema evolution, and versioning



Good time to try radical ideas

10

Revisiting Database Engines


Two directions of research projects


Revolutionary steps in DB system architecture


Broadening the range of applicability


Radically improving performance by designing special
purpose DB systems for specific domains



These efforts may be synergistic

11

Revisiting Database Engines


Important research topics in the core DB engine


Designing systems for clusters of many
-
core processors


Exploiting remote RAM and Flash as persistent media


Treating query optimization and physical data layout as a
unified, adaptive, self
-
tuning task to be carried out
continuously


Compressing and encrypting data at the storage layer,
integrated with data layout and query optimization


Designing systems for non
-
relational data models


Trading off consistency and availability for better
performance and scaleout to thousands of machines


Designing power
-
aware DBMS that limit energy costs
without sacrificing scalability

12

Declarative Programming for Emerging Platforms


The urgency of programmer productivity is increasing
exponentially as programmers target even more
complex environments



No
-
expert programmers need to be write robust code
that scales out across processors in both loosely
-

and
tightly
-
coupled architectures




13

Declarative Programming for Emerging Platforms


Example:


Map
-
Reduce


New declarative languages, based on Datalog, have been
developed for a variety of domain
-
specific systems


Network and distributed systems, computer games, machine
learning and robotics, compilers, security protocols, and
information extraction


Enterprise application programming


Ruby on Rails
(
http://www.ithome.com.tw/itadm/article.php?c=46863
,


http://en.wikipedia.org/wiki/Ruby_on_Rails

)


LINQ (Language
-
Integrated Query,
http://www.ithome.com.tw/itadm/article.php?c=44337
,


http://en.wikipedia.org/wiki/Language_Integrated_Query )

14

Declarative Programming for Emerging Platforms


Research questions


Language design


Fairly expressive


Attractive syntax, typing and modularity, development
tools, smooth interactions with the rest of the computing
ecosystem


Efficient compilers and runtimes


Techniques to optimize code automatically


Across both the horizontal distribution of parallel
processors and the vertical distribution of tiers


Should extend techniques behind parallel and distributed
DBs


15

The Interplay of Structured and Unstructured Data


Within enterprises, heterogeneous collections of
structured data linked with unstructured data


On Web, structured data from


Millions of DBs hidden behind forms (deep web)


High quality data items in HTML tables on web
pages, and mashups providing dynamic views on
structured data


Data contributed by Web 2.0 services


Photo and video sites


Collaborative annotation services


On
-
line structured data repositories



16

The Interplay of Structured and Unstructured Data


Challenges of managing dataspaces


Managing a rich collection of structured, semi
-
structured, and unstructured data


On the web, previous contributions


Techniques for domain
-
specific search engines


Domain
-
independent tech for crawling through
forms, and surfacing the resulting HTML pages in
a search
-
engine index


Within enterprises, enterprise search and
discovery of relationships between structured
and unstructured data

17

The Interplay of Structured and Unstructured Data


Challenge 1


Extract structure and meaning from unstructured
and semi
-
structured data


Applying and managing predictions from large numbers
of independently developed extractors


Need algorithms to introspect about the correctness of
extractions


Better technology to manage data in context


Discover data sources


Discover implicit relationships


Determine the weight of an object’s context when
assigning it semantics


Maintain data provenance


18

The Interplay of Structured and Unstructured Data


Challenge 2


Develop methods for effectively querying and deriving
insight from the resulting sea of heterogeneous data


Analyze keyword query to extract its intended semantics


Route the query to relevant sources


Do not assume we have semantic mappings for the data sources


Cannot assume that the domain of the query or data sources is
known


The system should provide
best
-
effort

service and improve over
time


Develop index structures to support querying hybrid
data


Need new notions of correctness and consistency to provide
metrics and to make cost/quality tradeoffs

19

The Interplay of Structured and Unstructured Data


Challenge 2


Innovation about
creating

data collections


Web 2.0


Users join ad
-
hoc communities to create, collaborate, curate,
and discuss data online


They rarely agree on schemata ahead of time


Schemata need to be inferred from the data and will be highly
dynamic


Schemata will be used to guide users to consensus


Need to incorporate visualizations effectively


They need to be easy to use

20

Cloud Data Services


Infrastructure change


Service
-
oriented cloud computing


Application services (salesforce.com)


Storage services (Amazon S3)


Compute services (Google App Engine, Amazon EC2)


Data services (Amazon SimpleDB, MS SQL Server Data Services,
Google Datastore)


Trade
-
off between functionality and operational costs


Manageability is particularly important


Limited human intervention


High
-
variance workloads: elastic provisioning


A variety of shared infrastructures: service tuning depends on
how the shared infrastructure is virtualized


Urgency of self
-
managing DB technologies


21

Cloud Data Services


Challenges from scale of cloud computing


SQL databases cannot scale to thousands of nodes


Different transactional implementation techniques?


Different storage semantics?


More work is needed to synthesize ideas from the
literature in cloud computing


Limitations on either the plan space or the search will
be required


How programmers will express their programs in the
cloud


22

Cloud Data Services


Challenges from scale of cloud computing


Data security and privacy


Key to success: target usage scenarios in the cloud


New scenarios will emerge with their own challenges


Specialized services pre
-
loaded with large data
-
sets


“Mash up” data from public and private domains


Services reaching out across clouds


Prevalent in scientific data “grids”


Federated cloud architectures will enhance the challenges

23

Mobile Applications and Virtual Worlds


This new class of applications need to manage diverse
user
-
created data, synthesize it intelligently, and provide
real
-
time services


Trends in the mobile space


Platforms to build mobile applications are mature


The emergence of mobile search and social networks suggest a
new set of mobile applications


Virtual worlds, like Second Life, increasingly blur the
distinctions with the real world


Suggest a more data
-
rich mixture (co
-
space)


Applications include rich social networking, massive multi
-
player
games, military training, edutainment and knowledge sharing

24

Mobile Applications and Virtual Worlds


New challenges


The need to process heterogeneous data streams to
materialize real
-
world events


The need to balance privacy against the collective
benefit of sharing personal real
-
time information


The need for more intelligent processing to send
interesting events in the co
-
space to someone in the
physical world

25

Moving Forward


Survey articles and tutorials are becoming an
increasingly important contribution


Risky or speculative papers not championed effectively


A need for approachable books on scalable data
management algorithms and techniques


Time is ripe for projects to stimulate collaboration and
cross
-
fertilization of ideas, like information integration


Two areas are identified for competitions


System components for cloud computing


Large
-
scale information extraction