Amol's slides

ocelotgiantΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 5 μήνες)

65 εμφανίσεις

Probabilistic Databases

Amol Deshpande, University of Maryland

Overview


V.S.
Subrahmanian


ProbView, PXML, Temporal Probabilistic
Databases, Probabilistic Aggregates


Lise Getoor


Statistical Relational Learning, Probabilistic
Relational Models, Entity Resolution


Amol


MauveDB: Statistical Modeling in Databases,
Correlated tuples in probabilistic databases


Overview of Today’s Presentation


Model
-
based Views/MauveDB [Amol]



Statistical Relational Learning [Lise]



Representing arbitrarily correlated data and processing
queries over it [Prithviraj]

Overview of Today’s Presentation


Model
-
based Views/MauveDB [Amol]


Goal: Making it easy to continuously apply statistical models to
streaming data


Current focus on designing declarative interfaces, and on efficient
maintenance algorithms


Less on the “probabilistic databases” issues



Statistical Relational Learning [Lise]



Representing arbitrarily correlated data and processing
queries over it [Prithviraj]

Motivation


Unprecedented, and rapidly increasing,
instrumentation of our every
-
day world


Huge data volumes generated
continuously

that must be processed in
real
-
time


Typically
imprecise
,
unreliable

and
incomplete

data


Measurement noises, low success rates,
failures etc…



Wireless sensor

networks

RFID

Distributed measurement

networks (e.g. GPS)

Industrial Monitoring

Data Processing Step 1


Process data using a statistical/probabilistic model


Regression and interpolation models


To eliminate spatial or temporal biases, handle missing data, prediction


Filtering techniques
(e.g. Kalman Filters
), Bayesian Networks


To eliminate measurement noise, to infer hidden variables etc

Regression/interpolation models

Temperature monitoring

Kalman Filters et

GPS Data

A Motivating Example


Inferring “transportation mode”/ “activities” [Henry Kautz
et al]


Using easily obtainable sensor data, e.g. GPS, RFID proximity
data


Can do much if we can infer these automatically

office

home

Have access to noisy “GPS” data

Infer

the transportation mode:


walking, running, in a car, in a bus

Motivating Example


Inferring “transportation mode”/ “activities” [Henry Kautz
et al]


Using easily obtainable sensor data, e.g. GPS, RFID proximity
data


Can do much if we can infer these automatically

office

home

Preferred end result:


Clean path annotated with transportation mode

Dynamic Bayesian Network

Use a “generative model” for describing how the

observations were generated

Time = t

M
t

X
t

O
t

Transportation Mode:


Walking, Running, Car, Bus

True velocity and location

Observed location

Need conditional probability

distributions


e.g. a distribution on


(velocity, location)


given the transportation mode


Prior knowledge or learned from

data

Dynamic Bayesian Network

Use a “generative model” for describing how the

observations were generated

Time = t

M
t

X
t

O
t

Transportation Mode:


Walking, Running, Car, Bus

True velocity and location

Observed location

Time = t+1

M
t+1

X
t+1

O
t+1

Dynamic Bayesian Network

Given a sequence of observations (O
t
), find the most likely


M
t
’s that explain it.

Or could provide a probability distribution on the possible M
t
’s.

Time = t

M
t

X
t

O
t

Transportation Mode:


Walking, Running, Car, Bus

True velocity and location

Observed location

Time = t+1

M
t+1

X
t+1

O
t+1

Statistical Modeling of Sensor Data


No support in database systems
--
> Database
ends up being used as a backing store


With much replication of functionality


Very inefficient, not declarative…


How can we push statistical modeling inside a
database system ?


Abstraction: Model
-
based Views


An abstraction analogous to
traditional database
views


Present the output of the application of model as
a database view


That the user can query as with normal database
views


Example DBN View

User

Time

Location

Mode

prob

John

5pm

(x’1, y’1)

Walking

0.9

John

5pm

(x’1, y’1)

Car

0.1

John

5:05pm

(x’2, y’2)

Walking

0

John

5:05pm

(x’2, y’2)

Car

1

User

Time

Location

John

5pm

(x1, y1)

John

5:05pm

(x2, y2)

Original noisy GPS data

User view of the data


-

Smoothed locations


-

Inferred variables

User

e.g.


select count(*)


group by mode


sliding window 5 minutes

Application of the model/inference


is pushed inside the database

Opens up many optimization


opportunities

e.g. can do inference lazily when


queried etc

Correlations

User

Time

Location

Mode

prob

John

5pm

(x’1, y’1)

Walking

0.9

John

5pm

(x’1, y’1)

Car

0.1

John

5:05pm

(x’2, y’2)

Walking

0

John

5:05pm

(x’2, y’2)

Car

1

User

Strong and complex


correlations across tuples



-

Mutual exclusivity



-

Temporal correlations



MauveDB: Status


Written in the Apache Derby Java open source
database system


Support for
Regression
-

and
Interpolation
-
based views


Neither produce probabilistic data


SIGMOD 2006 (w/ Sam Madden)


Currently building support for views based on
Dynamic
Bayesian networks [Bhargav]


Kalman Filters, HMMs etc


Initial focus on the user interfaces and efficient inference


Will generate probabilistic data; may not be able to do
anything too sophisticated with it

Research Challenges/Future Work


Generalizing to arbitrary models ?


Develop APIs for adding arbitrary models


Try to minimize the work of the model developer


Probabilistic databases


Uncertain data with complex correlation patterns


Query processing, query optimization


View maintenance in presence of high
-
rate
measurement streams

Thanks !!

Mauve
==
M
odel
-
b
a
sed
U
ser
V
i
e
ws