Privacy-preserving Data Mining for the

desertcockatooData Management

Nov 20, 2013 (3 years and 7 months ago)

73 views

Privacy
-
preserving Data Mining for the
Internet of
Things: State of the Art

Yee Wei Law (
罗裔纬
)

wsnlabs.com

Speaker’s brief bio


Ph.D. from University of
Twente

for research on security of wireless
sensor networks (WSNs) in EU project EYES


Research Fellowship on WSNs from The University of Melbourne


ARC projects “Trustworthy sensor networks: theory and
implementation”, “
BigNet



EU FP7 projects “SENSEI”, “
SmartSantander
”, “
IoT
-
i
”, “
SocIoTal



IBES seed
projects on participatory sensing, smart grids


Taught Master’s course “Sensor Systems”


Professional membership:


Associate
of (ISC)
2

(junior CISSP)


Smart Grid Australia Research Working Group


Current research
interests:


Privacy
-
preserving data
mining


Secure
/resilient
control


Applications
of above to the
IoT

and smart grid

Current research orientation:


Mixed basic/applied research

in
data science or network science


Research involving
probabilistic/statistical,
combinatorial, matrix analysis
Agenda


The
IoT

and its research priorities


Participatory sensing (PS)


Collaborative learning (CL)


Introduction to privacy
-
preserving data mining


Schemes suitable for PS and CL


Research opportunities challenges


If time permits, SOCIOTAL

A dynamic global network
infrastructure with
self
-
configuring

capabilities based on
standard and
interoperable
communication

protocols where
physical and virtual “things” have
identities
, physical attributes, and
virtual personalities and use
intelligent interfaces
, and are
seamlessly integrated into the
information network.


H.
Sundmaeker

et al.,
“Vision and Challenges for
Realising the Internet of
Things,” Cluster of European
Research Projects on the
Internet of Things, Mar.
2010
.

Evidence of the Internet of Things

Nissan EPORO robot cars

Smart grid

Research priorities

ITU
-
T: “Through the exploitation of identification,
data
capture
,
processin
g and communication capabilities,
the
IoT

makes full use of things to offer services to all
kinds of applications, whilst maintaining the required
privacy.”

Among research priorities:


Mathematical models and
algorithms for inventory
management, production
scheduling, and
data mining


Privacy

aware data processing

Smart transport

Smart grid

Smart water

Smart whatever

ARPAnet

Machine
-
to
-
machine
communications

Some graphics from Sabina
Jeschke


We have enough tech to
hook things up, now we
should make find better
ways of capturing and
analyzing

data.


Introducing
participatory
sensing

and
collaborative
learning
...

Shifting priorities

Participatory sensing

A process whereby individuals and communities use evermore
-
capable mobile phones
and cloud services to collect and analyze systematic data for use in discovery
.

Source:
Estrin

et al.

Citizen
-
provided data
can improve
governance with
benefits including
:


Increased public
safety


Increased social
inclusion and
awareness


Increased resource
efficiency for
sustainable
communities


Increased public
accountability

Data sharing scenarios

Lindell

and
Pinkas

[2000]: “
privacy
-
preserving data mining
” refers to privacy
-
preserving distributed data mining

Data sharing scenarios (cont’d)


Collaborative learning
:
Multiple
data owners

collaboratively
analyze the union of their data with the involvement of a third
-
party data miner
.


Agrawal

and
Srikant

[
2000] coined the term “privacy
-
preserving
data mining” to refer to privacy
-
preserving collaborative learning.


Encrypting data to data miner is inadequate, data should be
masked, at a balanced point between
accuracy

and
privacy
.




Privacy
-
preserving collaborative learning


R
equirement
imposed by
p
articipatory sensing:


online data submission,
offline data processing


Design space
:


Data type:


continuous
or
categorical


voice, images, videos, etc.


Data structure:


relational or time
series


for
relational data: horizontal
or vertical partitioned


Data mining
operation

Adversarial models

Semantic

Syntactic

Privacy criterion

SMC

Randomization

Proposed
criterion

Differential
privacy

Linear

Nonlinear

Additive

Multiplicative

Adversarial models

Semi
-
honest (honest but curious)


Passive attacker tries to
learn the private states of
other parties, without
deviating from protocol


By definition, semi
-
honest
parties do not collude

Malicious


Active attacker tries to learn
the private states of other
parties, and deviates
arbitrarily from protocol


Common approach: Design in the semi
-
honest model,
enhance it for the malicious model


General method: zero
-
knowledge proofs often not practical


Semi
-
honest model often realistic enough

Syntactic privacy criteria


To prevent syntactic attacks,
e.g.,
table linkage
:


Attacker has access to an
anonymous table and a
nonanonymous

table, with the
anonymous table being a
subset of the
nonanonymous

table


Attacker can infer the presence
of its target’s record in the
anonymous table from the
target’s record in the
nonanonymous

table


Relevant for relational data,
not time series data


Example:


k
-
anonymity

Semantic privacy criteria


To
minimize
the difference
between adversarial prior
knowledge and adversarial
posterior knowledge about
individuals represented in the
database


General enough for most data
types, relational or time series


Example:


Cryptographic privacy


Differential
privacy

Cryptographic
privacy

Differential
privacy

Secure Multiparty
Computation

Randomization

Secure multiparty computation

Oblivious transfer


Introduced by Rabin [1981]


Killian [1988] showed oblivious
transfer is
sufficient

for secure two
-
party
computation


Naor

et al. [2001]
reduce the
amortized
overhead of oblivious
transfer to one
exponentiation
per a
log number of oblivious
transfers


Homomorphic

encryption can be
used in the semi
-
honest model

f
(
x
1
,
x
2
)

x
1

x
2

Output

Garbled circuits
for
arbitrary functions
[Beaver et al. 1990]

Metaphor: Yao’s
millionaire
problem [1982
]

Building blocks:

Oblivious transfer

Sender

Receiver
chooses a value

Sender doesn’t
know which

n

values

1
-
out
-
of
-
n

oblivious transfer

Differential privacy


In cryptography,
semantic security
: whatever is
computable about the
cleartext

given the
ciphertext

is
also efficiently computable without the
ciphertext


Useless
for
PPDM: A DB satisfying above has no utility


Dwork

[2006]
proposed “differential privacy” for
statistical disclosure
control
: add noise to query results

Differential privacy (cont’d)


Theoretical basis for answering “sum queries”



Sum queries can be used for histogram, mean, covariance,
correlation, SVD, PCA, k
-
means, decision tree, etc.

R
ow index

Row

Differential privacy

Sensitivity

Laplace noise

Noisy sum
queries

Taxonomy of attacks against
randomization
-
based approaches

Known input/sample attack:


The
attacker has
some input
samples

and
all output
samples

but
does not know
which input sample
corresponds to which output
sample


Typically
begins with
establishing correspondences
between the input samples
and the output
samples



Known input
-
output attack:


The attacker has
some input
samples

and
all output
samples
, and knows which
input sample corresponds
to which output
sample


Proposed privacy criterion:

The
distance

between

f
(
X
)

and
estimated
f
(
X
)

kept above a
specified
threshold

under
known attacks

D
a
t
a

mi
n
e
r
D
a
t
a
R
e
s
u
l
t
D
a
t
a
R
e
s
u
l
t
.
.
.
.
.
.
Randomization


Additive perturbation
:
adds noise data to
data


iid

noise susceptible to:


Spectral
filtering attack by
Kargupta

et al. [2003]


PCA attack by Huang et al.
[2005
]:


Estimate
covariance matrix
of original data





Find eigenvalues and
eigenvectors of covariance
matrix through PCA





Bayesian estimation may
not have analytic form




Randomization

Linear

Nonlinear

Additive
perturbation

Multiplicative
perturbation

Randomized distortion or
perturbation

of
data

Time series
data

Relational data

eigenvectors of
covar

Collaborative learning using additive perturbation


Compared to multiplicative
perturbation, easier to
recover the
source data
distribution

f
X
(
x
)

from the
perturbed data distribution
and noise distribution


Against attacks: noise
to be
correlated with data and
participant
-
specific


PoolView

[
Ganti

et
al. 2008
]
builds
a model of the data,
then generate noise from
the model:



With a common noise model,
a participant (
i
) can
reconstruct another
participant’s (
j
) data from the
perturbed data:

Estimated with kernel density estimation

Solved through
deconvolution

Attac
k

Collaborative learning using additive perturbation


Zhang et al. [2012
]



Data
-
dependence

Participant
-
dependence


Catches:


The data miner has to
know the participants’
parameters

system not
resilient to collusion


Data correlation
between participants
expose them to attacks
(recall the PCA
-
based
attack?)

PDF reconstructed by data miner
based on PDF of y and noise

Multiplicative perturbation

Rotation perturbation [Chen
et al. 2005]


Noise
matrix is an
orthogonal matrix with
orthonormal rows and
columns



Giannella

et al.’s [2013]
attack can estimate the
original data using
all

perturbed data and a
small

amount of original data

Attack stage 1


Find maximally unique map
β

that satisfies



Then we know which
x
i

is
mapped to which
y
i

Attack stage 2


Find that maximizes



Enhanced version: geometric
perturbation

Multiplies
data with noise

Input
x

Output
y

Perturbation

Multiplicative projection:
r
andom projection


Projection by Gaussian
random
matrix


Statistically orthogonal


essentially
a Johnson
-
Lindenstrauss

transform


Other Johnson
-
Lindenstrauss

transforms:


Attack against orthogonal
transform adaptable for
this?

P
erturbed vectors

d dimension

k

dimension

inter
-
point
distances change
by factor (1
±
ε
)

as long as
k

O
(
ε
-
2
log
n
)


Collaborative learning using multiplicative perturbation

Goal is to use a different perturbation matrix for a different participant

Liu et al. [2012]:

Learn in
approx

an
inverse of
R
u

and
R
v

Data miner then
get an estimation
of
X
u

and
X
v
!

What about the
privacy criterion?

Nonlinear perturbation


Relies on linear
perturbation to achieve
projection


Near
-
many
-
to
-
one
mapping provides
privacy property


Many
-
to
-
one mapping
extended to the
“normal” part of the
curve?

Random matrices

Nonlinear function

Nonlinear + linear perturbation:

Normalized values

Extreme values (potential
outliers)

are “squashed”

=
tanh
(x)

Bayesian estimation attacks against
multiplication perturbation


Solve underdetermined
system
Y=RX

for
X


Maximum a posteriori
estimation (why?)


If
R

is known




Gaussian original data
obviously simplifies the
attacker’s problem


If
R

is not known





Difficult optimization
problem, although
Gaussian data simplifies
the problem


Choice of
p
(
R
)

matters

Independent component analysis
against multiplicative perturbation


Prerequisites for attacker


independence


at most one Gaussian
component


sparseness (Laplace)


m≥(n+1)/
2


Steps:


estimate
R


estimate X


resolve permutation and
scaling ambiguity

Perturbation matrix
treated as mixing
matrix

Blind source separation

m<n

m=n

m>n

Overcomplete
/underdet
ermined ICA

Sparse
representation

Nonnegative
matrix
factorization

Research opportunities and challenges


Commercial interest?


Large design space:
effectiveness
depends as much on
the nature of data as
the data mining
algorithms


Challenging multidisciplinary
problems
necessitate
broad range of
tools:


Scenario
-
dependent privacy criteria


Defenses and attacks evolve side
-
by
-
side


Role of dimensionality reduction?


Steganography for “traitor tracing”?


Many more from syntactic privacy, SMC, etc.

Multiplicative perturbation

Nonlinear perturbation

Participants’ data

Bayesian estimation
attacks

ICA attacks

Tools: Statistical analysis, Bayesian
analysis, matrix analysis, time series
analysis, optimization, signal processing

Data mining algorithms

Perturbed data


What is Big Data?


Unsupervised learning of Big
Data, e.g., Deep Learning


Vision:
Business
-
centric
Internet of Things


Citizen
-
centric Internet of
Things


Main
non
-
technical aim
:
Create trust and confidence in
Internet of Things systems,
while providing user
-
friendly
ways to contribute to and use
the system thus encouraging
creation of
services

of high
socio
-
economic value.


Main technical aims
:


Reliable and secure
communications


Trustworthy data collection


Privacy
-
preserving data
mining



Motivating use cases
:

Alice’s sensor network
monitoring her house

Alice’s friend Bob
granted access to
Alice’s network while
Alice’s on vacation

Sensor network monitoring community
microgrid

feeding data to stakeholders

Duration: Sep 2013
-

Aug
2016

Funding scheme: STREP

Total Cost:

3.69 m

EC Contribution:

2.81m


Contract Number:
CNECT
-
ICT
-

609112

Conclusion


Looking back: 1970s
gives us statistical
disclosure control; 2000s
gives us PPDM


Technological
development expands
design space, invites
multidisciplinary input


Socio
-
economical
development plays
critical role


Adversarial models

Semantic

Syntactic

Privacy criterion

SMC

Randomization

Proposed
criterion

Differential
privacy

Linear

Nonlinear

Additive

Multiplicative

Source: Cisco IBSG, April 2011

Syntactic privacy criteria/definitions

To prevent syntactic attacks:


Table
linkage
:


Attacker
has access to an anonymous table and a
nonanonymous

table, with
the anonymous table being a subset of the
nonanonymous

table


Attacker
can infer the presence of its
target’s
record in the anonymous table
from the target’s record in the
nonanonymous

table


Record linkage
:


Attacker
has access to an anonymous table and a
nonanonymous

table, and
the knowledge that its target is represented in both
tables


Attacker
can uniquely identify the target’s record in the anonymous table from
the target’s record in the
nonanonymous

table


Attribute linkage
:


Attacker
has access to an anonymous table, and the knowledge that its target
is represented in the table, the attacker can infer the value(s) of its target’s
sensitive attribute(s) from the group (e.g., 30
-
40 year
-
old females) the target
belongs
to

Examples:


k
-
anonymity