Privacy

preserving Data Mining for the
Internet of
Things: State of the Art
Yee Wei Law (
罗裔纬
)
wsnlabs.com
Speaker’s brief bio
•
Ph.D. from University of
Twente
for research on security of wireless
sensor networks (WSNs) in EU project EYES
•
Research Fellowship on WSNs from The University of Melbourne
–
ARC projects “Trustworthy sensor networks: theory and
implementation”, “
BigNet
”
–
EU FP7 projects “SENSEI”, “
SmartSantander
”, “
IoT

i
”, “
SocIoTal
”
–
IBES seed
projects on participatory sensing, smart grids
–
Taught Master’s course “Sensor Systems”
•
Professional membership:
–
Associate
of (ISC)
2
(junior CISSP)
–
Smart Grid Australia Research Working Group
Current research
interests:
•
Privacy

preserving data
mining
•
Secure
/resilient
control
•
Applications
of above to the
IoT
and smart grid
Current research orientation:
•
Mixed basic/applied research
in
data science or network science
•
Research involving
probabilistic/statistical,
combinatorial, matrix analysis
Agenda
•
The
IoT
and its research priorities
–
Participatory sensing (PS)
–
Collaborative learning (CL)
•
Introduction to privacy

preserving data mining
•
Schemes suitable for PS and CL
•
Research opportunities challenges
•
If time permits, SOCIOTAL
A dynamic global network
infrastructure with
self

configuring
capabilities based on
standard and
interoperable
communication
protocols where
physical and virtual “things” have
identities
, physical attributes, and
virtual personalities and use
intelligent interfaces
, and are
seamlessly integrated into the
information network.
H.
Sundmaeker
et al.,
“Vision and Challenges for
Realising the Internet of
Things,” Cluster of European
Research Projects on the
Internet of Things, Mar.
2010
.
Evidence of the Internet of Things
Nissan EPORO robot cars
Smart grid
Research priorities
ITU

T: “Through the exploitation of identification,
data
capture
,
processin
g and communication capabilities,
the
IoT
makes full use of things to offer services to all
kinds of applications, whilst maintaining the required
privacy.”
Among research priorities:
•
Mathematical models and
algorithms for inventory
management, production
scheduling, and
data mining
•
Privacy
aware data processing
Smart transport
Smart grid
Smart water
Smart whatever
ARPAnet
Machine

to

machine
communications
Some graphics from Sabina
Jeschke
•
We have enough tech to
hook things up, now we
should make find better
ways of capturing and
analyzing
data.
•
Introducing
participatory
sensing
and
collaborative
learning
...
Shifting priorities
Participatory sensing
A process whereby individuals and communities use evermore

capable mobile phones
and cloud services to collect and analyze systematic data for use in discovery
.
Source:
Estrin
et al.
Citizen

provided data
can improve
governance with
benefits including
:
•
Increased public
safety
•
Increased social
inclusion and
awareness
•
Increased resource
efficiency for
sustainable
communities
•
Increased public
accountability
Data sharing scenarios
Lindell
and
Pinkas
[2000]: “
privacy

preserving data mining
” refers to privacy

preserving distributed data mining
Data sharing scenarios (cont’d)
•
Collaborative learning
:
Multiple
data owners
collaboratively
analyze the union of their data with the involvement of a third

party data miner
.
•
Agrawal
and
Srikant
[
2000] coined the term “privacy

preserving
data mining” to refer to privacy

preserving collaborative learning.
•
Encrypting data to data miner is inadequate, data should be
masked, at a balanced point between
accuracy
and
privacy
.
Privacy

preserving collaborative learning
•
R
equirement
imposed by
p
articipatory sensing:
–
online data submission,
offline data processing
•
Design space
:
–
Data type:
•
continuous
or
categorical
•
voice, images, videos, etc.
–
Data structure:
•
relational or time
series
•
for
relational data: horizontal
or vertical partitioned
–
Data mining
operation
Adversarial models
Semantic
Syntactic
Privacy criterion
SMC
Randomization
Proposed
criterion
Differential
privacy
Linear
Nonlinear
Additive
Multiplicative
Adversarial models
Semi

honest (honest but curious)
•
Passive attacker tries to
learn the private states of
other parties, without
deviating from protocol
•
By definition, semi

honest
parties do not collude
Malicious
•
Active attacker tries to learn
the private states of other
parties, and deviates
arbitrarily from protocol
•
Common approach: Design in the semi

honest model,
enhance it for the malicious model
•
General method: zero

knowledge proofs often not practical
•
Semi

honest model often realistic enough
Syntactic privacy criteria
•
To prevent syntactic attacks,
e.g.,
table linkage
:
–
Attacker has access to an
anonymous table and a
nonanonymous
table, with the
anonymous table being a
subset of the
nonanonymous
table
–
Attacker can infer the presence
of its target’s record in the
anonymous table from the
target’s record in the
nonanonymous
table
•
Relevant for relational data,
not time series data
•
Example:
–
k

anonymity
Semantic privacy criteria
•
To
minimize
the difference
between adversarial prior
knowledge and adversarial
posterior knowledge about
individuals represented in the
database
•
General enough for most data
types, relational or time series
•
Example:
–
Cryptographic privacy
–
Differential
privacy
Cryptographic
privacy
Differential
privacy
Secure Multiparty
Computation
Randomization
Secure multiparty computation
Oblivious transfer
•
Introduced by Rabin [1981]
•
Killian [1988] showed oblivious
transfer is
sufficient
for secure two

party
computation
•
Naor
et al. [2001]
reduce the
amortized
overhead of oblivious
transfer to one
exponentiation
per a
log number of oblivious
transfers
•
Homomorphic
encryption can be
used in the semi

honest model
f
(
x
1
,
x
2
)
x
1
x
2
Output
Garbled circuits
for
arbitrary functions
[Beaver et al. 1990]
Metaphor: Yao’s
millionaire
problem [1982
]
Building blocks:
Oblivious transfer
Sender
Receiver
chooses a value
Sender doesn’t
know which
n
values
1

out

of

n
oblivious transfer
Differential privacy
•
In cryptography,
semantic security
: whatever is
computable about the
cleartext
given the
ciphertext
is
also efficiently computable without the
ciphertext
•
Useless
for
PPDM: A DB satisfying above has no utility
•
Dwork
[2006]
proposed “differential privacy” for
statistical disclosure
control
: add noise to query results
Differential privacy (cont’d)
•
Theoretical basis for answering “sum queries”
•
Sum queries can be used for histogram, mean, covariance,
correlation, SVD, PCA, k

means, decision tree, etc.
R
ow index
Row
Differential privacy
Sensitivity
Laplace noise
Noisy sum
queries
Taxonomy of attacks against
randomization

based approaches
Known input/sample attack:
•
The
attacker has
some input
samples
and
all output
samples
but
does not know
which input sample
corresponds to which output
sample
•
Typically
begins with
establishing correspondences
between the input samples
and the output
samples
Known input

output attack:
•
The attacker has
some input
samples
and
all output
samples
, and knows which
input sample corresponds
to which output
sample
Proposed privacy criterion:
The
distance
between
f
(
X
)
and
estimated
f
(
X
)
kept above a
specified
threshold
under
known attacks
D
a
t
a
mi
n
e
r
D
a
t
a
R
e
s
u
l
t
D
a
t
a
R
e
s
u
l
t
.
.
.
.
.
.
Randomization
•
Additive perturbation
:
adds noise data to
data
•
iid
noise susceptible to:
•
Spectral
filtering attack by
Kargupta
et al. [2003]
•
PCA attack by Huang et al.
[2005
]:
–
Estimate
covariance matrix
of original data
–
Find eigenvalues and
eigenvectors of covariance
matrix through PCA
–
•
Bayesian estimation may
not have analytic form
Randomization
Linear
Nonlinear
Additive
perturbation
Multiplicative
perturbation
Randomized distortion or
perturbation
of
data
Time series
data
Relational data
eigenvectors of
covar
Collaborative learning using additive perturbation
•
Compared to multiplicative
perturbation, easier to
recover the
source data
distribution
f
X
(
x
)
from the
perturbed data distribution
and noise distribution
•
Against attacks: noise
to be
correlated with data and
participant

specific
•
PoolView
[
Ganti
et
al. 2008
]
builds
a model of the data,
then generate noise from
the model:
•
With a common noise model,
a participant (
i
) can
reconstruct another
participant’s (
j
) data from the
perturbed data:
Estimated with kernel density estimation
Solved through
deconvolution
Attac
k
Collaborative learning using additive perturbation
•
Zhang et al. [2012
]
Data

dependence
Participant

dependence
•
Catches:
–
The data miner has to
know the participants’
parameters
—
system not
resilient to collusion
–
Data correlation
between participants
expose them to attacks
(recall the PCA

based
attack?)
PDF reconstructed by data miner
based on PDF of y and noise
Multiplicative perturbation
Rotation perturbation [Chen
et al. 2005]
•
Noise
matrix is an
orthogonal matrix with
orthonormal rows and
columns
•
Giannella
et al.’s [2013]
attack can estimate the
original data using
all
perturbed data and a
small
amount of original data
Attack stage 1
•
Find maximally unique map
β
that satisfies
Then we know which
x
i
is
mapped to which
y
i
Attack stage 2
•
Find that maximizes
Enhanced version: geometric
perturbation
Multiplies
data with noise
Input
x
Output
y
Perturbation
Multiplicative projection:
r
andom projection
•
Projection by Gaussian
random
matrix
–
Statistically orthogonal
–
essentially
a Johnson

Lindenstrauss
transform
•
Other Johnson

Lindenstrauss
transforms:
•
Attack against orthogonal
transform adaptable for
this?
P
erturbed vectors
d dimension
k
dimension
inter

point
distances change
by factor (1
±
ε
)
as long as
k
≥
O
(
ε

2
log
n
)
Collaborative learning using multiplicative perturbation
Goal is to use a different perturbation matrix for a different participant
Liu et al. [2012]:
Learn in
approx
an
inverse of
R
u
and
R
v
Data miner then
get an estimation
of
X
u
and
X
v
!
What about the
privacy criterion?
Nonlinear perturbation
•
Relies on linear
perturbation to achieve
projection
•
Near

many

to

one
mapping provides
privacy property
•
Many

to

one mapping
extended to the
“normal” part of the
curve?
Random matrices
Nonlinear function
Nonlinear + linear perturbation:
Normalized values
Extreme values (potential
outliers)
are “squashed”
=
tanh
(x)
Bayesian estimation attacks against
multiplication perturbation
•
Solve underdetermined
system
Y=RX
for
X
•
Maximum a posteriori
estimation (why?)
•
If
R
is known
•
Gaussian original data
obviously simplifies the
attacker’s problem
•
If
R
is not known
•
Difficult optimization
problem, although
Gaussian data simplifies
the problem
•
Choice of
p
(
R
)
matters
Independent component analysis
against multiplicative perturbation
•
Prerequisites for attacker
–
independence
–
at most one Gaussian
component
–
sparseness (Laplace)
–
m≥(n+1)/
2
•
Steps:
–
estimate
R
–
estimate X
–
resolve permutation and
scaling ambiguity
Perturbation matrix
treated as mixing
matrix
Blind source separation
m<n
m=n
m>n
Overcomplete
/underdet
ermined ICA
Sparse
representation
Nonnegative
matrix
factorization
Research opportunities and challenges
•
Commercial interest?
•
Large design space:
effectiveness
depends as much on
the nature of data as
the data mining
algorithms
•
Challenging multidisciplinary
problems
necessitate
broad range of
tools:
–
Scenario

dependent privacy criteria
–
Defenses and attacks evolve side

by

side
–
Role of dimensionality reduction?
–
Steganography for “traitor tracing”?
–
Many more from syntactic privacy, SMC, etc.
Multiplicative perturbation
Nonlinear perturbation
Participants’ data
Bayesian estimation
attacks
ICA attacks
Tools: Statistical analysis, Bayesian
analysis, matrix analysis, time series
analysis, optimization, signal processing
Data mining algorithms
Perturbed data
•
What is Big Data?
•
Unsupervised learning of Big
Data, e.g., Deep Learning
•
Vision:
Business

centric
Internet of Things
Citizen

centric Internet of
Things
•
Main
non

technical aim
:
Create trust and confidence in
Internet of Things systems,
while providing user

friendly
ways to contribute to and use
the system thus encouraging
creation of
services
of high
socio

economic value.
•
Main technical aims
:
–
Reliable and secure
communications
–
Trustworthy data collection
–
Privacy

preserving data
mining
Motivating use cases
:
Alice’s sensor network
monitoring her house
Alice’s friend Bob
granted access to
Alice’s network while
Alice’s on vacation
Sensor network monitoring community
microgrid
feeding data to stakeholders
Duration: Sep 2013

Aug
2016
Funding scheme: STREP
Total Cost:
€
3.69 m
EC Contribution:
€
2.81m
Contract Number:
CNECT

ICT

609112
Conclusion
•
Looking back: 1970s
gives us statistical
disclosure control; 2000s
gives us PPDM
•
Technological
development expands
design space, invites
multidisciplinary input
•
Socio

economical
development plays
critical role
Adversarial models
Semantic
Syntactic
Privacy criterion
SMC
Randomization
Proposed
criterion
Differential
privacy
Linear
Nonlinear
Additive
Multiplicative
Source: Cisco IBSG, April 2011
Syntactic privacy criteria/definitions
To prevent syntactic attacks:
•
Table
linkage
:
–
Attacker
has access to an anonymous table and a
nonanonymous
table, with
the anonymous table being a subset of the
nonanonymous
table
–
Attacker
can infer the presence of its
target’s
record in the anonymous table
from the target’s record in the
nonanonymous
table
•
Record linkage
:
–
Attacker
has access to an anonymous table and a
nonanonymous
table, and
the knowledge that its target is represented in both
tables
–
Attacker
can uniquely identify the target’s record in the anonymous table from
the target’s record in the
nonanonymous
table
•
Attribute linkage
:
–
Attacker
has access to an anonymous table, and the knowledge that its target
is represented in the table, the attacker can infer the value(s) of its target’s
sensitive attribute(s) from the group (e.g., 30

40 year

old females) the target
belongs
to
Examples:
•
k

anonymity
Comments 0
Log in to post a comment