1
Privacy and Data Mining
Rakesh Agrawal
Intelligent Information Systems Research
IBM Almaden Research Center
Joint work with R. Srikant & A. Evfimievski
2
Thesis
Data mining has made tremendous advances in recent years
in both research and commercialization.
However, serious concerns have emerged about the social
implications of data mining that threaten the survivability of the
field.
We, the data mining researchers, need to make a concerted
effort to address these concerns, which will require:
Advances in our theoretical understanding of the principles
underlying Data Mining;
An integrated approach to security and privacy in all phases of
data management and analysis.
Data Mining at the Crossroads
“The President and Congress take
those steps necessary to ensure
the protection of U.S. persons’
privacy and the efficient and
effective oversight of government
data mining activities.”
Safeguarding Privacy in the fight
against Terrorism,
Technology &
Privacy Advisory Committee, 2004.
“The
Consumer privacy apprehensions continue to plague the Web …
these fears will hold back roughly $15 billion in e

Commerce revenue.”
Forrester Research, 2001
“
The right to privacy: the most cherished of human freedom.”
Warren & Brandeis, 1890
4
Outline
Preliminary Feasibility Results:
Client

Server Setting
Distributed Setting
Outlook
5
Client

Server Setting
Application scenario: A central server interested in building a data
mining model using data obtained from a large number of clients,
while preserving their privacy
–
Web

commerce, e.g. recommendation service
Desiderata:
–
Must not slow

down the speed of client interaction
–
Must scale to very large number of clients
During the application phase
–
Ship model to the clients
–
Use oblivious computations
Implications:
–
Action taken to preserve privacy of a record must not depend on other
records
–
Speed vs. Accuracy trade

off
Recommendation
Service
Alice
Bob
35
95,000
J.S. Bach
painting
nasa
45
60,000
B. Spears
baseball
cnn
42
85,000
B. Marley
camping
microsoft
45
60,000
B. Spears
baseball
cnn
35
95,000
J.S. Bach
painting
nasa
Chris
42
85,000
B. Marley,
camping,
microsoft
World Today
Recommendation
Service
Alice
Bob
35
95,000
J.S. Bach
painting
nasa
45
60,000
B. Spears
baseball
cnn
42
85,000
B. Marley
camping
microsoft
45
60,000
B. Spears
baseball
cnn
35
95,000
J.S. Bach
painting
nasa
Chris
42
85,000
B. Marley,
camping,
microsoft
Mining Algorithm
Data Mining Model
World Today
Recommendation
Service
Alice
Bob
50
65,000
Metallica
painting
nasa
38
90,000
B. Spears
soccer
fox
32
55,000
B. Marley
camping
linuxware
45
60,000
B. Spears
baseball
cnn
35
95,000
J.S. Bach
painting
nasa
Chris
42
85,000
B. Marley,
camping,
microsoft
35
become
s 50
(35+15)
Per

record randomization
without considering other records
Randomization parameters
common across users
Randomization techniques
differ for numeric and
categorical data
Each attribute randomized
independently
New Order:
Randomization to
Protect Privacy
Recommendation
Service
Alice
Bob
50
65,000
Metallica
painting
nasa
38
90,000
B. Spears
soccer
fox
32
55,000
B. Marley
camping
linuxware
45
60,000
B. Spears
baseball
cnn
35
95,000
J.S. Bach
painting
nasa
Chris
42
85,000
B. Marley,
camping,
microsoft
New Order:
Randomization to
Protect Privacy
True values
Never Leave
the User!
Recommendation
Service
Alice
Bob
50
65,000
Metallica
painting
nasa
38
90,000
B. Spears
soccer
fox
32
55,000
B. Marley
camping
linuxware
45
60,000
B. Spears
baseball
cnn
35
95,000
J.S. Bach
painting
nasa
Chris
42
85,000
B. Marley,
camping,
microsoft
Data Mining Model
Mining Algorithm
Recovery
Recovery of
distributions, not
individual records
New Order:
Randomization
Protects Privacy
11
Reconstructing Distributions
(Numeric Data)
To hide original values x
1
, x
2
, ..., x
n
–
from probability distribution X (unknown)
we use y
1
, y
2
, ..., y
n
–
from probability distribution Y
Problem: Given
–
x
1
+y
1
, x
2
+y
2
, ..., x
n
+y
n
–
the probability distribution of Y
Estimate the probability distribution of X.
12
Reconstruction Algorithm
f
X
0
:= Uniform distribution
j := 0
repeat
f
X
j+1
(a) := Bayes’ Rule
j := j+1
until (stopping criterion met)
(R. Agrawal, R. Srikant.
Privacy Preserving Data Mining
. SIGMOD 2000)
Converges to maximum likelihood estimate.
(D. Agrawal & C.C. Aggarwal, PODS 2001)
n
i
j
X
i
i
Y
j
X
i
i
Y
a
f
a
y
x
f
a
f
a
y
x
f
n
1
)
(
)
)
((
)
(
)
)
((
1
13
Works Well
20
60
Age
0
200
400
600
800
1000
1200
Number of People
Original
Randomized
Reconstructed
Application to Building Decision Trees
Age
Salary
Repeat
Visitor?
23
50K
Repeat
17
30K
Repeat
43
40K
Repeat
68
50K
Single
32
70K
Single
20
20K
Repeat
Age < 25
Salary <
50K
Repeat
Repeat
Single
Yes
Yes
No
No
15
Algorithms
Global
–
Reconstruct for each attribute once at the beginning
By Class
–
For each attribute, first split by class, then reconstruct separately
for each class.
Local
–
Reconstruct at each node
See SIGMOD 2000 paper for details.
16
Experimental Methodology
Compare accuracy against
–
Original
: unperturbed data without randomization.
–
Randomized
: perturbed data but without making any corrections
for randomization.
Test data not randomized.
Synthetic benchmark from [AGI+92].
Training set of 100,000 records, split equally between the two
classes.
18
Accuracy vs. Randomization
10
20
40
60
80
100
150
200
Randomi zati on Level
40
50
60
70
80
90
100
Accuracy
Original
Randomized
Reconstructed
Fn 3
19
More on Randomization
Privacy

Preserving Association Rule Mining Over Categorical
Data
–
Rizvi & Haritsa [VLDB 02]
–
Evfimievski, Srikant, Agrawal, & Gehrke [KDD

02]
Privacy Breach Control: Probabilistic limits on what one can
infer with access to the randomized data as well as mining
results
–
Evfimievski, Srikant, Agrawal, & Gehrke [KDD

02]
–
Evfimievski, Gehrke & Srikant [PODS

03]
20
Outline
Preliminary Feasibility Results:
Client

Server Setting
Distributed Setting
Outlook
21
Distributed Setting
Sovereign entities interested in computation across private
databases
Implication: An entity has access to all the records in its
database prior to start of the computation
Considerable increase in available options
22
Sovereign Computing
Compute queries across databases so that no more
information than necessary is revealed (without using a
trusted third party).
Need is driven by several trends:
–
End

to

end integration of information systems across
companies (virtual organizations)
–
Simultaneously compete and cooperate.
–
Security: need

to

know information sharing
R. Agrawal, D. Asonov, P. Baliga, L. Liang, B. Porst, R. Srikant.
A Reusable Platform for Building
Sovereign Information Sharing Applications
. DIVO 04.
R. Agrawal, D. Asonov, R. Srikant.
Enabling Sovereign Information Sharing Using Web Services
.
SIGMOD 04 (Industrial Track).
R. Agrawal, A. Evfimievski, R. Srikant.
Information Sharing Across Private Databases
. SIGMOD 03.
23
Security Application
Security Agency finds those
passengers who are in its list
of suspects, but not the
names of other passengers.
Airline does not find anything.
Agency
Suspect
List
Airline
Passenger
List
http://www.informationweek.com/story/showArticle.jhtml?articleID=184010%79
24
Epidemiological Research
Validate hypothesis between
adverse reaction to a drug
and a specific DNA sequence.
Researcher should not learn
anything beyond 4 counts:
Medical
Research
Inst.
DNA
Sequences
Drug
Reactions
Adverse Reaction
No Adv. Reaction
Sequence Present
?
?
Sequence Absent
?
?
25
R
S
R must not
know that S
has b & y
S must not
know that R
has a & x
u
v
R
S
a
u
v
x
b
u
v
y
R
S
Count (R
匩
R & S do not learn
anything except that
the result is 2.
Minimal Necessary Sharing
26
Problem Statement:
Minimal Sharing
Given:
–
Two parties (honest

but

curious): R (receiver) and S (sender)
–
Query Q spanning the tables R and S
–
Additional (pre

specified) categories of information
I
Compute the answer to Q and return it to R without revealing any
additional information to either party,
except for the information
contained in
I
–
For example, in the upcoming intersection protocols
I
= { R , S }
27
A Possible Approach
Secure Multi

Party Computation
–
Given two parties with inputs x and y, compute f(x,y) such that
the parties learn only f(x,y) and nothing else.
–
Can be solved by building a combinatorial circuit, and simulating
that circuit [Yao86].
Prohibitive cost for database

size problems.
–
Intersection of two relations of a million records each would
require 144 days (Yao’s protocol)
Intersection Protocol
R
S
R
S
Secret key
a
b
f
b
(S
)
Shorthand for
{ f
b
(s)  s
S
}
Commutative Encryption
f
a
(f
b
(s)) = f
b
(f
a
(s))
f(s,b,p) = s
b
mod p
R
Intersection Protocol
S
R
S
f
b
(S)
f
b
(S
)
f
a
(f
b
(S
))
a
b
f
b
(f
a
(S
))
Commutative
property
R
Intersection Protocol
S
R
S
f
a
(R
)
f
a
(R
)
f
b
(f
a
(S
))
{< f
a
(r
), f
b
(f
a
(r
))>}
a
b
<r, f
b
(f
a
(x))>
{< f
a
(r
), f
b
(f
a
(r
))>}
Since R knows
<r, f
a
(r)>
R
Intersection Size
S
R
S
f
a
(R
)
f
a
(R
)
f
b
(f
a
(S
))
{< f
a
(r
), f
b
(f
a
(r
))>}
a
b
{< f
a
(r
), f
b
(f
a
(r
))>}
32
Related Protocols
[Naor & Pinkas 99]: Two protocols for list intersection problem
–
Oblivious evaluation of n polynomials of degree n each.
–
Oblivious evaluation of n
2
linear polynomials.
[Huberman et al 99]: find people with common preferences,
without revealing the preferences.
–
Intersection protocols are similar
[Clifton et al, 03]: Secure set union and set intersection
–
Similar protocols
33
Performance
Airline application: 150,000 (daily) passengers and 1 million people in
the watch list:
120 minutes with one accelerator card
12 minutes with ten accelerator cards
Epidemiological research: 1 million patient records in the hospital and
10 million records in the Genebank:
37 hours with one accelerator cards
3.7 hours with ten accelerator cards
AEP SSL CARD Runner 2000 ≈ $2K
20K encryptions per minute
10x improvement over software implementation
34
Other Ideas
Cryptographic approaches to building data mining models
–
ID3 Classifier [Lindell & Pinkas 2000]
–
Purdue Toolkit [Clifton et al. 2003]
Global approaches to data perturbation (e.g. swapping) from
Statistical Disclosure Control Community
Model combination and Voting
–
Potential for leakage from individual models
35
Private Distributed ID3
[Lindell & Pinkas, Crypto 2000]
How to build a decision

tree classifier on the union of two horizontally
partitioned private databases
Basic Idea:
Find attribute with highest information gain privately
Independently split on this attribute and recurse
Selecting the Split Attribute
Given v1 known to DB1 and v2 known to DB2, compute (v1 + v2) log (v1
+ v2) and output random shares of the answer
Given random shares, use Yao's protocol
[FOCS 84]
to compute
information gain.
Trade

off (Compared to Randomization approach)
+
Accuracy
–
Performance & scaling
36
Outline
Preliminary Feasibility Results:
Client

Server Setting
Distributed Setting
Outlook
37
Two Contradictory Perceptions
of Data Mining
Not very valuable in practice
Too Powerful. Ban it.
38
What Gives?
Huge gap in the way we (data mining community) think of data
mining and the way world perceives it.
39
Our Focus
Formalisms and algorithms for building data mining models
40
The Way World Thinks of Data Mining
Collect and organize data
–
Province of Data Management
Extract value from data
–
Province of data mining
Includes:
–
Model building and application
–
Complex querying
–
Search
“Data mining” is defined to mean: searches of one or more electronic
databases of information concerning (U.S.) persons…
Technology &
Privacy Advisory Committee, March 2004.
41
What should we do?
Accept the mandate
Broaden our view of what we are about
Take responsibility for the technology we create and design

in
safeguards against misuse
42
Some Key Problems
Understand the interactions between
–
Generality
–
Accuracy
–
Performance
–
Disclosure
Address head

on the issues of
–
Background information
–
Multiple operation
–
Maliciousness
Game

theoretic incentive compatibility coupled with auditing?
System design principles (privacy can’t be an afterthought)
43
Closing Thoughts
Solutions to complex problems such as privacy require a mix
of legislations, social customs, market forces, and technology
[Lessig]
By advancing technology, we can change the mix and
improve the overall quality of the solution
Gold mine of challenging and exciting research problems
(besides being useful)!
44
Thank you!
Questions?
Comments 0
Log in to post a comment