Privacy and Data Mining

internalchildlikeInternet and Web Development

Nov 12, 2013 (3 years and 5 months ago)

44 views

1

Privacy and Data Mining

Rakesh Agrawal


Intelligent Information Systems Research

IBM Almaden Research Center



Joint work with R. Srikant & A. Evfimievski

2

Thesis


Data mining has made tremendous advances in recent years
in both research and commercialization.


However, serious concerns have emerged about the social
implications of data mining that threaten the survivability of the
field.


We, the data mining researchers, need to make a concerted
effort to address these concerns, which will require:


Advances in our theoretical understanding of the principles
underlying Data Mining;


An integrated approach to security and privacy in all phases of
data management and analysis.


Data Mining at the Crossroads

“The President and Congress take
those steps necessary to ensure
the protection of U.S. persons’
privacy and the efficient and
effective oversight of government
data mining activities.”
Safeguarding Privacy in the fight
against Terrorism,

Technology &
Privacy Advisory Committee, 2004.

“The
Consumer privacy apprehensions continue to plague the Web …
these fears will hold back roughly $15 billion in e
-
Commerce revenue.”
Forrester Research, 2001


The right to privacy: the most cherished of human freedom.”
Warren & Brandeis, 1890

4

Outline


Preliminary Feasibility Results:


Client
-
Server Setting


Distributed Setting


Outlook

5

Client
-
Server Setting


Application scenario: A central server interested in building a data
mining model using data obtained from a large number of clients,
while preserving their privacy


Web
-
commerce, e.g. recommendation service


Desiderata:


Must not slow
-
down the speed of client interaction


Must scale to very large number of clients


During the application phase


Ship model to the clients


Use oblivious computations


Implications:


Action taken to preserve privacy of a record must not depend on other
records


Speed vs. Accuracy trade
-
off

Recommendation

Service

Alice

Bob

35

95,000

J.S. Bach

painting

nasa

45

60,000

B. Spears

baseball

cnn

42

85,000

B. Marley

camping

microsoft

45

60,000

B. Spears

baseball

cnn

35

95,000

J.S. Bach

painting

nasa

Chris

42

85,000

B. Marley,

camping,

microsoft

World Today

Recommendation

Service

Alice

Bob

35

95,000

J.S. Bach

painting

nasa

45

60,000

B. Spears

baseball

cnn

42

85,000

B. Marley

camping

microsoft

45

60,000

B. Spears

baseball

cnn

35

95,000

J.S. Bach

painting

nasa

Chris

42

85,000

B. Marley,

camping,

microsoft

Mining Algorithm

Data Mining Model

World Today

Recommendation

Service

Alice

Bob

50

65,000

Metallica

painting

nasa

38

90,000

B. Spears

soccer

fox

32

55,000

B. Marley

camping

linuxware

45

60,000

B. Spears

baseball

cnn

35

95,000

J.S. Bach

painting

nasa

Chris

42

85,000

B. Marley,

camping,

microsoft

35
become
s 50
(35+15)

Per
-
record randomization

without considering other records

Randomization parameters

common across users

Randomization techniques
differ for numeric and
categorical data

Each attribute randomized

independently

New Order:
Randomization to

Protect Privacy

Recommendation

Service

Alice

Bob

50

65,000

Metallica

painting

nasa

38

90,000

B. Spears

soccer

fox

32

55,000

B. Marley

camping

linuxware

45

60,000

B. Spears

baseball

cnn

35

95,000

J.S. Bach

painting

nasa

Chris

42

85,000

B. Marley,

camping,

microsoft

New Order:
Randomization to

Protect Privacy

True values

Never Leave
the User!

Recommendation

Service

Alice

Bob

50

65,000

Metallica

painting

nasa

38

90,000

B. Spears

soccer

fox

32

55,000

B. Marley

camping

linuxware

45

60,000

B. Spears

baseball

cnn

35

95,000

J.S. Bach

painting

nasa

Chris

42

85,000

B. Marley,

camping,

microsoft

Data Mining Model

Mining Algorithm

Recovery

Recovery of
distributions, not
individual records

New Order:
Randomization

Protects Privacy

11

Reconstructing Distributions
(Numeric Data)


To hide original values x
1
, x
2
, ..., x
n


from probability distribution X (unknown)


we use y
1
, y
2
, ..., y
n


from probability distribution Y


Problem: Given


x
1
+y
1
, x
2
+y
2
, ..., x
n
+y
n


the probability distribution of Y


Estimate the probability distribution of X.

12

Reconstruction Algorithm


f
X
0

:= Uniform distribution


j := 0




repeat


f
X
j+1
(a) := Bayes’ Rule







j := j+1


until (stopping criterion met)




(R. Agrawal, R. Srikant.
Privacy Preserving Data Mining
. SIGMOD 2000)




Converges to maximum likelihood estimate.

(D. Agrawal & C.C. Aggarwal, PODS 2001)











n
i
j
X
i
i
Y
j
X
i
i
Y
a
f
a
y
x
f
a
f
a
y
x
f
n
1
)
(
)
)
((
)
(
)
)
((
1
13

Works Well

20
60
Age
0
200
400
600
800
1000
1200
Number of People
Original
Randomized
Reconstructed
Application to Building Decision Trees

Age
Salary
Repeat
Visitor?
23
50K
Repeat
17
30K
Repeat
43
40K
Repeat
68
50K
Single
32
70K
Single
20
20K
Repeat
Age < 25
Salary <
50K
Repeat
Repeat
Single
Yes
Yes
No
No
15

Algorithms


Global


Reconstruct for each attribute once at the beginning


By Class


For each attribute, first split by class, then reconstruct separately
for each class.


Local


Reconstruct at each node



See SIGMOD 2000 paper for details.


16

Experimental Methodology


Compare accuracy against


Original
: unperturbed data without randomization.


Randomized
: perturbed data but without making any corrections
for randomization.


Test data not randomized.


Synthetic benchmark from [AGI+92].


Training set of 100,000 records, split equally between the two
classes.

18

Accuracy vs. Randomization

10
20
40
60
80
100
150
200
Randomi zati on Level
40
50
60
70
80
90
100
Accuracy
Original
Randomized
Reconstructed
Fn 3
19

More on Randomization


Privacy
-
Preserving Association Rule Mining Over Categorical
Data


Rizvi & Haritsa [VLDB 02]


Evfimievski, Srikant, Agrawal, & Gehrke [KDD
-
02]


Privacy Breach Control: Probabilistic limits on what one can
infer with access to the randomized data as well as mining
results


Evfimievski, Srikant, Agrawal, & Gehrke [KDD
-
02]


Evfimievski, Gehrke & Srikant [PODS
-
03]

20

Outline


Preliminary Feasibility Results:


Client
-
Server Setting


Distributed Setting


Outlook

21

Distributed Setting


Sovereign entities interested in computation across private
databases


Implication: An entity has access to all the records in its
database prior to start of the computation




Considerable increase in available options

22

Sovereign Computing


Compute queries across databases so that no more
information than necessary is revealed (without using a
trusted third party).


Need is driven by several trends:


End
-
to
-
end integration of information systems across
companies (virtual organizations)


Simultaneously compete and cooperate.


Security: need
-
to
-
know information sharing

R. Agrawal, D. Asonov, P. Baliga, L. Liang, B. Porst, R. Srikant.
A Reusable Platform for Building
Sovereign Information Sharing Applications
. DIVO 04.

R. Agrawal, D. Asonov, R. Srikant.
Enabling Sovereign Information Sharing Using Web Services
.
SIGMOD 04 (Industrial Track).

R. Agrawal, A. Evfimievski, R. Srikant.
Information Sharing Across Private Databases
. SIGMOD 03.

23

Security Application


Security Agency finds those
passengers who are in its list
of suspects, but not the
names of other passengers.


Airline does not find anything.

Agency


Suspect

List


Airline


Passenger

List


http://www.informationweek.com/story/showArticle.jhtml?articleID=184010%79

24

Epidemiological Research


Validate hypothesis between
adverse reaction to a drug
and a specific DNA sequence.


Researcher should not learn
anything beyond 4 counts:

Medical

Research

Inst.


DNA

Sequences


Drug

Reactions

Adverse Reaction

No Adv. Reaction

Sequence Present

?

?

Sequence Absent

?

?

25

R


S


R must not
know that S
has b & y


S must not
know that R
has a & x


u


v

R



S

a

u

v

x

b

u

v

y

R

S

Count (R





R & S do not learn
anything except that
the result is 2.

Minimal Necessary Sharing

26

Problem Statement:

Minimal Sharing


Given:


Two parties (honest
-
but
-
curious): R (receiver) and S (sender)


Query Q spanning the tables R and S


Additional (pre
-
specified) categories of information
I



Compute the answer to Q and return it to R without revealing any
additional information to either party,
except for the information
contained in
I


For example, in the upcoming intersection protocols


I

= { |R| , |S| }

27

A Possible Approach


Secure Multi
-
Party Computation


Given two parties with inputs x and y, compute f(x,y) such that
the parties learn only f(x,y) and nothing else.


Can be solved by building a combinatorial circuit, and simulating
that circuit [Yao86].


Prohibitive cost for database
-
size problems.


Intersection of two relations of a million records each would
require 144 days (Yao’s protocol)

Intersection Protocol


R













S












R

S

Secret key

a

b

f
b
(S

)

Shorthand for

{ f
b
(s) | s


S

}

Commutative Encryption

f
a
(f
b
(s)) = f
b
(f
a
(s))

f(s,b,p) = s
b

mod p

R













Intersection Protocol


S












R

S

f
b
(S)

f
b
(S

)

f
a
(f
b
(S

))

a

b

f
b
(f
a
(S

))

Commutative
property

R













Intersection Protocol


S












R

S

f
a
(R

)

f
a
(R

)

f
b
(f
a
(S

))

{< f
a
(r

), f
b
(f
a
(r

))>}

a

b

<r, f
b
(f
a
(x))>

{< f
a
(r

), f
b
(f
a
(r

))>}

Since R knows

<r, f
a
(r)>

R













Intersection Size

S












R

S

f
a
(R

)

f
a
(R

)

f
b
(f
a
(S

))

{< f
a
(r

), f
b
(f
a
(r

))>}

a

b

{< f
a
(r

), f
b
(f
a
(r

))>}

32

Related Protocols


[Naor & Pinkas 99]: Two protocols for list intersection problem


Oblivious evaluation of n polynomials of degree n each.


Oblivious evaluation of n
2

linear polynomials.


[Huberman et al 99]: find people with common preferences,
without revealing the preferences.


Intersection protocols are similar


[Clifton et al, 03]: Secure set union and set intersection


Similar protocols

33

Performance


Airline application: 150,000 (daily) passengers and 1 million people in
the watch list:


120 minutes with one accelerator card


12 minutes with ten accelerator cards



Epidemiological research: 1 million patient records in the hospital and
10 million records in the Genebank:


37 hours with one accelerator cards


3.7 hours with ten accelerator cards


AEP SSL CARD Runner 2000 ≈ $2K

20K encryptions per minute

10x improvement over software implementation

34

Other Ideas


Cryptographic approaches to building data mining models


ID3 Classifier [Lindell & Pinkas 2000]


Purdue Toolkit [Clifton et al. 2003]


Global approaches to data perturbation (e.g. swapping) from
Statistical Disclosure Control Community


Model combination and Voting


Potential for leakage from individual models

35

Private Distributed ID3

[Lindell & Pinkas, Crypto 2000]



How to build a decision
-
tree classifier on the union of two horizontally
partitioned private databases


Basic Idea:


Find attribute with highest information gain privately


Independently split on this attribute and recurse


Selecting the Split Attribute


Given v1 known to DB1 and v2 known to DB2, compute (v1 + v2) log (v1
+ v2) and output random shares of the answer


Given random shares, use Yao's protocol
[FOCS 84]

to compute
information gain.


Trade
-
off (Compared to Randomization approach)

+
Accuracy


Performance & scaling

36

Outline


Preliminary Feasibility Results:


Client
-
Server Setting


Distributed Setting


Outlook

37

Two Contradictory Perceptions

of Data Mining


Not very valuable in practice


Too Powerful. Ban it.

38

What Gives?


Huge gap in the way we (data mining community) think of data
mining and the way world perceives it.

39

Our Focus


Formalisms and algorithms for building data mining models

40

The Way World Thinks of Data Mining


Collect and organize data


Province of Data Management


Extract value from data


Province of data mining


Includes:


Model building and application


Complex querying


Search



“Data mining” is defined to mean: searches of one or more electronic
databases of information concerning (U.S.) persons…
Technology &
Privacy Advisory Committee, March 2004.






41

What should we do?


Accept the mandate


Broaden our view of what we are about


Take responsibility for the technology we create and design
-
in
safeguards against misuse

42

Some Key Problems


Understand the interactions between


Generality


Accuracy


Performance


Disclosure


Address head
-
on the issues of


Background information


Multiple operation


Maliciousness


Game
-
theoretic incentive compatibility coupled with auditing?


System design principles (privacy can’t be an afterthought)


43

Closing Thoughts


Solutions to complex problems such as privacy require a mix
of legislations, social customs, market forces, and technology
[Lessig]


By advancing technology, we can change the mix and
improve the overall quality of the solution


Gold mine of challenging and exciting research problems
(besides being useful)!

44

Thank you!

Questions?