Data Access Models

wildlifeplaincityΔιαχείριση

6 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

66 εμφανίσεις

Data Access Models

Material mostly taken from

P.
Donmez

and J. G.
Carbonell

(2008) Proactive learning:
Cost
-
sensitive active learning with multiple imperfect
oracles.
In Proceedings of the 17th ACM Conference on
Information and Knowledge Management (CIKM
’08)
pp:619

628

Copyright © 2013 A.W. Naik. These lecture notes are provided under a Creative Commons
Attribution
-
NonCommercial
-
ShareAlike

3.0
Unported

license. See
http://creativecommons.org/licenses/by
-
nc
-
sa/3.0/

for complete terms of the license.

Recap


Already discussed: ability to request data


Streaming
vs

Membership query models



What other assumptions were we making?


Labels all cost the same (just minimize #labels)


We can always obtain a label


We’ll only get one label (only one oracle)


Proactive Learning generalizes Active Learning

Case: Reluctant
vs

Reliable Oracles

Imagine that we can request labels from two
kinds of oracles:



A
reluctant

oracle which does not always return
a label



A
reliable
oracle which always returns a label


Both oracles return the correct label

The reliable oracle costs more to query than
the reluctant oracle

Case: Reluctant
vs

Reliable Oracles



We have a fixed budget B



Oracle costs: reluctant (C
0
) reliable (C
1
)



Measure solutions by their “utility”


same units as cost

In principle we want to minimize the cost and maximize
the utility:

max
σ

𝔼
[
Utility[
σ
]|S] + ∑
i

σ
C(x
i
)

Such that ∑
i

σ
C(x
i
) < B


Can’t directly maximize! Open problem: even if utility is
submodular
, does the greedy algorithm fare well here?
(Constrained adaptive
submodularity
)

Notice their method is
essentially a label
propagation method!


“One” hypothesis is acting
as a proxy to set of
decision boundaries.

Case: Accurate
vs

Inaccurate Oracles



Inaccurate costs less than Accurate Oracle



Basic intuition:



if inaccuracy ∝ nearness to
𝔼
[Y|X] boundary



then only pay for accuracy near
𝔼
[Y|X]

We relate this method to
Dasgupta

and Hsu’s
method in “Hierarchical
sampling for active
learning
.”


Larger clusters in
Dasgupta

and Hsu’s
method ≈ less reliable
oracles

Case: Cost Difference Oracles



Side knowledge: one oracle costs more for
“hard” cases (“
nearness to
𝔼
[Y|X]”)



Basic intuition:



if cost ∝ nearness to
𝔼
[Y|X] boundary



then only pay for accuracy near
𝔼
[Y|X]

Decision
-
centric Views



See: M. Saar
-
Tsechansky

and F. Provost (2007) Decision
-
centric active
learning of binary
-
outcome models. Information Systems Research,
18(1):1

19



Decision maker wants to estimate expected utility from performing an
action


x
i

is an example


“the description of a consumer”


f
i

(unknown) probability that the action with respect to x
i

will be
successful


“consumer x
i

will respond to the marketing campaign, or will renew her
contract”


f
i
th
is

a probability threshold for performing the action



if
f
i

exceeds
f
i
th

by “enough” then utility will be locally maximized


Solution concept:



Manage
𝔼
[Y|X] accuracy only where we can reliably estimate actions having
high utility