Get Another Label?

homelybrrrInternet and Web Development

Dec 4, 2013 (3 years and 9 months ago)

117 views

Get Another Label?

Improving Data Quality and Data Mining

Using Multiple, Noisy Labelers






Panos Ipeirotis


Stern School of Business

New York University


Joint work with Victor Sheng, Foster
Provost, and Jing Wang

2

Motivation


Many task rely on high
-
quality labels for objects:


relevance judgments for search engine results


identification of duplicate database records


image recognition


song categorization


videos



Labeling can be relatively inexpensive, using
Mechanical Turk, ESP game …

Micro
-
Outsourcing: Mechanical Turk

Requesters post micro
-
tasks, a few cents each

4

Motivation


Labels can be used in
training predictive models




But
: labels obtained through such sources are

noisy
.



This directly affects the quality of learning models

5

40
50
60
70
80
90
100
1
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
Number of examples (Mushroom)
Accuracy


Quality and Classification Performance

Labeling quality increases


classification quality increases


Q = 0.5

Q = 0.6

Q = 0.8

Q = 1.0

Training set size

6

How to Improve Labeling Quality


Find better labelers


Often expensive, or beyond our control



Use
multiple

noisy labelers:
repeated
-
labeling


Our focus



7

Majority Voting and Label Quality

0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
3
5
7
9
11
13
Number of labelers
Integrated quality
P=0.4

P=0.5

P=0.6

P=0.7

P=0.8

P=0.9

P=1.0


Ask multiple labelers, keep majority label as “true” label


Quality is probability of majority label being correct

P is probability

of individual
labeler

being correct

8


Tradeoffs for Modeling


Get more
examples



Improve classification


Get more
labels per example



Improve quality


Improve classification

40
50
60
70
80
90
100
1
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
Number of examples (Mushroom)
Accuracy
Q = 0.5

Q = 0.6

Q = 0.8

Q = 1.0

9

Basic Labeling Strategies


Single Labeling


Get as many data points as possible


One label each


Round
-
robin Repeated Labeling


Repeatedly label data points,


Give next label to the one with the fewest so far

10

Repeat
-
Labeling vs. Single Labeling

P= 0.8, labeling quality

K=5, #labels/example

Repeated

Single

With
low noise
, more (single labeled) examples better

11

Repeat
-
Labeling vs. Single Labeling

P= 0.6, labeling quality

K=5, #labels/example

Repeated

Single

With
high noise
, repeated labeling better

12

Selective

Repeated
-
Labeling


We have seen:


With enough examples and noisy labels, getting multiple
labels is better than single
-
labeling



Can we do better than the basic strategies?


Key observation
: we have additional information to
guide selection of data for repeated labeling


the current multiset of labels


Example:
{+,
-
,+,+,
-
,+}

vs.
{+,+,+,+}

13

Natural Candidate: Entropy


Entropy is a natural measure of label uncertainty:





E({+,+,+,+,+,+})=0


E({+,
-
, +,
-
, +,
-

})=1


Strategy:
Get more labels for high
-
entropy label multisets

|
|
|
|
log
|
|
|
|
|
|
|
|
log
|
|
|
|
)
(
2
2
S
S
S
S
S
S
S
S
S
E







negative
S
positive
S
|:
|
|:
|



14

What Not to Do: Use Entropy

Improves at first,

hurts in long run

Why not Entropy


In the presence of noise, entropy will be high
even with many labels



Entropy is scale invariant


(3+ , 2
-
) has same entropy as (600+ , 400
-
)





15

16

Estimating Label Uncertainty (LU)


Observe +’s and

’s and compute Pr{+|obs} and Pr{
-
|obs}


Label uncertainty = tail of beta distribution

S
LU


0.5

0.0

1.0

Beta probability density function

Label Uncertainty


p=0.7


5 labels

(3+, 2
-
)


Entropy ~ 0.97


CDF
b
=0.34


17

Label Uncertainty


p=0.7


10 labels

(7+, 3
-
)


Entropy ~ 0.88


CDF
b
=0.11




18

Label Uncertainty


p=0.7


20 labels

(14+, 6
-
)


Entropy ~ 0.88


CDF
b
=0.04




19

Quality Comparison

20

0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0
400
800
1200
1600
2000
Number of labels (waveform, p=0.6)
Labeling quality
UNF
MU
LU
LMU
Label
Uncertainty

Round robin

(
already better
than single
labeling
)

21

Model Uncertainty (MU)


Learning a model of the data provides
an alternative source of information
about label certainty


Model uncertainty
: get more labels for
instances that cause model uncertainty


Intuition?


for
data quality
, low
-
certainty “regions” may
be due to incorrect labeling of corresponding
instances


for
modeling
: why improve training data
quality if model already is certain there?

Models

Examples

Self
-
healing
process

+
+

+
+

+
+

+
+

+
+

+
+

+
+

+
+

+
+

+
+

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

?

?

?

22

Label + Model Uncertainty


Label and model uncertainty (
LMU
): avoid
examples where either strategy is certain

MU
LU
LMU
S
S
S






Quality

23

0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0
400
800
1200
1600
2000
Number of labels (waveform, p=0.6)
Labeling quality
UNF
MU
LU
LMU
Label
Uncertainty

Uniform,
round robin

Label + Model
Uncertainty

Model Uncertainty
alone also improves
quality

24

Comparison: Model Quality (I)

Label & Model
Uncertainty

Across 12 domains, LMU is always better

than GRR. LMU is statistically significantly

better than LU and MU.

70
75
80
85
90
95
100
0
1000
2000
3000
4000
Number of labels (sick, p=0.6)
Accuracy
GRR
MU
LU
LMU
25

Comparison: Model Quality (II)

65
70
75
80
85
90
95
100
0
1000
2000
3000
4000
Number of labels (mushroom, p=0.6)
Accuracy
GRR
MU
LU
LMU
SL
Across 12 domains, LMU is always better

than GRR. LMU is statistically significantly

better than LU and MU.

26

Summary of results


Micro
-
outsourcing (e.g., MTurk, RentaCoder, ESP game)
change the landscape for data acquisition



Repeated labeling improves
data quality

and
model
quality


With noisy labels, repeated labeling can be preferable to
single labeling


When labels relatively cheap, repeated labeling can do
much better than single labeling


Round
-
robin repeated labeling works well


Selective

repeated labeling improves substantially




Example: Build an Adult Web Site Classifier



Need a large number of hand
-
labeled sites


Get people to look at sites and classify them as:


G

(general),
PG

(parental guidance),
R

(restricted),
X

(porn)




Cost/Speed Statistics


Undergrad intern
: 200 websites/hr, cost: $15/hr


MTurk
: 2500 websites/hr, cost: $12/hr

Bad news: Spammers!

Worker
ATAMRO447HWJQ


labeled
X
(porn) sites as
G
(general audience)



Solution: Repeated Labeling


Probability of correctness increases with
number

of workers


Probability of correctness increases with
quality

of workers

1 worker

70% correct

11 workers

93% correct

11
-
vote
Statistics


MTurk
: 227 websites/hr, cost: $12/hr


Undergrad: 200 websites/hr, cost: $15/hr


Single Vote
Statistics


MTurk
: 2500 websites/hr, cost: $12/hr


Undergrad: 200 websites/hr, cost: $15/hr

But Majority Voting can be Expensive

Spammer among 9 workers

Our “friend”
ATAMRO447HWJQ


mainly marked sites as
G
.

Obviously a spammer…


We can compute error rates for each worker


Error rates for ATAMRO447HWJQ


P[X

X]=
9.847%

P[X


G]=
90.153%



P[G


X]=
0.053%

P[G


G]=
99.947%


Rejecting spammers and Benefits

Random answers error rate =
50%

Average error rate for ATAMRO447HWJQ:
45.2%


P[X

X]=
9.847%

P[X


G]=
90.153%



P[G


X]=
0.053%

P[G


G]=
99.947%



Action: REJECT and BLOCK



Results:


Over time you block all spammers


Spammers learn to avoid your HITS


You can decrease redundancy, as quality of workers is higher


After rejecting spammers,

quality goes up

With spam

1 worker

70% correct

With spam

11 workers

93% correct

Without spam

1 worker

80% correct

Without spam

5 workers

94% correct

Correcting biases



Sometimes workers are
careful

but
biased









Classifies G

P and P

R


Average error rate for ATLJIK76YH1TF:
45.0%








Is ATLJIK76YH1TF a spammer?

Error Rates for Worker: ATLJIK76YH1TF


P[G


G]=
20.0%

P[G


P]=
80.0%

P[G


R]=0.0%

P[G


X]=0.0%

P[P


G]=0.0%

P[P


P]=
0.0%

P[P


R]=
100.0%

P[P


X]=0.0%

P[R


G]=0.0%

P[R


P]=0.0%

P[R


R]=
100.0%

P[R


X]=0.0%

P[X


G]=0.0%

P[X


P]=0.0%

P[X


R]=0.0%

P[X


X]=
100.0%

Correcting biases










For ATLJIK76YH1TF, we simply need to compute the

“non
-
recoverable”
error
-
rate (technical details omitted)



Non
-
recoverable error
-
rate for ATLJIK76YH1TF: 9%



The “condition number” of the matrix
[how easy is to invert the
matrix]
is a good indicator of
spamminess







Error Rates for Worker: ATLJIK76YH1TF


P[G


G]=
20.0%

P[G


P]=
80.0%

P[G


R]=0.0%

P[G


X]=0.0%

P[P


G]=0.0%

P[P


P]=
0.0%

P[P


R]=
100.0%

P[P


X]=0.0%

P[R


G]=0.0%

P[R


P]=0.0%

P[R


R]=
100.0%

P[R


X]=0.0%

P[X


G]=0.0%

P[X


P]=0.0%

P[X


R]=0.0%

P[X


X]=
100.0%

Too much theory?

Open source implementation available at:

http://code.google.com/p/get
-
another
-
label/


Input:


Labels from Mechanical Turk


Cost of incorrect
labelings

(e.g., X

G costlier than G

X)


Output:


Corrected labels


Worker error rates


Ranking of workers according to their quality


Alpha version, more improvements to come!


Suggestions and collaborations welcomed!

37

Many new directions…


Strategies using “learning
-
curve gradient”


Increased compensation vs. labeler quality


Multiple “real” labels


Truly “soft” labels


Selective repeated
tagging

Other Projects


SQoUT project

Structured Querying over Unstructured Text

http://sqout.stern.nyu.edu


Faceted Interfaces


EconoMining project

The Economic Value of User Generated Content

http://economining.stern.nyu.edu




38

39

SQoUT: Structured Querying over Unstructured Text


Information extraction

applications extract structured
relations from unstructured text

July 8, 2008: Intel Corporation and DreamWorks Animation

today announced they have formed a strategic alliance

aimed at revolutionizing 3
-
D filmmaking technology,…

Date

Company1

Company2

08/06/08

BP

Veneriu

04/30/07

Omniture

Vignette

06/18/06

Microsoft

Nortel

07/08/08

Intel Corp.

DreamWorks

Information

Extraction System

(e.g.,
OpenCalais
)

Alliances covered in
The New York Times

Alliances and strategic partnerships before 1990 are
sparsely covered in databases such as SDC Platinum

40

In an ideal world…

Output Tokens



Extraction

System(s)

Text Databases

3.
Extract output
tuples

2.
Process
documents

1.
Retrieve documents from
database/web/archive

SELECT Date, Company1, Company2

FROM Alliances

USING
OpenCalais

OVER
NYT_archive

[WITH recall>0.2 AND precision >0.9]


SIGMOD’06, TODS’07,

ICDE’09, TODS’09

41

SQoUT:
The Questions

Output Tokens



Extraction

System(s)

Text Databases

3.
Extract output
tuples

2.
Process
documents

1.
Retrieve documents from
database/web/archive

Questions
:

1.
How to we
retrieve

the documents?

(Scan all?
Specific websites? Query Google?)

2.
How to
configure

the extraction systems?

3.
What is the
execution time
?

4.
What is the
output quality
?



SIGMOD’06 best paper,

TODS’07
,
ICDE’09,TODS’09

EconoMining Project

Show me the Money!

Applications (in increasing order of difficulty)


Buyer feedback and seller pricing power in online marketplaces (ACL 2007)


Product reviews and product sales (KDD 2007)


Importance of reviewers based on economic impact (ICEC 2007)


Hotel ranking based on “bang for the buck” (WebDB 2008)


Political news (MSM, blogs), prediction markets, and news importance

Basic Idea


Opinion mining an important application of information extraction


Opinions of users are reflected in some economic variable (price, sales)

Some Indicative Dollar Values

Positive

Negative

Natural

method for extracting sentiment
strength

and
polarity

good packaging

-
$0.56

Naturally

captures the
pragmatic

meaning within the given
context

captures misspellings as well

Positive?

Negative

?

Thanks!


Q & A?

So…


(Sometimes) quality of
multiple noisy labelers
better than quality of
best labeler in set


45


Multiple noisy labelers improve quality

So, should we always get multiple labels?

Optimal Label Allocation

46