Mining with Noise Knowledge: Error Awareness Data Mining

levelsordData Management

Nov 20, 2013 (3 years and 6 months ago)

60 views

1

Mining with Noise Knowledge:
Error Awareness Data Mining

Xindong Wu

Department of Computer Science

University of Vermont, USA;

Hong Kong Polytechnic University;

合肥工业大学计算机与信息学院

Tsinghua University, Beijing, January 15, 2008

2

Outline

1.
Introduction


Noise


Existing Efforts in Noise Handling

2.
A System Framework for Error Tolerant
Data Mining

3.
Error Detection and Instance Ranking

4.
Error Profiling with Structured Noise

5.
Error Tolerant Mining

Tsinghua University, Beijing, January 15, 2008

3

Noise Is Everywhere


Random noise


“a random error or variance in a measured
variable” (Han & Kamber 2001)


“any property of the sensed pattern which is not
due to the true underlying model but instead to
randomness in the world or the sensors” (Duda
et.al. 2000)


Structured noise


Caused by systematic mechanisms


Equipment failure


Deceptive information

Tsinghua University, Beijing, January 15, 2008

4

Noise Categories and Locations


Categorized by types


Erroneous value


Missing value


Categorized by variable types (Zhu & Wu
2004)


Independent variable


Attribute noise


Dependent variable


Class noise

Tsinghua University, Beijing, January 15, 2008

5

Existing Efforts (1):

Learning with Random Noise

Dataset
D
Data
preprocessing
Dataset
D’
Learning
A single
Learner

Data preprocessing techniques


Identifying mislabeled examples (Brodley & Friedl 1999)


Noise filtering


Erroneous attribute value detection (Teng 1999)


Attribute value prediction


Missing attribute value acquisition (Zhu & Wu 2004, Zhu &
Wu 2005)


Acquiring the most informative missing values


Data imputation (Fellegi & Holt 1976)


Filling missing values

Tsinghua University, Beijing, January 15, 2008

6

Existing Efforts (2):

Classifier Ensembling w/ Random Noise

Dataset
D
Ensemble
Learning
L
1
L
2
L
n
Learners
Combination
A single
Learner
...

Bagging (Breiman 1996)


Boosting (Freund & Schapire 1996)

Tsinghua University, Beijing, January 15, 2008

7

Limitations


The design of current ensembling
methods only focus on making
diverse base learners


How to learn from past noise
-
handling efforts to avoid future
noise?

Tsinghua University, Beijing, January 15, 2008

8

Outline

1.
Introduction


Noise


Existing Efforts in Noise Handling

2.
A System Framework for Error Tolerant
Data Mining

3.
Error Detection and Instance Ranking

4.
Error Profiling with Structured Noise

5.
Error Tolerant Mining

Tsinghua University, Beijing, January 15, 2008

9

A System Framework for

Noise
-
Tolerant Data Mining


Error Identification and Instance Ranking


Error Profiling and Reasoning


Error
-
Tolerant Mining

Tsinghua University, Beijing, January 15, 2008

10



Tsinghua University, Beijing, January 15, 2008

11

Outline

1.
Introduction


Noise


Existing Efforts in Noise Handling

2.
A System Framework for Error Tolerant
Data Mining

3.
Error Detection and Instance Ranking

4.
Error Profiling with Structured Noise

5.
Error Tolerant Mining

Tsinghua University, Beijing, January 15, 2008

12

Error Detection and Instance Ranking

(AAAI
-
04)


Error Detection


Construct suspicious instance subset


Locate erroneous attribute values


Impact
-
Sensitive Ranking


Rank suspicious instances based on located erroneous
attribute values and their impacts.




Noisy Dataset



D



Suspicious Instances

Subset

S






Erroneous Attribute

Detection



Calculate

Information

-

gain

Ratios



Impact

-

sensitive

Weight for Each

Attribute



Overall Impact Value

for Each Suspicious

Instance



Impact

-

sensitive

Ranking

and

Recommendation



Impact

-

sensitive Ranking



Error Detection



Tsinghua University, Beijing, January 15, 2008

13

Outline

1.
Introduction


Noise


Existing Efforts in Noise Handling

2.
A System Framework for Error Tolerant
Data Mining

3.
Error Detection and Instance Ranking

4.
Error Profiling with Structured Noise

5.
Error Tolerant Mining

Tsinghua University, Beijing, January 15, 2008

14

Error Profiling with Structured Noise


Unlimited types of structured noise


Occurs in many studies


Objective


Construct a systematic approach


Study specific types of structured noise.

Tsinghua University, Beijing, January 15, 2008

15

Approach

Learning
Noisy
Data D
Rules that
describe the
modification
pattern
Purged
Data
set D’
Output
Input
Rule
Learning

Noisy
Data
D
2
Modified
Data set
D
2

Rule
Evaluation

Tsinghua University, Beijing, January 15, 2008

16

Associative Noise (ICDM ’07)


Associative noise


The error in one attribute is associated with other attribute
values


Stability of certain measures is conditioned on other attributes


Intentionally planted false information


Model


Assumptions


Noisy data set
D
,

purged data set
D’.



Associative Corruption Rules


Errors are only in feature attributes.

Tsinghua University, Beijing, January 15, 2008

17

Associative Profiling


Take the purged data set
D
' as the base data
set


For each corrupted attribute
A
i

in
D
, add
A
i

into
D
' and label
A
i

as the class attribute


Learn a classification tree from
D
'


Obtain modification rules


If
A
1
= a
11
,
A
2
= a
21
,C = c
1
, then
A
5
= a
51
=>
A
5
= a
52


If
A
2
= a
21
,
A
3
= a
31
, then
A
5
= a
52
=>
A
5
= a
52

Tsinghua University, Beijing, January 15, 2008

18

Associative Profiling Rules


Inverse obtained rules


If
A
1
= a
11
,
A
2
= a
21
,C = c
1
, then
A
5
=
a
51
=>
A
5
=
a
52


If
A
1
= a
11
,
A
2
= a
21
,C = c
1
, then
A
5
=
a
52
=>
A
5
=
a
51


In
D
’, learn a Bayes learner
L

for attribute
A’
i


Evaluation


Correcting noisy data set
D
2

with the help of
L


Corrected data set
D’
2


Does
D’
2

have a higher quality than data set
D
2
in
terms of supervised learning?

Tsinghua University, Beijing, January 15, 2008

19

Outline

1.
Introduction


Noise


Existing Efforts in Noise Handling

2.
A System Framework for Error Tolerant
Data Mining

3.
Error Detection and Instance Ranking

4.
Error Profiling with Structured Noise

5.
Error Tolerant Mining

Tsinghua University, Beijing, January 15, 2008

20

Error
-
Tolerant Data Mining


Get a set of diverse base training sets by re
-
sampling


Unify error detection, correction and data
cleansing for each base training set to
improve its quality


Classifier ensembling.

Tsinghua University, Beijing, January 15, 2008

21

C2 Flowchart

Tsinghua University, Beijing, January 15, 2008

22

Accuracy Enhancement

Three Steps:


Locate noisy data from given dataset





Recommend possible corrections


Attribute prediction


Construct solution set


Select and perform one correction for each
noisy instance

Classifier
T’

D



D

S



Tsinghua University, Beijing, January 15, 2008

23

Attribute Prediction


Switch each attribute (
A
i
) with the class
label to train a classifier
AP
i


I
k
:

A
1
,
A
2
, ..,
A
i
, ..,
A
N
,

C

I

k
:

A
1
,
A
2
, ..,
C
, ..,
A
N
,

A
i

Classification
Algorithm

AP
i


Use
AP
i

to evaluate whether attribute
A
i

possibly
contains any error


Tsinghua University, Beijing, January 15, 2008

24

Construct A Solution Set

I
k
:


A
1


A
2



A
i

C

AP
1

AP
2

AP
i

I
k


:


A
1




A
2






A
i


C

For example, Solution set for instance
I
k


{
A
1

--
>
A
1

,

A
j

--
>
A
j

,

{
A
k
1

--
>
A
k
1

,

A
k
2

--
>
A
k
2

}
}

k = 3: maximum attribute value changes.

D


Classifier
T’

D


S

Tsinghua University, Beijing, January 15, 2008

25

Select and Perform Corrections

D

D
1


S
1



Classifier Ensembling

D
2


S
2

D
n


S
n

Resampling

Noise
locating,
detecting,

D
1
’’

D
2
’’

D
n
’’



correcting

Tsinghua University, Beijing, January 15, 2008

26

Experimental Results




We integrate Weka
-
3
-
4 packages into our
system


We use C4.5 classification tree


Real
-
world datasets from UCI data
depository


Attribute error corruption scheme:


Erroneous attribute values are introduced into each
attribute independently with noise level
x

100
%.

Tsinghua University, Beijing, January 15, 2008

27

Results


C2 won 34
trials


Bagging
won 4 trials


Tied 2 trials


Tsinghua University, Beijing, January 15, 2008

28

Results

Monks3

Performance Comparison
on Base Learners

Performance Comparison
on Four Methods

20%

10%

Noise
Level

Tsinghua University, Beijing, January 15, 2008

29

Results

Monks3

Performance Comparison
on Base Learners

Performance Comparison
on Four Methods

40%

30%

Noise
Level

Tsinghua University, Beijing, January 15, 2008

30

Performance Discussions


C2 (ICDM 2006), Bagging, ACE (ICTAI 2005)
outperform classifier
T


C2 outperforms Bagging in most trials


When the noise level is high, the accuracy
enhancement module is less reliable


We can consider improvement from following
aspects:


Locating noisy data


Recommending possible corrections


Selecting and performing one correction.

Tsinghua University, Beijing, January 15, 2008

31

Concluding Remarks


A defining problem hence a long
-
term issue


Data mining from large, noisy data sources


With different types of noise,


Structured noise


A specific type of structured noise


Associative Noise


Associative profiling


Random noise


C2


Corrective Classification


Future work: how to combine noise profiling and
noise
-
tolerant mining with unknown noise types?

Tsinghua University, Beijing, January 15, 2008

32

References

1.
J. Han and M. Kamber.
Data Mining: Concepts and
Techniques,

2001.

2.
R.O. Duda,
et.al.

Pattern Classification (2nd Edition),

Wiley
-
Interscience, 2000.

3.
Y. Zhang, X. Zhu, X. Wu and J.P. Bond, ACE: An
Aggressive Classifier Ensemble with Error Detection,
Correction and Cleansing,
IEEE ICTAI 2005
, pp.310
-
317.

4.
Y. Zhang, X. Zhu and X. Wu,

Corrective Classification:
Classifier Ensembling with Corrective and Diverse Base
Learners
,

ICDM 2006
, pp.1199
-
1204.

5.
X. Zhu and X. Wu, Class Noise vs Attribute Noise: A
Quantitative Study of Their Impacts,
Artificial Intelligence
Review
, 22(2004), 3
-
4: 177
-
210.

6.
Y. Zhang and X Wu, Noise Modeling with Associative
Corruption Rules,
ICDM 2007
, pp. 733
-
738.

Tsinghua University, Beijing, January 15, 2008

33

Acknowledgements


Joint work with:


Dr. Xingquan Zhu


Yan Zhang


Dr. Jeffrey Bond


Supported by


DOD (US)


NSF (US)