EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems

grizzlybearcroatianAI and Robotics

Oct 16, 2013 (3 years and 10 months ago)

79 views

EUS SVMs:
Ensemble of Under-Sampled SVMsfor Data Imbalance Problems
서울대학교산업공학과
데이터마이닝연구실
강필성
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang2
Introduction
Data Imbalance
Data Imbalance
•The number of a class is much smaller than that of other classes.
•Most machine learning algorithms are trained based on the assumptionthat the
ratios of the classes are almost equal.
In real problem
In real problem
•Fraud detection (Fawcett and Provost, 1997)
•Oil spill detection (Kubatet al., 1998)
•Response modeling (Shin and Cho, 2003)
•Remote sensing (Bruzzoneand Serpico, 1997)
•Scene Classification (Yanet al., 2003)
Many classification tasks are suffering from class imbalance problems.
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang3
Related Work
Under
Under
-
-
Sampling
Sampling
•Sampling from the majority class with some rules.
•The number of sampled data is usually equal to that of the minority class or at
least much less than the number of the majority class.
Advantage
Advantage
•Much faster than other methods.
Disadvantage
Disadvantage
•Cannot represent the entire data distribution.
Loss of generality, High variance.
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang4
Related Work
Over
Over
-
-
Sampling
Sampling
•Sampling or generating data from the minority class with some rules.
•The number of sampled data is usually equal to that of the majority class or at
least much bigger than the number of the minority class.
Advantage
Advantage
•Can represent the entire data distribution.
Disadvantage
Disadvantage
•Much slower than other methods.
•Information redundancy.
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang5
Related Work
Cost Modification
Cost Modification
•Give different costs to each class to compensate the class imbalance.
Mixture Model
Mixture Model
•Employ more than one algorithms and aggregate them.
Partitioning
Partitioning
•Partition the majority class samples.
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang6
The Purpose and Scope of the research
Purpose
Purpose
Propose Ensemble of Under-Sampled SVMs(EUS SVMs)
Expected Contribution
Expected Contribution
•It may improve the performance of the classifier.
•It may reduce the skewnessof data distribution.
Scope
Scope
Domain
Domain: Imbalanced Data
The type of pattern recognition
The type of pattern recognition: Two Class Classification
Algorithms
Algorithms: Support Vector Machine
Main Purpose
Main Purpose: Improve the performance of a classifier
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang7
Performance Measures
•Simple Accuracy= (TP+TN)/(TP+FN+FP+TN)
•A+
= TP/(TP+FN), A-
= TN/(FP+TN)
•GeometricMean=
•Recall= TP/(TP+FN), Precision= TP/(TP+FP)
•F1= 2*Recall*Precision/(Recall+Precision)
•AUROC= Area Under the ROC
Predict
Performance Measure
Posit
i
ve
Negative
Posit
i
ve
True Posit
i
ve (TP)
False Negative (FN)
Negative
False Posit
i
ve (FP)
True Negative (TN)
Actual
A
A
+


2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang8
Curse of Imbalance
4*4 Checker Board Data 1
4*4 Checker Board
(Set A)
Min Ratio
Majority
Ratio
Minority SamplesMajority Samples
1320
320
320
320
320
320
3
320
960
1,600
3,200
9,600
5
10
30
5016,000
Total Samples
Set 11640
Set 211,280
Set 311,920
Set 413,520
Set 519,920
Set 6116,320
Base Classifier: Support Vector Machine
Sufficient Minority Class Data
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang9
Curse of Imbalance
Sufficient Minority Class Data
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang10
Curse of Imbalance
Sufficient Minority Class Data
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang11
Curse of Imbalance
1:1
1:3
1:5
1:10
1:30
1:50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The Curse of Imbalance
A+
A-
Accuracy
GMean
Sufficient Minority Class Data
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang12
Curse of Imbalance
Geometric Means and Total Elapsed Time of Existing Methods
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1:11:31:51:101:301:50
No Sampling
Under Sampling
Over Sampling
Modifying Cost
0.01
0.1
1
10
100
1000
10000
1:11:31:51:101:301:50
No Sampling
Under Sampling
Over Sampling
Modifying Cost
Sufficient Minority Class Data
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang13
Curse of Imbalance
4*4 Checker Board Data 2
4*4 Checker Board
(Set B)
Min Ratio
Majority
Ratio
Minority SamplesMajority Samples
580
80
80
80
80
10
400
800
2,400
4,000
30
50
1008,000
Total Samples
Set 11480
Set 21880
Set 312,480
Set 414,080
Set 518,080
Base Classifier: Support Vector Machine
Insufficient Minority Class Data
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang14
Curse of Imbalance
A+, A-, and Geometric Mean of Under-Sampling SVM
Insufficient Minority Class Data
(a) Sufficient Minority Class Samples
0.74
0.76
0.78
0.80
0.82
0.84
0.86
0.88
0.90
1:11:31:51:101:301:50
0.74
0.76
0.78
0.80
0.82
0.84
0.86
0.88
0.90
A+
A-
G-Mean
(b) Inufficient Minority Class Samples
0.74
0.76
0.78
0.80
0.82
0.84
0.86
0.88
0.90
1:51:101:301:501:100
0.74
0.76
0.78
0.80
0.82
0.84
0.86
0.88
0.90
A+
A-
G-Mean
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang15
EUS SVMs(Ensemble of Under-Sampled SVMs)
Step 1 : Partitioning Training Data
–Divide training data into majority class and minority class data.
Step 2 : Constructing Majority Class Data Subsets
–Random sampling without substitution from majority class data.
–The number of sampling data is equal to that of the minority class data.
–Use all majority class data in each sampling step.
Step 3 : Construct Ensemble Training Data Sets
–Construct training data sets by combining each sampled data withminority class
data.
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang16
EUS SVMs(Ensemble of Under-Sampled SVMs)
Minority
Class Data
Majority Class Data
Minority
Class Data
Majority Class
Data Subset 1
Majority Class
Data Subset 2
Majority Class
Data Subset N
Minority
Class Data
Majority Class
Data Subset 1
Training Data Subset 1
Minority
Class Data
Majority Class
Data Subset 2
Training Data Subset 2
Minority
Class Data
Majority Class
Data Subset N
Training Data Subset N
SVM Classifier 1
SVM Classifier 2
SVM Classifier N
Ensemble Aggregation
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang17
Experiments
Data
Min Ratio
Majority
Ratio
Minority SamplesMajority Samples
580
80
80
80
80
10
400
800
2,400
4,000
30
50
1008,000
Total Samples
Set 11480
Set 21880
Set 312,480
Set 414,080
Set 518,080
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang18
Experiments
Majority Voting
•Each individual classifier votes for one of the candidate outputs.
•The candidate output that has the largest votes becomes the representative output of the
ensemble.
Weighted Voting
•Utilize each classifier’s training performance as a coefficient.
Function Value Aggregation
•Utilize each classifier’s output function.
Ensemble Aggregation Method
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang19
Experiment
(a) 4*4 Checker Board Data (Set B)
0.8
0.81
0.82
0.83
0.84
0.85
0.86
1:51:101:301:501:100
Under Sampling
EUS SVMs (MV)
EUS SVMs (WV)
EUS SVMs (FVA)
Results
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang20
Experiment
(b) Spiral Data
0.73
0.76
0.79
0.82
0.85
1:51:101:301:501:100
Under Sampling
EUS SVMs (MV)
EUS SVMs (WV)
EUS SVMs (FVA)
Results
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang21
Experimental Result
•Both Under-Sampling and EUS SVMsare effective to deal with data imbalance.
•EUS SVMsoutperforms Under-Sampling SVM.
•There is no significant difference between three ensemble aggregation methods.
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang22
Conclusion
Conclusion
•We proposed EUS SVMs(Ensemble of Under-Sampled SVMs) to overcome
existing methods when dealing with data imbalance
•EUS SVMsare proved to be more effective than existing methods on two
synthetic data sets
Future Work
•Experiments on real data sets should be performed
•More sophisticated sampling methods can be adopted rather than simple
random sampling
2006. May 19th Seoul National University, Datamining Laboratory, Pilsung Kang23
Q & A