Research Progress Report

journeycartAI and Robotics

Oct 15, 2013 (3 years and 2 months ago)

89 views

Active Learning for Class
Imbalance Problem

Problem to be addressed


Motivation


class imbalance problem



referring to the situation that at least one of class having significantly
less number of training examples


or examples in training data belonging to one class heavily outnumber
the examples in the other class



Currently, most of the machine learning algorithms assume the
training data to be balanced, support vector machine, logistic
regression, naïve bayesian classifier etc,.



During the last few decades, some effective methods have been
proposed to attack this problem, like up
-
sampling, down
-
sampling
and asymmetric bagging, etc,.




Problem to be addressed


Detailed problem


Traditional machine learning algorithms are often
biased toward the majority class



Since the goal of the classifiers is to reduce the training
error, not taking the data distribution into consideration



Consequently, examples from the majority class are
well
-
classified while the examples from minority class
tend to be misclassified




Several Common Approaches


From the data perspective


Over
-
sampling


Under
-
sampling


Asymmetric Bagging


From the learning algorithm perspective


Adjusting the cost function


Tuning the related parameters






Background Knowledge


Active Learning


Similar to semi
-
supervised learning method, the key idea is to use
both the labeled and unlabeled data for classifier training.



Active learning is composed of four components


A small set of labeled training data, a large pool of unlabeled data, a
based learning algorithm and an active learner (selection strategy)



Active learning is not a machine learning algorithm, It can be seen
as a enhancing wrapper method



The difference between semi
-
supervised learning and active
learning




Background Knowledge


Active Learning


Goals of active learning


Maximizing the learning performance while
minimizing the required labeled training examples


Achieving better performance using the same
amount of labeled training data


Needing less training samples to obtain the same
learning performance


Background Knowledge

Background Knowledge

An Example


SVM
-
based Active Learning


A small set of labeled training examples


A large pool of unlabeled data


Base learning algorithm SVM


Active Learner (selection strategy)


Instances closest to the current separating
hyperplane are selected and asks for human labeling

Problems


SVM
-
based Active Learning


In classical active learning methods, the most informative samples
are selected from the entire unlabeled pool



In other words, each iteration of active learning involves the
computation of distance of each sample to the decision boundary



For large
-
scale data set, it is time
-
consuming and computationally
inefficient


Paper Contribution


Proposed method


Instead of querying the whole unlabeled pool ,
a subset is first selected



Select the closed sample from using the
criterion that is among the top closest
instances with probability


Paper Contribution


Proposed Method


The probability that at least one of the
L

instances is among the closest is


We have










Paper Contribution


Proposed Method


For example


The active learner will pick one instance, with 95%
probability, that is among the top 5% closed
instances to the separating hyperplane, by randomly
sampling only instances
regardless of the training set size







Experiments

Experiments


Evaluation Metric


g
-
means





where sensitivity and specifity are the accuracies of the
positive and negative instances respectively


Experiments

Experiments

Experiments

Experiments

Conclusions


This paper propose a method to address the class
imbalance problem using active learning technique



Experimental results show that this approach can
achieve a significant decrease in the training time,
while maintaining the same or even higher g
-
means value by using less number of training
examples



Active selection of informative examples from a
randomly selected subset avoid searching the
whole unlabeled pool