A Short Discourse on Local Machine Learning

crazymeasleAI and Robotics

Oct 15, 2013 (3 years and 5 months ago)

67 views



A Short Discourse on Local Machine Learning


T

Wong

6/10/2008


Introduction


Traditional learning algorithms model biological human behavi
ors in perceiving and
organiz
ing data, and subsequently extracting information
by some

heur
istic experiments.
In
particular, the engineering process of imitating the human analysis and inference of
da
ta often focuses on a specific

pr
oblem and provides a customized

solution for the issues
at hand. This approach
allows a great deal of freedom in the design and impleme
ntation
of very diver
se strategies, but often makes it difficult to formalize the algorithms
analytically for generalization and optimization. At the opposite end of the spectrum,
modern methods of statistical modeling of Machine Learning seek to generali
ze the
adaptive process with algorithms that are not necessarily related to the biological or
social behavior, but are more
tractable for quantitative analysis and optimization. In
order to be able to formalize the Machine Learning system in terms of the
data and
inference
,

and quantify the
training and decision
process
es

analytically, the statistical
approach is often more
viable, although it is more
restrictive in the
choice of models and
the
actual design of the algorithm.


In recent years, with the ad
vent of fast computers and sophisticated software, statistical
modeling of Machine Learning has taken on a new life, and analytical algorithms are
being applied successfully in many areas of industry and research: Data Mining,
Classification, Bio
-
Informati
cs,

Pa
t
tern Recognition, Intrusion Detection, etc.
Fundamentally, the advantage of statistical modeling is the power in generalization and
optimization.
The formalization of the structure of data sets and the algorithms for
training and decision allow t
he learning process to be understood and quantified
analytically. The

statistical heuristics provide

the direction and strategy for optimization.
Moreover, the und
erlying mathematical structure
makes it relatively efficient to extend
the
statistical mode
l
by

combining and expanding mathematical
properties of the data set
and
training process, or transform decision algorithms

by
manipulating and adjusting
parameters and weight factors.

As result, one of the notable recent development in
statistical Machin
e
Learning is the Local Learning p
aradigm.


By Local Learning, we mean a methodology that does not use the complete, or
Global collection of data set for training or testing. In essence, the Global Learning
method processes and derives information from th
e complete data set, and
conclusively arrives at a universal approximation function for
predicting all future
input patterns

within the data space. In Global Learning, the training algorithm is
run on the complete data set until the optimal decision funct
ion
is obtained.
Subsequently, unknown data patterns can be tested and classifi
ed by this decision
function.
In most cases, the Global Learning method implies a density distribution
for the entire data space and a confidence level on recognizing new data

points.


By contrast, the Local Learning model works on a subset of the available training data at
a time.

The purpose of Local Learning is not so much to find a universal decision
function for all data

patterns.

Rather, it focuses on the specific task

at hand and seeks

a
solution for a specific
data

point
.
Different areas in

the data set might require different
properties in the local algorithm
Thus, a local training algorithm
based on the chosen
subsets
can render multiple and diverse decision funct
ions which can be optimized and
combined to provide to solve the complete dataset.

Moreover, in the case when a local
algorithm is run for each
new data pattern
, the Local Learning
process can be very
computation
-
intensive.


The Motivation

for Local Lear
ning


In Machine Learning, the specific training process and method of generalization are
often determined by the structure of the data set: the dimension and characteristics
of the attributes, the size of the data, and its relationship with the decision s
pace
.
In addition, the might be some a priori knowledge on the distribution of the training data.
The traditional Global Learning approach intends to capture the characteristics of all the
data attrib
utes within the feature space, not only based on cur
rently available data, but
with some assumption on all future data

pattern
. In other words, Global Lear
ning
ordinarily does not accommodate

for generalization of changing hypotheses on account
of unstable

values in data attributes.
One issue is training
data might not be evenly
distributed, causing the global training process to be skewed.
Even with a strong
learning algorithm, the unevenness in the data set can lead to
good decision

near
area of
well
-
represented

training data and miscalculation at poo
rly sampled areas (high capacity).
On the other hand, if the algorithm is tweak
ed

to compensate for poorly sampled data to
mitigate misclassification, it may lead to rejection of well
-
represented data (low
capacity).


In general, the Global Learning appro
ach implies a density distribution, associated with a
decision function that is complete (all test data are accepted) and consistent (accurate
decision based on a pre
-
defined criterion, eg acceptable loss rate). Intuitively, these
cannot be realized unle
ss the training data set is well
-
rep
resented and evenly distributed,
which is not the case most of time. Moreover, requiring this type of “
well
-
behaved

training data defeats the fundamental purpose of Machine Learning, which is to infer the
characteristi
cs of unknown data from
whatever is available at hand.


As such, Global Learning models the trend of data distribution in the data space and
make
s

predictions accordingly. This is a very intuitive notion and is designed to work
well if the density distrib
ution and associated parameters are chosen optimally. To this
end, this approach requires certain a priori or assumed knowledge about the training

data
set. As a simple example, shown below to correlate temperature and traffic to the Cape

using a Naïve B
ayesian algorithm:



Training Data Set:






No
Traffic Jam

Traffic Jam


Temperature



83



85


70


80

68


65

64



72

69



71





75





75








72








81












I

Total



9



5



Mean



73



74
.6


Std dev


6.2



7.9


Input Pattern:





Temperature = 66


Assuming the distribution of the training data is normal with the above mean and
standard deviation, the likelihoods of the input pattern are:



P(temperature = 66 |
no
traffic jam) = 0.0340


P(
t
emperature = 66 |
traffic jam) = 0.0291


Using the Bayesian formula:



P(traffic jam | temperature = 66) =


0.0
291







----------------------------

=
0.461








0.0340 + 0.0291






Table 1


This simple machine learns from the statistics of the ex
isting data and uses them to
predict the new unknown pattern. The critical element here is the assumption on the
normal distribution of the data.


Ultimately, indeed, it is the data se
t and data space that determine

the learned
distribution.

The quality

of the learning algorithm, then, will depend on how effective
the arbitrary choice of the prior. A frequentist approach to the same problem might offer
a more “reasonable” alternative for the estimate of the prior:






Training Data Set:




Temperature


No
Traffic Jam

Traffic Jam





Hot



2



2




Mild



4



2




Cool



3



1





Total




9



5


Input Pattern:



Temperature = cool



P(temperature = cool | no traffic jam) = 3/9 = 0.33


P(temperature = cool | traffic jam) = 1/5 = 0.2


Using the Bayesia
n formula:



P(traffic jam | temperature = cool) =


0.2







---------------------

=

0.377








0.33 + 0.2







Table 2



Although the second model above applies the same Bayesian formula, the distribution of
the training data are derived from t
he actual existing data,
which might

be considered a
more valid prior.



The above exercises demonstrate that by changing the way data attributes are
represented, different models and results can be effected.


There are other issues with the structure of
G
lobal
training data: The distribution can be
non
-
linear, or they can be derived from several varied distributions, high dimension and
small size of training data, or noise (unrecognizable attributes) in the data. While the
quality of a learning algorithm

is general measured by the rate of error, the capacity
(
completeness and consistence) is a practical concerned.


Local Learning methods, rather than modeling universal data to estimate the distribution,
fo
cuses on a specific data
pattern
, given a subset

of training data that are directly related
to the task at
hand. The simplest example of Local L
earning applies the following
procedure:


1)

Given a test data point x
0
, find the closest points among the training data set
through some distance or similarity m
easure

2)

Define a vicinity or neighborhood in n dimension, with
x
0

as the center of the n
-
dimension
al

ball

3)


Run the learning algorithm using only the training points included inside the




neighborhood of
x
0

4)

After the machine has been train, run the deci
sion algorithm on
the input pattern
x
0
.



x x

x


x x x x

x x


x x x


x


x
0


x x x x














Figure 1



Firstly, this is a Lazy Learning algorit
hm, as the machine is not trained until an input data
p
attern

needs to be
classified. M
oreover, for each unclassifed data pattern

x
n
, a new
neighborhood has to be defined accordingly


thus choosing a different set of tranining
data, and the training alg
orithm has to be run again for each x
n
.
This is labor and
computation
-
intensive, but does solve several issues: By localizing the training data
relative to the
input data pattern
, effectively, the attributes contributing to the geometric
location of the
data points can be eliminated. Since the algorithm is not seeking a
solution for all the
unknown
data points, the formulation of the hypothesis can be more
relaxed and simplified.
For example, the prior information of the local training data can
be made

arbitrary, without consideration of the universal data set. The idea is to focus on
the distribution of a group of data similar to the
single input pattern at hand
. A local
assumption may not be universally correct, but still serve
s

the purpose of pred
icting the
input pattern
. The capacity of the machine can be managed by manipulating the locality
of the training data.

Since local algorithms always focus on a specific point, intuitively, it
should be more efficient and accurate than its global counter
part.
Here is a toy example
:


We collect training data on dress code from around the world and use it to train a
machine to decide whether the subject is a man or woman. A globally trained
algorithm will probably always designate the test subject a woma
n, if the subject is
observed to wear a skirt. This decision is inaccurate, if the subject happens to be
Scottish. Given
the
small percentage of Scottish subjects, it incurs a small overall
error rate globally, but produces very inaccurate results when t
he algorithm is being
applied on
the population of Scots
.


Alternatively, if the Local Learning is applied, running the same learning algorithm, a
Scottish subject will be predicted by the machine trained with a data set most similar
to the subject, thus

achieving a high rate of accuracy. Moreover, the attributes that
determine the nationality
, race,

and locality of the subject can be eliminated to r
educe
the dimension. Because all the
chosen test data points (Scotsmen)
around the subject
will have these

same attributes.


Indeed, the
“local”
attributes in the training set
can determine

the local decision
function. The well
-
known Local Learning
methods
-

the k
-
Nearest Neighbors
algorithm, relies entirely on the geometric vicinity of the training data to
t
he
subject
to decide the output.


M
any
other
popular

algorithms employ local methods of divide
-
and
-
conquer to
reduce
an intractable problem to several less complex

sub
-
problems.


The practice
of Bagging, Boosting, and Cross
-
Validation all run algorithms o
n subsets of training
data to produce local decision functions, which are ultimately used to predict
universal data.


In some instances, local methods can be applied to problems that do not have a natural
linear solution. The so
-
called “XOR”
-
type of data
points that cannot be classified by a
simple linear mod
el, such as the Percep
tron. This problem can be easily solved by a
Local Learning model because the training data are locally linearly separable.


1

X O


1
X O



x
0





O X


O

X



0 1



0

1



There is no linearly separable hyperplane

But a
local algorithm can determine

for the global alg
orithm



where the input pattern belongs with







a linear hyperplane



Figure 2



Some local methods, instead of finding the locality
strictly around

the input pattern, seeks
selected local training points to
generalize the global decision. The Radial Basis
Function assumes density distributions can be found in clusters in the data set, which
certain “attractor” points at the center of each cluster. Each cluster is then trained with a
local algorithm suitable

for the assumed distribution. The resulting decision functions are
then combined and averaged out for predicting unseen global patterns.


In

recent years, the Support Vector Machine has taken center stage in the Local Learning
arena. The SVM applies the

training algorithm
on selected local data

points in order

to
find the optimal hyperplane in terms of the widest m
argin between the hyperplane and

the
closest local points

to the hyperplane
.
The SVM is a linea
r model. If hyperplanes exist

for the traini
ng data, there is one hyperplane that is optimal


in the sense that
this
hyperplane has the greatest orthogonal margin from any point in the training data set.
These points with the widest margin and closest to the hyperplane are called Support
Vectors,
because based on the training algorith
m, these points alone determine

the
hyperplane, and thus the decision function. No other training points besides the support
vectors contribute to the decision function.
The strength of the SVM is in the
transformati
on of linear model to non
-
linear ones using the “kernel trick” to map the
training data to a higher
-
dimension feature space.


As mentioned above, the
SVM
uses only a few local training points for the decision
function. Therefore, it
does not retain global

information and the general structure of
data set.

T
he SVM is very stable


not affect
ed

by the var
iation of unknown data, but
not
well
-
suited for predicting future data trend.
In this respect, the SVM is suitable for
problems with data distributions

th
at have a well
-
structured linear boundary, or can be
transformed to have a

c
lear linear boundary in a higher feature space.




Bibliography


Bontempi, Gianluca



Local Learning Techniques for Modeling,






Prediction and Control


Bottou, Leon; Vapnik, Vl
adimir

Local Learning Algorithm


Bur
ges, Christopher J C


A Tutorial on Support Vector Machines for Pattern






Recognition


Frank, Eibe; Witten, Ivan


Data Mining


Huang, Kai Zu,



Learning from Data Locally


Scholkopf and Smola



Learning with Kernels


Vapnik, Vladimir



Introduction to Support Vector Machine