A Short Discourse on Local Machine Learning
T
Wong
6/10/2008
Introduction
Traditional learning algorithms model biological human behavi
ors in perceiving and
organiz
ing data, and subsequently extracting information
by some
heur
istic experiments.
In
particular, the engineering process of imitating the human analysis and inference of
da
ta often focuses on a specific
pr
oblem and provides a customized
solution for the issues
at hand. This approach
allows a great deal of freedom in the design and impleme
ntation
of very diver
se strategies, but often makes it difficult to formalize the algorithms
analytically for generalization and optimization. At the opposite end of the spectrum,
modern methods of statistical modeling of Machine Learning seek to generali
ze the
adaptive process with algorithms that are not necessarily related to the biological or
social behavior, but are more
tractable for quantitative analysis and optimization. In
order to be able to formalize the Machine Learning system in terms of the
data and
inference
,
and quantify the
training and decision
process
es
analytically, the statistical
approach is often more
viable, although it is more
restrictive in the
choice of models and
the
actual design of the algorithm.
In recent years, with the ad
vent of fast computers and sophisticated software, statistical
modeling of Machine Learning has taken on a new life, and analytical algorithms are
being applied successfully in many areas of industry and research: Data Mining,
Classification, Bio

Informati
cs,
Pa
t
tern Recognition, Intrusion Detection, etc.
Fundamentally, the advantage of statistical modeling is the power in generalization and
optimization.
The formalization of the structure of data sets and the algorithms for
training and decision allow t
he learning process to be understood and quantified
analytically. The
statistical heuristics provide
the direction and strategy for optimization.
Moreover, the und
erlying mathematical structure
makes it relatively efficient to extend
the
statistical mode
l
by
combining and expanding mathematical
properties of the data set
and
training process, or transform decision algorithms
by
manipulating and adjusting
parameters and weight factors.
As result, one of the notable recent development in
statistical Machin
e
Learning is the Local Learning p
aradigm.
By Local Learning, we mean a methodology that does not use the complete, or
Global collection of data set for training or testing. In essence, the Global Learning
method processes and derives information from th
e complete data set, and
conclusively arrives at a universal approximation function for
predicting all future
input patterns
within the data space. In Global Learning, the training algorithm is
run on the complete data set until the optimal decision funct
ion
is obtained.
Subsequently, unknown data patterns can be tested and classifi
ed by this decision
function.
In most cases, the Global Learning method implies a density distribution
for the entire data space and a confidence level on recognizing new data
points.
By contrast, the Local Learning model works on a subset of the available training data at
a time.
The purpose of Local Learning is not so much to find a universal decision
function for all data
patterns.
Rather, it focuses on the specific task
at hand and seeks
a
solution for a specific
data
point
.
Different areas in
the data set might require different
properties in the local algorithm
Thus, a local training algorithm
based on the chosen
subsets
can render multiple and diverse decision funct
ions which can be optimized and
combined to provide to solve the complete dataset.
Moreover, in the case when a local
algorithm is run for each
new data pattern
, the Local Learning
process can be very
computation

intensive.
The Motivation
for Local Lear
ning
In Machine Learning, the specific training process and method of generalization are
often determined by the structure of the data set: the dimension and characteristics
of the attributes, the size of the data, and its relationship with the decision s
pace
.
In addition, the might be some a priori knowledge on the distribution of the training data.
The traditional Global Learning approach intends to capture the characteristics of all the
data attrib
utes within the feature space, not only based on cur
rently available data, but
with some assumption on all future data
pattern
. In other words, Global Lear
ning
ordinarily does not accommodate
for generalization of changing hypotheses on account
of unstable
values in data attributes.
One issue is training
data might not be evenly
distributed, causing the global training process to be skewed.
Even with a strong
learning algorithm, the unevenness in the data set can lead to
good decision
near
area of
well

represented
training data and miscalculation at poo
rly sampled areas (high capacity).
On the other hand, if the algorithm is tweak
ed
to compensate for poorly sampled data to
mitigate misclassification, it may lead to rejection of well

represented data (low
capacity).
In general, the Global Learning appro
ach implies a density distribution, associated with a
decision function that is complete (all test data are accepted) and consistent (accurate
decision based on a pre

defined criterion, eg acceptable loss rate). Intuitively, these
cannot be realized unle
ss the training data set is well

rep
resented and evenly distributed,
which is not the case most of time. Moreover, requiring this type of “
well

behaved
”
training data defeats the fundamental purpose of Machine Learning, which is to infer the
characteristi
cs of unknown data from
whatever is available at hand.
As such, Global Learning models the trend of data distribution in the data space and
make
s
predictions accordingly. This is a very intuitive notion and is designed to work
well if the density distrib
ution and associated parameters are chosen optimally. To this
end, this approach requires certain a priori or assumed knowledge about the training
data
set. As a simple example, shown below to correlate temperature and traffic to the Cape
using a Naïve B
ayesian algorithm:
Training Data Set:
No
Traffic Jam
Traffic Jam
Temperature
83
85
70
80
68
65
64
72
69
71
75
75
72
81
I
Total
9
5
Mean
73
74
.6
Std dev
6.2
7.9
Input Pattern:
Temperature = 66
Assuming the distribution of the training data is normal with the above mean and
standard deviation, the likelihoods of the input pattern are:
P(temperature = 66 
no
traffic jam) = 0.0340
P(
t
emperature = 66 
traffic jam) = 0.0291
Using the Bayesian formula:
P(traffic jam  temperature = 66) =
0.0
291

=
0.461
0.0340 + 0.0291
Table 1
This simple machine learns from the statistics of the ex
isting data and uses them to
predict the new unknown pattern. The critical element here is the assumption on the
normal distribution of the data.
Ultimately, indeed, it is the data se
t and data space that determine
the learned
distribution.
The quality
of the learning algorithm, then, will depend on how effective
the arbitrary choice of the prior. A frequentist approach to the same problem might offer
a more “reasonable” alternative for the estimate of the prior:
Training Data Set:
Temperature
No
Traffic Jam
Traffic Jam
Hot
2
2
Mild
4
2
Cool
3
1
Total
9
5
Input Pattern:
Temperature = cool
P(temperature = cool  no traffic jam) = 3/9 = 0.33
P(temperature = cool  traffic jam) = 1/5 = 0.2
Using the Bayesia
n formula:
P(traffic jam  temperature = cool) =
0.2

=
0.377
0.33 + 0.2
Table 2
Although the second model above applies the same Bayesian formula, the distribution of
the training data are derived from t
he actual existing data,
which might
be considered a
more valid prior.
The above exercises demonstrate that by changing the way data attributes are
represented, different models and results can be effected.
There are other issues with the structure of
G
lobal
training data: The distribution can be
non

linear, or they can be derived from several varied distributions, high dimension and
small size of training data, or noise (unrecognizable attributes) in the data. While the
quality of a learning algorithm
is general measured by the rate of error, the capacity
(
completeness and consistence) is a practical concerned.
Local Learning methods, rather than modeling universal data to estimate the distribution,
fo
cuses on a specific data
pattern
, given a subset
of training data that are directly related
to the task at
hand. The simplest example of Local L
earning applies the following
procedure:
1)
Given a test data point x
0
, find the closest points among the training data set
through some distance or similarity m
easure
2)
Define a vicinity or neighborhood in n dimension, with
x
0
as the center of the n

dimension
al
ball
3)
Run the learning algorithm using only the training points included inside the
neighborhood of
x
0
4)
After the machine has been train, run the deci
sion algorithm on
the input pattern
x
0
.
x x
x
x x x x
x x
x x x
x
x
0
x x x x
Figure 1
Firstly, this is a Lazy Learning algorit
hm, as the machine is not trained until an input data
p
attern
needs to be
classified. M
oreover, for each unclassifed data pattern
x
n
, a new
neighborhood has to be defined accordingly
–
thus choosing a different set of tranining
data, and the training alg
orithm has to be run again for each x
n
.
This is labor and
computation

intensive, but does solve several issues: By localizing the training data
relative to the
input data pattern
, effectively, the attributes contributing to the geometric
location of the
data points can be eliminated. Since the algorithm is not seeking a
solution for all the
unknown
data points, the formulation of the hypothesis can be more
relaxed and simplified.
For example, the prior information of the local training data can
be made
arbitrary, without consideration of the universal data set. The idea is to focus on
the distribution of a group of data similar to the
single input pattern at hand
. A local
assumption may not be universally correct, but still serve
s
the purpose of pred
icting the
input pattern
. The capacity of the machine can be managed by manipulating the locality
of the training data.
Since local algorithms always focus on a specific point, intuitively, it
should be more efficient and accurate than its global counter
part.
Here is a toy example
:
We collect training data on dress code from around the world and use it to train a
machine to decide whether the subject is a man or woman. A globally trained
algorithm will probably always designate the test subject a woma
n, if the subject is
observed to wear a skirt. This decision is inaccurate, if the subject happens to be
Scottish. Given
the
small percentage of Scottish subjects, it incurs a small overall
error rate globally, but produces very inaccurate results when t
he algorithm is being
applied on
the population of Scots
.
Alternatively, if the Local Learning is applied, running the same learning algorithm, a
Scottish subject will be predicted by the machine trained with a data set most similar
to the subject, thus
achieving a high rate of accuracy. Moreover, the attributes that
determine the nationality
, race,
and locality of the subject can be eliminated to r
educe
the dimension. Because all the
chosen test data points (Scotsmen)
around the subject
will have these
same attributes.
Indeed, the
“local”
attributes in the training set
can determine
the local decision
function. The well

known Local Learning
methods

the k

Nearest Neighbors
algorithm, relies entirely on the geometric vicinity of the training data to
t
he
subject
to decide the output.
M
any
other
popular
algorithms employ local methods of divide

and

conquer to
reduce
an intractable problem to several less complex
sub

problems.
The practice
of Bagging, Boosting, and Cross

Validation all run algorithms o
n subsets of training
data to produce local decision functions, which are ultimately used to predict
universal data.
In some instances, local methods can be applied to problems that do not have a natural
linear solution. The so

called “XOR”

type of data
points that cannot be classified by a
simple linear mod
el, such as the Percep
tron. This problem can be easily solved by a
Local Learning model because the training data are locally linearly separable.
1
X O
1
X O
x
0
O X
O
X
0 1
0
1
There is no linearly separable hyperplane
But a
local algorithm can determine
for the global alg
orithm
where the input pattern belongs with
a linear hyperplane
Figure 2
Some local methods, instead of finding the locality
strictly around
the input pattern, seeks
selected local training points to
generalize the global decision. The Radial Basis
Function assumes density distributions can be found in clusters in the data set, which
certain “attractor” points at the center of each cluster. Each cluster is then trained with a
local algorithm suitable
for the assumed distribution. The resulting decision functions are
then combined and averaged out for predicting unseen global patterns.
In
recent years, the Support Vector Machine has taken center stage in the Local Learning
arena. The SVM applies the
training algorithm
on selected local data
points in order
to
find the optimal hyperplane in terms of the widest m
argin between the hyperplane and
the
closest local points
to the hyperplane
.
The SVM is a linea
r model. If hyperplanes exist
for the traini
ng data, there is one hyperplane that is optimal
–
in the sense that
this
hyperplane has the greatest orthogonal margin from any point in the training data set.
These points with the widest margin and closest to the hyperplane are called Support
Vectors,
because based on the training algorith
m, these points alone determine
the
hyperplane, and thus the decision function. No other training points besides the support
vectors contribute to the decision function.
The strength of the SVM is in the
transformati
on of linear model to non

linear ones using the “kernel trick” to map the
training data to a higher

dimension feature space.
As mentioned above, the
SVM
uses only a few local training points for the decision
function. Therefore, it
does not retain global
information and the general structure of
data set.
T
he SVM is very stable
–
not affect
ed
by the var
iation of unknown data, but
not
well

suited for predicting future data trend.
In this respect, the SVM is suitable for
problems with data distributions
th
at have a well

structured linear boundary, or can be
transformed to have a
c
lear linear boundary in a higher feature space.
Bibliography
Bontempi, Gianluca
Local Learning Techniques for Modeling,
Prediction and Control
Bottou, Leon; Vapnik, Vl
adimir
Local Learning Algorithm
Bur
ges, Christopher J C
A Tutorial on Support Vector Machines for Pattern
Recognition
Frank, Eibe; Witten, Ivan
Data Mining
Huang, Kai Zu,
Learning from Data Locally
Scholkopf and Smola
Learning with Kernels
Vapnik, Vladimir
Introduction to Support Vector Machine
Comments 0
Log in to post a comment