Unsupervised learning
In
machine learning
,
unsupervised learning
refers to the problem of trying to find hidden
structure in unlabeled data. Since the examples given to
the learner are unlabeled, there is no
error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning
from
supervised learning
and
reinforcement learning
.
Unsupervised learning is closely related to the problem of
density estimation
in
statistics
.
However unsupervised learning also encompasses many other techniques that seek to summarize
and explain key features of the data.
Many met
hods employed in unsupervised learning are based on
data mining
methods used to
preprocess
data.
Approaches to unsupervised learning include:
clustering
(e.g.,
k

means
,
mixture models
,
hierarchical clustering
),
blind signal separation
using
feature extraction
techniques for
dimensionality
reduction
(e.g.,
Principal component analysis
,
Independen
t component analysis
,
Non

negative matrix factorization
,
Singular value decomposition
).
Among
neural network
models, the
self

organizing map
(SOM) and
adaptive resonance
theory
(ART) are commonly used unsupervised learning algorithms.
The SOM is a topographic organizati
on in which nearby locations in the map represent inputs
with similar properties.
The ART model allows the number of clusters to vary with problem size and lets the user control
the degree of similarity between members of the same clusters by means of a u
ser

defined
constant called the
vigilance parameter
.
ART networks are also used for many pattern recognition
tasks, such as
automatic target
recognition
and seismic signal processing. The first version of ART was "ART1", developed by
Carpenter and Grossberg
(1988).
Supervised learning
upervised learning
is the
machine learning
task of inferring a function
from
supervised
(labeled) training data.
The
training data
consist of a set of
training examples
.
In supervised learning, each example is a
pair
consisting of an input object (typically a vector)
and a desired output value
(also called the
supervisory signal
). A supervised learning algorithm
analyzes the training data and produces an inferred function, which is called a
classifier
(if the
output is discrete, see
classification
) or a
regression function
(if the output is continuous,
see
regression
). The inferred function should predict
the correct output value for any valid input
object.
This requires the learning algorithm to generalize from the training data to unseen situations in a
"reasonable" way (see
inductive bias
).
There are
four major issues
to consider in supervised learning:
1.Bias

variance tradeoff
2.Function complexity and amount of training data
3.Dimensionality of the input space
4.Noise in the output values
Generalizations of supervised
learning
There are several ways in which the standard supervised learning problem can be generalized:
1.
Semi

supervised learning
:
In this setting, the desired
output values are provided only
for a subset of the training data. The remaining data is unlabeled.
2.
Active learning
: Instead of
assuming that all of the training examples are given at the
start, active learning algorithms interactively collect new examples, typically by making
queries to a human user. Often, the queries are based on unlabeled data, which is a
scenario that combines
semi

supervised learning with active learning.
3.
Structured prediction
: When the desired output value is a complex object, such as a
parse tree or a labeled graph,
then standard methods must be extended.
4.
Learning to rank
: When the input is a set of objects and the desired output is a ranking
of those objects, then again the standard
methods must be extended.
Applications
Bioinformatics
,
Cheminformatics
Quantitative structure
–
activity relationship
Database marketing
,
Handwriting recognition
,
Information retrieval
Learning to rank
Object recognition in
computer vision
,
Optical character recognition
,
Spam detection
,
Pattern recognition
,
Speech
recognition
How supervised learning algorithms work
Given a set of training examples of the form
, a learning algorithm
seeks a function
, where
is the input space and
is the output space. The
function
is an element of some space of possible
functions
, usually called the
hypothesis
space
. It is sometimes convenient to represent
using a scoring function
such that
is defined as returning the
value that gives the highest
score:
. Let
denote the space of scoring functions.
Although
a
nd
can be any space of functions, many learning algorithms are probabilistic
models where
takes the form of a conditional probability model
, or
takes
the form of a joint probability model
.
Reinforcement learning
reinforcement learning
is an
area of
machine learning
in
computer science
, concerned with how
an
agent
ought to take
actions
in an
environment
so as to maximize some notion of
cumulative
reward
.
The problem, due to its generality, is studied in many other disciplines, such as
game
theory
,
control theory
,
operations research
,
information theory
,
simulation

based
optimization
,
statistics
, and
genetic algorithms
In machine learning, the environment is typically formulated as
a
Markov decision
process
(MDP), and many reinforcement learning algorithms for this context are highly related
to
dynamic programming
techniques.
The main difference to these classical techniques is that reinforcement learning algorithms do
not need the knowledge of the MDP and they target large MDPs where exact methods become
infeasible.
Reinforcement learning differs from standard
supervised learning
in that correct input/output
pairs are never presented, nor sub

optimal actions explicitly corrected.
Further, there is a focus on on

line performance, which in
volves finding a balance between
exploration (of uncharted territory) and exploitation (of current knowledge)
Introduction
The basic reinforcement learning model consists of:
1.
a set of environment states
;
2.
a set of actions
;
3.
rules of transitioning between
states;
4.
rules that determine the
scalar immediate reward
of a transition; and
5.
rules that describe what the agent observes.
Reinforcement learning is particularly well suited to problems which include a long

term versus
short

term reward trade

off. It has been applied successfully to various problems, including
robot
control
, elevator scheduling,
telecommunications
,
backgammon
and
checkers
.
Two components make reinforcement learning powerful
:
The use of samples to optimize performance and the use of function approximation to deal with
large environments. Thanks to these two key
components, reinforcement learning can be used in
large environments in any of the following situations:
A model of the environment is known, but an analytic solution is not available;
Only a simulation model of the environment is given (the subject of
simulation

based
optimization
);
The only way to collect information about the environment
is by interacting with it.
The first two of these problems could be considered planning problems (since some form of the
model is available), while the last one could be considered as a genuine learning problem.
However, under a reinforcement learning meth
odology both planning problems would be
converted to
machine learning
problems.
Decision tree
A
decision tree
is a decision support tool that uses a tree

like
graph
or
model
of decisions and
their possible consequences, including
chance
event outcomes, resource costs, and
utility
. It is
one way to display an
algorithm
.
Decision trees are commonly used in
operations research
, specifically in
decision analysis
, to
help identify a strategy most likely to reach a
goal
. If in practice decisions have to be taken
online with no re
call under incomplete knowledge,
a decision tree should be paralleled by a
Probability
model as a best choice model or online
selection model
algorithm
. Another use of decision trees is as a descriptive means for
calculating
conditi
onal probabilities
.
A decision tree consists of 3 types of nodes:

1. Decision nodes

commonly represented by squares
2. Chance nodes

represented by circles
3. End nodes

represented by triangles
Advantages
Decision trees:
Are simple to understand and
interpret.
People are able to understand decision tree
models after a brief explanation.
Have value even with little hard data.
Important insights can be generated based on
experts describing a situation (its alternatives, probabilities, and costs) and the
ir preferences
for outcomes.
Use a
white box
model.
If a given result is provided by a model, the explanation for the
result is easily repli
cated by simple math.
Can be combined with other decision techniques.
The following example uses Net Present
Value calculations, PERT 3

point estimations (decision #1) and a linear distribution of
expected outcomes
Disadvantages
For data including categori
cal variables with different number of levels,
information gain in
decision trees
are biased in favor of those attributes with more
levels
L
EARNING WITH
C
OMPLETE
D
ATA
Our development of statistical learning methods begins with the simplest task:
parameter
learning
with
complete data
.
A parameter learning task involves finding the numerical
parameters
for a probability model whose structure is fixed. For
example, we might be interested
in learning the conditional probabilities in a Bayesian netwo
rk with a given structure. Data
are
complete when each data point contains values for every va
riable in the
probability model
being
learned. Complete data greatly simplify the problem of learning the parameters of a
complex model.
1.Maximum

likelihood parameter learning: Discrete models
2.Naive Bayes models
Probably the most common Bayesian network model used in machine learning is the
naive
Bayes
model. In this model, the “class” variable C (which is to be predicted) is the root
and the “attribute” variables Xi are the leaves. The model is “naive” because it
assumes that
the attributes are conditionally independent of each other, given the class. (The model in
Figure 20.2(b) is a naive Bayes model with just one attribute.) Assuming Boolean variables,
the parameters are
y=P(C =true); yi1 =P(Xi =true C =true);
yi2 =P(Xi =true  C =false):
Once the model has been trained in this way, it can be used to classify new examples
for which the class variable C is unobserved. With observed attribute values x1; : : : ; xn,
the probability of each class is given by
A dete
rministic prediction can be obtained by choosing the most likely class. Figure 20.3
shows the learning curve for this method when it is applied to the restaurant problem from
Chapter 18. The method learns fairly well but not as well as decision

tree learni
ng; this is
presumably because the true hypothesis
—
which is a decision tree
—
is not representable exactly
using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a
wide range of applications; the boosted version (Exercise 20.5)
is one of the most effective
general

purpose learning algorithms. Naive Bayes learning scales well to very large problems:
with n Boolean attributes, there are just 2n + 1 parameters, and
no search is required
to find
hML
, the maximum

likelihood naive Baye
s hypothesis.
Finally, naive Bayes learning
has no difficulty with noisy data and can give probabilistic predictions when appropriate.
3.Maximum

likelihood parameter learning: Continuous models
4.Bayesian parameter learning
5.Learning
Bayes net structures
LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM
The preceding section dealt with the fully observable case. Many real

world problems have
LATENT VARIABLES
hidden variables
(sometimes called
latent variables
) which are not
observabl
e in the data
that are available for learning.
1.
Unsupervised clustering: Learning mixtures of Gaussians
2.
Learning Bayesian networks with hidden variables
3.
Learning hidden Markov models
The general form of the EM algorithm
We have seen several instances
of the EM algorithm. Each involves computing expected
values of hidden variables for each example and then recomputing the parameters, using the
expected values as if they were observed values. Let
x
be all the observed values in all the
examples, let
Z
de
note all the hidden variables for all the examples, and let be all the
parameters for the probability model. Then the EM algorithm is
This equation is the EM algorithm in a nutshell. The E

step is the computation of the summation,
which is the expectat
ion of the log likelihood of the “completed” data with respect
to the distribution P(
Z
=
z

x
; ), which is the posterior over the hidden variables, given
the data. The M

step is the maximization of this expected log likelihood with respect to the
par
ameters. For mixtures of Gaussians, the hidden variables are the Zij s, where Zij is 1 if
example j was generated by component i. For Bayes nets, the hidden variables are the values
of the unobserved variables for each example. For HMMs, the hidden variabl
es are the i!j
transitions. Starting from the general form, it is possible to derive an EM algorithm for a
specific application once the appropriate hidden variables have been identified.
Learning Bayes net structures with hidden variables
Comments 0
Log in to post a comment