Ai Notes - WordPress – www.wordpress.com

stemswedishAI and Robotics

Oct 15, 2013 (3 years and 10 months ago)

85 views

Unsupervised learning

In

machine learning
,

unsupervised learning

refers to the problem of trying to find hidden
structure in unlabeled data. Since the examples given to
the learner are unlabeled, there is no
error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning
from

supervised learning

and

reinforcement learning
.

Unsupervised learning is closely related to the problem of

density estimation

in

statistics
.

However unsupervised learning also encompasses many other techniques that seek to summarize
and explain key features of the data.


Many met
hods employed in unsupervised learning are based on

data mining

methods used to
preprocess

data.

Approaches to unsupervised learning include:



clustering

(e.g.,

k
-
means
,

mixture models
,

hierarchical clustering
),



blind signal separation

using

feature extraction

techniques for

dimensionality
reduction

(e.g.,

Principal component analysis
,

Independen
t component analysis
,

Non
-
negative matrix factorization
,
Singular value decomposition
).

Among

neural network

models, the

self
-
organizing map

(SOM) and

adaptive resonance
theory

(ART) are commonly used unsupervised learning algorithms.


The SOM is a topographic organizati
on in which nearby locations in the map represent inputs
with similar properties.

The ART model allows the number of clusters to vary with problem size and lets the user control
the degree of similarity between members of the same clusters by means of a u
ser
-
defined
constant called the

vigilance parameter
.


ART networks are also used for many pattern recognition

tasks, such as
automatic target
recognition

and seismic signal processing. The first version of ART was "ART1", developed by
Carpenter and Grossberg
(1988).

Supervised learning


upervised learning

is the

machine learning

task of inferring a function
from

supervised

(labeled) training data.

The

training data

consist of a set of

training examples
.

In supervised learning, each example is a

pair

consisting of an input object (typically a vector)
and a desired output value
(also called the

supervisory signal
). A supervised learning algorithm
analyzes the training data and produces an inferred function, which is called a

classifier

(if the
output is discrete, see

classification
) or a

regression function

(if the output is continuous,
see

regression
). The inferred function should predict
the correct output value for any valid input
object.


This requires the learning algorithm to generalize from the training data to unseen situations in a
"reasonable" way (see

inductive bias
).

There are
four major issues

to consider in supervised learning:

1.Bias
-
variance tradeoff

2.Function complexity and amount of training data

3.Dimensionality of the input space

4.Noise in the output values

Generalizations of supervised
learning

There are several ways in which the standard supervised learning problem can be generalized:

1.

Semi
-
supervised learning
:

In this setting, the desired

output values are provided only
for a subset of the training data. The remaining data is unlabeled.

2.

Active learning
: Instead of
assuming that all of the training examples are given at the
start, active learning algorithms interactively collect new examples, typically by making
queries to a human user. Often, the queries are based on unlabeled data, which is a
scenario that combines

semi
-
supervised learning with active learning.

3.

Structured prediction
: When the desired output value is a complex object, such as a
parse tree or a labeled graph,

then standard methods must be extended.

4.

Learning to rank
: When the input is a set of objects and the desired output is a ranking
of those objects, then again the standard
methods must be extended.

Applications



Bioinformatics

,
Cheminformatics



Quantitative structure

activity relationship



Database marketing
,
Handwriting recognition
,
Information retrieval



Learning to rank



Object recognition in

computer vision
,
Optical character recognition
,
Spam detection
,
Pattern recognition

,
Speech
recognition

How supervised learning algorithms work

Given a set of training examples of the form

, a learning algorithm
seeks a function

, where


is the input space and


is the output space. The
function


is an element of some space of possible
functions

, usually called the

hypothesis
space
. It is sometimes convenient to represent


using a scoring function

such that


is defined as returning the


value that gives the highest
score:

. Let


denote the space of scoring functions.

Although


a
nd


can be any space of functions, many learning algorithms are probabilistic
models where


takes the form of a conditional probability model

, or


takes
the form of a joint probability model

.



Reinforcement learning


reinforcement learning

is an
area of

machine learning

in

computer science
, concerned with how
an

agent

ought to take

actions

in an

environment

so as to maximize some notion of
cumulative

reward
.

The problem, due to its generality, is studied in many other disciplines, such as

game
theory
,

control theory
,

operations research
,

information theory
,
simulation
-
based
optimization
,

statistics
, and

genetic algorithms

In machine learning, the environment is typically formulated as
a

Markov decision
process

(MDP), and many reinforcement learning algorithms for this context are highly related
to

dynamic programming

techniques.


The main difference to these classical techniques is that reinforcement learning algorithms do
not need the knowledge of the MDP and they target large MDPs where exact methods become
infeasible.

Reinforcement learning differs from standard

supervised learning

in that correct input/output
pairs are never presented, nor sub
-
optimal actions explicitly corrected.

Further, there is a focus on on
-
line performance, which in
volves finding a balance between
exploration (of uncharted territory) and exploitation (of current knowledge)

Introduction

The basic reinforcement learning model consists of:

1.

a set of environment states

;

2.

a set of actions

;

3.

rules of transitioning between

states;

4.

rules that determine the

scalar immediate reward

of a transition; and

5.

rules that describe what the agent observes.


Reinforcement learning is particularly well suited to problems which include a long
-
term versus
short
-
term reward trade
-
off. It has been applied successfully to various problems, including
robot
control
, elevator scheduling,

telecommunications
,

backgammon

and

checkers

.

Two components make reinforcement learning powerful
:

The use of samples to optimize performance and the use of function approximation to deal with
large environments. Thanks to these two key

components, reinforcement learning can be used in
large environments in any of the following situations:



A model of the environment is known, but an analytic solution is not available;



Only a simulation model of the environment is given (the subject of

simulation
-
based
optimization
);



The only way to collect information about the environment
is by interacting with it.

The first two of these problems could be considered planning problems (since some form of the
model is available), while the last one could be considered as a genuine learning problem.
However, under a reinforcement learning meth
odology both planning problems would be
converted to

machine learning

problems.



Decision tree

A

decision tree

is a decision support tool that uses a tree
-
like

graph

or

model

of decisions and
their possible consequences, including

chance

event outcomes, resource costs, and

utility
. It is
one way to display an

algorithm
.


Decision trees are commonly used in

operations research
, specifically in

decision analysis
, to
help identify a strategy most likely to reach a

goal
. If in practice decisions have to be taken
online with no re
call under incomplete knowledge,


a decision tree should be paralleled by a

Probability

model as a best choice model or online
selection model
algorithm
. Another use of decision trees is as a descriptive means for
calculating

conditi
onal probabilities
.

A decision tree consists of 3 types of nodes:
-

1. Decision nodes
-

commonly represented by squares

2. Chance nodes
-

represented by circles

3. End nodes
-

represented by triangles

Advantages

Decision trees:



Are simple to understand and
interpret.

People are able to understand decision tree
models after a brief explanation.



Have value even with little hard data.

Important insights can be generated based on
experts describing a situation (its alternatives, probabilities, and costs) and the
ir preferences
for outcomes.



Use a

white box

model.

If a given result is provided by a model, the explanation for the
result is easily repli
cated by simple math.



Can be combined with other decision techniques.

The following example uses Net Present
Value calculations, PERT 3
-
point estimations (decision #1) and a linear distribution of
expected outcomes

Disadvantages

For data including categori
cal variables with different number of levels,

information gain in
decision trees

are biased in favor of those attributes with more
levels

L
EARNING WITH
C
OMPLETE
D
ATA

Our development of statistical learning methods begins with the simplest task:
parameter

learning
with
complete data
.

A parameter learning task involves finding the numerical

parameters

for a probability model whose structure is fixed. For
example, we might be interested
in learning the conditional probabilities in a Bayesian netwo
rk with a given structure. Data
are
complete when each data point contains values for every va
riable in the
probability model
being
learned. Complete data greatly simplify the problem of learning the parameters of a

complex model.

1.Maximum
-
likelihood parameter learning: Discrete models

2.Naive Bayes models

Probably the most common Bayesian network model used in machine learning is the
naive

Bayes
model. In this model, the “class” variable C (which is to be predicted) is the root

and the “attribute” variables Xi are the leaves. The model is “naive” because it

assumes that

the attributes are conditionally independent of each other, given the class. (The model in

Figure 20.2(b) is a naive Bayes model with just one attribute.) Assuming Boolean variables,

the parameters are

y=P(C =true); yi1 =P(Xi =true |C =true);

yi2 =P(Xi =true | C =false):

Once the model has been trained in this way, it can be used to classify new examples

for which the class variable C is unobserved. With observed attribute values x1; : : : ; xn,

the probability of each class is given by

A dete
rministic prediction can be obtained by choosing the most likely class. Figure 20.3

shows the learning curve for this method when it is applied to the restaurant problem from

Chapter 18. The method learns fairly well but not as well as decision
-
tree learni
ng; this is

presumably because the true hypothesis

which is a decision tree

is not representable exactly

using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a

wide range of applications; the boosted version (Exercise 20.5)
is one of the most effective

general
-
purpose learning algorithms. Naive Bayes learning scales well to very large problems:

with n Boolean attributes, there are just 2n + 1 parameters, and
no search is required

to find
hML
, the maximum
-
likelihood naive Baye
s hypothesis.
Finally, naive Bayes learning

has no difficulty with noisy data and can give probabilistic predictions when appropriate.


3.Maximum
-
likelihood parameter learning: Continuous models

4.Bayesian parameter learning

5.Learning

Bayes net structures

LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM

The preceding section dealt with the fully observable case. Many real
-
world problems have

LATENT VARIABLES
hidden variables
(sometimes called
latent variables
) which are not
observabl
e in the data
that are available for learning.

1.
Unsupervised clustering: Learning mixtures of Gaussians

2.
Learning Bayesian networks with hidden variables

3.
Learning hidden Markov models

The general form of the EM algorithm

We have seen several instances
of the EM algorithm. Each involves computing expected

values of hidden variables for each example and then recomputing the parameters, using the

expected values as if they were observed values. Let
x
be all the observed values in all the

examples, let
Z
de
note all the hidden variables for all the examples, and let be all the

parameters for the probability model. Then the EM algorithm is


This equation is the EM algorithm in a nutshell. The E
-
step is the computation of the summation,

which is the expectat
ion of the log likelihood of the “completed” data with respect

to the distribution P(
Z
=
z

|
x
; ), which is the posterior over the hidden variables, given

the data. The M
-
step is the maximization of this expected log likelihood with respect to the

par
ameters. For mixtures of Gaussians, the hidden variables are the Zij s, where Zij is 1 if

example j was generated by component i. For Bayes nets, the hidden variables are the values

of the unobserved variables for each example. For HMMs, the hidden variabl
es are the i!j

transitions. Starting from the general form, it is possible to derive an EM algorithm for a

specific application once the appropriate hidden variables have been identified.


Learning Bayes net structures with hidden variables