Machine Learning Tutorial

unknownlippsAI and Robotics

Oct 16, 2013 (3 years and 5 months ago)

116 views

Machine Learning Tutorial

Amit Gruber

The Hebrew University of
Jerusalem

Example: Spam Filter


Spam message: unwanted email
message


Dozens or even hundreds per day



Goal: Automatically distinguish
between spam and non
-
spam email
messages

Spam message 1

Spam message 2


Spam message 3


Spam message 4


How to Distinguish ?


Message contents ?


Automatic semantic analysis is yet to be
solved


Message sender ?


What about unfamiliar senders or fake
senders ?


Collection of keywords ?


Message Length ?


Mail server ? Time of delivery ?

How to Distinguish ?


It’s hard to define an explicit set of rules to
distinguish between spam and non
-
spam



Learn the concept of “spam” from
examples !

Example: Gender Classification

The Power of Learning:

Real Life example


How much time does it take you to get to
work ?


First approach: Analyze your route


Distance, traffic lights, traffic, etc…


Can be quite complicated…


Second Approach: how much time does it
usually take ?


Despite of some variance, works remarkably well!


Requires “training” for different times


May fail in special cases

Machine Translation


Collaborative Filtering


Collaborative Filtering: Prediction of user
ratings based on the ratings of other users



Examples:


Movie ratings


Product recommendation



Is this of merely theoretical interest ??

Netflix Prize

Over 100 million ratings from 480 thousand customers over
17000 movie titles (sparsity: 0.0123)

Recommendation system

Machine Learning Applications


Search Engines


Collaborative Filtering (Netflix, Amazon)


Face, speech and pattern Recognition


Machine Translation


Natural language processing


Medical diagnosis and treatment


Bioinformatics


Computer games


Many more !

Generalization: Train vs. Test


The central assumption we make is that
the train set and the new examples are
“similar”


Formally, the assumption is that samples
are drawn from the same distribution



Is this assumption realistic ?


Train vs. Test:

Might Fail to Generalize

Acquiring a good train set


Have a huge train set


Train data might be available on the web


Use humans to collect data


Collect results (or aggregations thereof) of
user actions


Unsupervised methods


require only raw
data, no need for labels !


Machine Learning Strategies


Discriminative Approach


Feature selection: find the features that carry
the most information for separation



Generative Approach


Model the data using a generative process


Estimate the parameters of the model

Supervised vs. Unsupervised


Supervised Machine Learning


Classification (learning)


Collection of large representative train set
might not be simple


Unsupervised Machine Learning


Clustering


The number of clusters may be known or unknown


Usually plenty of train data is available


Discriminative Learning


Data representation and Feature selection: What
is relevant for classification ?


Gender classification: hair, ears, make up, beard,
moustache, etc.



Linear Separation


SVM, Fisher LDA, Perceptron and more


Different criteria for separation


what would
generalize well ?


Non
-
linear separation

Linear Separation

Nonlinear Separation

(Kernel Trick)

Generative Approach


Model the observations using a generative
process


The generative process induces a
distribution over the observations


Learn a set of parameters


Statistical Approach


Real Life
Example


You’re stuck in traffic. Which Lane is
faster?


The complicated approach:


Consider the traffic, trucks, merging lanes,
etc.


The statistical (Bayesian) Approach:


Which lane is usually faster ? (prior)


What are you seeing ? (evidence)

Summary


Machine Learning: Learn a concept from
examples


For good generalization, train data has to
faithfully represent test data


Many potential applications


Already in use and works remarkably well