# Machine Learning and Numerical Analysis

AI and Robotics

Oct 14, 2013 (4 years and 7 months ago)

87 views

MachineLearningandNumericalAnalysis
FrancisBach
Willowproject,INRIA-EcoleNormaleSup´erieure
November2010
MachineLearningandNumericalAnalysis
Outline
•Machinelearning
–Supervisedvs.unsupervised
•Convexoptimizationforsupervisedlearning
–Sequenceoflinearsystems
•Spectralmethodsforunsupervisedlearning
–Sequenceofsingularvaluedecompositions
•Combinatorialoptimization
–Polynomial-timealgorithmsandconvexrelaxations
Statisticalmachinelearning
Computerscienceandappliedmathematics
•Modelisation,predictionandcontrolfromtrainingexamples
•Theory
–Analysisofstatisticalperformance
•Algorithms
–Numericaleﬃciencyandstability
•Applications
–Computervision,bioinformatics,neuro-imaging,text,audio
Statisticalmachinelearning
-
Supervisedlearning
•Data(x
i
,y
i
)∈X×Y,i=1,...,n
•Goal:predicty∈Yfromx∈X,i.e.,ﬁndf:X→Y
•Empiricalriskminimization
1
n
n
￿
i=1
ℓ(y
i
,f(x
i
))+
λ
2
kfk
2
Data-ﬁtting
+
Regularization
•Scientiﬁcobjectives:
–Studyinggeneralizationerror
–Improvingcalibration
–Choosingappropriaterepresentations-selectionofappropriateloss

Twomaintypesofnorms:ℓ
2
vs.ℓ
1
Usuallosses
•Regression:y∈R,predictionˆy=f(x),
1
2
(y−f(x))
2
•Classiﬁcation:y∈{−1,1}predictionˆy=sign(f(x))
–lossoftheformℓ(y,f(x))=ℓ(yf(x))
–“True”cost:ℓ(yf(x))=1
yf(x)<0
–Usual
convex
costs:
-3
-2
-1
0
1
2
3
4
0
1
2
3
4
5
0-1hingesquarelogistic
Supervisedlearning
-
Parsimonyandℓ
1
-norm
•Data(x
i
,y
i
)∈R
p
×Y,i=1,...,n
min
w∈R
p
1
n
n
￿
i=1
ℓ(y
i
,w

x
i
)+λ
p
￿j=1
|w
j
|
Data-ﬁtting
+
Regularization
•Attheoptimum,wisingeneralsparse
1
2
w
w
1
2
w
w
Sparsityinmachinelearning
•Assumption:y=w

x+ε,withw∈R
p
sparse
–Proxyfor
interpretability
–Allow
high-dimensionalinference
:
logp=O(n)
•Sparsityandconvexity(ℓ
1
-normregularization):
min
w∈R
p
L(w)+kwk
1
1
2
w
w
1
2
w
w
Statisticalmachinelearning
-
Unsupervisedlearning
•Datax
i
∈X,i=1,...,n.Goal:“Find”structurewithindata
–Discrete:clustering
–Low-dimension:principalcomponentanalysis
Statisticalmachinelearning
-
Unsupervisedlearning
•Datax
i
∈X,i=1,...,n.Goal:“Find”structurewithindata
–Discrete:clustering
–Low-dimension:principalcomponentanalysis
•Matrixfactorization:
X=DA
–StructureonDand/orA
–Algorithmicandtheoreticalissues
•Applications
Learningonmatrices
-
Collaborativeﬁltering
•Givenn
X
“movies”x∈Xandn
Y
“customers”y∈Y,
•predictthe“rating”z(x,y)∈Zofcustomeryformoviex
•Trainingdata:largen
X
×n
Y
incompletematrixZthatdescribesthe
knownratingsofsomecustomersforsomemovies
•Goal:completethematrix.
1
3
333
3
3
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
3
Learningonmatrices
-
Imagedenoising
•Simultaneouslydenoiseallpatchesofagivenimage
•ExamplefromMairaletal.(2009)
Learningonmatrices
-
Sourceseparation
•Singlemicrophone(F´evotteetal.,2009)
MachineLearningandNumericalAnalysis
Outline
•Machinelearning
–Supervisedvs.unsupervised
•Convexoptimizationforsupervisedlearning
–Sequenceoflinearsystems
•Spectralmethodsforunsupervisedlearning
–Sequenceofsingularvaluedecompositions
•Combinatorialoptimization
–Polynomial-timealgorithmsandconvexrelaxations
Supervisedlearning
-
Convexoptimization
•Data(x
i
,y
i
)∈X×Y,i=1,...,n
•Goal:predicty∈Yfromx∈X,i.e.,ﬁndf:X→Y
•Empiricalriskminimization
1
n
n
￿
i=1
ℓ(y
i
,f(x
i
))+
λ
2
kfk
2
Data-ﬁtting
+
Regularization
•Typicalproblems
–finvectorspace(e.g.,R
p
)
–ℓconvexwithrespecttosecondvariable,potentiallynonsmooth
–Normmaybenondiﬀerentiable
–pand/ornlarge
Convexoptimization
-
Kernelmethods
•Simplestcase:least-squares
min
w∈R
p
1
2n
ky−Xwk
2
2
+λkwk
2
2
–Solution:w=(X

X+nλI)
−1
X

yinO(p
3
)
•Kernelmethods
–Maybere-writtenasw=X

(XX

+nλI)
−1
yinO(n
3
)
–Replacex

i
x
j
byanypositivedeﬁnitekernelfunctionk(x
i
,x
j
),
e.g.,k(x,x

)=exp(−αkx−x

k
2
2
)
•Generallosses:Interiorpointvs.ﬁrstordermethods
•Manipulationoflargestructuredmatrices
Convexoptimization
-
Lowprecision
•Empiricalriskminimization
1
n
n
￿
i=1
ℓ(y
i
,f(x
i
))+
λ
2
kfk
2
Data-ﬁtting
+
Regularization
•Noneedtooptimizebelowprecisionn
−1/2

Goalistominimizetesterror
Convexoptimization
-
Lowprecision
(BottouandBousquet,2008)
Convexoptimization
-
Sequenceofproblems
•Empiricalriskminimization
1
n
n
￿
i=1
ℓ(y
i
,f(x
i
))+
λ
2
kfk
2
Data-ﬁtting
+
Regularization
•Inpractice:Needstobesolvedformanyvaluesofλ
•Piecewise-linearpaths
–Infavorablesituations
•Warmrestarts
Convexoptimization
-
Firstordermethods
•Empiricalriskminimization
1
n
n
￿
i=1
ℓ(y
i
,f(x
i
))+λΩ(f)
Data-ﬁtting
+
Regularization
losses
–Needtosolve
eﬃciently
problemsoftheform
min
f
kf−f
0
k
2
+λΩ(f)
1
n
n
￿
i=1
ℓ(y
i
,f(x
i
))proxyforEℓ(y,f(x))
MachineLearningandNumericalAnalysis
Outline
•Machinelearning
–Supervisedvs.unsupervised
•Convexoptimizationforsupervisedlearning
–Sequenceoflinearsystems
•Spectralmethodsforunsupervisedlearning
–Sequenceofsingularvaluedecompositions
•Combinatorialoptimization
–Polynomial-timealgorithmsandconvexrelaxations
Unsupervisedlearning
-
Spectralmethods
•Spectralclustering:givensimilaritymatrixW∈R
n×n
+
–ComputeLaplacianmatrixL=Diag(W1)−W=D−W
–Computegeneralizedeigenvectorof(L,D)
–Maybeseenasrelaxationofnormalizedcuts
•Applications
–Computervision
–Speechseparation
Applicationtocomputervision
Co-segmentation
(Joulinetal.,2010)
Blindone-microphonespeechseparation
(BachandJordan,2005)
1
,...,s
m
-onemicrophonex
•Idealacousticsx=s
1
+s
2
++s
m

Goal
:recovers
1
,...,s
m
fromx

Blind
•Formulationasspectogramsegmentation
Spectrogram

Spectrogram
(a.k.aGaboranalysis,WindowedFouriertransforms)
–cutthesignalsinoverlappingframes
–applyawindowandcomputetheFFT
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
-1
-0.5
0
0.5
1
Time
Waveform amplitude

Time
Frequency
50
100
150
200
250
300
350
20
40
60
80
100
120
0
100
200
300
400
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
Time
Amplitude
Windowing

0
100
200
300
400
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
Time
Amplitude
Hammingwindow

0
50
100
150
200
0
0.5
1
1.5
2
2.5
3
3.5
Frequency
Magnitude
Fouriertransform
Sparsityandsuperposition
s
1
+s
2
=x
Buildingtrainingset
Spectrogramofthemix
“Optimal”segmentation
acceptablesignals(e.g.,takeargmax(|S
1
|,|S
2
|))
•Workaspossiblylargetrainingdatasets

Requiresnewwayofsegmentingimages...

...whichcanbelearnedfromdata
Verylargesimilaritymatrices
Linearcomplexity
•Threediﬀerenttimescales⇒W=α
1
W
1

2
W
2

3
W
3
•Small
–Finescalestructure(continuity,harmonicity)
–verysparseapproximation
•Medium
–Mediumscalestructure(commonfatecues)
–band-diagonalapproximation,potentiallyreducedrank
•Large
–Globalstructure(e.g.,speakeridentiﬁcation)
–low-rankapproximation(rankisindependentofduration)
Experiments
•Twodatasetsofspeakers:onefortesting,onefortraining
•Left:optimalsegmentation-right:blindsegmentation
Time
Frequency
Time
Frequency
•Testingtime(Matlab/C):Tdurationofsignal
–Buildingfeatures≈4×T
–Separation≈30×T
Unsupervisedlearning
-
Convexrelaxations
•Cuts:givenanymatrixW∈R
n×n
,ﬁndy∈{−1,1}
n
thatminimizes
n
￿
i,j=1
W
ij
1
y
i
6=y
j
=
1
2
n
￿
i,j=1
W
ij
(1−y
i
y
j
)=
1
2
1

W1−
1
2
y

Wy
–LetY=yy

.WehaveY￿0,diag(Y)=1,
rank(Y)=1
–Convexrelaxation(GoemansandWilliamson,1997):
max
Y￿0,diag(Y)=1
trWY
–Maybesolvedassequenceofeigenvalueproblems
max
Y￿0,diag(Y)=1
trWY=min
µ∈R
n

max
(W+Diag())−1


Submodularfunctions
•F:2
V
→Rissubmodularifandonlyif
∀A,B⊂V,F(A)+F(B)￿F(A∩B)+F(A∪B)
⇔∀k∈V,A7→F(A∪{k})−F(A)isnon-increasing
Submodularfunctions
•F:2
V
→Rissubmodularifandonlyif
∀A,B⊂V,F(A)+F(B)￿F(A∩B)+F(A∪B)
⇔∀k∈V,A7→F(A∪{k})−F(A)isnon-increasing
•Intuition1:
deﬁnedlikeconcavefunctions
(“diminishingreturns”)
–Example:F:A7→g(Card(A))issubmodularifgisconcave
Submodularfunctions
•F:2
V
→Rissubmodularifandonlyif
∀A,B⊂V,F(A)+F(B)￿F(A∩B)+F(A∪B)
⇔∀k∈V,A7→F(A∪{k})−F(A)isnon-increasing
•Intuition1:
deﬁnedlikeconcavefunctions
(“diminishingreturns”)
–Example:F:A7→g(Card(A))issubmodularifgisconcave
•Intuition2:
behavelikeconvexfunctions
–Polynomial-timeminimization,conjugacytheory
Submodularfunctions
•F:2
V
→Rissubmodularifandonlyif
∀A,B⊂V,F(A)+F(B)￿F(A∩B)+F(A∪B)
⇔∀k∈V,A7→F(A∪{k})−F(A)isnon-increasing
•Intuition1:
deﬁnedlikeconcavefunctions
(“diminishingreturns”)
–Example:F:A7→g(Card(A))issubmodularifgisconcave
•Intuition2:
behavelikeconvexfunctions
–Polynomial-timeminimization,conjugacytheory
•Usedinseveralareasofsignalprocessingandmachinelearning
–Totalvariation/graphcuts
–Optimaldesign-Structuredsparsity
Documentmodelisation(Jenattonetal.,2010)
MachineLearningandNumericalAnalysis
Outline
•Machinelearning
–Supervisedvs.unsupervised
•Convexoptimizationforsupervisedlearning
–Sequenceoflinearsystems
•Spectralmethodsforunsupervisedlearning
–Sequenceofsingularvaluedecompositions
•Combinatorialoptimization
–Polynomial-timealgorithmsandconvexrelaxations
Machinelearning
-
Speciﬁcities
•Low-precision
–Objectivefunctionsareaverages
•Largescale
–Practicalimpactonlywhencomplexityclosetolinear
•Onlinelearning