No Free Lunch Theorems for Optimization 1 Introduction

hogheavyweightElectronics - Devices

Oct 8, 2013 (3 years and 10 months ago)

101 views

F ree Lunc h T heorems for ptimization O
Da vid H W e
IBM Almaden Researc hCen ter
Na
Harry Road
San Jose CA
William G Macready
San F e Institute
Hyde P ark Road
San ta F e
Decem b er
Abstract
A framew ork is dev elop ed to explore t he connection b et w een ectiv e optimization
algorithms and the problems they are solving A n um b er f o no free lunc h FL
theorems are presen ted that establish that or f an y algorithm an y elev ated p erformance
o v er one class of problems is exactly paid f or in p erformance o v er another class These
theorems result in a g eometric in terpretation of what it means f or an algorithm to
be w ell suited to an optimization problem Applications of the NFL theorems to
information theoretic asp ects of optimization nd a b e nc hmark measures of p erformance
are also presen ted Other i ssues addressed are time arying optimization problems
and a pr ior i eadoead minimax distinctions b et w een optimization algorithms
distinctions that can btain o despite he t NFL theorems enforcing of a t yp e of uniformit y
o v er all algorithms
In tro duction
The past few decades ha v e seen increased n i terest in generalurp ose lac k o x optimiza
tion algorithms that exploit little if an ykon wledge concerning the o ptimization problem on
whic h they are run In large p art hese t algorithms ha v edra wn inspiration f rom optimization
pro cesses that o ccur in n ature In particular the t w o m ost p opular blac k o x optimization
strategies ev olutionary algorithms O W Hol nd a sim ulated annealing GV m imi c
pro cesses in natural selection and statistical mec hanics resp ectiv ely

NM
ta
rt olp
NoIn ligh t of his t in terest in generalurp ose optimization algorithms it has b ecome im
p ortan t t o understand t he relationship b et w een ho ww ell an a lgorithm a p erforms and the
optimization problem f on whic h it s i run In this pap er w e p resen t a formal analysis that
con tributes to w ards suc h a n nderstanding u b y ddressing a questions ik l e the follo wing Giv en
the plethora of blac k o x optimization a lgorithms and of optimization problems h o w can w e
b est matc h algorithms to problems ho w b est can w e relax the b lac k o x nature of the
algorithms and ha v e them exploit some kno wledge concerning the optimization problem In
particular while serious optimization practitioners almost alw a ys p rform e suc h atc m it
is usually on an ad ho c asis b ho w can suc hmtca hing b e formally analyzed More generally
what is the underlying mathematical sk eleton of optimization theory b e fore the esh of
the probabilit y istributions d o f a particular con text and set of optimization problems are im
p osed What can information theory and a B y esian analysis con tribute to n a understanding
of these issues Ho w a p riori generalizable are the p erformance results of a ertain c algorithm
on a certain class of problems to its p erformance on other classes o f p roblems Ho w should
w eev en measure suc h generalization ho wshloudw e a ssess the p erformance of algorithms
on problems so that w ema y programmatically compare those algorithms
Broadly sp eaking w etak et w o approac hes to these questions First w ein v estigate what
aprrioi restrictions there are on the pattern of p erformance of one r o more algorithms as one
runs o v er the set of all o ptimization problems Our second approac h is to i nstead fo cus on
a particular problem and onsider c the ects f o running o v er all a lgorithms In the curren t
pap er w eprenes t results from b oth t yp es of analyses but oncen c trate largely on the st
approac h The reader s i referred to the companion pap er MW for more k inds of analysis
in v olving the second approac h
W e b egin in Section b yin tro ducing the necessary notation Also discussed in this
section is the mo del of computation w e adopt its l imitations and the easons r w ec hose it
One migh t exp ect that there are pairs of searc h algorithms A B suc httha A per
forms b etter han t B on a v erage ev en if B sometime s outp erforms A As a n e xample one
migh t e xp ect that hilllim bing usually utp o rforms e h illescending i f one goal is to d a
maxim um of the cost unction f O ne migh t lsoa exp ect it w ould outp erform a random searc h
in suc ha onc text
One of the main results of this ap p er is that uc s h exp ectations are ncorrect i W e pro v e
t w o NFL theorems in Section that emonstrate d this and ore m g enerally i lluminate the
connection b et w een algorithms and problems Roughly sp e aking w esho w that for b tho
static and time dep enden t optimization roblems p the a v erage p rformance e of an y pair of
algorithms across all p ossible p roblems s i exactly iden tical This means in articular p that if
some algorithm a p e rformance is sup erior to that o f nother a a lgorithm a o v er some set of

optimization problems then the rev erse m ust b e true o v er the set of all ther o optimization
problems he reader is urged to read this section carefully f or a precise statemen t of these
theorems This is true ev en if one of the a lgorithms s i random an y a lgorithm a p erforms

worse than r andomly just as readily o v er the set of all optimization problems as it p erforms
b etter than randomly ossible ob jections to these results are also addressed in Sections
and
In Section w e presen t a geometric in terpretation of the N FL theorems In particular


and
hingw e sho w that n a lgorithm a a v erage p erformance is determined b yho w ligned it is ith w
the underlying probabilit y d istribution o v er optimization problems on whic h it is run This
Section is critical for a n y one wishing to understand ho w he t NFL results are consisten twith
the w ellccepted fact that man y earc s h algorithms that do n ot tak ein to accoun t kno wledge
concerning the cost function w ork quite w ell in ractice p
Section demonstrates that the FL N theorems allo w o ne to answ er a n b er o f what
w ould otherwise seem to b e in tractable q uestions The implications f o these answ ers for
measures of algorithm p erformance and of h o w b est to compare optimization algorithms are
explored in Section
In Section w e discuss some of the w a h despite the NFL theorems algo
rithms can ha v e a pr ior i distinctions that h old ev en if nothing is sp ecid concerning he t
optimization p roblems In particular w esho w hat t there can b e headoead minimax
distinctions b et w een a air p f o a lgorithms t i i w e sho w that considered one f at a time a
pair of algorithms ma y b e distinguishable ev en if they are not when one o l o ks o v er all f
In Section w e p resen tanin tro duction to the a lternativ e approac h to the formal analysis
of optimization in hic w h p roblems a re held ed and ne o lo oks at rop p erties across the space
of algorithms Since these results hold in general they hold for n a y and all optimization
problems and in this are indep enden t of the what kinds f o p roblems one s i m ore or less lik
to encoun ter in the real w orld In particular these results state that one has no a priori
justiation f or using a s earc h algorithm b eha vior so far on a particular ost c function
to predict its future b ha e vior on that f unction In fact when c ho osing b et w een algorithms
based on their observ ed p erformance i t o d es ot n sue to mak e n a a ssumption ab o ut the cost
function some urren tly p o rly o understo o d assumptions re a also b eing made ab out ho w
the algorithms in question are related to eac h ther o and t o the cost function In ddition a to
presen ting results ot n found in W this section serv es as an in tro duction to p ersp ectiv e
adopted in W
W e conclude in Section w ith a brief discussion a summary of results a nd a short list
of op en problems
W eha v e coned as man y o f our pro ofs to pp a endices as p o ssible to facilitate the w
of the pap er A m ore detailed nd a substan tially longer v ersion of this pap r e a v ersion
that also analyzes some issues not ddressed a in this pap er can b e ound f in M
Finally e cannot emphasize enough that no claims whatso ev er are b e ing made in
this pap er c oncerning ho ww v arious searc h a lgorithms w ork in practice fo c
this pap er is on what can b e said a priori ithout w an y ssumptions a a nd from mathematical
principles alone concerning the utilit y o f a searc h a lgorithm
Preliminaries
W e restrict a tten tion to com binatorial ptimization o in whic h the searc h space X hough t
p erhaps quite large i s nite W e further assume that the space of p ossible ost v alues Y
is also ite T hese restrictions are automatically et m for optimization algorithms run on
digital computers F or example t ypically Y is some or bit represen tation of the real

of us The ell

ely
whic in ys
umn um b ers in suc h a case
The size of the s paces X Y are indicated b y j and j resp ectiv ptimization O
problems f ometimes called ost functions or b jectiv e f unctions or nergy func
X
tions are represen ted s a m appings f X F Y is then the space of all p ossible
jX j
problems F is of size jY j v ery large but nite n um b r e In addition to static f e
shall also b e in terested in optimization problems that dep end explicitly on ime t The extra
notation needed for suc h timeep enden t problems w ill b e in tro duced as needed
It is common in the optimization comm unit y to adopt an oracleased view of computa
tion In this view when assessing he t p erformance of algorithms results are stated n i terms
of the n um b er of function ev aluations required to d a certain solution Unfortunately
though man y optimization lgorithms a a re w asteful o f function e v aluations In particular
man y algorithms do ot n remem b er where they ha v e already searc hed nd a therefore often
revisit the same p in o ts lthough A an y lgorithm a that is w asteful in this fashion an c b e made
more eien t simply b y remem b ering where i t as h b een tabu searc h Glo Glo
man y real orld algorithms elect not to emplo y this stratagem Accordingly rom f the p oin t
of view of the oracleased p erformance easures m there a re rtefacts distorting the ap
paren t relationship b et w een man y suc h real orld algorithms
This diult y is exacerbated b y the fact that the amoun t o f revisiting that o ccurs is
a complicated function of b oth the algorithm a nd the ptimization o problem and therefore
cannot b e imply s tered o ut of a mathematical nalysis a Accordingly eha v e elected to
circum v e n t the problem en tirely b y comparing algorithms based o n t he n um ber of distinct
function ev aluations they ha v e p erformed Note that this do es not mean that w e cannot
compare lgorithms a that a re w asteful of ev aluations it simply means that w e compare
algorithms b y coun ting only their n b er of istinct d calls to the o racle
W e call a timerdered set of m distinct visited p oin ts a sample o f size m Samples are
x y x y
denoted b y d d d m m g The p o in ts in a sample are ordered
m
m m m m
x
according to the time at whic h they w ere generated Th us d i ndicates i the X v of
m
y
the i th successiv e elemen t in a sample f o size m and d i s i he t asso ciated cost or Y v alue
m
y y y
d d m g will b e used to indicate the ordered set of cost v alues The space
m m m
m
of all samples of size m D X Y o d and the set f o ll a p ssible o samples
m m m
of arbitrary size is D D
m m
As an imp ortan t clariation of this deition consider a hillescending algorithm
This is the algorithm that examines a set of neigh b o ring p o in X and o m v es to the one
ha ving the lo w est cost The pro cess is then iterated from the newly c hosen p oin t ften
implem en tati ons of illescending h stop when they reac h a lo cal minim um utb they can
easily b e extended to run onger l b y randomly jumping to a new un visited p oin t once the
neigh b orho o d of a lo cal minim um has b een exhausted The p oin t to ote n is t hat b e cause
a sample on c tains all the previous p in o ts at whic h the oracles w as consulted it includes the
X Y alues o f al l the neigh b ors of the c urren tpoin t and ot n only the o l w est c ost one that
the algorithm mo v es to This m ust b e tak en in to accoun twhen coun ting the v alue of m
Optimization algorithms a are represen ted s a mappings rom f p reviously v isited sets of
p oin ts to a s ingle new i p reviously un visited p oin tin X F ormally a d
f x j x d g Giv en our decision to o nly m easure distinct function ev aluations ev en if an
X




in ts

is

alue

um



ely jY jX andalgorithm revisits previously searc hed p in o ts our eition d of an algorithm includes all
common blac k o x ptimization o tec hniques ik l esmi ulated annealing and v e olutionary algo
rithms ec hniques lik e branc h and b ound L W are not included since hey t ely r explicitly
on the cost structure o f partial solutions and w e a re here in terested primarily in blac k o x
algorithms
As deed ab o v e a s earc h a lgorithm i s deterministic ev ery sample maps to a nique u new

p oin t Of course essen tially all algorithms implem e n ted on computers are deterministic nd
in this our eition d is ot n restrictiv e Nonetheless it is w orth noting that all f o our results
are extensible to noneterministic algorithms where the new p oin tisc hosen sto c hastically
from the set f o un visited p in o ts his p oin t i s returned to b lo e w
Under the oracleased o m d el of computation an y easure m of the p erformance of an
y
algorithm after m iterations is a unction f f o the sample d Suc h p erformance measures
m
y
will b e indicated b y d As an example if w e a re trying to nd a minim f ehn
m
a reasonable measure of the p erformance of a migh tbe the v alue of the lo w est Y v alue in
y y y
d d min f d i i m g Note t hat easures m of p erformance based on factors
i
m m m
y
other than d w all clo c k time are utside o the scop e f o our results
m
W e shall cast all of our esults r in terms of probabilit y theory e do so or f three reasons
First it allo ws simple generalization of our results to sto c hastic algorithms Second v e en
when the setting is deterministic probabilit y t p vides a simple consisten t f ramew ork
in whic h to carry out pro fs o
The third reason for using probabilit y theory is p erhaps the most in teresting A crucial
factor in the robabilistic p ramew f ork is the distribution P f P f x x Xj This

j
distribution deed o v er F iv es the p robabilit y that eac h f is the actual optimization
problem at hand n A approac h based on this distribution has t he imm ediate adv an tage that
often kno wledge of a problem is statistical in ature n and his t information ma y b e easily
enco dable in P f F or example Mark o v r o ibbs G random ld descriptions S of
families of optimization problems xpress e P f exactly
Ho w ev er exploiting P f also has dv a an tages ev en when w e are presen ted w ith a single
uniquely sp ecid cost function One suc havd tage is the fact that although it ma ybe
fully sp ecid man y a sp ects of the cost function are e ctively unkno e e c ertainly
do not kno w the extrema of the function I t s i i n an m yw a ys most appropriate to ha v ethis
ectiv e ignorance rected in the analysis as a p robabilit y distribution More generally
w e usually as though the cost function is partially unkno wn F or example w e migh t
use the same searc h algorithm for all cost functions i n a class e a ll tra v eling salesman
problems a h ving certain c haracteristics In so doing w e are implici tly ac kno wledging that
w e consider distinctions b et w een the cost functions n i that lass c to b e irrelev an t or at l east
unexploitable In this sense ev en though w e are presen ted w ith a single particular problem
from that class w e act as though w e are presen ted with a p robabilit y distribution o v er cost
functions a distribution t hat i s nonero only for m em b e rs of that class o f c ost functions
P fs th us a rior p sp eciation of the class of the ptimization o problem at h and with
diren t classes of roblems p corresp onding to diren tc hoices of what algorithms w e will

In particular note that random n um b er generators are deterministic g iv en a seed


act
wn
an



ro heory


of um
use and giving rise to diren t d istributions P f
Giv en our c hoice to use probabilit y theory the p erformance of an algorithm a iterated
y
m times on a c ost unction f f is measured with P d j f m a This s i the conditional proba
m
y
bilit y of obtaining a particular sample d under he t stated conditions F rom P d j f m a
m
m
y
p erformance measures d can b e found easily
m
y
In the next section w e will analyze P d j f m a and i n p articular o h w t i an c v ary ith w
m
the algorithm a Before pro ceeding with that nalysis a o h w ev er it is w orth brie noting
that there are other formal approac hes o t the issues in v estigated i n this pap er P erhaps the
most prominen t of hese t s i the ld of computational complexit y U e the approac htak en
in this pap er computational complexit y ostly m gnores i the statistical nature of searc h and
concen trates instead on computational issues Muc h hough b y n o means all of computa
tional complexit y s i concerned with ph ysically unrealizable computational devices uring
mac hines and the w orst case amoun t f o resources they require to d optimal solutions In
con trast the analysis in this pap er o d es ot n concern tself i ith w the computational engine
used b y the searc h algorithm but rather concen trates exclusiv ely n o the underlying statisti
cal nature of the s earc h p roblem In his t t he curren t robabilistic p approac h s i complime n
to computational complexit y uture w ork i n v olv es com bining our a nalysis of the statistical
nature of searc h with practical oncerns c or f computational resources
The NFL theorems
In this section w e analyze the connection b et w een algorithms and cost functions W eha v e
dubb ed the asso iated c results o F ree Lunc h FL theorems b ecause they demonstrate
that if an algorithm p erforms w ell o n a certain class of roblems p then it necessarily pa ys
for that with degraded p erformance on the set of all remaining roblems p Additionally he
name emphasizes the parallel ith w similar results i n s up ervised earning l W ola W olb
The precise question addressed in this ection s is Ho w o d es the set f o problems F

for whic h algorithm a p e rforms b e tter than algorithm a compare to the set F for

whic h the rev erse is true T o address this question w e ompare c the sum o v er all f of
y y
P d j f m a o t the sum o v er all f P d j f m a This comparison constitutes a ma jor

m m
y
result of this pap r e P d j f m a is i ndep enden tof a when w ea v erage o v er all cost functions
m
Theorem F or any p air of algorithms a a

X X
y y
P d j f m a P d j f m a

m m
f f
A pro of of this esult r is found in A pp endix A An mmedi i ate corollary of this result is that for
y y
an y p erformance measure d the a v erage o v er all f of P d j f m a is indep enden t
m m
of a The precise w a y that the sample is mapp ed to a p erformance measure is unimp ortan t
This theorem explicitly demonstrates that what an algorithm gains n i p erformance on
one class of problems it necessarily pa ys for n o the remaining problems that is the only w a y
that all algorithms can ha v e the same f v eraged p erformance



and

of




tary
nlikA r esult analogous to Theorem olds h for a class of timeep enden t c ost functions The
timeep enden t functions w e consider b egin with n a nitial i cost function f that is presen t

at the sampling of the rst x v alue B efore the b eginning of eac h ubsequen s t i teration of
the optimization algorithm the cost f unction is deformed to a new function as sp cid e

b y a mapping T F N F W e ndicate i this apping m with the notation To hte
i
function presen t uring d the i th iteration is f T f T is assumed to b e a oten tially
i i i i
i ep enden t bijection b t e w een F F e imp ose bijectivit y b ecause if it did not hold
the ev olution of cost functions could narro w n i on a region of f for hic w h some algorithms
ma y p erform b etter than others This w ould constitute an a p r i or i bias in fa v or of those
algorithms a bias w hose analysis w e w ish to d efer to future w ork
Ho w b est to assess the qualit y f o an a lgorithm p erformance on timeep nden e t cost
functions is not clear Here w e consider t w osc hemes based on manipulations of the deition
y
of the sample In sc heme the particular Y v alue in d j corresp onding to a particular
m
x x
x v alue d js giv en b y the cost function that w as presen t when d j as sampled In
m m
y
con trast for sc heme w e imagine a ample s D giv en b yteh Y v alues from the pr esent
m
x x x x
cost function for eac hof teh x v alues in d ormally f i d f d m genh
m m m m
y x x
in sc heme w eha v e d f f d f d m gnd in sc heme w eha v e
m m
m m m
y x x
D f f d d m g where f T f s i the al cost function
m m m m m
m m m
In some ituations s it ma y b e that the mem b ers of the sample liv e for a long time on
the time scale of the ev olution of he t cost function In suc h situations t i a m y b e a ppropriate
y
to judge the qualit y of the searc h algorithm b y D ll a those previous elemen ts of the sample
m
are still liv e at time m and therefore their curren t ocstis fio n terest n O the other hand
if mem b ers of the ample s l iv e for only a short time on the time scale of ev olution of the cost
function one ma y instead b e concerned with things lik eho ww ell the iving mem b er of
the sample trac kthe c hanging cost unction f n I suc h situations it ma ymak e m
y
judge the qualit y o f the algorithm ith w the d sample
m
Results similar to Theorem can b e eriv d ed for b oth sc hemes By analogy with that
theorem w ea v erage o v er all p ossible w a ys a ost c function ma y b e t imeep enden t i e
P
y
a v erage o v er all T ather than o v er all f us w e onsider c P d j f where f
T
m
is the initial cost function Since T only tak es ect for m and since f is ed there

a priori distinctions b et w een algorithms as far s a the st mem b er of he t p o pulation is
concerned Ho w ev er after redeing samples to only con tain those elemen ts added after the
st iteration of the a lgorithm w earirv eatteh olofl wing result p ro v en in App endix B
y y
Theorem F or al l d D m algorithms a and a and initial c ost functions f

m m
X X
y y
P d j f P d j f

m m
T T
and
X X
y y
P D j f P D j f

m m
T T

An ob vious restriction w ould b e to require that T do esn v ary w ith time so that it is a mapping simply
from F to F n A a nalysis for T limited this w a yis bey ond the scop e of this pap er



are
Th

to sense ore




and
So in particular if one algorithm outp erforms a nother for ertain c kinds of v e olution op erators
then the rev erse m ust b e true on the set of all ther o ev olution o p erators
Although this particular esult r is similar to the NFL esult r for the static c ase in general
the timeep enden t situation is more subtle In particular with timeep endence there are
situations in whic h t here can b e a priori distinctions b et w een algorithms ev en for those
mem b ers of the p pulation o rising a a fter the st F or example in general there will b e
P
y
distinctions b et w een a lgorithms when considering the quan tit y P d j f T m a T o see
f m
this consider the case where X is a et s of con tiguous n i tegers and or f all i terations T is a
shift op erator r eplacing f x y f x for ll a x ith min x max x F or suc ha
case w e can construct algorithms whic h b v e d iren a priori or example tak e a
b e the algorithm that st samples f x extn a t x and so o n regardless of the v alues

y
in the p opulation Then for an y f d is alw a ys made up of iden tical Y v alues Accordingly
m
P
y y y
P d j f T m a is onero n only for d for whic hall v alues d i a re iden tical Other
f
m m m
searc h algorithms ev en for t he same shift T do n v e t his r estriction on Y v alues This
constitutes an a priori distinction b et w een algorithms
Implications of the NFL heorems t
As emphasized ab o v e the NFL theorems mean that i f n a algorithm do es particularly w ell on
one class of problems then it most do more p o o rly o v er the remaining problems In particular
if an algorithm p erforms b etter than andom r earc s h o n some lass c of problems then in m ust
p erform worse than r andom se ar ch on the remaining problems Th us comparisons rep orting
the p erformance of a particular algorithm with particular parameter setting on a few sample
problems re a of limited utilit y While sicj results do indicate b eha vior on the n arro w ange r
of problems considered one hould s b e v ery w ary of trying to eneralize g those results to other
problems
Note though that the NFL heorem t need not b e view ed this w a y asaw a y of comparing
function classes F and F r classes of ev olution p o erators T T a s he t case migh t

b e It can b e view ed instead as a statemen t oncerning c a n y algorithm p erformance when
f is not ed under the uniform prior o v er cost functions P f jF j f w e w ish instead
to analyze p erformance where f is not xed as in this a lternativ ein terprations of the NFL
theorem but in on c trast with the NFL case f is no wc hosen from a nonniform prior then
w em ust analyze explicitly the sum
X
y y
P d j m a P d j f m a P f
m m
f
Since it is certainly true that an y class of problems faced b y a practitioner will not ha v e a t
prior what are the practical implications of the NFL theorems hen w view ed as a statemen t
concerning an algorithm p erformance for onxed n f This uestion q is tak en up in greater
detail in Section b ut w emak e a few commen ts here
First if the practitioner has kno wledge of problem c haracteristics but d o es n ot incorp o
rate them in to the optimization algorithm then P f s i ectiv ely uniform ecall that



and

ha ot
at
to tly eha
P f can b e view ed as a statemen t concerning the practitioner c hoice o f optimization l a
gorithms In suc h a case the NFL theorems establish that there are no ormal f assurances
that the algorithm c hosen will b e a t ll a ectiv e
Secondly while most classes of problems will certainly a h v e some structure whic h if
kno wn migh t b e exploitable the simple existence of that structure do es not justify c hoice of
a particular algorithm that structure m ust b e nok wn and rected d irectly in the c hoice of
algorithm to serv eas csu h a justiation In ther o w ords the s imple xistence e of structure
p er se absen t a sp eciation of that structure cannot p ro vide a b asis for p referring one al
gorithm o v er another F ormally t his is established b y the existence f o NFL yp e t heorems in
whic h rather than a v erage o v er sp eci cost functions f o ne a v erages o v er sp eci inds of
y
structure i theorems in whic hone a v erages P d j m a v er distributions P f That
m
suc h t heorems hold hen w o ne a v erages o v P f eans m that the indistinguishabilit yof
algorithms asso ciated with u niform P f is not some p athological outlier case Rather uni
form P f s ta ypical distribution as far s a ndistinguishabilit i y o f a lgorithms is concerned
The simple act f that the P f at and h s i n onniform cannot serv e to determine one c hoice
of optimization a lgorithm
Finally it is imp ortan t to emphasize that ev en if one i s considering the case where f is
not ed p erforming the asso ciated a v erage a ccording to a u niform P f s essen tial for
NFL to hold NFL can also b e demonstrated or f a r ange of nonniform priors F or example
Q

an y prior of the form P f x here P y f x is the istribution d of Y v alues
x
will also giv e N FL The f v erage can also enforce correlations b t e w een costs at diren t
X v alues and NFL still obtain F or example i f costs are rank ordered with ties brok en in
some arbitrary w a y and w e sum only o v er all cost f unctions g iv en b y p erm utations of t hose
orders then NFL still holds
The c hoice of uniform P f as motiv ated more from theoretical rather pragramattic
concerns as a w a y f o analyzing the theoretical structure of optimization Nev ertheless the
cautionary observ ations presen ted ab o v emak e clear that an analysis of the uniform P f
case has a n um b er of ramiations for p ractitioners
Sto c hastic optimization algorithms
Th us far w eha v e onsidered c the case in whic h a lgorithms are d eterministic What is the sit
uation for sto c hastic algorithms As i t turns out NFL results hold ev en for suc h algorithms
The pro f o of his t is straigh tforw ard Let be a ost c hastic non oten tially revisiting
algorithm F ormally his t means that is a apping m taking an y d to a d ep enden t istribu d
x
tion o v er X that equals zero for all x d In this sense is what in statistics comm uni t yis
x
kno wn as a h yp erarameter sp ecifying the unction f P d m j d for all m and
m
m
d One can no w repro duce the eriv d ation of the NFL result for d eterministic algorithms
only with a replaced b y throughout In so doing all steps in the pro of remain v alid his T
establishes that NFL results apply to sto c hastic algorithms as w ell as d eterministic ones



not

all er
A geometric p ersp ectiv e on the NFL theorems
In tuitiv ely he t NFL theorem illustrates that ev en if kno wledge of f erhaps sp cid e
through P f is not i ncorp orated in to a then there are n o formal a ssurances that a will
b e ectiv e Rather ectiv e optimization relies o n a fortuitous matc hing b e t w f and a
This p oin t is formally established b y viewing the NFL theorem from a eometric g p ersp ectiv e
Consider the space F of all p ossible cost functions As previously iscussed d n i regard o t
y
Equation he t probabilit y of obtaining some d is
m
X
y y
P d j m a P d j m a f P f
m m
f
where P f is the prior probabilit y that the optimization problem at hand has cost function
f This sum o v er functions can b e view ed as an inner pro duct in F M ore p recisely deing
y
y y
the F pace v ectors v p b y their f comp onen ts v f P d j m a f nd
d d
m m m
p f P f resp ectiv ely
y
y
P d j m a v p
d
m m
y
This equation pro vides a geometric i n terpretation of the optimization pro cess d can
m
b e view ed as ed to the sample that is desired sually u one with a lo w ost c v alue and m
is a measure of the computational resources that can b e arded An y kno wledge of the
prop erties of the cost unction f go es in to the p rior o v er cost functions p Then Equation
sa ys the p erformance of an algorithm s i determined b y the magnitude f o its pro jection
y
on to p i b yho w a ligned v is with the problems p Alternativ ya v eraging o v er
d
m
y y y
d itisesyta o ees atth E d j m a s i a n nner i pro duct b et w een p and E d j m a f The
m m m
y
exp ectation of an y p erformance measure d can b e written similarly
m
In an y of these c ases P fr p m ust atc h or b e aligned with a to get desired
beha vior This need for matc hing pro vides a new p ersp ectiv eon ho w certain algorithms can
p erform w ell in practice on sp eci kinds of problems F or example it means t hat the y ears
of researc hin to the tra v eling salesman problem TSP ha v e resulted in algorithms aligned
with the mplicit p describing tra v eling salesman problems of in terest to TSP researc hers
P
y
T aking the g eometric view the NFL result that P d j f m a i s i ndep enden tof a has
f
m
y
the in terpretation that f or an y articular p d m all algorithms a v e the same pro jection
m

y
on to the he t uniform P f represen b y the diagonal v ector F ormally v
d
m
y
y
cst d F or deterministic algorithms the comp onen ts of v i the robabilities p
d
m
m
y
that algorithm a giv es sample d on cost function f after m distinct cost ev aluations are
m
P
y y
all either or so NFL also implies that P d j m a f cst d Geometrically
f
m m
y
this indicates that the length of v is indep enden tof a Diren t algorithms th us
d
m
y
generate diren tv ectors v ha ving the same length and l ying on a one c w ith c onstan t
d
m

pro jection on to sc hematic of this situation is sho wn in Figure f or the c ase where
F is dimensional Because the comp onen ts of v are binary w emgih teivqu alen
c
y
view v as lying on the s ubset the v ertices of the Bo olean h yp ercub e ha ving the same
d
m

hamming distance from


tly
all


ted
ha and



ely

and

een1
p
Figure Sc hematic view of the situation i n whic h unction f space F is dimensional The

uniform prior o v er this space ies l long a the iagonal d Diren t algorithms a giv e diren t
v ectors v lying in the cone surrounding the diagonal A particular problem is represen ted b y
its prior p lying on the simplex The algorithm t hat ill w p erform b e st will b e the algorithm
in the cone ha ving the largest i nner pro duct with p
y
No w restrict atten tion to algorithms h a ving the same probabilit y o f ome s p articular d
m
The algorithms in this et s lie in the in tersection of conesne ab o ut the d iagonal set b y
y
the NFL theorem and one set b yha ving the ame s p robabilit yfor d T his s i in general an
m
y
jF j dimensional manifold Con tin uing s a w eimpose y et more d ased restrictions on a
m
set of algorithms w e will con tin ue to reduce the dimensionalit y of the manifold b y fo cusing
on in tersections of more and more cones
The geometric view of optimization also suggests lternativ a e easures m for determining
ho w imilar t w o ptimization o algorithms re a C onsider again E quation n I that he t
y
algorithm directly only giv es v p erhaps the most straigh torw ard w a y to compare t w o
d
m
y y
algorithms a and a w ould b e b y measuring ho w similar the v ectors v and v are
d d
m m
b yev aluating the dot pro duct of those v ectors H o w ev v ectors o ccur on the
righ tand side of Equation whereas the p erformance of the algorithms hic w hisatfer
all our ultimate concern i nstead o ccur on he t leftand side This suggests measuring
y
the similarit yof t w o lgorithms a not directly n i terms of their v ectors v but rather in
d
m
terms of the d ot pro ducts of those v ectors w ith p F or example it ma y b e the case that
algorithms b eha v ev ery similarly for certain P f but are quite diren t or f other P f In
man y resp ects kno wing this ab out t w o a lgorithms is of more in terest than kno wing ho w
y
their v ectors v compare
d
m
As another example of a similarit y easure m suggested b y the geometric p e rsp ectiv e
w e could measure similarit ybte w een algorithms based on similarities b t e w een P f s F or
example for t w o d iren t algorithms ne o can magine i solving for the P f that optimizes

those er
y
P d j m a or f those algorithms i n some onrivial n sense W e could then use some
m
measure of distance b et w een those t w o P f distributions a s a gauge of ho w similar the
asso ciated algorithms are
Unfortunately exploiting the inner pro duct form ula i n p ractice b y going from a P f
to an algorithm optimal for that P f app ears to often b e uite q diult Indeed v e en
determining a plausible P f f or the ituation s at hand is often diult Consider for
example TSP problems with N cities T o the degree that an y ractitioner p attac ks all
N it y TSP cost functions with the same algorithm that practitioner implicitly ignores
distinctions b et w een suc h cost functions In this that ractitioner p has i mplici tly agreed
that the problem is one of o h w their ed algorithm o d e s cross a the set of all N it yTSP
cost functions H o w ev er the etailed d nature of the P f that i s uniform o v er this class of
problems a pp ears o t b e iult d to elucidate
On the ther o hand here t is a g ro wing b o d y o f w ork that do es rely explicitly on en u
meration of P f F or example applications of Mark o v random elds ri KS to cost
landscap es yield P f directly a s a Gibbs distribution
Calculational applications of the NFL t heorems
In this section w e explore some of the applications of the NFL theorems for p erforming
calculations concerning optimization W e ill w consider b oth calculations of practical and
theoretical in terest and b egin with calculations of theoretical in terest in whic h nformation i
theoretic quan tities arise naturally
Informationheoretic a sp ects of optimization
F or exp ository purp oses w e implify s the d iscussion sligh b y c onsidering only the histogram
of n um b er o f instances of eac h p ossible cost v alue pro duced b y a run of an algorithm nd a
not the temp oral order in whic h those cost v alues w ere generated ssen tially all real
w orld p erformance measures are indep nden e tof suc h temp o ral i nformation W e indicate
that histogram with the sym bol c c has Y comp onen ts c where c is the
Y Y Y i
jY j
y
n um b er of imes t cost v alue Y o ccurs in the sample d
i
m
No w consider an y question lik e t wing What fraction of cost functions giv ea
particular histogram c of cost v alues after m distinct cost ev aluations p ro duced b y using a
particular instan tiation of an ev olutionary algorithm O W Hol
A t st glance this seems to b e an in tractable question Ho w ev er it turn out that the
NFL theorem pro vides a w a y to er it This is b ecause according to the NFL theorem
the answ er m ust b e indep enden t of the algorithm sed u to generate c onsequen C w e
can c hose an algorithm for whic h the calculation is tractable

In particular one ma yw t t o imp ose restrictions on P f F or instance one ma y ish w t o only consider
P f that are in v arian t under at least partial relab elling of the elemen ts in X to reclude p here t b eing an
algorithm that will assuredly uc k out and land on min f x n ist v ery rst query
x


an
tly
answ
follo he

tlyTheorem F or any a lgorithm the fr onoacti fc ost f unctions that r esult i n a p articular
histo gr am c m is

m m
jX j m
j
c c c c c c

jY j jY j

f
jX j m
j j
F or lar ge enough m this c an b e appr oximate das
exp mS

C m j
f
Q
jY j

i i
wher e S is the entr opy of the istribution d nd C m j a c onstant t hat do es not
dep end on
This theorem is eriv d ed in App endix C If some of the are the appro ximation still holds
i
only with Y redeed to exclude the y corresp onding to the zero alued o w ev Y
i
is deed the normalization constan t f o quation E can b e found b y summing o v er all
lying on the unit simplex
A question related to one addressed in this heorem t is the follo wing or a giv en cost
function what is the fraction of all lgorithms a that giv e rise to a particular c It turns
alg
out that the nly o feature of f relev t f or this q uestion is the h istogram of its cost v alues

formed b y lo o king across all X Sp ecify the fractional form of this histogram b y there
are N jX j poin ts in X for w hic h f x has the i h Y v alue
i i

In App endix D it is sho wn that to leading order dep ends on y et another
alg

information theoretic quan tit y the Kullbac kiebler distance T b t e w een and

Theorem F or a given f with histo gr am N jX j the fr action of algorithms that give
rise to a histo gr am c m is given by

Q
j
N
i
i
c
i



jX j
m
F or lar ge enough m this c an b e w ritten as

mD
KL
e

C m jX j jY j
alg Q
jY j

i i

wher e D is the Kul lb ackieb le r distnac eb etwe en the distributions and
KL
As b e fore C can b e calculated b y umming s o v er the u nit simplex



alg

jY
an
er
is jY

jY

jY jY

jY Measures of p erformance
W eno wsoh who w to pply a the NFL framew ork to calculate certain b enc hmark p erformance
measures These allo w b oth he t programmatic ather than a d o h c a ssessmen t of he t eacy
of an y individual optimization algorithm and principled comparisons b t e w een lgorithms a
Without loss f o generalit y a ssume that the goal of the s earc h p ro cess is ding a m ini
m um So w e are in terested in the ep endence of P in c j f m a b ywhic hw emean
the probabilit y that the minim um cost an algorithm a ds on problem f in m distinct
ev aluations is arger l than t l east three quan tities related to this conditional probabilit y
can b e used t o gauge an algorithm p erformance in a particular optimization run
i The uniform a v erage of P c j f m a v er all cost functions
ii The form P in c j f m a tak es for he t random a lgorithm whic h esnus oirfon
mation from the sample d
m
iii The fraction of algorithms whic h for a articular p f and m elsut in a c whose minim um
exceeds
These measures giv e b enc hmarks whic han y a lgorithm run on a particular cost function
should surpass if that algorithm is to b e considered as ha ving w ork w ell for that cost
function
Without loss of generalit y assume the that i h cost v alue i Y equals o i stco v alues
i
run from a minim um of o t a maxim um of jY j nin teger incremen ts T he follo wing results
are deriv ed in App endix E
Theorem
X
m
P in c j f m
f
wher e jY j is the r f action of c ost lying ab ove In the limit of jY j hist
distribution ob eys he t fol lowing r elationship
P
E in c j f m
f

j m
Unless one algorithm has its b estostoar drop faster than the drop asso ciated with
these results one w ould b e h ardressed indeed to claim that the lgorithm a is w elluited to
the cost function a t and h After all for uc s h p erformance the lgorithm a is doing no b etter
than one w ould exp ect t i t o for a randomly c hosen cost function
Unlik e he t preceding measure the measures analyzed b elo wtak ein to accoun t the actual
cost function at hand This is manifested in the dep endance o f the v alues of those measures

on the v ector N giv en b y the cost function histogram N jX j

jY




ed


in

Theorem F or the r andom algorithm a
m
Y
i jX j
P in c j f m a
i jX j
i
P
jY j
wher e N j is the fr action of p oints in X for which f x rsto or der
i
i
in jX j

m m
m
P in c j f m a
j
This result a llo ws the calculation of other quan tities of in terest for measuring p erformance
for example the quan y
j
X
E in c j f m a P in c j f m a P in c j f m a

Note that for man y cost functions f o b oth practical and t heoretical in terest cost v alues are
distributed Gaussianly or suc h ases c w e can use that Gaussian nature of the d istribution
to facilitate our calculations In particular i f the mean and v ariance of the G aussian are
p
and resp ectiv hen t w eha v e frce where e rfc is the complime n
error function
T o calculate the third p rformance e measure note that or f ed f m or an eydrte
ministic algorithm a P c j f m a s i either or Therefore the fraction f o algorithms
whic h result in a c whose minim um exceeds is giv en b y
P
P in c j f m a
a
P

a
P
Expanding in terms of c e can rewrite the n umerator f o this ratio s a P in c
c
P P
j c P c j f m a Ho w ev er the ratio of this quan yto i s exactly what w
a a
lated when w eev aluated measure ii see the b eginning of the a rgumen t eriving d quation E
This establishes the follo wing
Theorem F or e d f and m the fr action of algorithms which r esult in a c whose minimum
exc e e ds is given by the quantity on the rightand sides of quations E
As a p articular example of applying this result consider easuring m the v alue of in m c
pro duced in a articular p run of y our a lgorithm Then imagine that when i t s i ev aluated for
equal to this v alue he t quan tit ygvi en in Equation i s ess l than In suc h a situation
the algorithm i n q uestion has p erformaed w orse than o v er half of all searc h algorithms for
the f and m at hand hardly a stirring endorsemen t
None of the discussion ab o v e explicitly concerns t he dynamics of an algorithm p erfor
mance as m increases Man y sp a ects of s uc h dynamics a m ybe fo in terest As an example let

and
calcu as tit



and
tary ely


jY
tit
jX


jX
us consider whether as m ws there s i an yc hange in ho ww ell he t algorithm p erformance
compares to that of the random algorithm
T o this end let he t sample generated b y the algorithm a after m steps b e d and dee
m
y
y min d Let k b e the n um b er of additional steps it tak es the algorithm to d an
m

x suc h that f x o ww e can estimate the n um b er of steps it w ould ha v e tak en the

random searc h lgorithm a to searc h X d and nd a p oin t hose w y w as less than y he
X
x
exp ected v alue of this n um b er f o steps is d where z d s i he t fraction of X d
m

for whic h f x Therefore k d sho wm hw orse a did than w ould ha v ethe
random algorithm o n a v erage
Next imagine l etting a run for man y steps o v er some tness function f and p lotting ho w
w ell a did in comparison to the random algorithm on that run s a m increased Consider
the step where a ds its n h ew n v alue of min c F or that step t here is an asso ciated k
y
he n um b er of steps n u til the next min d and z d Accordingly indicate that step on
m
our plot as the p oin t n k d Put do wn as man y p oin ts on our plot as there are
successiv ev alues of min c d r a o v f
If throughout the run a is alw a ys a b etter m atc hto f than is the random searc h algorithm
then all the p o in ts in the p lot ill w ha v e heir t ordinate v alues ie l b elo w If the random
algorithm w y of he t comparisons though hat t w ould mean a p oin t lying ab o v e
In general ev en if the p oin ts all lie to ne o side of one w ould exp ect that as the searc h
progresses there is corresp onding erhaps systematic v ariation in ho w far a w a yfhrome
p oin ts lie That v ariation tells one when the algorithm s i en tering harder o r asier e parts of
the searc h
Note that ev en for a ed f y sing u diren t starting p oin ts for the algorithm one
could generate man y o f hese t plots and then sup erimp o se them This allo ws a plot of
the mean v alue of k d s a a function of n along ith w an sso a iated c rror e bar
Similarly one could replace the single n um ber z d haracterizing t he random algorithm
with a full distribution o v n um b r e of required steps to d a n ew minim In these
and similar w a ys one can generate a m ore n uanced picture f o a n algorithm p erformance
than is pro vided b yan y of the single n b ers giv en b y the p erformance measure discussed
ab o v e
Minimax distinctions b et w een algorithms
The NFL theorems do not direclt y ddress a minimax p rop erties of searc h F or example sa y
w ee considering t w o d eterministic algorithms a and a t ma yv ery w ell b e that there

exist cost functions f suc h t hat a histogram is m h b etter ccording to some a ppropriate

p erformance measure than a but no ost c functions for whic h the rev erse is true F or the

NFL theorem to b e ob ey ed in suc h a scenario it w ould ha v e to b e true that there are man y
more f for whic h a histogram is b etter than a than vice ersa but i t is only sligh tly

b etter for all t hose f or suc h a scenario i n a certain sense a has b etter headoead

minim ax b eha vior than a there are f for whic h a b eats a badly but none for whic h a

do es substan tially w orse than a



uc

um
um the er



an for on

er of un the in

uc


groF ormally esa y that there exists headoead minimax istinctions d b e t w een t w o algo
rithms a and a i there exists a k suc h that f or at least one cost function f the dirence

E c j f m a E c j f m a k but there is no other f for whic h E c j f m a E c j

y
f m a k A similar deition can b e u sed f i ne o is instead n i terested in cr d

m
rather than c
It app ears that analyzing headoead inimax m prop erties of algorithms is substan tially
more diult than analyzing a v erage b eha vior ik e in the NFL theorem Presen tly ery
little is kno wn ab out minimax b eha vior in v olving sto c hastic algorithms In particular it is
not kno wn if there are a n y senses in whic hastoc hastic v ersion of a deterministic algorithm
has b etter orse minim ax b eha vior than that eterministic d a lgorithm In fact ev en if w e
stic k completely to eterministic d algorithms only an extremely preliminary understanding
of minim ax issues has b een reac hed
What w edo nok w is the follo wing Consider the quan tit y
X

y y
P z z j f m a

d
m m
f
for deterministic algorithms a and a y P asmean t the distribution of a random
A
v ariable A ev aluated at A a F or deterministic a lgorithms this q uan tit y s i just he t
n um ber of f h that it is b oth true that a pro duces a p opulation with Y comp onen ts z


and that a pro duces a p opulation ith w Y comp onen ts z

In App endix F it is ro p v en b y example that this uan q tit y n eed not b e symmetric under

in terc hange of z and z
Theorem In gener al
X X

y y y y
P z z j f m a P z j f m a

d d
m m m m
f f
This means that under certain circumstances ev en kno wing only the Y comp onen ts of the
p opulations pro uced d b yt w o algorithms run on the s ame nkno wn f e can infer some
thing concerning what algorithm pro duced eac h p opulation
No w consider the quan tit y
X

P z z j f m a
C

f
again for deterministic algorithms a a This uan q y is just the n um ber of f suc h that


it is b oth true that a pro duces a histogram z and that a pro duces a istogram h zt too


need not b e symmetric under i n terc hange o f z z ee App endix F This is a tronger s
y
statemen t then the asymmetry of d statemen t since an y particular histogram corresp onds
to m ultiple p opulations
It w ould seem that neither of these t w o esults r d irectly mplies i that there are algorithms
a and a suc h that for some fa histogram i s m h b etter than a b no f is the

rev erse is true T oin v estigate this problem in v olv es lo oking o v er all p airs of h istograms ne

for ut uc
and

tit and



suc





pair for eac h f cu h t hat there is the same relationship b et w een the p erformances of he t
algorithms as rected in the histograms S imply ha ving an inequalit ybet w een the sums
presen ted ab o v e do es ot n seem to directly i mply that the relativ e p erformances b et w een
the asso ciated pair of istograms h is a symmetric T o formally e stablish this w ould in v olv e
creating scenarios in whic h there s i n a i nequalit ybet w een the sums but no eadoead h
minim ax distinctions Suc h an a nalysis i s b ey ond the scop e o f this p ap er
On the other hand ha ving the sums equal do es carry ob vious mplications i for hether w
there are headoead minim ax distinctions F or example f i b oth algorithms are determinis
y y
tic then for an y particular fP z j f m a e quals for one z pair and
d
m m
P
y y
for all others In suc h a case P z j f m a is ustj the n ber of f that re
d
f
m m
P P

y y y y
sult in the pair z So P z z j f m a P z j f m a

f d f d
m m m m
implies that there are no headoead minim ax distinctions b et w een a and a T he con v erse


do es not app ear to hold ho w ev er
As a preliminary analysis of whether there can b e headoead minim ax distinctions w e
can exploit the esult r in App endix F whic h concerns the case where jX j jY j First
y
dee the follo wing p erformance measures of t w olemen t p opulations Q d

i Q y Q y

ii Q y Q y

iii Q of an y ther o argumen t
In App endix F w esoh w that for t his scenario there exist pairs of algorithms a and a h

that for one fa generates the istogram h f y g and a generates the istogram h f y g

but there is no f for w hic h the rev erse o c curs there is no f suc hthta a generates the

histogram f y g a generates f y g

So in this scenario with our deed p erformance measure there ar e minimax distinc
tions b et w een a and a or one f the p erformance measures of algorithms a and a are

resp ectiv ely and The dirence in the Q v alues for the t w o lgorithms a i s for that f
Ho w ev er there are no other f for whic h the dirence is F or this Q then algorithm a is

minim ax sup erior to algorithm a

y
It is not curren tly kno wn what restrictions on Q d re a needed for there to b e minimax
m
y
distinctions b et w een the algorithms As an example t i ma yw lbel etath for Q d
m
y
min f d i g there are no minimax distinctions b et w een algorithms
i
m
More generally at p resen t nothing is kno wn ab out ho w ig b a problem t hese kinds of
asymmetries are All of the e xamples f o asymmetry considered here arise when the et s of

Consider t he grid of all z z airs p Assign t o eac h grid p o in tthe n um ber of f that result in that grid

p oin t z z air p Then our constrain i b yteh h yp othesis that there are no headoead minim ax
distinctions if grid p oin t z s i assigned a nonero n b e r then so is z and i i b y the noree

lunc h theorem the sum of all n um b ers in ro w z equals the sum of all n b ers in column z These t w o
constrain ts do not app ear to imply that the istribution d of n b ers is symmetric nder u in terc hange of ro ws
and columns Although again l ik e b efore to formally establish this p oin tw ould in v olv e explicitly creating
searc h scenarios in whic hithldos


um
um
um
are ts


and

suc




um


X v alues a has visited o v erlaps with those that a has isited v Giv en suc ho v erlap and

certain prop erties of ho w the algorithms generated the o v erlap asymmetry arises A precise
sp eciation of those ertain prop erties is not y et in hand N or is it kno wn ho w generic
they are i for what p ercen tage of pairs f o algorithms they arise lthough A suc h issues are
easy to state ee App endix F it s i ot n at all clear o h w b est to answ er them
Ho w ev er consider the c ase where w e are assured that n i m steps the p o pulations of t w o
particular algorithms ha v e not o v erlapp ed h assurances hold f or example if w e are
comparing t w o hilllim ng algorithms that start far a part n the scale of m n X It
turns out that giv en suc h ssurances a there are n o asymmetrie sbet w een the t w o algorithms
for m lemen t p opulations T o see this formally o g through the a rgumen t sed u to pro v e
P

y y
the NFL theorem but apply that argumen t to the quan tit y P z z j f m a

f d
m m
rather than P c j f m a Doing his t stablishes e the follo wing
x x
Theorem If there is no o v erlap b et w een d and d hen
m m
X X

y y y y
P z z j f m a P z j f m a

d d
m m m m
f f
An immedi ate consequence of this theorem s i that nder u the n o v erlap conditions the
P

quan tit y P z z j f m a i s s tric u nder in terc hange of z and z a s are
C
f
all distributions etermined d from this one o v er C C the distribution o v er the

dirence b et w een t hose C extrema
x
Note that with sto c hastic algorithms f i they giv e nonero probabilit yto lal d there
m
is alw a o v erlap to onsider c So there is alw a ys the p ossibilit y of asymmetry b et w een
algorithms if one o f them s i sto c hastic
P f ndep enden t esults r
All w ork to his t p oin t has largely considered the b eha vior of v arious algorithms across a w ide
range of problems In this section w ein tro duce the k inds of results that can b e obtained
when w e rev erse roles and consider the prop erties of an m y algorithms o n a single problem
More results f o this t yp e are found in W The results of this section although less
sw eeping than the NFL results hold no matter w hat t he real w orld distribution o v er cost
functions is

Let a and a be t w o searc h lgorithms a Dee a c ho osing ro p cedure as a rule that

examines the samples d and d pro duced b y a a resp ectiv ely nd a based on t hose
m
m

p opulations ecides d to use either a or a for the subsequen t art p of the searc h As an
example one ational c ho osing pro cedure is to use a for the subsequen t part o f the searc h

if and only it as h generated a lo w er cost v alue in its sample than has a Con v ersely w e
can consider a irrational c ho osing pro cedure that w en t w ith t he algorithm that had not
generated the s ample with the lo w est cost solution
A t t he p oin t hat t a c ho osing pro cedure tak es ect the cost function will ha v ebeen

sampled at d d d A ccordingly f d refers to the samples of the cost function that
m
m


and
ys
and
ymme



bi
Succome after using the c ho osing algorithm then the ser u i s in terested in the remaining sample
d s alw a ys without loss of generalit y t i i s assumed that the searc h algorithm c hosen


b y the c ho osing pro cedure do es not return to an ypino ts in d

The follo wing t heorem ro p v en in App endix G establishes that there i s o n a priori
justiation for using an y p articular c ho osing p ro cedure o L o sely sp eaking o n matter what
the cost function without sp ecial consideration of the algorithm at and h simply observing
ho ww ell that algorithm has done so far tells us nothing a priori ab out o h ww it w
if w e con tin ue to use it on the same cost function F or simplicit y i n stating the result w e
only consider deterministic algorithms

Theorem L et d d b e t wo e d s amples of size m t hat ar egneer ate d henw het
m
m

algorithms a and a r esp e ctively ar e run on the rbitr ary c ost function at hand L et A and
B b e two dir ent cho osing pr o c e dur L et k b e the numb er of elements in c

X X

P c j f d d P c j f d d


a a

Implicit in this result is the assumption that the sum excludes those algorithms a and a

that do not result in d and d resp ectiv ely when run on f
In the precise form i t s i resen p b a o v e the result ma y a pp ear misleading since it
treats all p opulations equally w f ygiv en f some p opulations will b e more lik
than others Ho w ev er ev en if one w ts p o pulations a ccording to their probabilit yof
o ccurrence it s i still true that on a v erage the c ho osing p ro cedure ne o uses has o n ect on
lik ely c This is e stablished b y the follo wing result p ro v en in App endix H
Theorem Under the c onditions given in the pr e c e ding the or em
X X

P c j f m k a a P c j


a a
These results sho w t hat o n assumption for P f alone justis using some c ho osing
pro cedure as far as subsequen t searc h s i concerned T oha v ean ni telligen tc ho osing pro cedure
one m ust tak ein to accoun t ot n only P f b ut also the searc h lgorithms a one is c ho osing
among This conclusion ma y b e surprising I n articular p note that it means that there is no
in trinsic adv an tage to using a r ational c ho osing ro p cedure whic hcon tin ues with the b etter

of a and a rather than using a irrational c ho osing pro cedure whic h d o es the opp osite
These results also ha v ein teresting implications for degenerate c ho osing p ro cedures A

f alw a ys use algorithm a g and B alw a ys use algorithm a g As applied to this case they

a can kno wto a v oid the elemen it has seen b fore e Ho w er a priori a has n o w a yto a v oid t he elemen ts

it hasn seen y et but that a has nd vice ersa Rather than ha v e the deition of a someho w d

on the elemen d d nd similarly for a w e d eal w ith this problem b y d eing c to b e set only b y
those elemen d that lie outside f o d This is similar to the con v en tion w e xploited e ab o v e to d eal

with p oten tially retracing algorithms F ormally this means that the r andom v ariable c is a function of

d as w ell as of d It a lso means there ma ybe few er elemen ts in the h istogram c than there are i n the

p opulation d


in ts

in ts
end ep
ev ts



eigh
ely an or hen
ted

Then es
and
do ould ell
mean that for ed f and f f f do es b etter n a v erage ith w the algorithms in some set

A then f do es b etter n a v erage with the algorithms in the set of all other algorithms

In particular if f or some fa v orite algorithms a c ertain ell eha v ed f results in b etter
p erformance than do es the random f then that w ell eha v ed f es worse than r andom
beha vior on the et s all remaining algorithms In this ense s just as there are no univ ersally
eacious searc h algorithms there are n o univ ersally b enign f whic h can b e assured of
resulting in b tter e than random p erformance regardless of one algorithm
In fact things ma yv ery w ell b e w orse than this In sup ervised learning there is a
related result ola T ranslated in to the curren tcno text that result uggests s that if one
restricts our s ums to only b e o v er those algorithms that are a g o o d matc hto P f then it is
often the case thattupid c ho osing pro c edures ik l e the irrational pro edure c o f c ho osing
the algorithm with the less desirable c o utp erform n telligen t ones What the set of
algorithms summed o v er m ust b e for a rational c ho osing pro cedure to b e s up erior to an
irrational is not urren c tly kno wn
Conclusions
A framew ork has b een presen ted in whic h to compare generalurp ose optimization algo
rithms A n b er f o NFL theorems w ere deriv ed that d emonstrate the danger of comparing
algorithms b y their p erformance on a small sample of problems These same results also in
dicate the imp ortance of incorp orating problemp eci kno wledge in to the b ha e vior of the
algorithm A geometric in terpretation w as giv en sho wing hat w t i means for a n lgorithm a to
be w elluited to solving a certain class of problems The geometric p ersp ectiv e a lso uggests s
an um b er of measures to compare the similarit yof v arious optimization algorithms
More direct calculational applications of the N FL theorem w ere demonstrated b yin v es
tigating certain information theoretic sp a ects of searc h w ell as b ydev eloping a n um ber
of b enc hmark measures of algorithm p erformance These b enc hmark measures should pro v e
useful in practice
W e pro vided an analysis of the w a ys that algorithms can dir a priori despite the
NFL theorems W eha v e also pro vided an i n tro duction to a v arian t o f t he framew ork that
fo cuses on the b eha vior of a range of algorithms on sp eci problems ather than sp eci
algorithms o v er a range of problems This v arian t eads l directly to reconsideration of man y
issues addressed b y computational complexit y a s etailed d in W
Muc h future w ork clearly remains the reader is directed to M or f a l ist of some
of it Most imp ortan t i s the dev elopmen t f o ractical p applications of these ideas Can the ge
ometric viewp oin t b e u sed to construct n ew optimization tec hniques n i ractice p W e b eliev e
the answ er to b e y es A t a minim um s Mrak o v andom r ld o m d els o f andscap l es b ecome
more widepread the a pproac hem b o died in this pap r e should nd wider applicabilit y
Ac kno wledgmen ts
W ew ould lik e t o thank Ra ja Das Da vid F ogel T al Grossman P aul H elman Bennett Lev
itan Unaa y O ielly and the review ers for helpful commen ts and suggestions W GM
thanks the San ta F e Institute for funding and DHW t hanks the San ta F e Institute and TXN


as
um

giv
Inc for supp ort
References
T T M Co v er and J A Thomas Elements of information the ory John Wiley
Sons New Y ork
O W L J F ogel A J Ow ens and M J W alsh A rtiial Intel ligenc e thr ough Simulate d
Evolution Wiley ew Y ork
lo F Glo v er ORSA J Comput
lo F Glo v er ORSA J Comput
ri D Griath In tro uction d o t random elds S pringer erlag New Y ork
ol J H Holland A daptation in Natur al and A rtiial Systems MIT Press Cam
bridge MA
GV S Kirkpatric k D C Gelatt and M P ecc hi O ptimization b y sim ulated an
nealing Scienc e
S R Kinderman and J L nell S Markov r andom lds and their applic ations merA
ican Mathematical o S ciet y ro vidence
W E L La wler nd a D E W ood Op er ations R ar ch
W W G Macready and D H W olp ert What mak es an optimization problem
Complexity
M D H W olp rt e W G Macready No free lunc h the
orems for searc h T ec hnical Rep ort SFIR ftp
f tpantaf eduubhw tpf l ear ch R s an ta F e Institute
f
ola D H W olp ert The lac k o f a prior distinctions b et w een learning algorithms and the
existence of a p riori distinctions b et w een earning l algorithms Neur al Computation

olb D H W olp ert On bias plus v ariance Neural omputation C in ress p
A NFL pro of for tatic s ost c functions
P
W e sho w that P c j f m a h as no dep endence on a Conceptually the pro of is quite
f
simple but necessary b o ok eeping complicates things l engthening he t pro of considerably
The in tuition b hind e the p ro of is quite simple though b y umming s o v er all f w e ensure that


and
ese



the past p erformance of an algorithm has no b earing on its f uture p erformance Accordingly
under suc h a sum all algorithms p erform equally
The pro f o is b y nduction i The nduction i is based on m and the inductiv e step is
x x
based on breaking f in to t w o ndep i enden t p arts one f or x d and o ne for x d These
m m
are ev aluated separately g iving the d esired result
x x x
F or m w e write the sample as d f d d g where d b y a The only


y
x
p ossible v alue for d is f d so w eha v e

X X
y y x
P d j f m d d

f f
where is the Kronec k er delta function
y
x
Summing o v er all p ossible cost functions d d is nly o or f those functions whic h

y
x j x
ha v e cost d at p oin t d Therefore that sum equals jY j i ndep enden tof d

X
y
j
P d j f m jY j

f
whic h is indep enden tof a his T bases the induction
P
y y
The inductiv e step requires that if P d j f m a i s ndep i enden tof a for all d enh
f m m
P
y
so also is P d j f m Establishing this step completes the p ro of
f m
W e b egin b y writing
y y y y
P d j f m P f d m g m j f m
m m m m
y
y
P d m j f m
m
m
y y
P d m j d P d j f m
m
m m
and th us
X X
y y y y
P d j f m P d m j d P d j f m
m m m m
f f
y
The new y v alue d m will dep end on the new x v alue f n e w e
m
expand o v er these p ossible x v alues obtaining
X X
y y
y y
P d j f m P d m j f x P x j d P d j f m
m m
m m
f fx
X
y
y y
d m x P x j d P d j f m
m m m
fx
x y
Next note that since x a d it do es not d ep end directly on f onsequen C w e
m m
x y
expand in d to remo v e the f dep endence in P x j d
m m
X X
y y
x y
P d j f m d m x P x j d P d j d
m
m m m m
x
f d fx
m
y
P d j f m
m
X
y
d m a d P d j f m a
m m
m
x
fd
m







tly


So lse othing and








jX
jX


set is where use w as made of the fact that P x j d x a d and the fact that P d j f m
m m m
P d j f m a
m
The sum o v er cost functions f is done st The cost function is deed b oth o v er those
x x
p oin ts restricted to d and t hose p in o ts outside of d P d j f m a will dep end on the f
m
m m
y
x
v alues deed o v p o ts inside d while d m a d dep ends only on the f
m
m m
x x x
v alues deed o v er p oin ts outside d Recall t hat a d d So w eha v e
m m m
X X X X
y y
P d j f m P d j f m a d m a d
m m
m m
x
x x
f d f x d f x d
m m m
P
j m
The sum x con tributes a constan t jY j equal to the n um b r e of functions
f x d
m
x x
deed o v er p in o d passing through d m a d So
m
m m
X X
y
jX j m
P d j f m jY j P d j f m a
m
m
x x
f f x d
m m
X

P d j f m a
m
jY j
x
fd
m
X

y
P d j f m a
m
jY j
f
By h yp othesis the righ t hand side of this equation is indep enden tof a so the left hand side
m ust also b e T his completes the pro of
B NFL pro of for timeep enden t cost functions
In analogy with the pro of of the static NFL theorem the pro of for the timeep enden t case
P
pro ceeds b y establishing the a ndep endence of the sum P c j f T m a where here c is
T
y y
either d or D
m m
T o b egin replace eac h T in this sum with a set f o cost unctions f f ne o for eac h iteration
i
of the algorithm T o o d his t w e s tart with the follo wing
X X X X
x x

P c j f T m a P c j f d P f f j f
m
m m
x
T T d f f
m
m
X X X
x x

P c j f d P d j f m a P f f j f
m
m m
x
d f f T
m
m

where the sequence of cost functions f as h b een indicated b ythe v ector f f
i m
In the next step the sum o v er all p ossible T is decomp o sed in to a series of sums Eac h sum
in the series is o v er the v alues T can tak e for one p articular iteration f o the algorithm More
formally sing u f T f w e write
i i i
X X X
x x

P c j f T m a P c j f d P d j f ma
m m
x
T d f f
m
m
X X
f f f T T f
m m m
T T
m











in not ts
jX


in er


P
Note that P c j f T m a i s i ndep enden tof the v alues o f T so those v alues can b e
i
T
absorb ed in to an o v erall a ndep enden t p rop ortionalit ycannsto t
Consider the innermost sum o v er T for ed v alues f o the outer sum indices T T
m m
F or ed v alues of the outer indices T T T f is just a particular ed cost func
m m
tion Accordingly the innermost sum o v er T is simply the n um b r e of ijections b of F that
m
map that ed cost function to f This i s the constan t jF j Consequen tly v aluating
m
the T sum yields
m
X X X
x x

P c j f T m a P c j f d P d j f m a

m m
x
T d f f
m
m
X X
f f f T T f
m m m
T T
m
The sum o v er T can b e accomplished in the same manner T is summed o v er In fact
m m
all the sums o v er all T can b e done lea
i
X X X
y x x

P c j f T m a P D j f d P d j f m a

m m m
x
T d f f
m
m
X X
x x

P c j f d P d j f f
m
m m
x
d f f
m
m
In this last step the statistical indep endence of c f has b een used
m
y y
F urther progress dep e nds on whether c represen ts d or D e b egin with analysis of
m m
y x y x y

the D case F or this case P c j f d P D j f since D only rects cost v alues
m
m m m m m
from the last cost function f sing U this result giv es
m
X X X X
y x y x
P D j f T m a P d j f f P D j f
m m
m m m m
x
T d f f f
m m
m
The al sum o v er f is a constan t equal to the n um ber of w a ys of generating the sample
m
y
D from cost v alues dra wn from f The i mp ortan tpoin t i s hat t it is indep enden tof
m
m
x x
the particular d Because of this the sum o v er d can b e ev aluated eliminating the a
m m
dep endence
X X X
y x
P D j f T m a P d j f f
m
m m
x
T f f d
m
m
y
This completes the pro of of Theorem for the case of D
m
y
The pro of of Theorem is completed b y turning to the d case his T s i considerably
m
x

more diult since P c j f d can not b e simplid so that the ums s o v f can not b e
i
m
decoupled Nev ertheless the NFL result still holds This is ro p v en b y e xpanding quation E
y
o v er p ssible o d v alues
m
X X X X
y y y y x x

P d j f T m a P d j d P d j f d P d j f f
m
m m m m m m
y
x
T d f f
m d
m m
m
X X X Y
y y x y x
P d j d P d j f f d i d i
m i
m m m m m
y
x
d f f i
d m
m m



er






and



ving




y x
The innermost sum o v er f only has an ct e on the d i d i term so it con tributes
m i
m m
P
y x jX j
d m d m This is a constan t equal to jY j ihs lea v
f m
m m m
m
X X X X Y
y y y x y x
P d j f T m a P d j d P d j f f d i d i
m i
m m m m m m
y
x
T d f f i
d m
m
m
x
The sum o v er d m sno w s imple
m
X X X X X
y y y x
P d j f T m a P d j d P d j f f
m
m m m m
y
x x
T d d m f f
d m
m m m
m
Y
y x
d i d i
i
m m
i
The ab o v e equation is o f the same form as Equation only w ith a remaining p opulation
of size m rather than m Consequen tly n i an analogous manner to the sc heme used to
x
ev aluate the s ums o v er f and d m that existed in Equation the sums o v er f and
m m
m
x
d m can b e ev aluated Doing so simply generates more a ndep enden t rop p ortionalit y
m
constan ts Con tin uing in this manner all sums o v f can b e ev aluated to nd
i
X X X
y x y x
P c j f T m a P c j d P d j m a d d

m m m m
y
x
T
d
d
m m
There is algorithmep endence in this result ut b it is the trivial dep endence d iscussed pre
x
viously It arises from ho w the algorithm selects the st x p oin t in its p opulation d
m
Restricting in terest to those p oin ts in the sample that are enerated g subsequen t to the st
this result sho ws that there are no distinctions b et w een algorithms Alternativ summing
o v er the initial cost function f all p in o ts in the s ample could b e onsidered c while still

retaining an N FL result
C Pro of of result
f
As noted in the discussion leading up to heorem T the fraction of functions giving a sp ecid
histogram c m is indep enden t of the algorithm Consequen tly a s imple algorithm is
used to pro v e the theorem The lgorithm a visits p o in X in some canonical order
sa y x Recall that the istogram h c is sp ecid b y giving t he frequencies of
m
o ccurrence cross a the x for eac h o f the jY j p ssible o cost v alues The n um ber
m
of f giving the desired histogram under this algorithm is just the m ultinom ial giving the
n um ber of w a ys of distributing the ost c v alues n i c t the remaining jX j m p oin ts in X
the cost can assume an yof eth jY j f v alues iving g the st result of Theorem
The expression of in terms of the en trop yof follo ws from an application of
f
Stirling appro ximation to order O whic his v alid when all o f the c are arge l In this
i




in ts
ely

the er






es
case the m ultinomi al is written

jY j jY j
X X
m

ln m ln m c ln c ln m c

i i i
c c c
jY j
i i
jY j

X


mS Yj ln m

i

i
from whic h the theorem ollo f b y exp onen tiating this result
D Pro of of result
al g
In this section the prop ortion of all lgorithms a that g iv e a particular c for a particular f is
calculated The calculation pro c eeds n i sev eral steps
Since X is ite there are nite n um b er of diren t samples herefore T an y determinis
tic a isah uge but ite list indexed b y all p ossible d Eac hen try in t he list is the x
the a in question outputs for that d ndex
Consider an y particular unordered set of m X Y iars ewhrenot w o of he t pairs share
the same x v alue Suc h a set is called n a unordered path Without loss of generalit y f rom
no won w e implicitl y restrict the discussion to unordered paths f o length m A particular
is in or from a particular f if there is a unordered set of m x f x pairs iden tical to
The n umerator on the righ tand side of Equation s i the n b er f o unordered paths in
the giv en f that giv e the desired c
The n b er of unordered paths in f that giv e the desired c he n umerator on the
righ tand side of Equation s i rop p ortional to the n um ber of a that iv g e the desired
c for f and the pro of f o this claim constitutes a pro of of Equation F urthermore he t
prop ortionalit y constan t s i indep enden tof f and c
Pro of he T pro of is established b y constructing a apping m a taking in an a that
giv es the desired c for f nd a pro ucing d a f and g iv es the esired d c Sho
that for an y the n um b er f o lgorithms a a suc hthat a isacnsotan t indep enden tof
f and dn c tath is single alued will complete the ro p of
Recalling that that ev ery x v alue in an unordered path is distinct a n y u
giv es a set of m diren t ordered paths Eac h suc h ordered path in turn pro vides a s et
or d
of m successiv e d f the empt y d is included and a follo wing x ndicate I b y d his
or d
set of the st md pro vided b y
or d
rom an y o rdered ath p a partial algorithm can b e constructed This consists
or d
of the list of an a but with only the md n tries in the list lled in the remaining
d
en tries are blank Since there are m istinct d partial a for eac h ne for ac e h rdered o path
corresp onding o t there re a m suc h p artially ledn lists for eac h A partial algorithm
ma yor am y not b e consisten t w ith a articular p ull f algorithm This a llo ws the deition

of the in v erse of for an y that is in f and g iv es c he set of all a that are
consisten t w ith at least one partial algorithm generated from and t hat giv e c when run on
f

or


path nordered


wing in is that
um
um


ws

ln

lnT o complete the st part of the pro of i t m b e s wn that for all that a re in f and

giv e c con tains the same n um b er of elemen ts regardless of f r c o that end
st generate all ordered aths p induced b y and then a sso ciate eac h suc h ordered path with
a distinct m lemen t p artial algorithm No who wman y f ull algorithms l ists are consisten t
with at least one of these partial algorithm artial p lists Ho w t his q uestion is answ ered is
the core of this app ndix e T o nsw a er this q uestion reorder the en tries in eac h f o the partial
algorithm lists b y p erm uting t he indices d of all the lists Ob viously suc h a reordering w on
c hange the answ er to our uestion q
Reordering is a ccomplished b yin terc hanging airs p of d indices First n i terc hange an y
x y x y
d index of the orm f d d i m i m whose en try is led in
m m m m
x x
in an y of our partial algorithm lists with d d d d i where z is
m m
some arbitrary constan t Y v alue and x refers to the j h elemen tof X ext N create some
j

arbitrary but xed ordering of all x x Then in terc hange n a y d index of
j
x x
the form d d i m whose en try is lled in in an y o f our ew partial
m m
x
algorithm lists with d d x x Recall that a ll the d i ust b e distinct
m
m
By construction the resultan t partial algorithm lists are ndep i enden tof c and f sis hte
n um ber of ucs h lists t m Therefore the n um b er o f lgorithms a consisten t ith w at l east

one partial a lgorithm list in s i i ndep enden tof c and f T his completes the st
part of the pro of
F or the second part st c ho ose an y unordered paths that ir d from one another A
and B There is no rdered o path A constructed from A that equals an ordered path B
or d or d
constructed from Bo c ho ose an y suc h A an ysuc h B If they disagree or f the
or d or d
n ull d then w ekno w hat t there is no eterministic a that agrees with b oth of them If
they agree for the n ull d then since they are sampled from the same f hyhe a v e the same
singlelemen t d If they disagree for that d then there is no a that agrees with b oth of
them If they gree a for that d then they ha v e the same d oublelemen t dno tin ue in this
manner all the up to the m lemen t d S ince the t w o rdered o paths d ir they m ust
ha v e d isagreed at some p o in tb yno w and therefore there is o n a that agrees with b oth of
them Since this is true or f an y A from A and a n y B from B e see that there is no a
d d

in A that is also in B This completes the ro p of
T osho w the relation to the Kullbac kiebler distance the pro duct of binomials is ex
panded with the aid of Stirlings appro ximation when b o th N and c are large
i i

jY j j
Y X
N
i

ln ln N ln N c ln c N c n N c
i i i i i i i i
c
i
i i


N ln N c ln c
i i i i

W e it h as b een assumed that c N whic h s i reasonable w hen m Xj Expanding
i i

ln z z z to second order giv es


j jY j

Y X
N N c
i i i

ln c ln c c c
i i i i
c c N
i i i
i i

ln ln
jY


ln

jY
or or



and




jX




ho ust
Using m jX j then in terms of one nds

jY j

Y
N m j
i

ln mD m m ln ln
KL
c j
i
i
jY j

X
m
i
m m
i i
j
i
i
P

where D ln is the Kullbac kiebler istance d b et w een the distributions
KL i i i
i

and xp E onen tiating this expression yields the second r esult n i Theorem
E Benc hmark measures of p erformance
The result for ac e h b enc hmark measure is established i n turn
P
y
The st measure is P in d j f m a Consider
f m
X
y
P in d j f m a
m
f
for whic h t he summand equals or for all f and eterministic d a It s i only if
x y
i f d d
m m
y
ii f a d d
m
m
y
iii f a d d
m m
m
and so on These restrictions w ill the v alue of f x t m poin ts while f remains free at
all other p o in ts herefore T
X
y jX j m
P d j f m a j
m
f
Using this result in Equation w end
X X X

y y y
P in d j f m P in d j d
m m m
m m
j jY j
y y y
f
d d min d
m m m

m
j
m
j
whic h is t he result quoted in Theorem
P P
jY j
m m
In the limit as jY j gets large write E in c j f m and
f
P
jY j
substitute in for jY j Replacing with turns the sum in to


m m
Next write jY j b for some b and m ultiply and divide he t
Yj j
summand b y S ince jY j then T o tak eteh mtlii of apply Lopital


jY


jY
jY
jY

jY





jX
ln
jX

jY
andrule to the ratio in the summand Next use the fact that is going to to cancel terms
in the summand Carrying through the algebra and d ividing b y b w e get a Riemann
R
b
m m
sum of the form dx x x v aluating the in tegral giv es the econd s result in


b
Theorem
The second b enc hmark concerns t he b eha vior of the random algorithm Marginalizing
o v er the Y v alues o f diren t h istograms c the p e rformance f o a is
X
P in c j f m a P in c j c P c j f m a
c
No w P c j f m a s i the probabilit y f o obtaining histogram c m random dra ws from the

histogram N of the function f his T can b e view ed as the eition d o f a This robabilit p y

Q
jY j
N j
i
has b een calculated previously as o
i
c m
i

jY j jY j
m m
X X X Y
N
i

P in c j f m a c P in c j c
i
jX j
c
i
c c
i i
j
m

j jY j
m m
X X X Y
N
i

c
i
jX j
c
i
c c i i
jY j
m
P

jY j
N jX j
i
i
m m


jX j j
m m
whic h is E quation of Theorem
F Pro of related to minimax distinctions b et w een algo
rithms
The pro of is b y example
Consider three p oin ts in X x nd x nd a three p o in Y y nd y

Let the st p oin t a visits b e x and the rst p o in t a visits b e x

If at its st p oin t a sees a y a y i t jumps to x Otherwise it jumps to x

If at its st p oin t a sees a y tjusmp ot x fiI tssaee y t i j umps to x

Consider the cost function that has as t he Y v alues for the three X v alues f y g

resp ectiv ely
F or m a will pro duce a p opulation y or f this function and a will pro duce

y

The pro of is c ompleted if w e sho w that t here is no cost function so that a pro duces a

p opulation con taining y and y and suc h that a pro duces a p opulation con taining y and

y

There are four p ossible pairs of p opulations to consider





or
in ts
jX

jY
jY



jX
in

i y y

ii y y

iii y y

iv y y

Since if its st p o in tisa y a jumps to x whic h s i here w a starts when a st p oin tis

a y its second p o in tm ust equal a st p o in t This rules o ut p ossibilities i and ii

F or p ssibilities o iii and iv b y a p o pulation w ekno wtaht f m ust b e of the form

f y g orf some v ariable s or case iii s w ould need to equal y due to the st p oin t

in a p opulation Ho w ev er for that c ase the second p oin t a sees w ould b e the v at x

whic his y con trary to h yp othesis

F or case iv w ekon wtttha he s w ould ha v e t o equal y due to the rst p in o tin a

p opulation Ho w er that w ould mean hat t a jumps to x for its second p oin t and w ould

therefore see a y con trary to h yp othesis

Accordingly n one of the four cases is p ossible This is a case b oth here w there is no
y
symmetry under exc hange f o d b t e w een a and a and no symmetry under e xc hange of

histograms QED
G Fixed cost functions and c ho osing pro edures c
Since an y deterministic searc h algorithm is a m apping from d to x n y h
D
algorithm is a v ector in the space X The comp onen ts of suc hav ector re a indexed b ythe
p ossible p o pulations and the v alue for eac h comp onen t s i the x that the algorithm pro duces
giv en the asso ciated p opulation
Consider no w a particular p o pulation d of size m Giv en d e an c sa y w hether an y
other p opulation of size greater than m has the rdered elemen d as its st m r
dered elemen ts The set of hose t p opulations that d o start with d this w a nesay de etos f
comp onen ts of an y algorithm v ector a Those comp onen ts will b e indicated b y a
d
The remaining comp nen o ts of a are of t w ot yp es The st is giv en b y those p opulations
that are equiv alen t to he t st M m elemen ts in d for some M The v alues of t hose
comp onen ts for the v ector algorithm a will b e indicated b y a T he second t yp e consists of
d
those comp onen ts corresp nding o o t all remaining p opulations In tuitiv ely these are p opu
lations that are not compatible w ith d S ome examples of suc h p opulations re a p opulations
that con tain as one of their rst m elemen ts an elemen t ot n found in d and p pulations o that
rerder the elemen ts found n i dhe v alues of a for comp o nen ts of this second t yp e will b e
indicated b y a
d
Let pr o c b e either A or B eaer in terested in
X X X X

P c j f d roc P c j f d d roc


a a a a
d d d
d d d






of ts

searc
ev
alue





The summand is indep enden t of the v alues of a and a for either of o ur t w o d In
d
d
addition the n um ber fo scu hv alues i s a constan t t s i ivg en b y the pro duct o v er all
p opulations not consisten t ith w df eth n b er f o p ossible x eac h suc h p opulation could

b e mapp ed to Therefore up to an o v erall constan t indep enden tof d d f and pr o che
sum equals
X X

P c j f d d roc
d d
d d

a a
d d
d d

By deition w e are implicitl y restricting the sum to those a and a so that our summand
is deed This means that w e ctually a only allo woen v alue for eac h comp nen o tin a
d

amely he t v alue that giv es the next x elemen tin d and similarly for a Therefore he t

d
sum reduces to
X

P c j f d d roc
d
d

a
d

d
x
Note that no c omp onen tof a lies in d The same is true o f a o eth sum o v er a is

d d
d

o v er the same comp onen ts of a as the sum o v er a a o wforxed d and d pr o c
d

c hoice of a a is ed Accordingly without loss of generalit y he t sum can b e rewritten
as
X

P c j f d d
d
a
d
with the implicit assumption that c is set b y a his T sum is indep e nden tof pr o c
d
H Pro o f of heorem T
Let pr o c refer to a c ho osing pro cedure W e are in terested in
X X

P c j f m k a a rco P c j f d d roc


a a

P d d j f k m a a roc

The sum o v er d and d can b e mo v ed outside the sum o v er a and a onsider C an y term in that

sum i n y particular pair of v alues of d and d F or that term P d d j f k m a a roc

is just for those a and a that result n i d and d resp ectiv ely when run on f and

otherwise ecall the assumption that a a are eterministic d This m eans that the

P d d j f k m a a rco f actor simply restricts our sum o v er a and a to the a and a
considered in our theorem Accordingly o ur theorem tell s u hat t the summand f o the sum

o v er d and d is the ame s for c ho osing pro cedures A and B Therefore the full sum is the
same for b th o pro cedures


and






or
of is




um