516
Estimating betamixing coecients
Daniel J.McDonald Cosma Rohilla Shalizi Mark Schervish
Department of Statististics
Carnegie Mellon University
Pittsburgh,PA 15213
danielmc@stat.cmu.edu
Department of Statististics
Carnegie Mellon University
Pittsburgh,PA 15213
cshalizi@stat.cmu.edu
Department of Statististics
Carnegie Mellon University
Pittsburgh,PA 15213
mark@cmu.edu
Abstract
The literature on statistical learning for time
series assumes the asymptotic independence
or\mixing"of the datagenerating process.
These mixing assumptions are never tested,
and there are no methods for estimating mix
ing rates fromdata.We give an estimator for
the betamixing rate based on a single sta
tionary sample path and show it is L1risk
consistent.
1 Introduction
Relaxing the assumption of independence is an active
area of research in the statistics and machine learning
literature.For time series,independence is replaced
by the asymptotic independence of events far apart in
time,or\mixing".Mixing conditions make the depen
dence of the future on the past explicit by quantifying
the decay in dependence as the future moves farther
from the past.There are many denitions of mixing
of varying strength with matching dependence coe
cients (see [9,7,4] for reviews),but most of the results
in the learning literature focus on mixing or absolute
regularity.Roughly speaking (see Denition 2.1 below
for a precise statement),the mixing coecient at
lag a is the total variation distance between the actual
joint distribution of events separated by a time steps
and the product of their marginal distributions,i.e.,
the L
1
distance from independence.
Numerous results in the statistical machine learning
literature rely on knowledge of the mixing coe
cients.As Vidyasagar [25,p.41] notes,mixing is
\just right"for the extension of IID results to de
Appearing in Proceedings of the 14
th
International Con
ference on Articial Intelligence and Statistics (AISTATS)
2011,Fort Lauderdale,FL,USA.Volume 15 of JMLR:
W&CP 15.Copyright 2011 by the authors.
pendent data,and so recent work has consistently
focused on it.Meir [15] derives generalization error
bounds for nonparametric methods based on model se
lection via structural risk minimization.Baraud et al.
[1] study the nite sample risk performance of pe
nalized least squares regression estimators under 
mixing.Lozano et al.[13] examine regularized boost
ing algorithms under absolute regularity and prove
consistency.Karandikar and Vidyasagar [12] consider
\probably approximately correct"learning algorithms,
proving that PAC algorithms for IID inputs remain
PAC with mixing inputs under some mild condi
tions.Ralaivola et al.[20] derive PAC bounds for
ranking statistics and classiers using a decomposition
of the dependency graph.Finally,Mohri and Ros
tamizadeh [16] derive stability bounds for mixing
inputs,generalizing existing stability results for IID
data.
All these results assume not just mixing,but known
mixing coecients.In particular,the risk bounds
in [15,16] and [20] are incalculable without knowl
edge of the rates.This knowledge is never available.
Unless researchers are willing to assume specic val
ues for a sequence of mixing coecients,the results
mentioned in the previous paragraph are generally use
less when confronted with data.To illustrate this de
ciency,consider Theorem 18 of [16]:
Theorem 1.1 (Brie y).Assume a learning algorithm
is stable.
1
Then,for any sample of size n drawn
from a stationary mixing distribution,and > 0
P(jR
b
Rj > ) (n;;;a;b) +(a)(
n
1)
where n = (a + b)
n
, has a particular functional
form,and R
b
R is the dierence between the true risk
and the empirical risk.
Ideally,one could use this result for model selection
or to control the size of the generalization error of
1
The literature on algorithmic stability refers to this as
stability (e.g.Bousquet and Elissee [3]).
517
Estimating betamixing coecients
competing prediction algorithms (support vector ma
chines,support vector regression,and kernel ridge re
gression are a few of the many algorithms known to
satisfy stability).However the bound depends ex
plicitly on the mixing coecient (a).To make mat
ters worse,there are no methods for estimating the
mixing coecients.According to Meir [15,p.7],
\there is no ecient practical approach known at this
stage for estimation of mixing parameters."We begin
to rectify this problemby deriving the rst method for
estimating these coecients.We prove that our esti
mator is consistent for arbitrary mixing processes.
In addition,we derive rates of convergence for Markov
approximations to these processes.
Application of statistical learning results to mixing
data is highly desirable in applied work.Many com
mon time series models are known to be mixing,
and the rates of decay are known given the true pa
rameters of the process.Among the processes for
which such knowledge is available are ARMA mod
els [17],GARCH models [5],and certain Markov pro
cesses  see [9] for an overview of such results.To
our knowledge,only Nobel [18] approaches a solution
to the problem of estimating mixing rates by giving
a method to distinguish between dierent polynomial
mixing rate regimes through hypothesis testing.
We present the rst method for estimating the 
mixing coecients for stationary time series data.Sec
tion 2 denes the mixing coecient and states our
main results on convergence rates and consistency for
our estimator.Section 3 gives an intermediate result
on the L
1
convergence of the histogramestimator with
mixing inputs.Section 4 proves the main results
from x2.Section 5 concludes and lays out some av
enues for future research.
2 Estimation of mixing
In this section,we present one of many equivalent def
initions of absolute regularity and state our main re
sults,deferring proof to x4.
To x notation,let X = fX
t
g
1
t=1
be a sequence of
random variables where each X
t
is a measurable func
tion from a probability space (
;F;P) into a measur
able space X.A block of this random sequence will
be given by X
j
i
fX
t
g
j
t=i
where i and j are integers,
and may be innite.We use similar notation for the
sigma elds generated by these blocks and their joint
distributions.In particular,
j
i
will denote the sigma
eld generated by X
j
i
,and the joint distribution of X
j
i
will be denoted P
j
i
.
2.1 Denitions
There are many equivalent denitions of mixing (see
for instance [9],or [4] as well as Meir [15] or Yu [28]),
however the most intuitive is that given in Doukhan
[9].
Denition 2.1 (mixing).For each positive integer
a,the coecient of absolute regularity,or mixing
coecient,(a),is
(a) sup
t
P
t
1
P
1
t+a
P
t;a
TV
(1)
where jj jj
TV
is the total variation norm,and P
t;a
is
the joint distribution of (X
t
1
;X
1
t+a
).A stochastic
process is said to be absolutely regular,or mixing,
if (a)!0 as a!1.
Loosely speaking,Denition 2.1 says that the coe
cient (a) measures the total variation distance be
tween the joint distribution of random variables sepa
rated by a time units and a distribution under which
random variables separated by a time units are in
dependent.The supremum over t is unnecessary for
stationary random processes X which is the only case
we consider here.
Denition 2.2 (Stationarity).A sequence of ran
dom variables X is stationary when all its nite
dimensional distributions are invariant over time:for
all t and all nonnegative integers i and j,the random
vectors X
t+i
t
and X
t+i+j
t+j
have the same distribution.
Our main result requires the method of blocking used
by Yu [27,28].The purpose is to transforma sequence
of dependent variables into subsequence of nearly IID
ones.Consider a sample X
n
1
from a stationary 
mixing sequence with density f.Let m
n
and
n
be
nonnegative integers such that 2m
n
n
= n.Now di
vide X
n
1
into 2
n
blocks,each of length m
n
.Identify
the blocks as follows:
U
j
= fX
i
:2(j 1)m
n
+1 i (2j 1)m
n
g;
V
j
= fX
i
:(2j 1)m
n
+1 i 2jm
n
g:
Let Ube the entire sequence of odd blocks U
j
,and let
V be the sequence of even blocks V
j
.Finally,let U
0
be a sequence of blocks which are independent of X
n
1
but such that each block has the same distribution as
a block from the original sequence:
U
0
j
D
= U
j
D
= U
1
:(2)
The blocks U
0
are now an IID block sequence,so stan
dard results apply.(See [28] for a more rigorous analy
sis of blocking.) With this structure,we can state our
main result.
518
Daniel J.McDonald,Cosma Rohilla Shalizi,Mark Schervish
2.2 Results
Our main result emerges in two stages.First,we rec
ognize that the distribution of a nite sample depends
only on nitedimensional distributions.This leads to
an estimator of a nitedimensional version of (a).
Next,we let the nitedimension increase to innity
with the size of the observed sample.
For positive integers t,d,and a,dene
d
(a)
P
t
td+1
P
t+a+d1
t+a
P
t;a;d
TV
;(3)
where P
t;a;d
is the joint distribution of
(X
t
td+1
;X
t+a+d1
t+a
).Also,let
b
f
d
be the ddimensional
histogram estimator of the joint density of d consec
utive observations,and let
b
f
2d
a
be the 2ddimensional
histogram estimator of the joint density of two sets of
d consecutive observations separated by a time points.
We construct an estimator of
d
(a) based on these two
histograms.
2
Dene
b
d
(a)
1
2
Z
b
f
2d
a
b
f
d
b
f
d
(4)
We show that,by allowing d = d
n
to grow with n,
this estimator will converge on (a).This can be seen
most clearly by bounding the`
1
risk of the estimator
with its estimation and approximation errors:
j
b
d
(a) (a)j j
b
d
(a)
d
(a)j +j
d
(a) (a)j:
The rst term is the error of estimating
d
(a) with a
random sample of data.The second term is the non
stochastic error induced by approximating the innite
dimensional coecient,(a),with its ddimensional
counterpart,
d
(a).
Our rst theorem in this section establishes consis
tency of
b
d
n
(a) as an estimator of (a) for all mixing
processes provided d
n
increases at an appropriate rate.
Theorem 2.4 gives nite sample bounds on the esti
mation error while some measure theoretic arguments
contained in x4 show that the approximation error
must go to zero as d
n
!1.
Theorem 2.3.Let X
n
1
be a sample from an arbitrary
mixing process.Let d
n
= O(expfW(log n)g) where
W is the Lambert W function.
3
Then
b
d
n
(a)
P
!(a)
as n!1.
2
While it is clearly possible to replace histograms with
other choices of density estimators (most notably kernel
density estimators),histograms in this case are more con
venient theoretically and computationally.See x5 for more
details.
3
The Lambert W function is dened as the (mul
tivalued) inverse of f(w) = wexpfwg.Thus,
O(expfW(log n)g) is bigger than O(log log n) but smaller
than O(log n).See for example Corless et al.[6].
A nite sample bound for the approximation error is
the rst step to establishing consistency for
b
d
(a).
This result gives convergence rates for estimation of
the nite dimensional mixing coecient
d
(a) and also
for Markov processes of known order d,since in this
case,
d
(a) = (a).
Theorem2.4.Consider a sample X
n
1
from a station
ary mixing process.Let
n
and m
n
be positive inte
gers such that 2
n
m
n
= n and
n
d > 0.Then
P(j
b
d
(a)
d
(a)j > )
2 exp
n
2
1
2
+2 exp
n
2
2
2
+4(
n
1)(m
n
);
where
1
= =2 E
h
R
j
b
f
d
f
d
j
i
and
2
=
E
h
R
j
b
f
2d
a
f
2d
a
j
i
.
Consistency of the estimator
b
d
(a) is guaranteed only
for certain choices of m
n
and
n
.Clearly
n
!1
and
n
(m
n
)!0 as n!1are necessary conditions.
Consistency also requires convergence of the histogram
estimators to the target densities.We leave the proof
of this theorem for section 4.As an example to show
that this bound can go to zero with proper choices of
m
n
and
n
,the following corollary proves consistency
for rst order Markov processes.Consistency of the
estimator for higher order Markov processes can be
proven similarly.These processes are geometrically 
mixing as shown in e.g.Nummelin and Tuominen [19].
Corollary 2.5.Let X
n
1
be a sample from a rst order
Markov process with (a) =
1
(a) = O(r
a
) for some
0 r < 1.Then under the conditions of Theorem 2.4,
b
1
(a)
P
!(a) at a rate of o(
p
n) up to a logarithmic
factor.
Proof.Recall that n = 2
n
m
n
.Then,
4(
n
1)(m
n
) = 4
n
(m
n
) +4(m
n
)
= K
1
n
m
n
r
m
n
+K
2
r
m
n
!0
if m
n
=
(log n) for constants K
1
and K
2
.But the
exponential terms are
exp
(
K
3
n
2
j
m
n
)
for j = 1;2 and a constant K
3
.Therefore,both expo
nential terms go to 0 as n!1for m
n
= o(n).Balanc
ing the rates gives the optimal choice of m
n
= o(
p
n)
with corresponding rate of convergence (up to a loga
rithmic factor) of o(
p
n).
519
Estimating betamixing coecients
Proving Theorem 2.4 requires showing the L
1
con
vergence of the histogram density estimator with 
mixing data.We do this in the next section.
3 L
1
convergence of histograms
Convergence of density estimators is thoroughly stud
ied in the statistics and machine learning literature.
Early papers on the L
1
convergence of kernel density
estimators (KDEs) include [26,2,22];Freedman and
Diaconis [10] look specically at histogramestimators,
and Yu [27] considered the L
1
convergence of KDEs
for mixing data and shows that the optimal IIDrates
can be attained.Devroye and Gyor [8] argue that L
1
is a more appropriate metric for studying density esti
mation,and Tran [23] proves L
1
consistency of KDEs
under  and mixing.As far as we are aware,ours is
the rst proof of L
1
convergence for histograms under
mixing.
Additionally,the dimensionality of the target density
is analogous to the order of the Markov approxima
tion.Therefore,the convergence rates we give are
asymptotic in the bandwidth h
n
which shrinks as n
increases,but also in the dimension d which increases
with n.Even under these asymptotics,histogramesti
mation in this sense is not a high dimensional problem.
The dimension of the target density considered here is
on the order of expfW(log n)g,a rate somewhere be
tween log n and log log n.
Theorem 3.1.If
b
f is the histogram estimator based
on a (possibly vector valued) sample X
n
1
from a 
mixing sequence with stationary density f,then for all
> E
h
R
j
b
f fj
i
,
P
Z
j
b
f fj >
2 exp
n
2
1
2
+2(
n
1)(m
n
) (5)
where
1
= E
h
R
j
b
f fj
i
.
To prove this result,we use the blocking method of Yu
[28] to transform the dependent mixing into a se
quence of nearly independent blocks.We then apply
McDiarmid's inequality to the blocks to derive asymp
totics in the bandwidth of the histogram as well as the
dimension of the target density.For completeness,we
state Yu's blocking result and McDiarmid's inequality
before proving the doubly asymptotic histogram con
vergence for IIDdata.Combining these lemmas allows
us to derive rates of convergence for histograms based
on mixing inputs.
Lemma 3.2 (Lemma 4.1 in Yu [28]).Let be a mea
surable function with respect to the block sequence U
uniformly bounded by M.Then,
jE[]
~
E[]j M(m
n
)(
n
1);(6)
where the rst expectation is with respect to the depen
dent block sequence,U,and
~
E is with respect to the
independent sequence,U
0
.
This lemma essentially gives a method of applying IID
results to mixing data.Because the dependence de
cays as we increase the separation between blocks,
widely spaced blocks are nearly independent of each
other.In particular,the dierence between expecta
tions over these nearly independent blocks and expec
tations over blocks which are actually independent can
be controlled by the mixing coecient.
Lemma 3.3 (McDiarmid Inequality [14]).Let
X
1
;:::;X
n
be independent random variables,with X
i
taking values in a set A
i
for each i.Suppose that the
measurable function f:
Q
A
i
!R satises
jf(x) f(x
0
)j c
i
whenever the vectors x and x
0
dier only in the i
th
coordinate.Then for any > 0,
P(f Ef > ) exp
2
2
P
c
2
i
:
Lemma 3.4.For an IID sample X
1
;:::;X
n
from
some density f on R
d
,
E
Z
j
b
f E
b
fjdx = O
1=
q
nh
d
n
(7)
Z
jE
b
f fjdx = O(dh
n
) +O(d
2
h
2
n
);(8)
where
b
f is the histogram estimate using a grid with
sides of length h
n
.
Proof of Lemma 3.4.Let p
j
be the probability of
falling into the j
th
bin B
j
.Then,
E
Z
j
b
f E
b
fj = h
d
n
J
X
j=1
E
1
nh
d
n
n
X
i=1
I(X
i
2 B
j
)
p
j
h
d
h
d
n
J
X
j=1
1
nh
d
n
v
u
u
t
V
"
n
X
i=1
I(X
i
2 B
j
)
#
= h
d
n
J
X
j=1
1
nh
d
n
q
np
j
(1 p
j
)
=
1
p
n
J
X
j=1
q
p
j
(1 p
j
)
= O(n
1=2
)O(h
d=2
n
) = O
1=
q
nh
d
n
:
520
Daniel J.McDonald,Cosma Rohilla Shalizi,Mark Schervish
For the second claim,consider the bin B
j
centered at
c.Let I be the union of all bins B
j
.Assume the
following:
1.f 2 L
2
and f is absolutely continuous on I,with
a.e.partial derivatives f
i
=
@
@y
i
f(y)
2.f
i
2 L
2
and f
i
is absolutely continuous on I,with
a.e.partial derivatives f
ik
=
@
@y
k
f
i
(y)
3.f
ik
2 L
2
for all i;k.
Using a Taylor expansion
f(x) = f(c) +
d
X
i=1
(x
i
c
i
)f
i
(c) +O(d
2
h
2
n
);
where f
i
(y) =
@
@y
i
f(y).Therefore,p
j
is given by
p
j
=
Z
B
j
f(x)dx = h
d
n
f(c) +O(d
2
h
d+2
n
)
since the integral of the second term over the bin is
zero.This means that for the j
th
bin,
E
b
f
n
(x) f(x) =
p
j
h
d
n
f(x)
=
d
X
i=1
(x
i
c
i
)f
i
(c) +O(d
2
h
2
n
):
Therefore,
Z
B
j
E
b
f
n
(x) f(x)
=
Z
B
j
d
X
i=1
(x
i
c
i
)f
i
(c) +O(d
2
h
2
n
)
Z
B
j
d
X
i=1
(x
i
c
i
)f
i
(c)
+
Z
B
j
O(d
2
h
2
)
=
Z
B
j
d
X
i=1
(x
i
c
i
)f
i
(c)
+O(d
2
h
2+d
n
)
= O(dh
d+1
n
) +O(d
2
h
2+d
n
)
Since each bin is bounded,we can sum over all J bins.
The number of bins is J = h
d
n
by denition,so
Z
jE
b
f
n
(x) f(x)jdx
= O(h
d
n
)
O(dh
d+1
n
) +O(d
2
h
2+d
n
)
= O(dh
n
) +O(d
2
h
2
n
):
We can now prove the main result of this section.
Proof of Theorem 3.1.Let g be the L
1
loss of the his
togram estimator,g =
R
jf
b
f
n
j.Here
b
f
n
(x) =
1
nh
d
n
P
n
i=1
I(X
i
2 B
j
(x)) where B
j
(x) is the bin con
taining x.Let
b
f
U
,
b
f
V
,and
b
f
U
0 be histograms based on
the block sequences U,V,and U
0
respectively.Clearly
b
f
n
=
1
2
(
b
f
U
+
b
f
V
):Now,
P(g > ) = P
Z
jf
b
f
n
j >
= P
Z
f
b
f
U
2
+
f
b
f
V
2
>
!
P
1
2
Z
jf
b
f
U
j +
1
2
Z
jf
b
f
V
j >
= P(g
U
+g
V
> 2)
P(g
U
> ) +P(g
V
> )
= 2P(g
U
E[g
U
] > E[g
U
])
= 2P(g
U
E[g
U
0
] > E[g
U
0
])
= 2P(g
U
E[g
U
0
] >
1
);
where
1
= E[g
U
0 ].Here,
E[g
U
0
]
~
E
Z
j
b
f
U
0
~
E
b
f
U
0
jdx +
Z
j
~
E
b
f
U
0
fjdx;
so by Lemma 3.4,as long as for
n
!1,h
n
#0 and
n
h
d
n
!1,then for all there exists n
0
() such that
for all n > n
0
(), > E[g] = E[g
U
0 ].Now applying
Lemma 3.2 to the expectation of the indicator of the
event fg
U
E[g
U
0 ] >
1
g gives
2P(g
U
E[g
U
0
] >
1
) 2P(g
U
0
E[g
U
0
] >
1
)
+2(
n
1)(m
n
)
where the probability on the right is for the eld gen
erated by the independent block sequence U
0
.Since
these blocks are independent,showing that g
U
0
sat
ises the bounded dierences requirement allows for
the application of McDiarmid's inequality 3.3 to the
blocks.For any two block sequences u
0
1
;:::;u
0
n
and
u
0
1
;:::;u
0
n
with u
0
`
= u
0
`
for all`6= j,then
g
U
0 (u
0
1
;:::;u
0
n
) g
U
0 (u
0
1
;:::;u
0
n
)
=
Z
j
b
f(y;u
0
1
;:::;u
0
n
) f(y)jdy
Z
j
b
f(y;u
0
1
;:::;u
0
n
) f(y)jdy
Z
j
b
f(y;u
0
1
;:::;u
0
n
)
b
f(y;u
0
1
;:::;u
0
n
)jdy
=
2
n
h
d
n
h
d
n
=
2
n
:
Therefore,
P(g > ) 2P(g
U
0 E[g
U
0 ] >
1
) +2(
n
1)(m
n
)
2 exp
n
2
1
2
+2(
n
1)(m
n
):
521
Estimating betamixing coecients
4 Proofs
The proof of Theorem 2.4 relies on the triangle in
equality and the relationship between total variation
distance and the L
1
distance between densities.
Proof of Theorem 2.4.For any probability measures
and dened on the same probability space with asso
ciated densities f
and f
with respect to some domi
nating measure ,
jj jj
TV
=
1
2
Z
jf
f
jd():
Let P be the ddimensional stationary distribution of
the d
th
order Markov process,i.e.P = P
t
td+1
=
P
t+a+d1
t+a
in the notation of equation 3.Let P
a;d
be
the joint distribution of the bivariate random process
created by the initial process and itself separated by a
time steps.By the triangle inequality,we can upper
bound
d
(a) for any d = d
n
.Let
b
P and
b
P
a;d
be the
distributions associated with histogram estimators
b
f
d
and
b
f
2d
a
respectively.Then,
d
(a) = jjP
P P
a;d
jj
TV
=
P
P
b
P
b
P +
b
P
b
P
b
P
a;d
+
b
P
a;d
P
a;d
TV
P
P
b
P
b
P
TV
+
b
P
b
P
b
P
a;d
TV
+
b
P
a;d
P
a;d
TV
2
P
b
P
TV
+
b
P
b
P
b
P
a;d
TV
+
b
P
a;d
P
a;d
TV
=
Z
jf
d
b
f
d
j +
1
2
Z
j
b
f
d
b
f
d
b
f
2d
a
j
+
1
2
Z
jf
2d
a
b
f
2d
a
j
where
1
2
R
j
b
f
d
b
f
d
b
f
2d
a
j is our estimator
b
d
(a) and the
remaining terms are the L
1
distance between a density
estimator and the target density.Thus,
d
(a)
b
d
(a)
Z
jf
d
b
f
d
j +
1
2
Z
jf
2d
a
b
f
2d
a
j:
A similar argument starting from
d
(a) =
jjP
P P
a;d
jj
TV
shows that
d
(a)
b
d
(a)
Z
jf
d
b
f
d
j
1
2
Z
jf
2d
a
b
f
2d
a
j;
so we have that
d
(a)
b
d
(a)
Z
jf
d
b
f
d
j +
1
2
Z
jf
2d
a
b
f
2d
a
j:
Therefore,
P
d
(a)
b
d
(a)
>
P
Z
jf
d
b
f
d
j +
1
2
Z
jf
2d
a
b
f
2d
a
j >
P
Z
jf
d
b
f
d
j >
2
+P
1
2
Z
jf
2d
a
b
f
2d
a
j >
2
2 exp
n
2
1
2
+2 exp
n
2
2
2
+4(
n
1)(m
n
);
where
1
= =2 E
h
R
j
b
f
d
f
d
j
i
and
2
=
E
h
R
j
b
f
2d
a
f
2d
a
j
i
.
The proof of Theorem2.3 requires two steps which are
given in the following Lemmas.The rst species the
histogrambandwidth h
n
and the rate at which d
n
(the
dimensionality of the target density) goes to innity.If
the dimensionality of the target density were xed,we
could achieve rates of convergence similar to those for
histograms based on IID inputs.However,we wish to
allow the dimensionality to grow with n,so the rates
are much slower as shown in the following lemma.
Lemma 4.1.For the histogram estimator in
Lemma 3.4,let
d
n
expfW(log n)g;
h
n
n
k
n
;
with
k
n
=
W(log n) +
1
2
log n
log n
1
2
expfW(log n)g +1
:
These choices lead to the optimal rate of convergence.
Proof.Let h
n
= n
k
n
for some k
n
to be determined.
Then we want n
1=2
h
d
n
=2
n
= n
(k
n
d
n
1)=2
!0,
d
n
h
n
= d
n
n
k
!0,and d
2
n
h
2
n
= d
2
n
n
2k
!0 all
as n!1.Call these A,B,and C.Taking A and B
rst gives
n
(k
n
d
n
1)=2
d
n
n
k
n
)
1
2
(k
n
d
n
1) log n log d
n
k
n
log n
)k
n
log n
1
2
d
n
+1
log d
n
+
1
2
log n
)k
n
log d
n
+
1
2
log n
log n
1
2
d
n
+1
:(9)
522
Daniel J.McDonald,Cosma Rohilla Shalizi,Mark Schervish
Similarly,combining A and C gives
k
n
2 log d
n
+
1
2
log n
log n
1
2
d
n
+2
:(10)
Equating (9) and (10) and solving for d
n
gives
)d
n
expfW(log n)g
where W() is the Lambert W function.Plugging back
into (9) gives that
h
n
= n
k
n
where
k
n
=
W(log n) +
1
2
log n
log n
1
2
expfW(log n)g +1
:
It is also necessary to show that as d grows,
d
(a)!
(a).We now prove this result.
Lemma 4.2.
d
(a) converges to (a) as d!1.
Proof.By stationarity,the supremum over t is un
necessary in Denition 2.1,so without loss of gen
erality,let t = 0.Let P
0
1
be the distribution on
0
1
= (:::;X
1
;X
0
),and let P
1
a
be the distribu
tion on
1
a+1
= (X
a+1
;X
a+2
;:::).Let P
a
be the
distribution on =
0
1
1
a+1
(the product sigma
eld).Then we can rewrite Denition 2.1 using this
notation as
(a) = sup
C2
jP
a
(C) [P
0
1
P
1
a+1
](C)j:
Let
0
d+1
and
a+d
a+1
be the subelds of
0
1
and
1
a+1
consisting of the ddimensional cylinder sets for
the d dimensions closest together.Let
d
be the prod
uct eld of these two.Then we can rewrite
d
(a)
as
d
(a) = sup
C2
d
jjP
a
(C) [P
0
1
P
1
a+1
](C)j:(11)
As such
d
(a) (a) for all a and d.We can
rewrite (11) in terms of nitedimensional marginals:
d
(a) = sup
C2
d
jP
a;d
(C) [P
0
d+1
P
a+d
a+1
](C)j;
where P
a;d
is the restriction of P to
(X
d+1
;:::;X
0
;X
a+1
;:::;X
a+d
).Because of
the nested nature of these sigmaelds,we have
d
1
(a)
d
2
(a) (a)
for all nite d
1
d
2
.Therefore,for xed a,f
d
(a)g
1
d=1
is a monotone increasing sequence which is bounded
above,and it converges to some limit L (a).To
show that L = (a) requires some additional steps.
Let R = P
a
[P
0
1
P
1
a
],which is a signed mea
sure on .Let R
d
= P
a;d
[P
0
d
P
a+d
a
],which is
a signed measure on
d
.Decompose R into positive
and negative parts as R = Q
+
Q
and similarly for
R
d
= Q
+d
Q
d
.Notice that since R
d
is constructed
using the marginals of P,then R(E) = R
d
(E) for all
E 2
d
.Now since R is the dierence of probability
measures,we must have that
0 = R(
) = Q
+
(
) Q
(
)
= Q
+
(D) +Q
+
(D
c
) Q
(D) Q
(D
c
) (12)
for all D 2 .
Dene Q = Q
+
+Q
.Let > 0.Let C 2 be such
that
Q(C) = (a) = Q
+
(C) = Q
(C
c
):(13)
Such a set C is guaranteed by the Hahn decomposi
tion theorem (letting C
be a set which attains the
supremum in (11),we can throw away any subsets
with negative R measure) and (12) assuming without
loss of generality that P
a
(C) > [P
0
1
P
1
a
](C).We
can use the eld
f
=
S
d
d
to approximate in the
sense that,for all ,we can nd A 2
f
such that
Q(AC) < =2 (see Theorem D in Halmos [11,x13]
or Lemma A.24 in Schervish [21]).Now,
Q(AC) = Q(A\C
c
) +Q(C\A
c
)
= Q
(A\C
c
) +Q
+
(C\A
c
)
by (13) since A\C
c
C
c
and C\A
c
C.Therefore,
since Q(AC) < =2,we have
Q
(A\C
c
) =2 (14)
Q
+
(A
c
\C) =2:
Also,
Q(C) = Q(A\C) +Q(A
c
\C)
= Q
+
(A\C) +Q
+
(A
c
\C)
Q
+
(A) +=2
since A\C and A
c
\C are contained in C and A\C
A.Therefore
Q
+
(A) Q(C) =2:
Similarly,
Q
(A) = Q
(A\C) +Q
(A\C
c
) 0 +=2 = =2
since A\C C and Q
(C) = 0 by (14).Finally,
Q
+d
(A) Q
+d
(A) Q
d
(A) = R
d
(A)
= R(A) = Q
+
(A) Q
(A)
Q(C) =2 =2 = Q(C)
= (a) :
523
Estimating betamixing coecients
And since
d
(a) Q
+d
(A),we have that for all > 0
there exists d such that for all d
1
> d,
d
1
(a)
d
(a) Q
+d
(A)
(a) :
Thus,we must have that L = (a),so that
d
(a)!
(a) as desired.
Proof of Theorem 2.3.By the triangle inequality,
j
b
d
n
(a) (a)j j
b
d
n
(a)
d
n
(a)j +j
d
n
(a) (a)j:
The rst term on the right is bounded by the re
sult in Theorem 2.4,where we have shown that d
n
=
O(expfW(log n)g) is slow enough for the histogram
estimator to remain consistent.That
d
n
(a)
d
n
!1
!
(a) follows from Lemma 4.2.
5 Discussion
We have shown that our estimator of the mixing
coecients is consistent for the true coecients (a)
under some conditions on the data generating process.
There are numerous results in the statistics and ma
chine learning literatures which assume knowledge of
the mixing coecients,yet as far as we know,this
is the rst estimator for them.An ability to estimate
these coecients will allow researchers to apply ex
isting results to dependent data without the need to
arbitrarily assume their values.Despite the obvious
utility of this estimator,as a consequence of its novelty,
it comes with a number of potential extensions which
warrant careful exploration as well as some drawbacks.
The reader will note that Theorem 2.3 does not pro
vide a convergence rate.The rate in Theorem 2.4 ap
plies only to the dierence between
^
d
(a) and
d
(a).
In order to provide a rate in Theorem 2.3,we would
need a better understanding of the nonstochastic con
vergence of
d
(a) to (a).It is not immediately
clear that this quantity can converge at any well
dened rate.In particular,it seems likely that the
rate of convergence depends on the tail of the sequence
f(a)g
1
a=1
.
Several other mixing and weakdependence coecients
also have a totalvariation avor,perhaps most no
tably mixing [9,7,4].None of themhave estimators,
and the same trick might well work for them,too.
The use of histograms rather than kernel density esti
mators for the joint and marginal densities is surpris
ing and perhaps not ultimately necessary.As men
tioned above,Tran [23] proved that KDEs are consis
tent for estimating the stationary density of a time se
ries with mixing inputs,so perhaps one could replace
the histograms in our estimator with KDEs.However,
this would need an analogue of the double asymptotic
results proven for histograms in Lemma 3.4.In partic
ular,we need to estimate increasingly higher dimen
sional densities as n!1.This does not cause a
problem of smallnlarged since d is chosen as a func
tion of n,however it will lead to increasingly higher
dimensional integration.For histograms,the integral
is always trivial,but in the case of KDEs,the nu
merical accuracy of the integration algorithm becomes
increasingly important.This issue could swamp any
eciency gains obtained through the use of kernels.
However,this question certainly warrants further in
vestigation.
The main drawback of an estimator based on a den
sity estimate is its complexity.The mixing coecients
are functionals of the joint and marginal distributions
derived from the stochastic process X,however,it is
unsatisfying to estimate densities and solve integrals in
order to estimate a single number.Vapnik's main prin
ciple for solving problems using a restricted amount of
information is
When solving a given problem,try to avoid
solving a more general problem as an inter
mediate step [24,p.30].
This principle is clearly violated here,but perhaps our
seed will precipitate a more aesthetically pleasing so
lution.
Acknowledgements
The authors wish to thank Darren Homrighausen and
two anonymous reviewers for helpful comments and
the Institute for New Economic Thinking for support
ing this research.
References
[1] Baraud,Y.,Comte,F.,and Viennet,G.(2001),
\Adaptive estimation in autoregression or mixing
regression via model selection,"Annals of statistics,
29,839{875.
[2] Bickel,P.and Rosenblatt,M.(1973),\On Some
Global Measures of the Deviations of Density Func
tion Estimates,"The Annals of Statistics,1,1071{
1095.
[3] Bousquet,O.and Elissee,A.(2002),\Stabil
ity and Generalization,"The Journal of Machine
Learning Research,2,499{526.
[4] Bradley,R.C.(2005),\Basic Properties of Strong
Mixing Conditions.A Survey and Some Open Ques
tions,"Probability Surveys,2,107{144.
524
Daniel J.McDonald,Cosma Rohilla Shalizi,Mark Schervish
[5] Carrasco,M.and Chen,X.(2002),\Mixing
and Moment Properties of Various GARCH and
Stochastic Volatility Models,"Econometric Theory,
18,17{39.
[6] Corless,R.,Gonnet,G.,Hare,D.,Jerey,D.,and
Knuth,D.(1996),\On the Lambert W Function,"
Advances in Computational Mathematics,5,329{
359.
[7] Dedecker,J.,Doukhan,P.,Lang,G.,Leon R.,
J.R.,Louhichi,S.,and Prieur,C.(2007),Weak
Dependence:With Examples and Applications,vol.
190 of Lecture Notes in Statistics,Springer Verlag,
New York.
[8] Devroye,L.and Gyor,L.(1985),Nonparamet
ric Density Estimation:The L
1
View,Wiley,New
York.
[9] Doukhan,P.(1994),Mixing:Properties and Exam
ples,vol.85 of Lecture Notes in Statistics,Springer
Verlag,New York.
[10] Freedman,D.and Diaconis,P.(1981),\On the
Maximum Deviation Between the Histogram and
the Underlying Density,"Probability Theory and
Related Fields,58,139{167.
[11] Halmos,P.(1974),Measure Theory,Graduate
Texts in Mathematics,SpringerVerlag,New York.
[12] Karandikar,R.L.and Vidyasagar,M.(2009),
\Probably Approximately Correct Learning with
BetaMixing Input Sequences,"submitted for pub
lication.
[13] Lozano,A.,Kulkarni,S.,and Schapire,R.
(2006),\Convergence and Consistency of Regular
ized Boosting Algorithms with Stationary Beta
Mixing Observations,"Advances in Neural Informa
tion Processing Systems,18,819.
[14] McDiarmid,C.(1989),\On the Method of
Bounded Dierences,"in Surveys in Combinatorics,
ed.J.Siemons,vol.141 of London Mathematical So
ciety Lecture Note Series,pp.148{188,Cambridge
University Press.
[15] Meir,R.(2000),\Nonparametric Time Series Pre
diction Through Adaptive Model Selection,"Ma
chine Learning,39,5{34.
[16] Mohri,M.and Rostamizadeh,A.(2010),\Stabil
ity Bounds for Stationary'mixing and mixing
Processes,"Journal of Machine Learning Research,
11,789{814.
[17] Mokkadem,A.(1988),\Mixing properties of
ARMA processes,"Stochastic processes and their
applications,29,309{315.
[18] Nobel,A.(2006),\Hypothesis Testing for Fami
lies of Ergodic Processes,"Bernoulli,12,251{269.
[19] Nummelin,E.and Tuominen,P.(1982),\Ge
ometric Ergodicity of Harris Recurrent Markov
Chains with Applications to Renewal Theory,"
Stochastic Processes and Their Applications,12,
187{202.
[20] Ralaivola,L.,Szafranski,M.,and Stempfel,G.
(2010),\Chromatic PACBayes Bounds for Non
IID Data:Applications to Ranking and Stationary
Mixing Processes,"Journal of Machine Learning
Research,11,1927{1956.
[21] Schervish,M.(1995),Theory of Statistics,
Springer Series in Statistics,Springer Verlag,New
York.
[22] Silverman,B.(1978),\Weak and Strong Uniform
Consistency of the Kernel Estimate of a Density and
its Derivatives,"The Annals of Statistics,6,177{
184.
[23] Tran,L.(1989),\The L
1
Convergence of Kernel
Density Estimates under Dependence,"The Cana
dian Journal of Statistics/La Revue Canadienne de
Statistique,17,197{208.
[24] Vapnik,V.(2000),The Nature of Statistical
Learning Theory,Statistics for Engineering and In
formation Science,Springer Verlag,New York,2nd
edn.
[25] Vidyasagar,M.(1997),A Theory of Learning and
Generalization:With Applications to Neural Net
works and Control Systems,Springer Verlag,Berlin.
[26] Woodroofe,M.(1967),\On the Maximum Devi
ation of the Sample Density,"The Annals of Math
ematical Statistics,38,475{481.
[27] Yu,B.(1993),\Density Estimation in the L
1
Norm for Dependent Data with Applications to the
Gibbs Sampler,"Annals of Statistics,21,711{735.
[28] Yu,B.(1994),\Rates of Convergence for Empiri
cal Processes of Stationary Mixing Sequences,"The
Annals of Probability,22,94{116.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment