516

Estimating beta-mixing coecients

Daniel J.McDonald Cosma Rohilla Shalizi Mark Schervish

Department of Statististics

Carnegie Mellon University

Pittsburgh,PA 15213

danielmc@stat.cmu.edu

Department of Statististics

Carnegie Mellon University

Pittsburgh,PA 15213

cshalizi@stat.cmu.edu

Department of Statististics

Carnegie Mellon University

Pittsburgh,PA 15213

mark@cmu.edu

Abstract

The literature on statistical learning for time

series assumes the asymptotic independence

or\mixing"of the data-generating process.

These mixing assumptions are never tested,

and there are no methods for estimating mix-

ing rates fromdata.We give an estimator for

the beta-mixing rate based on a single sta-

tionary sample path and show it is L1-risk

consistent.

1 Introduction

Relaxing the assumption of independence is an active

area of research in the statistics and machine learning

literature.For time series,independence is replaced

by the asymptotic independence of events far apart in

time,or\mixing".Mixing conditions make the depen-

dence of the future on the past explicit by quantifying

the decay in dependence as the future moves farther

from the past.There are many denitions of mixing

of varying strength with matching dependence coe-

cients (see [9,7,4] for reviews),but most of the results

in the learning literature focus on -mixing or absolute

regularity.Roughly speaking (see Denition 2.1 below

for a precise statement),the -mixing coecient at

lag a is the total variation distance between the actual

joint distribution of events separated by a time steps

and the product of their marginal distributions,i.e.,

the L

1

distance from independence.

Numerous results in the statistical machine learning

literature rely on knowledge of the -mixing coe-

cients.As Vidyasagar [25,p.41] notes,-mixing is

\just right"for the extension of IID results to de-

Appearing in Proceedings of the 14

th

International Con-

ference on Articial Intelligence and Statistics (AISTATS)

2011,Fort Lauderdale,FL,USA.Volume 15 of JMLR:

W&CP 15.Copyright 2011 by the authors.

pendent data,and so recent work has consistently

focused on it.Meir [15] derives generalization error

bounds for nonparametric methods based on model se-

lection via structural risk minimization.Baraud et al.

[1] study the nite sample risk performance of pe-

nalized least squares regression estimators under -

mixing.Lozano et al.[13] examine regularized boost-

ing algorithms under absolute regularity and prove

consistency.Karandikar and Vidyasagar [12] consider

\probably approximately correct"learning algorithms,

proving that PAC algorithms for IID inputs remain

PAC with -mixing inputs under some mild condi-

tions.Ralaivola et al.[20] derive PAC bounds for

ranking statistics and classiers using a decomposition

of the dependency graph.Finally,Mohri and Ros-

tamizadeh [16] derive stability bounds for -mixing

inputs,generalizing existing stability results for IID

data.

All these results assume not just -mixing,but known

mixing coecients.In particular,the risk bounds

in [15,16] and [20] are incalculable without knowl-

edge of the rates.This knowledge is never available.

Unless researchers are willing to assume specic val-

ues for a sequence of -mixing coecients,the results

mentioned in the previous paragraph are generally use-

less when confronted with data.To illustrate this de-

ciency,consider Theorem 18 of [16]:

Theorem 1.1 (Brie y).Assume a learning algorithm

is -stable.

1

Then,for any sample of size n drawn

from a stationary -mixing distribution,and > 0

P(jR

b

Rj > ) (n;;;a;b) +(a)(

n

1)

where n = (a + b)

n

, has a particular functional

form,and R

b

R is the dierence between the true risk

and the empirical risk.

Ideally,one could use this result for model selection

or to control the size of the generalization error of

1

The literature on algorithmic stability refers to this as

-stability (e.g.Bousquet and Elissee [3]).

517

Estimating beta-mixing coecients

competing prediction algorithms (support vector ma-

chines,support vector regression,and kernel ridge re-

gression are a few of the many algorithms known to

satisfy -stability).However the bound depends ex-

plicitly on the mixing coecient (a).To make mat-

ters worse,there are no methods for estimating the

-mixing coecients.According to Meir [15,p.7],

\there is no ecient practical approach known at this

stage for estimation of mixing parameters."We begin

to rectify this problemby deriving the rst method for

estimating these coecients.We prove that our esti-

mator is consistent for arbitrary -mixing processes.

In addition,we derive rates of convergence for Markov

approximations to these processes.

Application of statistical learning results to -mixing

data is highly desirable in applied work.Many com-

mon time series models are known to be -mixing,

and the rates of decay are known given the true pa-

rameters of the process.Among the processes for

which such knowledge is available are ARMA mod-

els [17],GARCH models [5],and certain Markov pro-

cesses | see [9] for an overview of such results.To

our knowledge,only Nobel [18] approaches a solution

to the problem of estimating mixing rates by giving

a method to distinguish between dierent polynomial

mixing rate regimes through hypothesis testing.

We present the rst method for estimating the -

mixing coecients for stationary time series data.Sec-

tion 2 denes the -mixing coecient and states our

main results on convergence rates and consistency for

our estimator.Section 3 gives an intermediate result

on the L

1

convergence of the histogramestimator with

-mixing inputs.Section 4 proves the main results

from x2.Section 5 concludes and lays out some av-

enues for future research.

2 Estimation of -mixing

In this section,we present one of many equivalent def-

initions of absolute regularity and state our main re-

sults,deferring proof to x4.

To x notation,let X = fX

t

g

1

t=1

be a sequence of

random variables where each X

t

is a measurable func-

tion from a probability space (

;F;P) into a measur-

able space X.A block of this random sequence will

be given by X

j

i

fX

t

g

j

t=i

where i and j are integers,

and may be innite.We use similar notation for the

sigma elds generated by these blocks and their joint

distributions.In particular,

j

i

will denote the sigma

eld generated by X

j

i

,and the joint distribution of X

j

i

will be denoted P

j

i

.

2.1 Denitions

There are many equivalent denitions of -mixing (see

for instance [9],or [4] as well as Meir [15] or Yu [28]),

however the most intuitive is that given in Doukhan

[9].

Denition 2.1 (-mixing).For each positive integer

a,the coecient of absolute regularity,or -mixing

coecient,(a),is

(a) sup

t

P

t

1

P

1

t+a

P

t;a

TV

(1)

where jj jj

TV

is the total variation norm,and P

t;a

is

the joint distribution of (X

t

1

;X

1

t+a

).A stochastic

process is said to be absolutely regular,or -mixing,

if (a)!0 as a!1.

Loosely speaking,Denition 2.1 says that the coe-

cient (a) measures the total variation distance be-

tween the joint distribution of random variables sepa-

rated by a time units and a distribution under which

random variables separated by a time units are in-

dependent.The supremum over t is unnecessary for

stationary random processes X which is the only case

we consider here.

Denition 2.2 (Stationarity).A sequence of ran-

dom variables X is stationary when all its nite-

dimensional distributions are invariant over time:for

all t and all non-negative integers i and j,the random

vectors X

t+i

t

and X

t+i+j

t+j

have the same distribution.

Our main result requires the method of blocking used

by Yu [27,28].The purpose is to transforma sequence

of dependent variables into subsequence of nearly IID

ones.Consider a sample X

n

1

from a stationary -

mixing sequence with density f.Let m

n

and

n

be

non-negative integers such that 2m

n

n

= n.Now di-

vide X

n

1

into 2

n

blocks,each of length m

n

.Identify

the blocks as follows:

U

j

= fX

i

:2(j 1)m

n

+1 i (2j 1)m

n

g;

V

j

= fX

i

:(2j 1)m

n

+1 i 2jm

n

g:

Let Ube the entire sequence of odd blocks U

j

,and let

V be the sequence of even blocks V

j

.Finally,let U

0

be a sequence of blocks which are independent of X

n

1

but such that each block has the same distribution as

a block from the original sequence:

U

0

j

D

= U

j

D

= U

1

:(2)

The blocks U

0

are now an IID block sequence,so stan-

dard results apply.(See [28] for a more rigorous analy-

sis of blocking.) With this structure,we can state our

main result.

518

Daniel J.McDonald,Cosma Rohilla Shalizi,Mark Schervish

2.2 Results

Our main result emerges in two stages.First,we rec-

ognize that the distribution of a nite sample depends

only on nite-dimensional distributions.This leads to

an estimator of a nite-dimensional version of (a).

Next,we let the nite-dimension increase to innity

with the size of the observed sample.

For positive integers t,d,and a,dene

d

(a)

P

t

td+1

P

t+a+d1

t+a

P

t;a;d

TV

;(3)

where P

t;a;d

is the joint distribution of

(X

t

td+1

;X

t+a+d1

t+a

).Also,let

b

f

d

be the d-dimensional

histogram estimator of the joint density of d consec-

utive observations,and let

b

f

2d

a

be the 2d-dimensional

histogram estimator of the joint density of two sets of

d consecutive observations separated by a time points.

We construct an estimator of

d

(a) based on these two

histograms.

2

Dene

b

d

(a)

1

2

Z

b

f

2d

a

b

f

d

b

f

d

(4)

We show that,by allowing d = d

n

to grow with n,

this estimator will converge on (a).This can be seen

most clearly by bounding the`

1

-risk of the estimator

with its estimation and approximation errors:

j

b

d

(a) (a)j j

b

d

(a)

d

(a)j +j

d

(a) (a)j:

The rst term is the error of estimating

d

(a) with a

random sample of data.The second term is the non-

stochastic error induced by approximating the innite

dimensional coecient,(a),with its d-dimensional

counterpart,

d

(a).

Our rst theorem in this section establishes consis-

tency of

b

d

n

(a) as an estimator of (a) for all -mixing

processes provided d

n

increases at an appropriate rate.

Theorem 2.4 gives nite sample bounds on the esti-

mation error while some measure theoretic arguments

contained in x4 show that the approximation error

must go to zero as d

n

!1.

Theorem 2.3.Let X

n

1

be a sample from an arbitrary

-mixing process.Let d

n

= O(expfW(log n)g) where

W is the Lambert W function.

3

Then

b

d

n

(a)

P

!(a)

as n!1.

2

While it is clearly possible to replace histograms with

other choices of density estimators (most notably kernel

density estimators),histograms in this case are more con-

venient theoretically and computationally.See x5 for more

details.

3

The Lambert W function is dened as the (mul-

tivalued) inverse of f(w) = wexpfwg.Thus,

O(expfW(log n)g) is bigger than O(log log n) but smaller

than O(log n).See for example Corless et al.[6].

A nite sample bound for the approximation error is

the rst step to establishing consistency for

b

d

(a).

This result gives convergence rates for estimation of

the nite dimensional mixing coecient

d

(a) and also

for Markov processes of known order d,since in this

case,

d

(a) = (a).

Theorem2.4.Consider a sample X

n

1

from a station-

ary -mixing process.Let

n

and m

n

be positive inte-

gers such that 2

n

m

n

= n and

n

d > 0.Then

P(j

b

d

(a)

d

(a)j > )

2 exp

n

2

1

2

+2 exp

n

2

2

2

+4(

n

1)(m

n

);

where

1

= =2 E

h

R

j

b

f

d

f

d

j

i

and

2

=

E

h

R

j

b

f

2d

a

f

2d

a

j

i

.

Consistency of the estimator

b

d

(a) is guaranteed only

for certain choices of m

n

and

n

.Clearly

n

!1

and

n

(m

n

)!0 as n!1are necessary conditions.

Consistency also requires convergence of the histogram

estimators to the target densities.We leave the proof

of this theorem for section 4.As an example to show

that this bound can go to zero with proper choices of

m

n

and

n

,the following corollary proves consistency

for rst order Markov processes.Consistency of the

estimator for higher order Markov processes can be

proven similarly.These processes are geometrically -

mixing as shown in e.g.Nummelin and Tuominen [19].

Corollary 2.5.Let X

n

1

be a sample from a rst order

Markov process with (a) =

1

(a) = O(r

a

) for some

0 r < 1.Then under the conditions of Theorem 2.4,

b

1

(a)

P

!(a) at a rate of o(

p

n) up to a logarithmic

factor.

Proof.Recall that n = 2

n

m

n

.Then,

4(

n

1)(m

n

) = 4

n

(m

n

) +4(m

n

)

= K

1

n

m

n

r

m

n

+K

2

r

m

n

!0

if m

n

=

(log n) for constants K

1

and K

2

.But the

exponential terms are

exp

(

K

3

n

2

j

m

n

)

for j = 1;2 and a constant K

3

.Therefore,both expo-

nential terms go to 0 as n!1for m

n

= o(n).Balanc-

ing the rates gives the optimal choice of m

n

= o(

p

n)

with corresponding rate of convergence (up to a loga-

rithmic factor) of o(

p

n).

519

Estimating beta-mixing coecients

Proving Theorem 2.4 requires showing the L

1

con-

vergence of the histogram density estimator with -

mixing data.We do this in the next section.

3 L

1

convergence of histograms

Convergence of density estimators is thoroughly stud-

ied in the statistics and machine learning literature.

Early papers on the L

1

convergence of kernel density

estimators (KDEs) include [26,2,22];Freedman and

Diaconis [10] look specically at histogramestimators,

and Yu [27] considered the L

1

convergence of KDEs

for -mixing data and shows that the optimal IIDrates

can be attained.Devroye and Gyor [8] argue that L

1

is a more appropriate metric for studying density esti-

mation,and Tran [23] proves L

1

consistency of KDEs

under - and -mixing.As far as we are aware,ours is

the rst proof of L

1

convergence for histograms under

-mixing.

Additionally,the dimensionality of the target density

is analogous to the order of the Markov approxima-

tion.Therefore,the convergence rates we give are

asymptotic in the bandwidth h

n

which shrinks as n

increases,but also in the dimension d which increases

with n.Even under these asymptotics,histogramesti-

mation in this sense is not a high dimensional problem.

The dimension of the target density considered here is

on the order of expfW(log n)g,a rate somewhere be-

tween log n and log log n.

Theorem 3.1.If

b

f is the histogram estimator based

on a (possibly vector valued) sample X

n

1

from a -

mixing sequence with stationary density f,then for all

> E

h

R

j

b

f fj

i

,

P

Z

j

b

f fj >

2 exp

n

2

1

2

+2(

n

1)(m

n

) (5)

where

1

= E

h

R

j

b

f fj

i

.

To prove this result,we use the blocking method of Yu

[28] to transform the dependent -mixing into a se-

quence of nearly independent blocks.We then apply

McDiarmid's inequality to the blocks to derive asymp-

totics in the bandwidth of the histogram as well as the

dimension of the target density.For completeness,we

state Yu's blocking result and McDiarmid's inequality

before proving the doubly asymptotic histogram con-

vergence for IIDdata.Combining these lemmas allows

us to derive rates of convergence for histograms based

on -mixing inputs.

Lemma 3.2 (Lemma 4.1 in Yu [28]).Let be a mea-

surable function with respect to the block sequence U

uniformly bounded by M.Then,

jE[]

~

E[]j M(m

n

)(

n

1);(6)

where the rst expectation is with respect to the depen-

dent block sequence,U,and

~

E is with respect to the

independent sequence,U

0

.

This lemma essentially gives a method of applying IID

results to -mixing data.Because the dependence de-

cays as we increase the separation between blocks,

widely spaced blocks are nearly independent of each

other.In particular,the dierence between expecta-

tions over these nearly independent blocks and expec-

tations over blocks which are actually independent can

be controlled by the -mixing coecient.

Lemma 3.3 (McDiarmid Inequality [14]).Let

X

1

;:::;X

n

be independent random variables,with X

i

taking values in a set A

i

for each i.Suppose that the

measurable function f:

Q

A

i

!R satises

jf(x) f(x

0

)j c

i

whenever the vectors x and x

0

dier only in the i

th

coordinate.Then for any > 0,

P(f Ef > ) exp

2

2

P

c

2

i

:

Lemma 3.4.For an IID sample X

1

;:::;X

n

from

some density f on R

d

,

E

Z

j

b

f E

b

fjdx = O

1=

q

nh

d

n

(7)

Z

jE

b

f fjdx = O(dh

n

) +O(d

2

h

2

n

);(8)

where

b

f is the histogram estimate using a grid with

sides of length h

n

.

Proof of Lemma 3.4.Let p

j

be the probability of

falling into the j

th

bin B

j

.Then,

E

Z

j

b

f E

b

fj = h

d

n

J

X

j=1

E

1

nh

d

n

n

X

i=1

I(X

i

2 B

j

)

p

j

h

d

h

d

n

J

X

j=1

1

nh

d

n

v

u

u

t

V

"

n

X

i=1

I(X

i

2 B

j

)

#

= h

d

n

J

X

j=1

1

nh

d

n

q

np

j

(1 p

j

)

=

1

p

n

J

X

j=1

q

p

j

(1 p

j

)

= O(n

1=2

)O(h

d=2

n

) = O

1=

q

nh

d

n

:

520

Daniel J.McDonald,Cosma Rohilla Shalizi,Mark Schervish

For the second claim,consider the bin B

j

centered at

c.Let I be the union of all bins B

j

.Assume the

following:

1.f 2 L

2

and f is absolutely continuous on I,with

a.e.partial derivatives f

i

=

@

@y

i

f(y)

2.f

i

2 L

2

and f

i

is absolutely continuous on I,with

a.e.partial derivatives f

ik

=

@

@y

k

f

i

(y)

3.f

ik

2 L

2

for all i;k.

Using a Taylor expansion

f(x) = f(c) +

d

X

i=1

(x

i

c

i

)f

i

(c) +O(d

2

h

2

n

);

where f

i

(y) =

@

@y

i

f(y).Therefore,p

j

is given by

p

j

=

Z

B

j

f(x)dx = h

d

n

f(c) +O(d

2

h

d+2

n

)

since the integral of the second term over the bin is

zero.This means that for the j

th

bin,

E

b

f

n

(x) f(x) =

p

j

h

d

n

f(x)

=

d

X

i=1

(x

i

c

i

)f

i

(c) +O(d

2

h

2

n

):

Therefore,

Z

B

j

E

b

f

n

(x) f(x)

=

Z

B

j

d

X

i=1

(x

i

c

i

)f

i

(c) +O(d

2

h

2

n

)

Z

B

j

d

X

i=1

(x

i

c

i

)f

i

(c)

+

Z

B

j

O(d

2

h

2

)

=

Z

B

j

d

X

i=1

(x

i

c

i

)f

i

(c)

+O(d

2

h

2+d

n

)

= O(dh

d+1

n

) +O(d

2

h

2+d

n

)

Since each bin is bounded,we can sum over all J bins.

The number of bins is J = h

d

n

by denition,so

Z

jE

b

f

n

(x) f(x)jdx

= O(h

d

n

)

O(dh

d+1

n

) +O(d

2

h

2+d

n

)

= O(dh

n

) +O(d

2

h

2

n

):

We can now prove the main result of this section.

Proof of Theorem 3.1.Let g be the L

1

loss of the his-

togram estimator,g =

R

jf

b

f

n

j.Here

b

f

n

(x) =

1

nh

d

n

P

n

i=1

I(X

i

2 B

j

(x)) where B

j

(x) is the bin con-

taining x.Let

b

f

U

,

b

f

V

,and

b

f

U

0 be histograms based on

the block sequences U,V,and U

0

respectively.Clearly

b

f

n

=

1

2

(

b

f

U

+

b

f

V

):Now,

P(g > ) = P

Z

jf

b

f

n

j >

= P

Z

f

b

f

U

2

+

f

b

f

V

2

>

!

P

1

2

Z

jf

b

f

U

j +

1

2

Z

jf

b

f

V

j >

= P(g

U

+g

V

> 2)

P(g

U

> ) +P(g

V

> )

= 2P(g

U

E[g

U

] > E[g

U

])

= 2P(g

U

E[g

U

0

] > E[g

U

0

])

= 2P(g

U

E[g

U

0

] >

1

);

where

1

= E[g

U

0 ].Here,

E[g

U

0

]

~

E

Z

j

b

f

U

0

~

E

b

f

U

0

jdx +

Z

j

~

E

b

f

U

0

fjdx;

so by Lemma 3.4,as long as for

n

!1,h

n

#0 and

n

h

d

n

!1,then for all there exists n

0

() such that

for all n > n

0

(), > E[g] = E[g

U

0 ].Now applying

Lemma 3.2 to the expectation of the indicator of the

event fg

U

E[g

U

0 ] >

1

g gives

2P(g

U

E[g

U

0

] >

1

) 2P(g

U

0

E[g

U

0

] >

1

)

+2(

n

1)(m

n

)

where the probability on the right is for the -eld gen-

erated by the independent block sequence U

0

.Since

these blocks are independent,showing that g

U

0

sat-

ises the bounded dierences requirement allows for

the application of McDiarmid's inequality 3.3 to the

blocks.For any two block sequences u

0

1

;:::;u

0

n

and

u

0

1

;:::;u

0

n

with u

0

`

= u

0

`

for all`6= j,then

g

U

0 (u

0

1

;:::;u

0

n

) g

U

0 (u

0

1

;:::;u

0

n

)

=

Z

j

b

f(y;u

0

1

;:::;u

0

n

) f(y)jdy

Z

j

b

f(y;u

0

1

;:::;u

0

n

) f(y)jdy

Z

j

b

f(y;u

0

1

;:::;u

0

n

)

b

f(y;u

0

1

;:::;u

0

n

)jdy

=

2

n

h

d

n

h

d

n

=

2

n

:

Therefore,

P(g > ) 2P(g

U

0 E[g

U

0 ] >

1

) +2(

n

1)(m

n

)

2 exp

n

2

1

2

+2(

n

1)(m

n

):

521

Estimating beta-mixing coecients

4 Proofs

The proof of Theorem 2.4 relies on the triangle in-

equality and the relationship between total variation

distance and the L

1

distance between densities.

Proof of Theorem 2.4.For any probability measures

and dened on the same probability space with asso-

ciated densities f

and f

with respect to some domi-

nating measure ,

jj jj

TV

=

1

2

Z

jf

f

jd():

Let P be the d-dimensional stationary distribution of

the d

th

order Markov process,i.e.P = P

t

td+1

=

P

t+a+d1

t+a

in the notation of equation 3.Let P

a;d

be

the joint distribution of the bivariate random process

created by the initial process and itself separated by a

time steps.By the triangle inequality,we can upper

bound

d

(a) for any d = d

n

.Let

b

P and

b

P

a;d

be the

distributions associated with histogram estimators

b

f

d

and

b

f

2d

a

respectively.Then,

d

(a) = jjP

P P

a;d

jj

TV

=

P

P

b

P

b

P +

b

P

b

P

b

P

a;d

+

b

P

a;d

P

a;d

TV

P

P

b

P

b

P

TV

+

b

P

b

P

b

P

a;d

TV

+

b

P

a;d

P

a;d

TV

2

P

b

P

TV

+

b

P

b

P

b

P

a;d

TV

+

b

P

a;d

P

a;d

TV

=

Z

jf

d

b

f

d

j +

1

2

Z

j

b

f

d

b

f

d

b

f

2d

a

j

+

1

2

Z

jf

2d

a

b

f

2d

a

j

where

1

2

R

j

b

f

d

b

f

d

b

f

2d

a

j is our estimator

b

d

(a) and the

remaining terms are the L

1

distance between a density

estimator and the target density.Thus,

d

(a)

b

d

(a)

Z

jf

d

b

f

d

j +

1

2

Z

jf

2d

a

b

f

2d

a

j:

A similar argument starting from

d

(a) =

jjP

P P

a;d

jj

TV

shows that

d

(a)

b

d

(a)

Z

jf

d

b

f

d

j

1

2

Z

jf

2d

a

b

f

2d

a

j;

so we have that

d

(a)

b

d

(a)

Z

jf

d

b

f

d

j +

1

2

Z

jf

2d

a

b

f

2d

a

j:

Therefore,

P

d

(a)

b

d

(a)

>

P

Z

jf

d

b

f

d

j +

1

2

Z

jf

2d

a

b

f

2d

a

j >

P

Z

jf

d

b

f

d

j >

2

+P

1

2

Z

jf

2d

a

b

f

2d

a

j >

2

2 exp

n

2

1

2

+2 exp

n

2

2

2

+4(

n

1)(m

n

);

where

1

= =2 E

h

R

j

b

f

d

f

d

j

i

and

2

=

E

h

R

j

b

f

2d

a

f

2d

a

j

i

.

The proof of Theorem2.3 requires two steps which are

given in the following Lemmas.The rst species the

histogrambandwidth h

n

and the rate at which d

n

(the

dimensionality of the target density) goes to innity.If

the dimensionality of the target density were xed,we

could achieve rates of convergence similar to those for

histograms based on IID inputs.However,we wish to

allow the dimensionality to grow with n,so the rates

are much slower as shown in the following lemma.

Lemma 4.1.For the histogram estimator in

Lemma 3.4,let

d

n

expfW(log n)g;

h

n

n

k

n

;

with

k

n

=

W(log n) +

1

2

log n

log n

1

2

expfW(log n)g +1

:

These choices lead to the optimal rate of convergence.

Proof.Let h

n

= n

k

n

for some k

n

to be determined.

Then we want n

1=2

h

d

n

=2

n

= n

(k

n

d

n

1)=2

!0,

d

n

h

n

= d

n

n

k

!0,and d

2

n

h

2

n

= d

2

n

n

2k

!0 all

as n!1.Call these A,B,and C.Taking A and B

rst gives

n

(k

n

d

n

1)=2

d

n

n

k

n

)

1

2

(k

n

d

n

1) log n log d

n

k

n

log n

)k

n

log n

1

2

d

n

+1

log d

n

+

1

2

log n

)k

n

log d

n

+

1

2

log n

log n

1

2

d

n

+1

:(9)

522

Daniel J.McDonald,Cosma Rohilla Shalizi,Mark Schervish

Similarly,combining A and C gives

k

n

2 log d

n

+

1

2

log n

log n

1

2

d

n

+2

:(10)

Equating (9) and (10) and solving for d

n

gives

)d

n

expfW(log n)g

where W() is the Lambert W function.Plugging back

into (9) gives that

h

n

= n

k

n

where

k

n

=

W(log n) +

1

2

log n

log n

1

2

expfW(log n)g +1

:

It is also necessary to show that as d grows,

d

(a)!

(a).We now prove this result.

Lemma 4.2.

d

(a) converges to (a) as d!1.

Proof.By stationarity,the supremum over t is un-

necessary in Denition 2.1,so without loss of gen-

erality,let t = 0.Let P

0

1

be the distribution on

0

1

= (:::;X

1

;X

0

),and let P

1

a

be the distribu-

tion on

1

a+1

= (X

a+1

;X

a+2

;:::).Let P

a

be the

distribution on =

0

1

1

a+1

(the product sigma-

eld).Then we can rewrite Denition 2.1 using this

notation as

(a) = sup

C2

jP

a

(C) [P

0

1

P

1

a+1

](C)j:

Let

0

d+1

and

a+d

a+1

be the sub--elds of

0

1

and

1

a+1

consisting of the d-dimensional cylinder sets for

the d dimensions closest together.Let

d

be the prod-

uct -eld of these two.Then we can rewrite

d

(a)

as

d

(a) = sup

C2

d

jjP

a

(C) [P

0

1

P

1

a+1

](C)j:(11)

As such

d

(a) (a) for all a and d.We can

rewrite (11) in terms of nite-dimensional marginals:

d

(a) = sup

C2

d

jP

a;d

(C) [P

0

d+1

P

a+d

a+1

](C)j;

where P

a;d

is the restriction of P to

(X

d+1

;:::;X

0

;X

a+1

;:::;X

a+d

).Because of

the nested nature of these sigma-elds,we have

d

1

(a)

d

2

(a) (a)

for all nite d

1

d

2

.Therefore,for xed a,f

d

(a)g

1

d=1

is a monotone increasing sequence which is bounded

above,and it converges to some limit L (a).To

show that L = (a) requires some additional steps.

Let R = P

a

[P

0

1

P

1

a

],which is a signed mea-

sure on .Let R

d

= P

a;d

[P

0

d

P

a+d

a

],which is

a signed measure on

d

.Decompose R into positive

and negative parts as R = Q

+

Q

and similarly for

R

d

= Q

+d

Q

d

.Notice that since R

d

is constructed

using the marginals of P,then R(E) = R

d

(E) for all

E 2

d

.Now since R is the dierence of probability

measures,we must have that

0 = R(

) = Q

+

(

) Q

(

)

= Q

+

(D) +Q

+

(D

c

) Q

(D) Q

(D

c

) (12)

for all D 2 .

Dene Q = Q

+

+Q

.Let > 0.Let C 2 be such

that

Q(C) = (a) = Q

+

(C) = Q

(C

c

):(13)

Such a set C is guaranteed by the Hahn decomposi-

tion theorem (letting C

be a set which attains the

supremum in (11),we can throw away any subsets

with negative R measure) and (12) assuming without

loss of generality that P

a

(C) > [P

0

1

P

1

a

](C).We

can use the eld

f

=

S

d

d

to approximate in the

sense that,for all ,we can nd A 2

f

such that

Q(AC) < =2 (see Theorem D in Halmos [11,x13]

or Lemma A.24 in Schervish [21]).Now,

Q(AC) = Q(A\C

c

) +Q(C\A

c

)

= Q

(A\C

c

) +Q

+

(C\A

c

)

by (13) since A\C

c

C

c

and C\A

c

C.Therefore,

since Q(AC) < =2,we have

Q

(A\C

c

) =2 (14)

Q

+

(A

c

\C) =2:

Also,

Q(C) = Q(A\C) +Q(A

c

\C)

= Q

+

(A\C) +Q

+

(A

c

\C)

Q

+

(A) +=2

since A\C and A

c

\C are contained in C and A\C

A.Therefore

Q

+

(A) Q(C) =2:

Similarly,

Q

(A) = Q

(A\C) +Q

(A\C

c

) 0 +=2 = =2

since A\C C and Q

(C) = 0 by (14).Finally,

Q

+d

(A) Q

+d

(A) Q

d

(A) = R

d

(A)

= R(A) = Q

+

(A) Q

(A)

Q(C) =2 =2 = Q(C)

= (a) :

523

Estimating beta-mixing coecients

And since

d

(a) Q

+d

(A),we have that for all > 0

there exists d such that for all d

1

> d,

d

1

(a)

d

(a) Q

+d

(A)

(a) :

Thus,we must have that L = (a),so that

d

(a)!

(a) as desired.

Proof of Theorem 2.3.By the triangle inequality,

j

b

d

n

(a) (a)j j

b

d

n

(a)

d

n

(a)j +j

d

n

(a) (a)j:

The rst term on the right is bounded by the re-

sult in Theorem 2.4,where we have shown that d

n

=

O(expfW(log n)g) is slow enough for the histogram

estimator to remain consistent.That

d

n

(a)

d

n

!1

!

(a) follows from Lemma 4.2.

5 Discussion

We have shown that our estimator of the -mixing

coecients is consistent for the true coecients (a)

under some conditions on the data generating process.

There are numerous results in the statistics and ma-

chine learning literatures which assume knowledge of

the -mixing coecients,yet as far as we know,this

is the rst estimator for them.An ability to estimate

these coecients will allow researchers to apply ex-

isting results to dependent data without the need to

arbitrarily assume their values.Despite the obvious

utility of this estimator,as a consequence of its novelty,

it comes with a number of potential extensions which

warrant careful exploration as well as some drawbacks.

The reader will note that Theorem 2.3 does not pro-

vide a convergence rate.The rate in Theorem 2.4 ap-

plies only to the dierence between

^

d

(a) and

d

(a).

In order to provide a rate in Theorem 2.3,we would

need a better understanding of the non-stochastic con-

vergence of

d

(a) to (a).It is not immediately

clear that this quantity can converge at any well-

dened rate.In particular,it seems likely that the

rate of convergence depends on the tail of the sequence

f(a)g

1

a=1

.

Several other mixing and weak-dependence coecients

also have a total-variation avor,perhaps most no-

tably -mixing [9,7,4].None of themhave estimators,

and the same trick might well work for them,too.

The use of histograms rather than kernel density esti-

mators for the joint and marginal densities is surpris-

ing and perhaps not ultimately necessary.As men-

tioned above,Tran [23] proved that KDEs are consis-

tent for estimating the stationary density of a time se-

ries with -mixing inputs,so perhaps one could replace

the histograms in our estimator with KDEs.However,

this would need an analogue of the double asymptotic

results proven for histograms in Lemma 3.4.In partic-

ular,we need to estimate increasingly higher dimen-

sional densities as n!1.This does not cause a

problem of small-n-large-d since d is chosen as a func-

tion of n,however it will lead to increasingly higher

dimensional integration.For histograms,the integral

is always trivial,but in the case of KDEs,the nu-

merical accuracy of the integration algorithm becomes

increasingly important.This issue could swamp any

eciency gains obtained through the use of kernels.

However,this question certainly warrants further in-

vestigation.

The main drawback of an estimator based on a den-

sity estimate is its complexity.The mixing coecients

are functionals of the joint and marginal distributions

derived from the stochastic process X,however,it is

unsatisfying to estimate densities and solve integrals in

order to estimate a single number.Vapnik's main prin-

ciple for solving problems using a restricted amount of

information is

When solving a given problem,try to avoid

solving a more general problem as an inter-

mediate step [24,p.30].

This principle is clearly violated here,but perhaps our

seed will precipitate a more aesthetically pleasing so-

lution.

Acknowledgements

The authors wish to thank Darren Homrighausen and

two anonymous reviewers for helpful comments and

the Institute for New Economic Thinking for support-

ing this research.

References

[1] Baraud,Y.,Comte,F.,and Viennet,G.(2001),

\Adaptive estimation in autoregression or -mixing

regression via model selection,"Annals of statistics,

29,839{875.

[2] Bickel,P.and Rosenblatt,M.(1973),\On Some

Global Measures of the Deviations of Density Func-

tion Estimates,"The Annals of Statistics,1,1071{

1095.

[3] Bousquet,O.and Elissee,A.(2002),\Stabil-

ity and Generalization,"The Journal of Machine

Learning Research,2,499{526.

[4] Bradley,R.C.(2005),\Basic Properties of Strong

Mixing Conditions.A Survey and Some Open Ques-

tions,"Probability Surveys,2,107{144.

524

Daniel J.McDonald,Cosma Rohilla Shalizi,Mark Schervish

[5] Carrasco,M.and Chen,X.(2002),\Mixing

and Moment Properties of Various GARCH and

Stochastic Volatility Models,"Econometric Theory,

18,17{39.

[6] Corless,R.,Gonnet,G.,Hare,D.,Jerey,D.,and

Knuth,D.(1996),\On the Lambert W Function,"

Advances in Computational Mathematics,5,329{

359.

[7] Dedecker,J.,Doukhan,P.,Lang,G.,Leon R.,

J.R.,Louhichi,S.,and Prieur,C.(2007),Weak

Dependence:With Examples and Applications,vol.

190 of Lecture Notes in Statistics,Springer Verlag,

New York.

[8] Devroye,L.and Gyor,L.(1985),Nonparamet-

ric Density Estimation:The L

1

View,Wiley,New

York.

[9] Doukhan,P.(1994),Mixing:Properties and Exam-

ples,vol.85 of Lecture Notes in Statistics,Springer

Verlag,New York.

[10] Freedman,D.and Diaconis,P.(1981),\On the

Maximum Deviation Between the Histogram and

the Underlying Density,"Probability Theory and

Related Fields,58,139{167.

[11] Halmos,P.(1974),Measure Theory,Graduate

Texts in Mathematics,Springer-Verlag,New York.

[12] Karandikar,R.L.and Vidyasagar,M.(2009),

\Probably Approximately Correct Learning with

Beta-Mixing Input Sequences,"submitted for pub-

lication.

[13] Lozano,A.,Kulkarni,S.,and Schapire,R.

(2006),\Convergence and Consistency of Regular-

ized Boosting Algorithms with Stationary Beta-

Mixing Observations,"Advances in Neural Informa-

tion Processing Systems,18,819.

[14] McDiarmid,C.(1989),\On the Method of

Bounded Dierences,"in Surveys in Combinatorics,

ed.J.Siemons,vol.141 of London Mathematical So-

ciety Lecture Note Series,pp.148{188,Cambridge

University Press.

[15] Meir,R.(2000),\Nonparametric Time Series Pre-

diction Through Adaptive Model Selection,"Ma-

chine Learning,39,5{34.

[16] Mohri,M.and Rostamizadeh,A.(2010),\Stabil-

ity Bounds for Stationary'-mixing and -mixing

Processes,"Journal of Machine Learning Research,

11,789{814.

[17] Mokkadem,A.(1988),\Mixing properties of

ARMA processes,"Stochastic processes and their

applications,29,309{315.

[18] Nobel,A.(2006),\Hypothesis Testing for Fami-

lies of Ergodic Processes,"Bernoulli,12,251{269.

[19] Nummelin,E.and Tuominen,P.(1982),\Ge-

ometric Ergodicity of Harris Recurrent Markov

Chains with Applications to Renewal Theory,"

Stochastic Processes and Their Applications,12,

187{202.

[20] Ralaivola,L.,Szafranski,M.,and Stempfel,G.

(2010),\Chromatic PAC-Bayes Bounds for Non-

IID Data:Applications to Ranking and Stationary

-Mixing Processes,"Journal of Machine Learning

Research,11,1927{1956.

[21] Schervish,M.(1995),Theory of Statistics,

Springer Series in Statistics,Springer Verlag,New

York.

[22] Silverman,B.(1978),\Weak and Strong Uniform

Consistency of the Kernel Estimate of a Density and

its Derivatives,"The Annals of Statistics,6,177{

184.

[23] Tran,L.(1989),\The L

1

Convergence of Kernel

Density Estimates under Dependence,"The Cana-

dian Journal of Statistics/La Revue Canadienne de

Statistique,17,197{208.

[24] Vapnik,V.(2000),The Nature of Statistical

Learning Theory,Statistics for Engineering and In-

formation Science,Springer Verlag,New York,2nd

edn.

[25] Vidyasagar,M.(1997),A Theory of Learning and

Generalization:With Applications to Neural Net-

works and Control Systems,Springer Verlag,Berlin.

[26] Woodroofe,M.(1967),\On the Maximum Devi-

ation of the Sample Density,"The Annals of Math-

ematical Statistics,38,475{481.

[27] Yu,B.(1993),\Density Estimation in the L

1

Norm for Dependent Data with Applications to the

Gibbs Sampler,"Annals of Statistics,21,711{735.

[28] Yu,B.(1994),\Rates of Convergence for Empiri-

cal Processes of Stationary Mixing Sequences,"The

Annals of Probability,22,94{116.

## Comments 0

Log in to post a comment