Bayesian Model Estimation and

Comparison for Longitudinal

Categorical Data

Thu T.Tran

Submitted in total ful¯lment of the requirements of the

degree of

Doctor of Philosophy

Statistics and Operational Research

School of Mathematical Sciences

Queensland University of Technology

Australia

March 2008

Abstract

In this thesis,we address issues of model estimation for longitudinal categorical

data and of model selection for these data with missing covariates.Longitu-

dinal survey data capture the responses of each subject repeatedly through

time,allowing for the separation of variation in the measured variable of inter-

est across time for one subject from the variation in that variable among all

subjects.Questions concerning persistence,patterns of structure,interaction

of events and stability of multi-variate relationships can be answered through

longitudinal data analysis.

Longitudinal data require special statistical methods because they must

take into account the correlation between observations recorded on one subject.

Afurther complication in analysing longitudinal data is accounting for the non-

response or drop-out process.Potentially,the missing values are correlated

with variables under study and hence cannot be totally excluded.

Firstly,we investigate a Bayesian hierarchical model for the analysis of

categorical longitudinal data from the Longitudinal Survey of Immigrants to

Australia.Data for each subject is observed on three separate occasions,or

waves,of the survey.One of the features of the data set is that observations for

some variables are missing for at least one wave.A model for the employment

status of immigrants is developed by introducing,at the ¯rst stage of a hierar-

chical model,a multinomial model for the response and then subsequent terms

are introduced to explain wave and subject e®ects.To estimate the model,we

use the Gibbs sampler,which allows missing data for both the response and

explanatory variables to be imputed at each iteration of the algorithm,given

some appropriate prior distributions.After accounting for signi¯cant covariate

e®ects in the model,results show that the relative probability of remaining

unemployed diminished with time following arrival in Australia.

Secondly,we examine the Bayesian model selection techniques of the Bayes

2

factor and Deviance Information Criterion for our regression models with miss-

ing covariates.Computing Bayes factors involve computing the often complex

marginal likelihood p(yjmodel) and various authors have presented methods to

estimate this quantity.Here,we take the approach of path sampling via power

posteriors (Friel and Pettitt,2006).The appeal of this method is that for hi-

erarchical regression models with missing covariates,a common occurrence in

longitudinal data analysis,it is straightforward to calculate and interpret since

integration over all parameters,including the imputed missing covariates and

the random e®ects,is carried out automatically with minimal added complexi-

ties of modelling or computation.We apply this technique to compare models

for the employment status of immigrants to Australia.

Finally,we also develop a model choice criterion based on the Deviance In-

formation Criterion (DIC),similar to Celeux et al.(2006),but which is suitable

for use with generalized linear models (GLMs) when covariates are missing at

random.We de¯ne three di®erent DICs:the marginal,where the missing data

are averaged out of the likelihood;the complete,where the joint likelihood for

response and covariates is considered;and the naive,where the likelihood is

found assuming the missing values are parameters.These three versions have

di®erent computational complexities.We investigate through simulation the

performance of these three di®erent DICs for GLMs consisting of normally,

binomially and multinomially distributed data with missing covariates having

a normal distribution.We ¯nd that the marginal DIC and the estimate of the

e®ective number of parameters,p

D

,have desirable properties appropriately in-

dicating the true model for the response under di®ering amounts of missingness

of the covariates.We ¯nd that the complete DIC is inappropriate generally

in this context as it is extremely sensitive to the degree of missingness of the

covariate model.Our new methodology is illustrated by analysing the results

of a community survey.

3

Keywords:Longitudinal data analysis;Generalized linear models;Bayesian

hierarchical models;Bayesian model choice;Bayes factors;Deviance Informa-

tion Criterion;Missing data

4

Declaration

This thesis comprises of only my original work except where indicated in

the Preface.Due acknowledgement has been made in the text to all other

material used.

5

Preface

The work in Chapter 3 can be found in Pettitt,A.,M.Haynes,T.Tran,and

J.Hay (2002)\A Model for Longitudinal Employment Status of Immigrants

to Australia"QUT e-Prints (available at http://eprints.qut.edu.au/).The re-

search was carried out in collaboration with Tony Pettitt,Michele Haynes

and John Hay.The original concepts were formed by Tony and John.Addi-

tional changes,improvements,implementation and write-up were carried out

by Michele and I.

Chapter 4 appears as Pettitt,A.N.,T.T.Tran,M.A.Haynes,and J.L.

Hay (2006)\A Bayesian hierarchical model for categorical longitudinal data

from a social survey of immigrants"Journal of the Royal Statistical Society,

series A 169,97{114.I was in charge of all the Bayesian modelling and model

selection analyses.I collaborated with the co-authors on the original concepts

and shared in the writing and editing of the paper.

Part of chapter 5 (section 5.4.1) appear in Friel,N.and A.Pettitt (2006)

\Marginal likelihood estimation via power posteriors"Available from authors.

In section 5.4.2,I extended the ideas of Friel and Pettitt (2006) for models

with missing covariates.I carried out all the applications and analyses in this

chapter.

Chapter 6 has been submitted as Tran,T.T.and Pettitt,A.N.(2006)\De-

viance Information Criteria for Models with Imputed Missing Covariates"Ac-

cepted subject to revision by the Australian and New Zealand Journal of Statis-

tics.I was the main researcher responsible for the ideas,analyses and writing

of the paper.Professor Pettitt contributed to the original ideas and editing

the paper.

6

Acknowledgements

The research carried out in this thesis is part of a project funded by the Aus-

tralian Research Council in conjunction with the Queensland Treasury Depart-

ment - O±ce of Economics and Statistical Research.

A big thank you to Tony Pettitt,Michele Haynes,Nancy Spencer,all the

maths CSO's,sta® and students for your help and support.I greatly appreci-

ated your kindness.

Thank you Mum,Dad,my brothers and sisters for always being there for

me.

7

Contents

Abstract 2

Declaration 5

Preface 6

Acknowledgements 7

List of Figures 13

List of Tables 18

1 Introduction 19

1.1 Introduction to Longitudinal Data Analysis...........21

1.1.1 Parameter Estimation Methods..............23

1.1.2 Missing Data Models....................26

1.2 Bayesian methods,MCMC and BUGS..............28

1.2.1 Bayesian methods......................29

1.2.2 Markov Chain Monte Carlo................29

1.2.3 BUGS............................32

1.3 Layout of the Thesis........................34

2 Longitudinal Surveys - A brief review 37

2.1 Longitudinal Survey Designs....................38

9

CONTENTS

2.1.1 The Classic Design.....................38

2.1.2 Rotating Design.......................40

2.1.3 Proxy Interviews......................41

2.1.4 Addressing dropout.....................41

2.1.5 Summary..........................42

2.2 The Goodna Community Longitudinal Survey..........42

2.2.1 The Suburb of Goodna...................43

2.2.2 Survey Design of GCLS..................44

2.3 The Longitudinal Survey of Immigrants to Australia.......45

2.3.1 Survey Design of LSIA...................45

2.3.2 LSIA Data..........................47

2.3.3 Missing Data........................50

2.3.4 Initial Univariate Analysis.................52

2.4 Conclusions.............................57

3 Marginal Models of Categorical Longitudinal Data 59

3.1 Introduction.............................59

3.2 Modelling Employment Status...................61

3.2.1 The Pooled Multinomial Logistic Model.........61

3.2.2 Generalised Estimating Equations (GEE) Model.....62

3.3 Results................................63

3.3.1 Models Results for Complete Data Records........63

3.3.2 Model Results for Data with Incomplete Response and

Covariate Entries......................68

3.4 Conclusion..............................69

4 Bayesian Hierarchical Models 77

4.1 Introduction.............................77

4.2 Modelling Employment Status...................78

10

CONTENTS

4.2.1 Bayesian hierarchical model................78

4.2.2 Imputation of Missing Data................81

4.2.3 Alternative Models and Models Selection.........83

4.3 Results................................88

4.3.1 Model Results for Complete Data Records........88

4.3.2 Model Results for Data with Incomplete Response and

Covariate Entries......................95

4.4 Discussion..............................97

5 The Bayes Factor for Models with Missing Covariates 101

5.1 Introduction.............................101

5.2 Bayes Factor.............................102

5.3 Marginal Likelihood via Power Posteriors.............105

5.3.1 Bayes Factor for Regression Models with Missing Covari-

ates.............................108

5.4 Modelling Employment Status...................109

5.4.1 Model Selection for Complete Data Records.......110

5.4.2 Model Selection for Data with Incomplete Response and

Covariates..........................113

5.5 Discussion and Conclusions....................115

6 DIC for Models with Missing Covariates 117

6.1 Introduction.............................117

6.2 Deviance Information Criterion..................120

6.3 DIC for missing data models....................121

6.4 Simulation Study..........................123

6.4.1 Case 1:The Linear Model.................125

6.4.2 Cases 2 and 3:The Binary and Multinomial Logit Models129

6.5 Case Study:Goodna Community Survey.............132

11

CONTENTS

6.5.1 Modelling Community Participation............132

6.6 Discussions and Conclusions....................136

7 Conclusions 139

A Goodna Questionnaire 145

B Results of Models A and B from Chapter 3 157

C Results of Models 3 to 6 from Chapter 4 161

D Code for Power Posteriors 163

D.1 R code for power posteriors of Model 2 with missing values...163

D.2 WinBugs code (model.¯le) for Model 2 with missing values...164

E Code for DIC 165

E.1 R code for binary response data..................165

E.2 WinBUGS code for model2.txt and model3.txt..........167

E.2.1 model2.txt..........................167

E.2.2 Model3.txt..........................168

12

List of Figures

1.1 Graphical model for the`Pumps'example in Spiegelhalter et al.

(1998)................................33

5.1 Expected deviances for the power posterior at temperature t

i

for

model k = 2 (random e®ects model) using priors with variances

of 9,16 and 25............................111

5.2 Expected deviances calculated at waves 2 and 3 only for the

power posterior at temperature t

i

for models k = 2,3 and 5 for

data with incomplete entries.....................114

6.1 Plots of values of the marginal (±) and naive (4) DICs and p

D

s

for varying amounts of missingness of the covariate x

1

.(a)DICs

for gaussian y,(b)p

D

s for gaussian y,(c)DICs for binary y,

(d)p

D

s for binary y,(e)DICs for categorical y,(f)p

D

s for cat-

egorical y..............................128

13

List of Tables

2.1 List of survey acronyms.......................38

2.2 Distribution of Principal Applicant Interview Status by Wave.50

2.3 Missing data patterns for the Longitudinal Survey of Immigrants

to Australia.............................51

2.4 Counts (and proportions) of PAs by employment status and

wave of interview for the two data sets with`complete'and`in-

complete'cases,respectively....................53

2.5 Summary percentage counts for Employment Status in Wave 1,

Wave 2 and Wave 3,by explanatory variable...........55

2.6 Percentages of immigrants in Survey by Visa Category and English-

speaking ability at Wave 1......................57

3.1 Estimates and standard errors from the binary generalised esti-

mating equation models.......................72

3.2 Estimated e®ects and standard errors for the signi¯cant inter-

actions in the binary model for employed/unemployed......73

3.3 Predicted marginal means and standard errors for the interac-

tions in the binary model for employed/unemployed.......74

3.4 Estimated e®ects and standard errors for the signi¯cant inter-

actions in the binary model for non-participant/unemployed...75

15

LIST OF TABLES

3.5 Predicted marginal means and standard errors for the interac-

tions in the binary model for non-participant/unemployed....76

4.1 Estimated posterior means and 95% intervals for the e®ects of

selected explanatory variables on the log of the probability of an

immigrant being unemployed relative to the probability of being

employed,from Models 2,3 and 5.................90

4.2 Deviance Information Criteria calculated for all three waves cor-

responding to Models 1,2,3 and 5 from Table 4.1........91

4.3 Deviance Information Criteria calculated for waves 2 and 3 cor-

responding to Models 1,2,3 and 5 from Table 4.1........91

4.4 Estimated posterior means and 95% intervals for the e®ects of

selected explanatory variables on the log of the probability of

an immigrant being non-participant in the work force relative

to the probability of being employed,from Models 2,3 and 5..94

4.5 Estimated posterior means and empirical standard errors for

parameters (j = 2;3) in Model 5 ¯tted to complete case data

and incomplete

¤

case data......................95

4.6 Increase in precision of estimated Model 5 parameters with in-

clusion of imputed missing values in data.............97

5.1 Estimated log p(yjk) via Power Posteriors for model k = 2 using

priors with variances of 9,16 and 25................112

5.2 Estimated log p(yjk) by omitting t

0

= 0 and starting at t

1

=

0:0016 in 5.6 for model k = 2 using priors with variances of 9,

16 and 25...............................112

5.3 Expected deviances (and Monte Carlo standard errors) for the

power posterior at temperature t

i

for models k = 1;2 for com-

plete data records..........................112

16

LIST OF TABLES

5.4 Estimated log p(yjk) via Power Posteriors based on all three

waves for employment status models k = 2,3 and 5 for com-

plete data records..........................112

5.5 Estimated log p(yjk) via Power Posteriors based on waves 2 and

3 only for employment status models k = 2,3 and 5 for complete

data records.............................112

5.6 Expected log p(yjk) via Power Posteriors based on all three waves

for employment status models k=2,3 and 5 for data with in-

complete entries...........................113

5.7 Expected log p(yjk) via Power Posteriors based on waves 2 and

3 only for employment status models k=2,3 and 5 for data with

incomplete entries..........................114

6.1 Values of MDIC,NDIC and CDIC for Models 1 and 2 for gaus-

sian data...............................127

6.2 Values of MDIC,NDIC and CDIC for Models 1 and 2 of binary

data..................................130

6.3 Values of MDIC,NDIC and CDIC for Models 1 and 2 of poly-

tomous categorical data.......................132

6.4 DIC values for Community Involvement Binary Logit Models..135

6.5 Estimated posterior means and empirical standard errors for

the e®ects of selected explanatory variables on the logit of the

probability of community involvement of a resident of Goodna,

from Models 1 and 2.........................135

B.1 Estimates and standard errors from the pooled multinomial lo-

gistic model..............................158

B.2 Estimates and standard errors from the binary generalised esti-

mating equation models with an independent covariance matrix.159

17

LIST OF TABLES

C.1 Estimated posterior means and 95% intervals for the e®ects of

selected explanatory variables from Models 3,4,5 and 6 (unem-

ployed/employed)..........................161

C.2 Estimated posterior means and 95% intervals for the e®ects of

selected explanatory variables from Models 3,4,5 and 6 (non-

participant/employed)........................162

18

Chapter 1

Introduction

Longitudinal investigations (de¯ned broadly as studies in which the response

of each individual is observed on two or more occasions) represent one of the

principle research strategies employed in medical,health and social science

research.

Government agencies around the world,in their policy making,have re-

cently recognized the need for community longitudinal data to provide infor-

mation on the dynamic nature of events and how they interact in in°uencing

the changing behaviour and fortunes of households,families and individuals

(Wooden,2001).Surveys that have been conducted include the Los Angeles

Family and Neighbourhood Survey (LAFANS) (Sastry et al.,2000),the House-

hold,Income and Labour Dynamics in Australia (HILDA) Survey (Wooden,

2001) and Longitudinal Survey of Immigrants to Australia (LSIA).

At a more local level,in Queensland,Australia,the Queensland Govern-

ment has recently invested in the Goodna Service Integration Project (GSIP)

in response to an identi¯ed need to improve the services o®ered or funded by

government in Goodna.The project is not about additional sources of funding

for service delivery but to ensure that services o®ered in Goodna are inte-

grated,improve community lifestyle and strengthen the Goodna community.

19

Chapter 1.Introduction

To assess the e®ectiveness of the GSIP,the Goodna Community Longitudi-

nal Survey (GCLS) has been initiated to collect information on the Goodna

community on the topics of social well-being,environmental well-being and

economic well-being.

With the growing use of longitudinal data,methods of analysis need to

be more widely understood.In this thesis,we investigate and develop meth-

ods,Bayesian in the main,to address issues of model estimation and model

selection for longitudinal categorical data.The methods used for the analysis

of longitudinal data di®er from the traditional methods of multivariate anal-

ysis such as multiple regression.Longitudinal or panel data sets,as they are

sometimes known,consist of repeated observations of an outcome and a set of

covariates for each of many subjects which may be ¯xed at the outset,such as

gender,or which may change with time,such as English-speaking ability.In

other words,longitudinal data sets consist of a collection of short time series

taken on di®erent units or individuals.As one would expect,such data sets are

characterised by the fact that repeated observations for a subject tend to be

correlated.Correct statistical analysis of such data therefore requires the mod-

elling of this correlation (for example,see Diggle et al.(2002)).We introduce

extisting methods for analysing longitudinal data in Section 1.1.

In addition to the modelling of the correlation structure,the analysis of

longitudinal data is dependent on the nature of the response variable.When

the response variable is approximately Gaussian,a large class of linear models

is available for analysis.However,when the response variable is non-Gaussian,

other techniques must be considered.Adding to this complexity is the ten-

dency for some respondents to\drop-out"from the study or to participate at

intermittent times,which may cause a bias in the inference if not appropri-

ately allowed for.In chapter 3,we discuss in more details the complexities

of analysing longitudinal nominal data and we apply two di®erent methods to

20

1.1.Introduction to Longitudinal Data Analysis

analyse the employment statuses of immigrants to Australia.We highlight the

drawbacks of these methods when missing data are present and in chapter 4,

we propose Bayesian modelling which can accommodate well for these types of

data.

In the next section,we present an introduction to the analysis of longitu-

dinal data and a literature review.

1.1 Introduction to Longitudinal Data Analy-

sis

Before proceeding,let us de¯ne some notation as follows.Let y

it

denotes the

response variable observed at time w

it

for observation t = 1;:::;T

i

on subject

i = 1;:::;N;Y

i

= (y

i1

;:::;y

iT

i

).Let x

it

denotes the p-vector of explanatory

variables;X

i

= (x

i1

;:::;x

iT

i

).E(y

it

) denotes the mean of y

it

,V ar(y

it

) denotes

the variance of y

it

,Cov(y

it

;y

it

0

) denotes the covariance between y

it

and y

it

0

and

Corr(y

it

;y

it

0

) denotes the correlation between y

it

and y

it

0

.

Most longitudinal data analyses are based on a regression model of the form

E(y

it

) = ¹

it

;

h(¹

it

) = x

T

it

¯ for link function h;

V ar(y

it

) = v(¹

it

)Á;

Corr(y

it

;y

it

0

) = ½(¹

it

;¹

it

0

;®);

where v(¢) and ½(¢) are known functions,¯ is the p-vector of unknown regression

parameters and Á and ® are additional parameters which may need to be

estimated (see,for example,McCullagh and Nelder (1993)).

There are generally 3 approaches to modelling longitudinal data (Diggle

et al.,2002).

21

Chapter 1.Introduction

1.

Model the marginal mean as in cross-sectional data but account for cor-

relation between repeated values.For example,the marginal mean linear

model is given by

E(Y

i

) = X

i

¯ and

V ar(Y

i

) = V

i

(®)

with parameters ¯ and ® to be estimated.This method is advantageous

in allowing for separate modelling of the mean and covariance.

2.

The subject-speci¯c or random e®ects model.For example,the linear

random e®ects model can be speci¯ed as

E(y

it

j¯

i

) = x

T

it

¯

i

where we assume the person speci¯c vector of coe±cients ¯

i

= ¯ + U

i

with ¯ constant and a zero mean latent random variable U

i

.U

i

's can be

interpreted as unobserved factors that are common to all responses for a

given person (thus inducing the within individual correlation) but which

vary across people (thus the heterogeneous assumption can be restated

for the U

i

).

3.

The transition model which speci¯es a model for the conditional distri-

bution of y

it

given past responses y

i;t¡1

;:::;y

i1

and x

it

.For example,the

¯rst-order (Markov chain) logit model for binary responses is given by

logit Pr(y

i;t

= 1jy

i;t¡1

;:::;y

i1

) = x

T

it

¯ +®y

i;t¡1

where logit(p) =

p

1¡p

.The model combines assumptions about depen-

dence of Y on X and the correlation among repeated Y's.

Traditionally,the most popular approach to modelling longitudinal re-

sponse data is the marginal model.These models address net changes in the

population and utilise various methodological strategies to account for the

22

1.1.Introduction to Longitudinal Data Analysis

correlation between repeated measurements.In his discussion of Liang et al.

(1992a),Gilks (1992) comments that marginal models can be thought of as ap-

proximations to conditional models containing random e®ects since marginal

models are simply random e®ects models with the random e®ects integrated

out (Zeger et al.,1988).

Random e®ects models are also known as generalised linear mixed models

(GLMMs) (see for example,Breslow and Clayton (1993)).Zeger et al.(1988)

note that GLMMS are more appropriate when inferences about individuals are

the scienti¯c focus.Marginal models can only address the dependence of the

population-averaged response on explanatory variables.

Other variations of these three main models include Pourahmadi and Daniels

(2002) who developed the dynamic conditionally linear mixed models.These

models are an extension of the random e®ects model to include lagged ob-

servations as covariates.Heagerty (1999) developed the marginally speci¯ed

logistic-normal models for binary data.Heagerty and Zeger (2000) extended

this model for multi-level data where an alternative parameterization for the

random e®ects model in which the marginal mean rather than the conditional

mean given random e®ects,is regressed on covariates.This model allows a

choice as to whether the marginal mean structure or the conditional mean

structure is the focus of modelling when using a latent variable formulation.

1.1.1 Parameter Estimation Methods

Although it is relatively straightforward to obtain parameter estimates and

inferences for linear models through methods such as Maximum Likelihood

(ML) and Restricted Maximum Likelihood (REML) (see,for example,Diggle

et al.(2002)),the resulting likelihood functions from non-linear models are

often much more complex.

23

Chapter 1.Introduction

Marginal Models

For the case of marginal models,the likelihood functions from non-linear mod-

els often contain moments of higher order than mean and variance and involves

many nuisance parameters.As a result,a quasi-likelihood method such as the

generalised estimating equations (GEE) (Liang and Zeger,1986) presents an

attractive alternative.The GEE approach assumes that the marginal distribu-

tion of the dependent variable follows a generalised linear model.Speci¯cation

of a\working"correlation matrix for the relationship between repeated ob-

servations on each individual gives consistent estimates of these correlation

parameters using the method of moments.A score equation approach is used

to calculate parameter estimates and standard errors.For linear models,Wu

et al.(2001) present a comparative study of the GEE with other methods,such

as ML and REML,for variance and covariance estimation.

Since its introduction,several authors including Prentice (1988),Lipsitz

et al.(1994) and Liang et al.(1992b) have studied the use of the GEE.Pren-

tice (1988) applies the methodology to binary data and introduced a second set

of estimating equations for the covariance matrix parameters.Thall and Vail

(1990) extend the research and suggest other covariance models in the context

of count data.Dobbie and Welsh (2001) adapt the GEE for the analysis of

correlated zero-in°ated count data.Liang et al.(1992b) suggest an extension

of GEE which they call GEE2.This further generalisation allows simultane-

ous modelling of both the marginal distributions and the dependency between

repeated observations.Miller et al.(1993) and Lipsitz et al.(1994) develop

the methodology for repeated nominal or ordinal categorical response.Lipsitz

et al.(1994) have extended the GEE approach to encompass the analysis of

longitudinal polytomous data.Catalano and Ryan (1992) and Fitzmaurice and

Laird (1995) develop GEE methodology for bivariate discrete and continuous

data.

24

1.1.Introduction to Longitudinal Data Analysis

Although statistical packages such as SAS and Splus accommodate estima-

tion well for repeated binary and ordinal data,the GEE cannot be applied for

repeated nominal data using these software.

Random E®ects Models

For random e®ects models (or GLMMs),numerical methods are required to

solve the likelihood functions.Estimation for the GLMMcan be undertaken in

several di®erent ways.Schall (1991) provides methods for maximizing the joint

distribution of the observed data and the random e®ects with respect to the

¯xed e®ects (or parameters) and the random e®ects.McGilchrist (1994) sug-

gests a generalization of the best linear unbiased predictor (BLUP) approach

(Robinson,1991) similar to that of Schall (1991).McCulloch (1997) labels

these methods\joint maximization"algorithms.McCulloch (1997) also notes

that similar algorithms have been suggested by Breslow and Clayton (1993)

and Wol¯nger (1994).Estimation is based on the iterative ¯t of a set of gen-

eralised estimating equations.The procedures used in McCullagh and Nelder

(1993) are available in some statistical software packages (e.g.PROC GLIM-

MIX in SAS,GLMM in Genstat).Diggle et al.(2002) suggest an approach

based on the use of numerical quadrature.Hedeker and Gibbons (1994) and

Liu and Pierce (1994) also favour this approach.This method of estimation is

available in SAS (PROC NLMIXED) and is based on the direct maximisation

of the approximate integrated likelihood.

In contrast with these approaches,McCulloch (1997) examines a Monte

Carlo EM (MCEM) algorithm approach and proposes a new procedure which

he calls Monte Carlo Newton-Raphson (MCNR).

Recently,for estimation of random e®ects models,the approach is often

Bayesian.The major advantages of Bayesian methods is that they can be

implemented using Markov Chain Monte Carlo (MCMC) methods and hence

25

Chapter 1.Introduction

complicated models can be handled.

Albert and Jais (1998) use a Bayesian hierarchical model for longitudinal

binary data to identify risk factors associated with allergic accidents during

plasma exchange.Lunn et al.(2001) examine a Bayesian hierarchical model

for the analysis of ordinal longitudinal data.They examine the e±cacy of

a treatment for allergic rhinitis (recorded on a four-point scale),controlling

for various confounding variables.Cruz-Mesia and Marshall (2006) develop

Bayesian non-linear random e®ects models with continuous time autoregres-

sive errors for longitudinal medical studies with unequally spaced observations.

Other applications of Bayesian hierarchical models for the analysis of longitu-

dinal data include Hogan and Daniels (2002),Rasmussen (2004) and Congdon

(2002).

1.1.2 Missing Data Models

A further complication in analyzing longitudinal data is accounting for the

missing values due to non-response and attrition.Potentially,the missing

values are correlated with the variable under study and hence cannot be totally

excluded.

Missing data is generally characterised into three missingness patterns (Dig-

gle et al.,2002):

²

Missing completely at random (MCAR):Cases with complete data are

indistinguishable from cases with incomplete data.

²

Missing at random (MAR):Cases with complete data di®er from cases

with incomplete data,but the pattern of missingness is traceable or pre-

dictable from other independent variables rather than being due to the

speci¯c variable for which the data are missing.

²

Nonignorable (NI):The pattern of data missingness is non-random and

26

1.1.Introduction to Longitudinal Data Analysis

it is not predictable from other independent variables.In contrast to

the MAR situation where missingness is explained by other independent

variables in a study,nonignorable missing data arise due to the data

missingness pattern being explained by the variable for which the data

are missing.

The GEE approach relies on the assumption that the data are MCAR.

As noted by Liang and Zeger (1986) and many other authors subsequent to

this,the generalised estimating equations will generally yield biased estima-

tors if the missing data are not MCAR.Several methods have been proposed

to adjust the estimating equation to obtain unbiased estimators when missing

data are MAR rather than MCAR.These methods include weighted gener-

alised estimating equations (Robins et al.,1995) nonparametric estimation of

the conditional estimating scores (Reilly and Pepe,1995),modelling the con-

ditional distributions (Murphy and Li,1995) and multiple imputation (Rubin,

1987).All these methods except the second make additional parametric model

assumptions that potentially reduce the robustness of the estimates;the second

method has di±culties in dealing with continuous data.While these methods

reduce bias under a missing at random data process,if the data are missing

completely at random,they unnecessarily increase variance.Applications of

the GEE for longitudinal studies with data missing at random include Re-

boussin et al.(2002).

Likelihood based tests to assess whether or not the missing data are MCAR

have been proposed for contingency tables (Fuchs,1982) and for multivariate

normal data (Little,1988).A nonparametric test has also been proposed by

Diggle (1989),mainly for preliminary screening.Ridout (1991) proposed a

parametric test based on modelling the missing data process.In recent times

there have been two tests suggested for the assessment of MCARin conjunction

with the generalised estimating equations;Park and Lee (1997) and Chen

27

Chapter 1.Introduction

and Little (1999).Park and Lee (1997) extend an idea by Park and Davis

(1993) who use weighted least squares for repeated categorical data.Chen and

Little (1999) generalise the basic idea for constructing test statistics in Little

(1988) to the generalised estimating equation setting,avoiding distributional

assumptions.Qu and Song (2002) present an alternative approach that avoids

the exhaustive estimation of parameters for each missing data pattern necessary

for the method of Chen and Little (1999).

It is well known that maximum likelihood estimates ignoring the missing

data mechanism are valid when data is MAR (Rubin,1976).Hence the dis-

tinction between MCAR and MAR is less pertinent to maximum likelihood

estimation than to estimating equation methods.

When the missing data mechanism is non-ignorable,it has to be incor-

porated into the main model.The modelling of missing processes has had

considerable attention in the biostatistics literature and,for example,Diggle

and Kenward (1994) give a good summary of contemporary issues.Modelling

drop-out can give a priori information about the biases and potential for °awed

designs,(Fitzmaurice et al.,1996).Post-hoc modelling can e®ectively account

for informative drop-out,for example,Fitzgerald et al.(1996),Dionne et al.

(1998),and so removing biases.Yun and Lee (2005) examine a hierarchical like-

lihood method for the analysis of longitudinal data allowing for non-ignorable

and non-monotone (or\intermittent") missingness.Other authors who have

presented ideas on informative missing data include Wu and Follmann (1999)

and Crouchley and Ganjali (2002) and Gelman et al.(2003).

1.2 Bayesian methods,MCMC and BUGS

Recently,Bayesian methodology has been a popular choice for the analysis of

longitudinal data.Although it can be computer intensive,the advantages of

28

1.2.Bayesian methods,MCMC and BUGS

the Bayesian approach is that it can be implemented via Markov chain Monte

Carlo (MCMC) methods and the ongoing development software packages such

as WinBUGS (Spiegelhalter et al.,1998) has meant that complicated models

such as those to account for informative missing values can be readily handled.

1.2.1 Bayesian methods

In Bayesian parametric statistical modelling of data y,we require the speci¯ca-

tion of p(yjµ),the probability model for y conditional on a set of parameters µ,

and p(µ),the joint prior for µ.Inferences for µ are made on the joint posterior

given by

p(µjy)/p(yjµ)p(µ):

Bayesian inference can be based on any function of the posterior distribu-

tion.For any function f(µ),its posterior expectation is given by

E[f(µ)jy]/

Z

f(µ)p(yjµ)p(µ)dµ:(1.1)

It is often the case that the integration in (1.1) cannot be carried out analyti-

cally.Thus simulation techniques such as Markov Chain Monte Carlo (MCMC)

are needed to perform the analysis.

1.2.2 Markov Chain Monte Carlo

MCMC (see,for example,Gilks et al.(1998b) and Robert and Casella (2004))

is essentially Monte Carlo integration using Markov chains.Monte Carlo inte-

gration evaluates E[f(X)] by drawing samples fX

t

;t = 1;:::;ng throughout

the support of the target distribution,in our case p(µjy),in the correct pro-

portion and then approximating

E[f(X)] ¼

1

n

n

X

t=1

f(X

t

):

29

Chapter 1.Introduction

One way of generating these samples is by cleverly constructing a Markov

chain and running it for a long time.Gilks et al.(1998a) describe a Markov

chain as essentially,a sequence of random variables,fX

t

;t = 0;:::;ng,where

at each time t ¸ 0,the next state X

t+1

is sampled froma distribution P(X

t+1

jX

t

)

which depends only on the current state of the chain,X

t

.P(¢j¢) is called the

transition kernel of the chain.For discrete state-spaces,Gilks et al.(1998a)

discuss that convergence of a chain can be assured through the following the-

orem.

Theorem 1.2.1

If a Markov chain,X,is irreducible,aperiodic and has ¼ as

invariant distribution,then:

P(X

t

= xjX

0

= x

0

)!

t!1

¼(x) for all x and any x

0

:

Following the de¯nitions fromRoberts (1998),X is irreducible if for all x;x

0

there exist t > 0 such that P

x;x

0

(t) = P(X

t

= xjX

0

= x

0

) > 0.An irreducible

chain is said to be aperiodic if for all x

0

,

greatest common divisor ft > 0:P

x

0

;x

0

(t) > 0g = 1:

Finally,¼ is said to be the invariant distribution under the transition kernel

P(¢j¢) if

X

y

¼(y)P(zjy) = ¼(z)

for all z.

Some common MCMC methods include the Metropolis-Hastings algorithm

and the Gibbs sampler.

Metropolis-Hastings algorithm

A su±cient condition for the invariance of ¼ under P(¢j¢) is the reversibility

condition:

¼(y)P(zjy) = ¼(z)P(yjz) 8y;z;(1.2)

30

1.2.Bayesian methods,MCMC and BUGS

(Roberts,1998) and based on the general framework of Metropolis et al.(1953)

and Hastings (1970),the Metropolis-Hastings algorithm makes use of this re-

versibility condition.

At each time t,the next state X

t+1

in a Metropolis-Hastings algorithm

is chosen by ¯rst sampling a candidate point Y from a proposal distribution

q(¢jX

t

).Y is then accepted with probability ®(X

t

;Y ) where

®(X;Y ) = min

µ

1;

¼(Y )q(XjY )

¼(X)q(Y jX)

¶

;

If Y is accepted,X

t+1

= Y,otherwise X

t+1

= X

t

and the chain does not move.

Thus the transition kernel is given by

P(X

t+1

jX

t

) = q(X

t+1

jX

t

)®(X

t

;X

t+1

)

+I(X

t+1

= X

t

)[1 ¡

Z

q(Y jX

t

)®(X

t

;Y )dY ] (1.3)

where I(¢) takes the value 1 when the chain does not move and 0 otherwise.

It can be shown that (1.3) satis¯es the reversibility condition (1.2) (see,for

example,Robert and Casella (2004)).

Advantages of the Metropolis-Hastings algorithm is that the target distri-

bution ¼(¢) needs only to be known up to a multiplicative factor and it can be

shown that,for ¯xed ¼(¢) and q(¢j¢) the acceptance probability minimizes the

variance estimation in the Monte Carlo integration.

Although theoretical convergence is guaranteed for q(¢jX

t

) of any form,the

convergence rate of the algorithm is highly dependent on the choice of q(¢jX

t

)

(Gilks et al.,1998a).Roberts (1998) brie°y discussed theoretical convergence

rates but noted that in practice,it is usually too di±cult to obtain.Instead,

it is common to run several chains in parallel at di®erent starting values and

compare their estimates.Graphically,it is also useful to examine plots of the

chain to determine convergence.

31

Chapter 1.Introduction

Gibbs sampler

Instead of updating the whole of X,it is often more convenient to divide X

into components fX

¢1

;X

¢2

;:::;X

¢h

g and update these one by one (Gilks et al.,

1998a).The Gibbs sampler takes this approach.Thus each iteration comprises

h updating steps.Let X

¢¡i

= fX

¢1

;:::;X

¢i¡1

;X

¢i+1

;:::;X

¢h

g and X

t¢i

be the

state of X

¢i

at the end of iteration t.For step i of iteration t +1,candidate Y

¢i

is generated from the proposal distribution q

i

(Y

¢i

jX

t¢i

;X

t¢¡i

) where

X

t¢¡i

= fX

t+1¢1

;:::;X

t+1¢i¡1

;X

t¢i+1

;:::;X

t¢h

g:

The acceptance probability is given by ®(X

t¢¡i

;X

t¢i

;Y

¢i

) where

®(X

¢¡i

;X

¢i

;Y

¢i

) = min

µ

1;

¼(Y

¢i

jX

¢¡i

)q

i

(X

¢i

jY

¢i

;X

¢¡i

)

¼(X

¢i

jX

¢¡i

)q

i

(Y

¢i

jX

¢i

;X

¢¡i

)

¶

:

If the the proposal distribution of

q

i

(Y

¢i

jX

¢i

;X

¢¡i

) = ¼(Y

¢i

jX

¢¡i

)

is used,®(X

¢¡i

;X

¢i

;Y

¢i

) = 1.This is the Gibbs sampler.Thus Gibbs sampling

consists purely in sampling from full conditional distributions ¼(Y

¢i

jX

¢¡i

) and

candidates are always accepted.See Robert and Casella (2004) and Gilks et al.

(1998b) for further details and discussions.

1.2.3 BUGS

BUGS is a computer package that carries out Bayesian inference on statistical

problems using Gibbs sampling.In BUGS,the crucial idea is that in order to

fully specify the model,the users only need to provide the parent-child distri-

butions (Spiegelhalter et al.,1998).For most models,this can be represented

by a Directed Acyclic Graph (DAG) such as Figure (1.1).In a directed graph-

ical model,all quantities are represented as nodes with arrows running into

nodes from their direct in°uences (parents).See Lauritzen (1996) for a more

32

1.2.Bayesian methods,MCMC and BUGS

Figure 1.1:Graphical model for the`Pumps'example in Spiegelhalter et al.

(1998)

detailed description of the graphical model and other deviations of the DAG

models.

Essentially,the model represents the assumption that,given its parent

nodes parents[v],each node v is independent of all other nodes in the graph ex-

cept descendants of v.Thus the full joint distribution of all the quantities V has

a simple factorisation in terms of the conditional distribution p(vjparents[v])

such that

p(V ) =

Y

v2V

p(vjparents[v]):

BUGS then constructs the full conditional distribution of each node v,p(vjV n

33

Chapter 1.Introduction

v),given the remaining nodes V nv,for the Gibbs sampling algorithmas follows

p(vjV n v)/p(v;V n v)

/terms in p(V ) containing v

= p(vjparents[v])

Y

v2parents[w]

p(wjparents[w]):

Thus,the full conditional distribution for v contains a prior component

p(vjparents[v]) and likelihood components arising fromeach child of v (Spiegel-

halter et al.,1998).

1.3 Layout of the Thesis

In this chapter,we have introduced the basic methodology and reviewed the

literature in the analysis of longitudinal data.We present the results of our

research in the forthcoming chapters.

In chapter 2,we introduce the two surveys used in our analyses,the LSIA

and the GCLS and we brie°y review other international and national longi-

tudinal surveys that have been conducted.We discuss the general designs of

these surveys and for the LSIA,we present an initial univariate analysis of the

data set.

Chapter 3 presents an application of the marginal model to categorical

longitudinal data using readily available software such as STATA (Statcorp.,

2001) and SAS (SAS Institute Inc.,1999).We consider two approaches in the

modelling of the employment status from the LSIA data set.We illustrate

the di±culties and drawbacks of using these methods when missing values are

present.

In chapter 4,we examine Bayesian hierarchical models for longitudinal cat-

egorical data.The advantages of Bayesian modelling are that the parameters

may be estimated via Markov chain Monte Carlo (MCMC) methods and the

34

1.3.Layout of the Thesis

ongoing development of MCMC software packages such as WinBUGS (Spiegel-

halter et al.,1998) has meant that Bayesian hierarchical models are increasingly

being used to model longitudinal data.Also,in WinBUGS,missing data from

both the response and the explanatory variables can be routinely handled.We

applied various Bayesian models in the analysis of employment status using

¯rstly,the complete case LSIA data and then the data with incomplete re-

sponse and covariate entries.For the complete case data set,we employed the

Deviance Information Criterion (DIC) (Spiegelhalter et al.,2002) to compare

our models.

In chapter 5,we apply an alternative model selection technique in the form

of Bayes factors calculated via power posteriors as presented by Friel and Pet-

titt (2006).In Bayesian data analysis,there are numerous model comparison

criteria that one can use including DIC (Spiegelhalter et al.,2002),BIC (Kass

and Raftery,1995) and Bayes factors.The choice of which to use often de-

pends on the focus of the analysis.For example,when prediction is important,

DIC is appropriate while the Bayes factor is used if the aim is to obtain a

most probable model.When missing covariates are present,Bayes factors are

straightforward to interpret and calculate.We compute marginal likelihoods

(and ultimately Bayes factors) for the models in chapter 4 for both the fully

observed data set and that with missing values.Chapters 4 and 6 also contain

a literature review.

In chapter 6,we present a simulation study to investigate the DIC for re-

gression models with missing covariates.Although the imputation of missing

values has had considerable attention in the literature,the problem of building

or choosing between regression models when covariates are missing appears to

be overlooked.For model selection,the use of the DIC as it is de¯ned can

be unreliable when there are missing covariates that have to be imputed.We

investigate three versions of the DIC,a naive DIC (NDIC),a complete DIC

35

Chapter 1.Introduction

(CDIC) and a marginalised DIC (MDIC).Celeux et al.(2006) have consid-

ered similar extensions of the DIC speci¯cally for mixture modelling where the

missing data have been interpreted as the indicator for each observation for

the mixture component.That is,for their analysis,missingness is part of the

modelling process.

We present concluding remarks and discussions in Chapter 7.

36

Chapter 2

Longitudinal Surveys - A brief

review with reference to LSIA

and GCLS

Longitudinal designs are uniquely suited to the study of individual change

over time,including the e®ects of development,aging and other factors that

a®ect change in contrast to cross-sectional studies in which a single outcome

is measured for each individual.

In this chapter,we brie°y review international and national longitudinal

surveys that have been conducted.We discuss the overall designs of these

surveys and the schemes used to address problems such as attrition and geo-

graphical movements of the population.Brief descriptions of the GCLS and

LSIA are given in sections 2.2 and 2.3 respectively with an initial univariate

analysis of the LSIA in section 2.3.4.Table 2.1 contains a list of some of

the longitudinal surveys that have been conducted around the world and their

acronyms.

37

Chapter 2.Longitudinal Surveys - A brief review

Table 2.1:List of survey acronyms.

Acronym

Survey

GCLS

Goodna Community Longitudinal Survey

GSOEP

German Socio-economic Panel

HILDA

Household Income and Labour Dynamics in Australia

LAFANS

Los Angeles Family and Neighbourhood Survey

LSIA

Longitudinal Survey of Immigrants to Australia

NLSY

National Longitudinal Study of Youth

PSID

Panel Study of Income and Dynamics

2.1 Longitudinal Survey Designs

As opposed to standard cross-sectional surveys where interest is in population

characteristics at a single point in time,longitudinal (or panel) surveys are

concerned with capturing the pathways of the target population through time.

There are many aspects associated with implementing a longitudinal survey.

These include:survey instruments;interview mode;sample size;frequency of

the survey;and life span of the survey.Longitudinal studies vary markedly in

terms of design and collection method and there is no single approach that is

universally accepted as the best.Primarily,the design of a large scale panel

study is dependent on the key research objectives (Wooden,2001).

There are two common designs for collecting longitudinal data,the classic

design and the repeated medium life or rotating design.

2.1.1 The Classic Design

The classic design collects information on the same sample of units over the

entire life of the survey.Examples of this type of survey are the LSIA and

the National Longitudinal Study of Youth (NLSY).Research objectives of in-

38

2.1.Longitudinal Survey Designs

de¯nite life surveys are in a sense more purely longitudinal.For example,the

aim of the NLSY is to gather information such as labour force activity,mari-

tal status,fertility and participation in government assistance programs such

as unemployment insurance,in an event history format,in which dates are

collected for the beginning and ending of important life events.

For household based surveys similar to the GCLS and HILDA,the most

popular design is the classic life design with replenishment.In this design,the

sample is automatically extended over time by following rules that add to the

sample any new children of members of the selected households (including both

biological and adopted children) as well as new household members resulting

from changes in composition of the original households.

For the Panel Study of Income and Dynamics (PSID),Fitzgerald et al.

(1998) in Wooden (2001),have shown that 21 years on,and despite a loss of

50 per cent of the original sample,the sample still retained its cross-sectional

representativeness by adopting the following replenishment rules.

1.

A child is born to,or is adopted by,an`original'or`continuing sample

member'.This child automatically counts as an original sample member

and information about that child will be collected from parents until age

15 (after which they too will become eligible for interview).

2.

An original sample member moves into a di®erent household with one

or more new people.These new people will now become eligible for

interview,but are only treated as`temporary sample members'.

3.

One or more new people move in with an original sample member.Again,

these new people will now become eligible for interview,and are counted

as temporary sample members.

4.

All temporary sample members remain in the sample for as long as they

remain in the same household as the original sample member.Temporary

39

Chapter 2.Longitudinal Surveys - A brief review

sample members,however,are converted to continuing sample members if

they become the parent of a new birth with the original sample member.

Similarly,the replenishment scheme adopted by the LAFANS are:

1.

A randomly selected adult and randomly selected child are the primary

respondents.Once they join the sample they will be followed throughout

the study,whether they live together or apart

2.

In each wave,interview sampled respondents who have remained in the

neighbourhood as well as those who have left.

3.

Also select a sample of\new entrants"into the neighbourhood,that is,

people who have moved into the neighbourhood between the preceding

wave and the current wave.

4.

The new entrants become part of the sample and will be followed in

subsequent waves.

2.1.2 Rotating Design

The repeated medium life or the rotating design collects information on units

over a ¯xed life (say 5 to 10 years).Portions of the sample are then gradually

dropped fromthe survey and replaced with new but comparable samples drawn

from the current population.How long a unit remains in the sample before

rotation should be determined by operational constraints.

Rotation sampling o®ers a compromise for surveys with the dual purposes

of measuring level and change.Measures of change are best achieved by keeping

the same sample while for estimates of means or totals,it is best to draw a

new sample at each wave.Binder and Hidiroglou (1988) discuss approaches

to rotation design and methods of estimation for level and change.Surveys

40

2.1.Longitudinal Survey Designs

which employed a rotating design include the Survey of Labour and Income

Dynamics (SLID).

2.1.3 Proxy Interviews

A design aspect speci¯c to a household panel survey is whether to use proxy

interviews.In surveys including PSID,one household member answers on

behalf of other household members.In other studies such as the German Socio-

economic Panel (GSOEP) survey,interviews are expected from all members of

the household.The disadvantages of using proxy interviews are that it is

subject to more measurement error and it renders these studies less receptive

to more subjective questions about satisfaction and aspirations.On the other

hand,interviewing all members may be associated with higher attrition and

non-response rates.It has been suggested that what matters most for response

and attrition is not so much the length of a questionnaire,but the total time

spent in the household (Wooden,2001).Nonetheless,collection of data from

all household members permits more complicated analyses of family e®ects and

intra-household dynamics.

2.1.4 Addressing dropout

Another major di±culty introduced in the conduct and analyses of longitudinal

surveys is due to attrition or dropout.Potentially,the dropout process is

correlated with variables under study thus increasing bias.Attrition should

be addressed in ¯eldwork strategies and sample design (Wooden,2001).For

example,

²

The initial sample size can be in°ated to achieve a desired ¯nal sample

size;

²

Unequal probability sampling at subsequent waves can be adopted to

41

Chapter 2.Longitudinal Surveys - A brief review

replenish sub-populations exhibiting high attrition;

International experience tends to suggest that attrition is highest in the ¯rst

two years of the survey and then stabilizes.In the PSID,the response rates

were 76 percent and 88.5 percent in the ¯rst and second waves respectively.

Since the second wave,annual response rates have ranged between 96.9 and

98.5 percent (Hill,1991).

2.1.5 Summary

In this section,we have discussed how the design of a longitudinal survey

depends on its objectives.For more traditional longitudinal surveys,the classic

design is best whereas for household based surveys,replenishment schemes are

necessary to adjust for changes in the household structures.Design should also

address potential bias due to dropout.

In the next section,we introduce two Australian longitudinal surveys which

are the focus of our analyses.The LSIA uses a classic design while the GCLS

is household based thus introducing more complexities in designing issues.For

the LSIA dataset,we also present a univariate analysis of the variable of inter-

est,Employment Status,and examine the missingness patterns of this variable

over time.

2.2 The Goodna Community Longitudinal Sur-

vey

The Queensland Government has recently invested in the Goodna Service In-

tegration Project (GSIP) in response to an identi¯ed need to improve the

services o®ered or funded by government in Goodna (OESR,2002).The GSIP

is not about additional sources of funding for service delivery but to ensure

42

2.2.The Goodna Community Longitudinal Survey

that services o®ered in Goodna are integrated,improve community lifestyle

and strengthen the Goodna community.Speci¯c aims include reducing crime,

increasing school retention rates,stabilizing households,improving community

health,reducing unemployment,providing opportunities for community pride

and improve community relations.

To assess the GSIP,a survey was conducted to collect information on the

Goodna community in the areas of social well-being,environmental well-being

and economic well-being.A pilot survey commenced in 2002 and the ques-

tionnaire (see Appendix A) consists of items (mainly likert-scale and nominal

categorical items) to gauge:

²

community participation in group activities and in individual activities;

²

community perceptions on crime,local job opportunities and other issues;

²

service usage;

²

movement along child pathways;

² volunteerism within the community;and

²

changes in individuals'labour market statuses as well as howthese changes

occur.

2.2.1 The Suburb of Goodna

The suburb of Goodna is located between the major cities of Brisbane and

Ipswich.It has a diverse population of newly settled immigrants and Aboriginal

and Torres Strait Islanders.Based on the 1996 Census of Population and

Housing,a third of Goodna's 6,963 residents were born overseas and 6.1%were

Aboriginal and Torres Strait Islanders.It has one of the lowest socio-economic

indexes (863.9) in Queensland and an unemployment rate (percentage of those

43

Chapter 2.Longitudinal Surveys - A brief review

in the labour market that are unemployed) of 18.5%.With nearly half (42.09%)

of the residents who had left school at an age between fourteen and ¯fteen,the

most common occupation among workers in Goodna was intermediate clerical,

sales and service worker (17.2%).

Almost all (99.3%) of the residents in Goodna lived in a private dwelling and

more than half (51.1%) have changed address in the last ¯ve years.Goodna had

an estimated 2,272 households and an average household size of 3.1 persons.

Of the households within the Goodna suburb,28.1% were fully owned,43.0%

rented and 24.6% were being purchased or paid o® (Domrow,2002).

2.2.2 Survey Design of GCLS

In the ¯rst wave of the pilot study for the GCLS,all residents in private

dwellings in Goodna aged 18 years or over were in the survey scope.244

households were chosen at randomfromthe frame which consisted of addresses

of all properties located in the suburb of Goodna.For each dwelling with one

or more usual residents aged 18 years or over,one of those residents aged 18

years or over was randomly selected to complete the questionnaire face to face.

A total sample of 243 households and 13 caravans within the local caravan

park was selected for the survey.The sample was designed to achieve at least

150 completed interviews.This ¯gure accounts for 6.6% of the households in

Goodna.

The subsequent waves of the survey involved interviewing the following

three groups:

²

Respondents to the 2002 survey who had agreed to a follow-up survey

in 2003 and were still living in Goodna.For these respondents,a com-

bination of telephone and face to face interviews were conducted.Three

interviewers were used for the telephone phase and one interviewer for

the follow-up by face to face of non-contacts from the telephone phase;

44

2.3.The Longitudinal Survey of Immigrants to Australia

²

Respondents to the 2002 survey who had agreed to a follow-up survey

in 2003 and had moved from Goodna.Information for these respondents

were collected through telephone and mail out interviews;

² A new (replenishment) sample designed to maintain the total sample at

or above the 2002 level.Fourteen interviewers carried out face to face

interviews for this sub-sample.

263 households were chosen to supplement the 152 respondents that had

agreed to a further interview in 2003.(See OESR (2002) and OESR (2003))

The main objectives of the pilot study were to determine people's per-

ceptions of community wellbeing and the quality of services available to the

Goodna community and to determine the success rate for re-interviewing re-

spondents 12 months later.Other secondary information such as respondent

reaction to the survey,time per interview,response rate and factors which

might impact on the future repeats of this survey were also collected.

2.3 The Longitudinal Survey of Immigrants to

Australia

The Longitudinal Survey of Immigrants to Australia (LSIA) is the most com-

prehensive survey of immigrants ever to be undertaken in Australia.The survey

seeks to provide government and other agencies with reliable data to monitor

and improve immigration and settlement policies,programs and services.

2.3.1 Survey Design of LSIA

The sampling unit for the LSIA was the person upon whom the approval to

immigrate was based - the Principal Applicant (PA).The sample of 5192 PAs

was drawn to represent the entire population of o®shore visaed PAs aged 15

45

Chapter 2.Longitudinal Surveys - A brief review

years and older migrating to Australia between September 1993 and August

1995.New Zealand citizens,those under 15 years of age and people who were

granted a visa while in Australia were excluded from the scope of the survey.

The sample was randomly selected and strati¯ed by Visa eligibility category

and region/country of birth.There are ¯ve visa eligibility categories used in the

strati¯cation.These are Humanitarian,Preferential Family,Concessional Fam-

ily,Business Skills and Employer Nomination and Independent.Preferential

Family immigration is based on close family relationships.The Concessional

Family programlies between family-based and skill-based migration streams as-

sessing potential migrants on both skills and more distant family relationships.

Skill-based migration includes independent migrants without family relation-

ships who are points tested (Independents),migrants with pre-arranged o®ers

of employment (ENS or Employer Nomination Scheme) and migrants intend-

ing to establish businesses in Australia who meet certain capital requirements

(Business Skills).

There were about 50 region/country of birth categories used in strati¯ca-

tion.A mixture of categories was required because some individual countries

provide relatively few migrants and therefore for the purposes of strati¯ca-

tion they have to be aggregated into regions.For example,Peru,Chile and

Argentina have their own individual country of birth category.Other South

American countries are aggregated into the category`Other South America'.

The selected PAs,together with any accompanying spouse or partner who

migrated with themon the same visa application,were interviewed three times.

Wave 1 interviews were designed to take place approximately 6 months after

the PA entered Australia;Wave 2 interviews were designed to occur 12 months

after the Wave 1 interviews.The third and ¯nal Wave was designed to occur

a further 24 months after the Wave 2 interviews.To assist PAs in provid-

ing accurate responses,the time between arrival and the ¯rst interview was

46

2.3.The Longitudinal Survey of Immigrants to Australia

minimised.

2.3.2 LSIA Data

A relationship of particular interest for immigration policies is the in°uence

of Visa Category on employment status of the immigrants.There are sev-

eral reasons for this emphasis.Immigration policy in Australia makes a clear

distinction between skill-based migration and family-based migration.In re-

cent years,the Australian government has moved to increase the number of

places for skilled migrants while at the same time cutting the overall size of

the immigration program.This shift in policy is consistent with that of other

nations such as the United States where the number of U.S.visas reserved for

skill-based immigrants has increased substantially since the introduction of the

U.S.Immigration Act of 1990.Similarly,Canada increased its intake of skilled

independent migrants almost ¯ve-fold between 1984 and 1995.These policy

changes stem largely from the view that immigrants selected on the basis of

their labour market skills ¯nd the transition into the labour market easier and

make a greater contribution than immigrants selected on the basis of their

family relationships.

In Australia,several studies have been undertaken to analyse labour mar-

ket status at various stages of the settlement process.Williams et al.(1997)

analyse data fromthe ¯rst wave of the LSIA and conclude that there is a signif-

icant association between Visa Category and labour market status six months

after arrival.Cobb-Clark (2000) examines the ¯rst two waves of the LSIA

and suggests that migrants selected for their skills have better labour market

outcomes.The author also notes that labour market outcomes are better for

native English speakers and for those who visited Australia prior to migration.

We add to this growing body of work by examining the relationship between

selection criteria and employment status of immigrants entering Australia us-

47

Chapter 2.Longitudinal Surveys - A brief review

ing data from all three waves of the Longitudinal Survey of Immigrants to

Australia.

The data for employment status was collected through a question on the

survey which utilised show cards.The PA was asked to identify which category

best describes their current main activity in Australia.The show card has the

following categories:

Employed

1 - A wage or salary earner

2 - Conducting own business but not employing others

3 - Conducting own business and employing others

4 - Other employed

Unemployed

5 - Unemployed looking for full time work

6 - Unemployed looking for part time work

Non-Participant

7 - Student

8 - Home duties

9 - Retired

10 - Aged pensioner

11 - Other pensioner

Two codes were allowed for other responses:88 - Other and 98 - Refused/

Not stated and principal applicants who fell into these categories were removed

from the analysis.The multinomial variable created by these three categories

is referred to as\employment status."

The explanatory variables chosen to model the response variable are gen-

der,age (+ age

2

),marital status,wave,self-reported English speaking ability,

Visa Category,quali¯cation,state of residence,region of birth,pre-migration

employment status,and pre-migration visit to Australia.Of particular interest

48

2.3.The Longitudinal Survey of Immigrants to Australia

in our analyses are the explanatory variables of self-reported English speaking

ability and Visa Category.

Self-reported English speaking ability is derived from a show card question

which asks the principal applicant to nominate how well they speak English.

This is coded:1 - English Only,2 - Very well;3 - Well;4 - Not well 5 - Not

at all.Visa category was also recorded at the time of interview.The ¯ve visa

eligibility categories were as follows:

²

Preferential Preferential Family/Family Stream

²

Concessional Preferential Family/Family Stream

²

Business Skills and Employer Nomination

²

Independent

²

Humanitarian

The variable for quali¯cation is derived from a a show card question which

asks the principal applicant to nominate their highest quali¯cation.This is

coded as 1 - Higher degree;2 - Postgraduate Diploma;3 - Bachelor degree

or equivalent;4 - Technical/professional quali¯cation/diploma/certi¯cate;5 -

Trade;6 - 12 or more years of schooling;7 - 10-11 years of schooling;8 - 7-9

years of schooling;9 - 6 or fewer years of schooling;88 - other.These categories

were reduced to 1 - Tertiary quali¯cation;2 - Technical/trade quali¯cation;3

- 10 or more years of schooling,4 - <10 years of schooling.

The place of interview was recorded in terms of the state or territory in

which the interview took place.The states and territories were categorised as

1 - New South Wales;2 - Victoria;3 - Queensland;4 - South Australia;5 -

Western Australia;6 - Tasmania,Northern Territory,Australian Capital Ter-

ritory.Region of birth was categorised into ¯ve categories 1 - Oceania/Other

Africa;2 - Middle East/North Africa;3 - Asia;4 - Americas;5 - Europe and the

49

Chapter 2.Longitudinal Surveys - A brief review

former USSR.Two variables giving pre-migration information were included

in the analysis.The variable prior employment status records the employment

status (0 - employed,1 - unemployed,2 - non-participant) of the principal

applicant prior to migration and the variable Visit (1 - yes,2 - no) indicated

whether the principal applicant had visited Australia prior to migration.

2.3.3 Missing Data

Wave 1 of the Longitudinal Survey of Immigrants comprised 5192 principal

applicants.These were drawn to represent the entire population of o®shore

visaed PAs aged 15 years and older migrating to Australia between September

1993 and August 1995.Not all persons interviewed at Wave 1 were interviewed

at Wave 2.Similarly,not all persons interviewed at Wave 2 were interviewed

at Wave 3.At Wave 2,4468 (86%) of PAs were interviewed while at Wave 3,

3752 (72%) of PAs were interviewed.Table 2.2 shows the distribution of PAs

by interview status.

Table 2.2:Distribution of Principal Applicant Interview Status by Wave,be-

ginning with 5192 PAs at Wave 1

STATUS

WAVE 2

WAVE 3

Interviewed

4468

3575

Unable to track

251

563

Refused

109

225

Overseas temporarily

204

289

Overseas permanently

78

234

Australia - Out of Scope

27

41

Deceased

4

19

Other

51

69

It is important to note that the description of interview status by Wave

50

2.3.The Longitudinal Survey of Immigrants to Australia

does not necessarily describe a\drop-out"process.Drop-out is characterised

by the fact that once an individual has left the study,no more measurements are

obtained on that individual.For the LSIA,an attempt was made to interview

the PA in Wave 3 even when the PA was not interviewed at Wave 2.If a PA

was interviewed at Waves 1 and 3 but not Wave 2,this missing data pattern

is referred to as\intermittent"rather than a drop-out pattern (see Table 2.3).

Table 2.3:Missing data patterns for the Longitudinal Survey of Immigrants to

Australia;O - observed,M - missing

Missing data

Wave

pattern

1 2 3

0

O O O

1

O O M

2

O M M

3

O M O

In addition to these missing data patterns,there may be missing data due

to a principal applicant's refusal to answer a speci¯c question or questions of

the survey.The reasons for non-response to particular questions are many and

varied and include not knowing the answer to the question,not understanding

the wording of the question and unwillingness to provide information which

they feel is of a sensitive or private nature.This question-speci¯c missing data

coupled with the missing data due to non-interview can cause considerable

problems when the data comes to be analysed and interpreted.

Table 2.4 shows the distribution of PAs by employment status and wave of

survey interviews,for the data set containing the 3234 complete cases,and for

the full incomplete data set which includes missing response and covariate ob-

servations,respectively.The incomplete data set includes records for the 4950

PAs with age ranging from 19 to 64.For the complete case data,Table 2.4

51

Chapter 2.Longitudinal Surveys - A brief review

shows that the percentage of employed PAs increases substantially with wave

while the percentages of unemployed and non-participant PAs each decrease

more moderately with wave.Because the same individuals are being inter-

viewed at each wave,this implies that more people are entering the work force

from unemployed and non-participant states,as time progresses since arrival

in Australia.

For the larger incomplete case data set,this pattern at Wave 3 is not

obvious.During the ¯rst wave,39% of PA immigrants are employed,34% are

unemployed,23%are non-participant in the work force while a reasonably small

4% of responses are missing.By Wave 3,the percentage of missing responses

has risen to 30%which is double the percentage of missing responses in Wave 2.

While the percentage of PAs employed has remained at an approximate level

of 45%,and the percentages of PAs unemployed and non-participant have

appeared to fall across Waves 2 and 3 of interviews,it is unknown whether

these patterns in the data beyond Wave 1 are masked by the high percentage of

missing responses.This is a typical scenario in longitudinal survey data.The

incomplete case data contains records for 53% more PAs than the complete

case data,and so these incomplete cases should be included in the analyses

so as not to exclude valuable information.Relationships determined by the

complete cases can be used to impute missing response and covariate data for

the incomplete cases.

2.3.4 Initial Univariate Analysis

A summary of immigrant employment status,categorised by interview wave

and selected variables associated with the immigrant,is presented in Table 2.5.

Table 2.5 shows strong associations between the response variable employment

status and the explanatory variables.

For each level of the variables recorded for an immigrant,Table 2.5 shows

52

2.3.The Longitudinal Survey of Immigrants to Australia

Table 2.4:Counts (and proportions) of PAs by employment status and wave

of interview for the two data sets with`complete'and`incomplete'cases,re-

spectively

Employment Status for Complete Cases

Wave

Employed

Unemployed

Non-participant

Total

1

1325 (0.41)

1143 (0.35)

766 (0.24)

3234

2

1770 (0.55)

1009 (0.31)

455 (0.14)

3234

3

2072 (0.64)

837 (0.26)

325 (0.10)

3234

Employment Status for Incomplete Cases

Wave

Employed

Unemployed

Non-participant

Missing

Total

1

1918 (0.39)

1676 (0.34)

1151 (0.23)

205 (0.04)

4950

2

2267 (0.46)

1324 (0.27)

616 (0.12)

743 (0.15)

4950

3

2227 (0.45)

906 (0.18)

363 (0.07)

1454 (0.30)

4950

the percentages of immigrants who are employed,unemployed and non-participant

in the work force at the speci¯ed Wave of interviews.As discussed in section

3.4,the in°uence of Visa Category on employment status is of particular inter-

est.Table 2.5 shows that approximately 90% of immigrants with a Business

Skills/ENS visa are employed across all 3 Waves.Of the immigrants with an

Independent visa,a lower 59% are employed at Wave 1,however,this rises

to 87% employed at Wave 3.For each of the remaining Preferential Family,

Concessional Family and Humanitarian Visa Categories,the percentage of im-

migrants employed is lower in Wave 1 but has increased signi¯cantly by Wave

3.For all Visa Categories the percentage of immigrants who are unemployed

is considerably lower by Wave 3 interviews.Approximately 54% of immigrants

with a Preferential Family or Humanitarian visa are non-participant in the

work force at Wave 1 and this falls to 45% and 40% respectively,by Wave 3.

These univariate summary results indicate that immigrants with a Business

Skills/ENS visa are most likely to be employed shortly after arriving in Aus-

tralia and that a very high percentage of immigrants remain employed over

three years later.However,it also appears that the potential for employment

is reasonably high for immigrants with Independent and Concessional Family

53

Chapter 2.Longitudinal Surveys - A brief review

visas,after several years following arrival in Australia.

English-speaking ability is time-variant and may be improved as the immi-

grant settles into a new country.It is of interest to investigate to what extent

English-speaking ability in°uences employment status.Table 2.5 shows that

for immigrants who speak English only,70% are employed shortly after arrival

and that this increased to 80% by Wave 3 of interviews.Of those who speak

English very well,51% are employed shortly after arrival and a high 74% are

employed by Wave 3.For immigrants who do not speak well or at all,a very

low percentage are employed after arrival (19%,8% respectively).This per-

centage doubles approximately by Wave 3,however,it should be noted that a

very high percentage of immigrants who do not speak English at all (71%) are

classi¯ed as non-particpant at Wave 1,and this percentage increases to 81%

at Wave 3.Approximately half of immigrants who do not speak English well

are classi¯ed as non-participant.These univariate summary results show that

immigrants who speak English well are more likely to be a participant in the

work force and employed sooner,than those who do not.

Table 2.5 also shows that female immigrants are much more likely to be

non-participant in the work force than males.Approximately half of the female

immigrants are non-participant while only 14% of males are non-participant

at Wave 3.Although only 25% of female immigrants are employed at Wave 1,

this increase to 42% by Wave 3.A higher 75% of males are employed at Wave

3.

Table 2.5 also shows that immigrants with a tertiary education are just

as likely to be employed at all Waves as those who have a technical or trade

quali¯cation.The percentage of immigrants employed shortly after arrival is

greater for those who visited Australia at a previous time (55%) compared to

those who didn't (26%).For all variables listed in Table 2.5,the percentage

of immigrants employed at Wave 3 is considerably higher than the percentage

54

2.3.The Longitudinal Survey of Immigrants to Australia

Table 2.5:Summary percentage counts for Employment Status in Wave 1 (ES

1

),Wave 2 (ES

2

) and Wave 3 (ES

3

),

by explanatory variable.Employment Status categories are:E - employed,N-P - non-participant,U - unemployed.

Percentage counts have been computed after excluding missing values in Employment Status due to non-response as well

as failure to interview (Waves 2 and 3).

VARIABLE LEVEL

% ES

1

(n=4983)

% ES

2

(n=4414)

% ES

3

(n=3680)

E N-P U

E N-P U

E N-P U

Gender Female

25 58 17

35 55 10

42 50 8

Male

49 23 28

64 19 17

75 14 11

English Speaking Only

70 15 15

78 16 6

80 17 3

Very Well

51 22 27

64 25 11

74 18 8

Well

38 33 29

48 35 17

59 29 12

Not well

19 57 24

29 50 21

37 44 19

Not at all

8 71 21

15 73 12

13 81 6

Quali¯cation Tertiary

48 26 26

62 24 14

72 20 8

Technical/Trade

46 34 20

59 29 12

68 23 9

¸ 10 yrs school

25 52 23

38 46 16

46 42 12

< 10 yrs school

16 60 24

23 58 19

35 51 14

State NSW

40 38 22

54 31 15

63 27 10

VIC

29 38 33

46 37 17

55 33 12

QLD

52 37 11

60 30 10

69 25 6

SA

31 46 23

42 44 14

52 34 14

WA

44 36 20

56 35 9

65 27 8

TAS,NT,ACT

43 41 16

51 40 9

64 29 7

Visa Preferential Family

28 53 19

39 50 11

46 45 9

Concessional Family

48 23 29

63 19 18

77 12 11

Business Skills/ENS

89 8 3

92 5 3

92 6 2

Independent

59 15 26

77 12 11

87 9 4

Humanitarian

7 54 39

22 49 29

37 40 23

Region of Birth Oceania/Other Africa

48 31 21

62 28 10

65 26 9

Middle East/Nth Africa

17 47 36

28 43 29

39 39 22

Asia

38 38 24

51 36 13

62 30 8

Americas

40 43 17

52 39 9

63 30 7

Europe and former USSR

48 36 16

60 30 10

67 25 8

Marital Status Unmarried

41 37 22

53 35 12

57 30 13

Married

38 38 24

51 34 15

62 29 9

Visit Yes

55 29 16

67 26 7

72 23 5

No

26 45 29

40 40 20

53 33 14

Pre-Migration ES Employed

46 30 24

60 26 14

70 21 9

Non-participant

16 67 17

26 61 13

32 56 12

Unemployed

30 36 34

38 36 26

56 26 18

55

Chapter 2.Longitudinal Surveys - A brief review

employed at Wave 1.It therefore appears that Wave (time since arrival in

Australia) has a strong in°uence on employment status.

With a cohort study such as the LSIA it is extremely likely that the sample

design will be unbalanced with respect to the categorical variables of interest.

The sample is chosen to be representative of the population which will often

be unbalanced in reality.From Table 2.5,two important variables that appear

to be strongly associated with employment status (other than Wave) are Visa

Category and English-speaking ability.To assess whether the data for these

two variables are unbalanced we produced a cross-tabulation of the ¯ve levels

of Visa Category with the ¯ve levels of English-speaking ability.The results of

this cross-tabulation are reported in Table 2.6 as percentages of immigrants by

English-speaking ability within each Visa Category.The data is very unbal-

anced with 78%of immigrants with a Humanitarian visa being unable to speak

English well,while the majority of immigrants from the remaining four Visa

Categories are able to speak English well or better.This means,for example,

that we are unable to make any inferences about the associations of employ-

ment status with immigrants on a Humanitarian visa who speak English only.

It is important in this case to assess the signi¯cance of an interaction term

for Visa Category and English-speaking ability in the model as the main e®ects

of these variables may be confounded with each other.Some of the counts

associated with cells in Table 2.6 are very small.Thus,to estimate appropriate

model coe±cients in Section 3.5.2 we have chosen to combine categories of

English-speaking ability to form the three new categories of Very Well,Well

and Not Very Well.

With the exception of marital status at waves 1 and 2,all univariate asso-

ciations between the response and explanatory variables are statistically signif-

icant at the ® = 0:01 level.In Section 3.5.2 we investigate these associations

in the context of a multivariate model which allows the data for all waves to

56

2.4.Conclusions

Table 2.6:Percentages of immigrants in Survey by Visa Category and English-

speaking ability at Wave 1.

English-speaking Ability

Visa Category

English Only

Very Well

Well

Not Well

Not at All

Preferential Family

18

11

25

32

14

Concessional Family

32

13

26

25

4

Business Skills/ENS

48

15

17

15

5

Independent

41

21

30

7

1

Humanitarian

1

4

17

53

25

be modelled simultaneously.Throughout this section the\adjusted log-odds"

for a term refers to the estimated log-odds which have been\adjusted"for

the e®ects of other explanatory variables and their interactions that have been

included in the model.

2.4 Conclusions

Government agencies around the world have recently invested in longitudinal

surveys to provide information for their policy making.In this chapter,we

have introduced two such surveys that have been conducted in Australia.The

GCLS and the LSIA.The GCLS was conducted to collect information on the

Goodna community in the areas of social,environmental and economic well-

being while the LSIA seeks to provide government with data to monitor and

improve immigration and settlement policies,programs and services.

In section 2.1,we introduced the basic issues in longitudinal survey designs.

Like any survey designs,there is no single`best'approach but ultimately,the

design is dependent on the key research objectives and should incorporate

mechanisms to reduce potential bias due to drop-out.However,there are two

57

Chapter 2.Longitudinal Surveys - A brief review

common basic design structures,namely the classic design and the rotation

design.Surveys such as LSIA use the classic design to collect information on

the same sampling units over the life of the survey whilst household based

surveys,such as GCLS,require replenishment schemes to re°ect changes in

the household compositions.

For the LSIA dataset,we examined the missingness patterns of the vari-

able of interest,employment status,and observed slight transitional di®erences

with respect to time between the dataset containing 3234 complete cases (PA's

between the ages of 19 and 64) and the dataset with 4950 incomplete cases

(PA's between the ages of 19 and 64).Additionally,the incomplete case data

contains records for 53% more PAs than the complete case data,and so these

incomplete cases should be included in the analyses so as not to exclude valu-

able information.

In the next chapter,we begin our multivariate analysis by applying frequen-

tist marginal modelling to the employment status of immigrants to Australia.

The GCLS dataset will be used in chapter 6

58

Chapter 3

Marginal Models of Categorical

Longitudinal Data

The work in Chapter 3 can be found in Pettitt,A.,M.Haynes,T.Tran,and

J.Hay (2002)\A Model for Longitudinal Employment Status of Immigrants

to Australia"QUT e-Prints (available at http://eprints.qut.edu.au/).The re-

search was carried out in collaboration with Tony Pettitt,Michele Haynes

and John Hay.The original concepts were formed by Tony and John.Addi-

tional changes,improvements,implementation and write-up were carried out

by Michele and I.

3.1 Introduction

In recent years,considerable e®ort has gone into the development of statistical

methods for the analysis of longitudinal categorical response data.While much

of this e®ort has focused on techniques for binary or Poisson data,relatively

little attention has been given to nominal categorical variables.

A signi¯cant contribution to the analysis of nominal categorical data in

general is by McFadden (1976).By introducing a latent variable (also referred

59

Chapter 3.Marginal Models of Categorical Longitudinal Data

to as a utility function),Z say,the discrete choice model of McFadden (1976)

can be formulated as follows.Let Y

i

denote the state of individual i.Y

i

can

be in any state j(= 1;:::;J).The model is speci¯ed in terms of the so-called

utility,Z

ij

,of the individual i for choice j and is given by

Z

ij

= X

T

i

¯

j

+e

ij

;(3.1)

where X

T

i

¯

j

is the deterministic part of the utility and e

ij

is the stochastic part

capturing the uncertainty.Then Y

i

= j if and only if Z

ij

= max

1·k·J

fZ

ik

g,

(j = 1;:::;k).

## Comments 0

Log in to post a comment