Bayesian Model Estimation and Comparison for Longitudinal Categorical Data

fearfuljewelerUrban and Civil

Nov 16, 2013 (3 years and 6 months ago)

154 views

Bayesian Model Estimation and
Comparison for Longitudinal
Categorical Data
Thu T.Tran
Submitted in total ful¯lment of the requirements of the
degree of
Doctor of Philosophy
Statistics and Operational Research
School of Mathematical Sciences
Queensland University of Technology
Australia
March 2008
Abstract
In this thesis,we address issues of model estimation for longitudinal categorical
data and of model selection for these data with missing covariates.Longitu-
dinal survey data capture the responses of each subject repeatedly through
time,allowing for the separation of variation in the measured variable of inter-
est across time for one subject from the variation in that variable among all
subjects.Questions concerning persistence,patterns of structure,interaction
of events and stability of multi-variate relationships can be answered through
longitudinal data analysis.
Longitudinal data require special statistical methods because they must
take into account the correlation between observations recorded on one subject.
Afurther complication in analysing longitudinal data is accounting for the non-
response or drop-out process.Potentially,the missing values are correlated
with variables under study and hence cannot be totally excluded.
Firstly,we investigate a Bayesian hierarchical model for the analysis of
categorical longitudinal data from the Longitudinal Survey of Immigrants to
Australia.Data for each subject is observed on three separate occasions,or
waves,of the survey.One of the features of the data set is that observations for
some variables are missing for at least one wave.A model for the employment
status of immigrants is developed by introducing,at the ¯rst stage of a hierar-
chical model,a multinomial model for the response and then subsequent terms
are introduced to explain wave and subject e®ects.To estimate the model,we
use the Gibbs sampler,which allows missing data for both the response and
explanatory variables to be imputed at each iteration of the algorithm,given
some appropriate prior distributions.After accounting for signi¯cant covariate
e®ects in the model,results show that the relative probability of remaining
unemployed diminished with time following arrival in Australia.
Secondly,we examine the Bayesian model selection techniques of the Bayes
2
factor and Deviance Information Criterion for our regression models with miss-
ing covariates.Computing Bayes factors involve computing the often complex
marginal likelihood p(yjmodel) and various authors have presented methods to
estimate this quantity.Here,we take the approach of path sampling via power
posteriors (Friel and Pettitt,2006).The appeal of this method is that for hi-
erarchical regression models with missing covariates,a common occurrence in
longitudinal data analysis,it is straightforward to calculate and interpret since
integration over all parameters,including the imputed missing covariates and
the random e®ects,is carried out automatically with minimal added complexi-
ties of modelling or computation.We apply this technique to compare models
for the employment status of immigrants to Australia.
Finally,we also develop a model choice criterion based on the Deviance In-
formation Criterion (DIC),similar to Celeux et al.(2006),but which is suitable
for use with generalized linear models (GLMs) when covariates are missing at
random.We de¯ne three di®erent DICs:the marginal,where the missing data
are averaged out of the likelihood;the complete,where the joint likelihood for
response and covariates is considered;and the naive,where the likelihood is
found assuming the missing values are parameters.These three versions have
di®erent computational complexities.We investigate through simulation the
performance of these three di®erent DICs for GLMs consisting of normally,
binomially and multinomially distributed data with missing covariates having
a normal distribution.We ¯nd that the marginal DIC and the estimate of the
e®ective number of parameters,p
D
,have desirable properties appropriately in-
dicating the true model for the response under di®ering amounts of missingness
of the covariates.We ¯nd that the complete DIC is inappropriate generally
in this context as it is extremely sensitive to the degree of missingness of the
covariate model.Our new methodology is illustrated by analysing the results
of a community survey.
3
Keywords:Longitudinal data analysis;Generalized linear models;Bayesian
hierarchical models;Bayesian model choice;Bayes factors;Deviance Informa-
tion Criterion;Missing data
4
Declaration
This thesis comprises of only my original work except where indicated in
the Preface.Due acknowledgement has been made in the text to all other
material used.
5
Preface
The work in Chapter 3 can be found in Pettitt,A.,M.Haynes,T.Tran,and
J.Hay (2002)\A Model for Longitudinal Employment Status of Immigrants
to Australia"QUT e-Prints (available at http://eprints.qut.edu.au/).The re-
search was carried out in collaboration with Tony Pettitt,Michele Haynes
and John Hay.The original concepts were formed by Tony and John.Addi-
tional changes,improvements,implementation and write-up were carried out
by Michele and I.
Chapter 4 appears as Pettitt,A.N.,T.T.Tran,M.A.Haynes,and J.L.
Hay (2006)\A Bayesian hierarchical model for categorical longitudinal data
from a social survey of immigrants"Journal of the Royal Statistical Society,
series A 169,97{114.I was in charge of all the Bayesian modelling and model
selection analyses.I collaborated with the co-authors on the original concepts
and shared in the writing and editing of the paper.
Part of chapter 5 (section 5.4.1) appear in Friel,N.and A.Pettitt (2006)
\Marginal likelihood estimation via power posteriors"Available from authors.
In section 5.4.2,I extended the ideas of Friel and Pettitt (2006) for models
with missing covariates.I carried out all the applications and analyses in this
chapter.
Chapter 6 has been submitted as Tran,T.T.and Pettitt,A.N.(2006)\De-
viance Information Criteria for Models with Imputed Missing Covariates"Ac-
cepted subject to revision by the Australian and New Zealand Journal of Statis-
tics.I was the main researcher responsible for the ideas,analyses and writing
of the paper.Professor Pettitt contributed to the original ideas and editing
the paper.
6
Acknowledgements
The research carried out in this thesis is part of a project funded by the Aus-
tralian Research Council in conjunction with the Queensland Treasury Depart-
ment - O±ce of Economics and Statistical Research.
A big thank you to Tony Pettitt,Michele Haynes,Nancy Spencer,all the
maths CSO's,sta® and students for your help and support.I greatly appreci-
ated your kindness.
Thank you Mum,Dad,my brothers and sisters for always being there for
me.
7
Contents
Abstract 2
Declaration 5
Preface 6
Acknowledgements 7
List of Figures 13
List of Tables 18
1 Introduction 19
1.1 Introduction to Longitudinal Data Analysis...........21
1.1.1 Parameter Estimation Methods..............23
1.1.2 Missing Data Models....................26
1.2 Bayesian methods,MCMC and BUGS..............28
1.2.1 Bayesian methods......................29
1.2.2 Markov Chain Monte Carlo................29
1.2.3 BUGS............................32
1.3 Layout of the Thesis........................34
2 Longitudinal Surveys - A brief review 37
2.1 Longitudinal Survey Designs....................38
9
CONTENTS
2.1.1 The Classic Design.....................38
2.1.2 Rotating Design.......................40
2.1.3 Proxy Interviews......................41
2.1.4 Addressing dropout.....................41
2.1.5 Summary..........................42
2.2 The Goodna Community Longitudinal Survey..........42
2.2.1 The Suburb of Goodna...................43
2.2.2 Survey Design of GCLS..................44
2.3 The Longitudinal Survey of Immigrants to Australia.......45
2.3.1 Survey Design of LSIA...................45
2.3.2 LSIA Data..........................47
2.3.3 Missing Data........................50
2.3.4 Initial Univariate Analysis.................52
2.4 Conclusions.............................57
3 Marginal Models of Categorical Longitudinal Data 59
3.1 Introduction.............................59
3.2 Modelling Employment Status...................61
3.2.1 The Pooled Multinomial Logistic Model.........61
3.2.2 Generalised Estimating Equations (GEE) Model.....62
3.3 Results................................63
3.3.1 Models Results for Complete Data Records........63
3.3.2 Model Results for Data with Incomplete Response and
Covariate Entries......................68
3.4 Conclusion..............................69
4 Bayesian Hierarchical Models 77
4.1 Introduction.............................77
4.2 Modelling Employment Status...................78
10
CONTENTS
4.2.1 Bayesian hierarchical model................78
4.2.2 Imputation of Missing Data................81
4.2.3 Alternative Models and Models Selection.........83
4.3 Results................................88
4.3.1 Model Results for Complete Data Records........88
4.3.2 Model Results for Data with Incomplete Response and
Covariate Entries......................95
4.4 Discussion..............................97
5 The Bayes Factor for Models with Missing Covariates 101
5.1 Introduction.............................101
5.2 Bayes Factor.............................102
5.3 Marginal Likelihood via Power Posteriors.............105
5.3.1 Bayes Factor for Regression Models with Missing Covari-
ates.............................108
5.4 Modelling Employment Status...................109
5.4.1 Model Selection for Complete Data Records.......110
5.4.2 Model Selection for Data with Incomplete Response and
Covariates..........................113
5.5 Discussion and Conclusions....................115
6 DIC for Models with Missing Covariates 117
6.1 Introduction.............................117
6.2 Deviance Information Criterion..................120
6.3 DIC for missing data models....................121
6.4 Simulation Study..........................123
6.4.1 Case 1:The Linear Model.................125
6.4.2 Cases 2 and 3:The Binary and Multinomial Logit Models129
6.5 Case Study:Goodna Community Survey.............132
11
CONTENTS
6.5.1 Modelling Community Participation............132
6.6 Discussions and Conclusions....................136
7 Conclusions 139
A Goodna Questionnaire 145
B Results of Models A and B from Chapter 3 157
C Results of Models 3 to 6 from Chapter 4 161
D Code for Power Posteriors 163
D.1 R code for power posteriors of Model 2 with missing values...163
D.2 WinBugs code (model.¯le) for Model 2 with missing values...164
E Code for DIC 165
E.1 R code for binary response data..................165
E.2 WinBUGS code for model2.txt and model3.txt..........167
E.2.1 model2.txt..........................167
E.2.2 Model3.txt..........................168
12
List of Figures
1.1 Graphical model for the`Pumps'example in Spiegelhalter et al.
(1998)................................33
5.1 Expected deviances for the power posterior at temperature t
i
for
model k = 2 (random e®ects model) using priors with variances
of 9,16 and 25............................111
5.2 Expected deviances calculated at waves 2 and 3 only for the
power posterior at temperature t
i
for models k = 2,3 and 5 for
data with incomplete entries.....................114
6.1 Plots of values of the marginal (±) and naive (4) DICs and p
D
s
for varying amounts of missingness of the covariate x
1
.(a)DICs
for gaussian y,(b)p
D
s for gaussian y,(c)DICs for binary y,
(d)p
D
s for binary y,(e)DICs for categorical y,(f)p
D
s for cat-
egorical y..............................128
13
List of Tables
2.1 List of survey acronyms.......................38
2.2 Distribution of Principal Applicant Interview Status by Wave.50
2.3 Missing data patterns for the Longitudinal Survey of Immigrants
to Australia.............................51
2.4 Counts (and proportions) of PAs by employment status and
wave of interview for the two data sets with`complete'and`in-
complete'cases,respectively....................53
2.5 Summary percentage counts for Employment Status in Wave 1,
Wave 2 and Wave 3,by explanatory variable...........55
2.6 Percentages of immigrants in Survey by Visa Category and English-
speaking ability at Wave 1......................57
3.1 Estimates and standard errors from the binary generalised esti-
mating equation models.......................72
3.2 Estimated e®ects and standard errors for the signi¯cant inter-
actions in the binary model for employed/unemployed......73
3.3 Predicted marginal means and standard errors for the interac-
tions in the binary model for employed/unemployed.......74
3.4 Estimated e®ects and standard errors for the signi¯cant inter-
actions in the binary model for non-participant/unemployed...75
15
LIST OF TABLES
3.5 Predicted marginal means and standard errors for the interac-
tions in the binary model for non-participant/unemployed....76
4.1 Estimated posterior means and 95% intervals for the e®ects of
selected explanatory variables on the log of the probability of an
immigrant being unemployed relative to the probability of being
employed,from Models 2,3 and 5.................90
4.2 Deviance Information Criteria calculated for all three waves cor-
responding to Models 1,2,3 and 5 from Table 4.1........91
4.3 Deviance Information Criteria calculated for waves 2 and 3 cor-
responding to Models 1,2,3 and 5 from Table 4.1........91
4.4 Estimated posterior means and 95% intervals for the e®ects of
selected explanatory variables on the log of the probability of
an immigrant being non-participant in the work force relative
to the probability of being employed,from Models 2,3 and 5..94
4.5 Estimated posterior means and empirical standard errors for
parameters (j = 2;3) in Model 5 ¯tted to complete case data
and incomplete
¤
case data......................95
4.6 Increase in precision of estimated Model 5 parameters with in-
clusion of imputed missing values in data.............97
5.1 Estimated log p(yjk) via Power Posteriors for model k = 2 using
priors with variances of 9,16 and 25................112
5.2 Estimated log p(yjk) by omitting t
0
= 0 and starting at t
1
=
0:0016 in 5.6 for model k = 2 using priors with variances of 9,
16 and 25...............................112
5.3 Expected deviances (and Monte Carlo standard errors) for the
power posterior at temperature t
i
for models k = 1;2 for com-
plete data records..........................112
16
LIST OF TABLES
5.4 Estimated log p(yjk) via Power Posteriors based on all three
waves for employment status models k = 2,3 and 5 for com-
plete data records..........................112
5.5 Estimated log p(yjk) via Power Posteriors based on waves 2 and
3 only for employment status models k = 2,3 and 5 for complete
data records.............................112
5.6 Expected log p(yjk) via Power Posteriors based on all three waves
for employment status models k=2,3 and 5 for data with in-
complete entries...........................113
5.7 Expected log p(yjk) via Power Posteriors based on waves 2 and
3 only for employment status models k=2,3 and 5 for data with
incomplete entries..........................114
6.1 Values of MDIC,NDIC and CDIC for Models 1 and 2 for gaus-
sian data...............................127
6.2 Values of MDIC,NDIC and CDIC for Models 1 and 2 of binary
data..................................130
6.3 Values of MDIC,NDIC and CDIC for Models 1 and 2 of poly-
tomous categorical data.......................132
6.4 DIC values for Community Involvement Binary Logit Models..135
6.5 Estimated posterior means and empirical standard errors for
the e®ects of selected explanatory variables on the logit of the
probability of community involvement of a resident of Goodna,
from Models 1 and 2.........................135
B.1 Estimates and standard errors from the pooled multinomial lo-
gistic model..............................158
B.2 Estimates and standard errors from the binary generalised esti-
mating equation models with an independent covariance matrix.159
17
LIST OF TABLES
C.1 Estimated posterior means and 95% intervals for the e®ects of
selected explanatory variables from Models 3,4,5 and 6 (unem-
ployed/employed)..........................161
C.2 Estimated posterior means and 95% intervals for the e®ects of
selected explanatory variables from Models 3,4,5 and 6 (non-
participant/employed)........................162
18
Chapter 1
Introduction
Longitudinal investigations (de¯ned broadly as studies in which the response
of each individual is observed on two or more occasions) represent one of the
principle research strategies employed in medical,health and social science
research.
Government agencies around the world,in their policy making,have re-
cently recognized the need for community longitudinal data to provide infor-
mation on the dynamic nature of events and how they interact in in°uencing
the changing behaviour and fortunes of households,families and individuals
(Wooden,2001).Surveys that have been conducted include the Los Angeles
Family and Neighbourhood Survey (LAFANS) (Sastry et al.,2000),the House-
hold,Income and Labour Dynamics in Australia (HILDA) Survey (Wooden,
2001) and Longitudinal Survey of Immigrants to Australia (LSIA).
At a more local level,in Queensland,Australia,the Queensland Govern-
ment has recently invested in the Goodna Service Integration Project (GSIP)
in response to an identi¯ed need to improve the services o®ered or funded by
government in Goodna.The project is not about additional sources of funding
for service delivery but to ensure that services o®ered in Goodna are inte-
grated,improve community lifestyle and strengthen the Goodna community.
19
Chapter 1.Introduction
To assess the e®ectiveness of the GSIP,the Goodna Community Longitudi-
nal Survey (GCLS) has been initiated to collect information on the Goodna
community on the topics of social well-being,environmental well-being and
economic well-being.
With the growing use of longitudinal data,methods of analysis need to
be more widely understood.In this thesis,we investigate and develop meth-
ods,Bayesian in the main,to address issues of model estimation and model
selection for longitudinal categorical data.The methods used for the analysis
of longitudinal data di®er from the traditional methods of multivariate anal-
ysis such as multiple regression.Longitudinal or panel data sets,as they are
sometimes known,consist of repeated observations of an outcome and a set of
covariates for each of many subjects which may be ¯xed at the outset,such as
gender,or which may change with time,such as English-speaking ability.In
other words,longitudinal data sets consist of a collection of short time series
taken on di®erent units or individuals.As one would expect,such data sets are
characterised by the fact that repeated observations for a subject tend to be
correlated.Correct statistical analysis of such data therefore requires the mod-
elling of this correlation (for example,see Diggle et al.(2002)).We introduce
extisting methods for analysing longitudinal data in Section 1.1.
In addition to the modelling of the correlation structure,the analysis of
longitudinal data is dependent on the nature of the response variable.When
the response variable is approximately Gaussian,a large class of linear models
is available for analysis.However,when the response variable is non-Gaussian,
other techniques must be considered.Adding to this complexity is the ten-
dency for some respondents to\drop-out"from the study or to participate at
intermittent times,which may cause a bias in the inference if not appropri-
ately allowed for.In chapter 3,we discuss in more details the complexities
of analysing longitudinal nominal data and we apply two di®erent methods to
20
1.1.Introduction to Longitudinal Data Analysis
analyse the employment statuses of immigrants to Australia.We highlight the
drawbacks of these methods when missing data are present and in chapter 4,
we propose Bayesian modelling which can accommodate well for these types of
data.
In the next section,we present an introduction to the analysis of longitu-
dinal data and a literature review.
1.1 Introduction to Longitudinal Data Analy-
sis
Before proceeding,let us de¯ne some notation as follows.Let y
it
denotes the
response variable observed at time w
it
for observation t = 1;:::;T
i
on subject
i = 1;:::;N;Y
i
= (y
i1
;:::;y
iT
i
).Let x
it
denotes the p-vector of explanatory
variables;X
i
= (x
i1
;:::;x
iT
i
).E(y
it
) denotes the mean of y
it
,V ar(y
it
) denotes
the variance of y
it
,Cov(y
it
;y
it
0
) denotes the covariance between y
it
and y
it
0
and
Corr(y
it
;y
it
0
) denotes the correlation between y
it
and y
it
0
.
Most longitudinal data analyses are based on a regression model of the form
E(y
it
) = ¹
it
;
h(¹
it
) = x
T
it
¯ for link function h;
V ar(y
it
) = v(¹
it
)Á;
Corr(y
it
;y
it
0
) = ½(¹
it

it
0
;®);
where v(¢) and ½(¢) are known functions,¯ is the p-vector of unknown regression
parameters and Á and ® are additional parameters which may need to be
estimated (see,for example,McCullagh and Nelder (1993)).
There are generally 3 approaches to modelling longitudinal data (Diggle
et al.,2002).
21
Chapter 1.Introduction
1.
Model the marginal mean as in cross-sectional data but account for cor-
relation between repeated values.For example,the marginal mean linear
model is given by
E(Y
i
) = X
i
¯ and
V ar(Y
i
) = V
i
(®)
with parameters ¯ and ® to be estimated.This method is advantageous
in allowing for separate modelling of the mean and covariance.
2.
The subject-speci¯c or random e®ects model.For example,the linear
random e®ects model can be speci¯ed as
E(y
it

i
) = x
T
it
¯
i
where we assume the person speci¯c vector of coe±cients ¯
i
= ¯ + U
i
with ¯ constant and a zero mean latent random variable U
i
.U
i
's can be
interpreted as unobserved factors that are common to all responses for a
given person (thus inducing the within individual correlation) but which
vary across people (thus the heterogeneous assumption can be restated
for the U
i
).
3.
The transition model which speci¯es a model for the conditional distri-
bution of y
it
given past responses y
i;t¡1
;:::;y
i1
and x
it
.For example,the
¯rst-order (Markov chain) logit model for binary responses is given by
logit Pr(y
i;t
= 1jy
i;t¡1
;:::;y
i1
) = x
T
it
¯ +®y
i;t¡1
where logit(p) =
p
1¡p
.The model combines assumptions about depen-
dence of Y on X and the correlation among repeated Y's.
Traditionally,the most popular approach to modelling longitudinal re-
sponse data is the marginal model.These models address net changes in the
population and utilise various methodological strategies to account for the
22
1.1.Introduction to Longitudinal Data Analysis
correlation between repeated measurements.In his discussion of Liang et al.
(1992a),Gilks (1992) comments that marginal models can be thought of as ap-
proximations to conditional models containing random e®ects since marginal
models are simply random e®ects models with the random e®ects integrated
out (Zeger et al.,1988).
Random e®ects models are also known as generalised linear mixed models
(GLMMs) (see for example,Breslow and Clayton (1993)).Zeger et al.(1988)
note that GLMMS are more appropriate when inferences about individuals are
the scienti¯c focus.Marginal models can only address the dependence of the
population-averaged response on explanatory variables.
Other variations of these three main models include Pourahmadi and Daniels
(2002) who developed the dynamic conditionally linear mixed models.These
models are an extension of the random e®ects model to include lagged ob-
servations as covariates.Heagerty (1999) developed the marginally speci¯ed
logistic-normal models for binary data.Heagerty and Zeger (2000) extended
this model for multi-level data where an alternative parameterization for the
random e®ects model in which the marginal mean rather than the conditional
mean given random e®ects,is regressed on covariates.This model allows a
choice as to whether the marginal mean structure or the conditional mean
structure is the focus of modelling when using a latent variable formulation.
1.1.1 Parameter Estimation Methods
Although it is relatively straightforward to obtain parameter estimates and
inferences for linear models through methods such as Maximum Likelihood
(ML) and Restricted Maximum Likelihood (REML) (see,for example,Diggle
et al.(2002)),the resulting likelihood functions from non-linear models are
often much more complex.
23
Chapter 1.Introduction
Marginal Models
For the case of marginal models,the likelihood functions from non-linear mod-
els often contain moments of higher order than mean and variance and involves
many nuisance parameters.As a result,a quasi-likelihood method such as the
generalised estimating equations (GEE) (Liang and Zeger,1986) presents an
attractive alternative.The GEE approach assumes that the marginal distribu-
tion of the dependent variable follows a generalised linear model.Speci¯cation
of a\working"correlation matrix for the relationship between repeated ob-
servations on each individual gives consistent estimates of these correlation
parameters using the method of moments.A score equation approach is used
to calculate parameter estimates and standard errors.For linear models,Wu
et al.(2001) present a comparative study of the GEE with other methods,such
as ML and REML,for variance and covariance estimation.
Since its introduction,several authors including Prentice (1988),Lipsitz
et al.(1994) and Liang et al.(1992b) have studied the use of the GEE.Pren-
tice (1988) applies the methodology to binary data and introduced a second set
of estimating equations for the covariance matrix parameters.Thall and Vail
(1990) extend the research and suggest other covariance models in the context
of count data.Dobbie and Welsh (2001) adapt the GEE for the analysis of
correlated zero-in°ated count data.Liang et al.(1992b) suggest an extension
of GEE which they call GEE2.This further generalisation allows simultane-
ous modelling of both the marginal distributions and the dependency between
repeated observations.Miller et al.(1993) and Lipsitz et al.(1994) develop
the methodology for repeated nominal or ordinal categorical response.Lipsitz
et al.(1994) have extended the GEE approach to encompass the analysis of
longitudinal polytomous data.Catalano and Ryan (1992) and Fitzmaurice and
Laird (1995) develop GEE methodology for bivariate discrete and continuous
data.
24
1.1.Introduction to Longitudinal Data Analysis
Although statistical packages such as SAS and Splus accommodate estima-
tion well for repeated binary and ordinal data,the GEE cannot be applied for
repeated nominal data using these software.
Random E®ects Models
For random e®ects models (or GLMMs),numerical methods are required to
solve the likelihood functions.Estimation for the GLMMcan be undertaken in
several di®erent ways.Schall (1991) provides methods for maximizing the joint
distribution of the observed data and the random e®ects with respect to the
¯xed e®ects (or parameters) and the random e®ects.McGilchrist (1994) sug-
gests a generalization of the best linear unbiased predictor (BLUP) approach
(Robinson,1991) similar to that of Schall (1991).McCulloch (1997) labels
these methods\joint maximization"algorithms.McCulloch (1997) also notes
that similar algorithms have been suggested by Breslow and Clayton (1993)
and Wol¯nger (1994).Estimation is based on the iterative ¯t of a set of gen-
eralised estimating equations.The procedures used in McCullagh and Nelder
(1993) are available in some statistical software packages (e.g.PROC GLIM-
MIX in SAS,GLMM in Genstat).Diggle et al.(2002) suggest an approach
based on the use of numerical quadrature.Hedeker and Gibbons (1994) and
Liu and Pierce (1994) also favour this approach.This method of estimation is
available in SAS (PROC NLMIXED) and is based on the direct maximisation
of the approximate integrated likelihood.
In contrast with these approaches,McCulloch (1997) examines a Monte
Carlo EM (MCEM) algorithm approach and proposes a new procedure which
he calls Monte Carlo Newton-Raphson (MCNR).
Recently,for estimation of random e®ects models,the approach is often
Bayesian.The major advantages of Bayesian methods is that they can be
implemented using Markov Chain Monte Carlo (MCMC) methods and hence
25
Chapter 1.Introduction
complicated models can be handled.
Albert and Jais (1998) use a Bayesian hierarchical model for longitudinal
binary data to identify risk factors associated with allergic accidents during
plasma exchange.Lunn et al.(2001) examine a Bayesian hierarchical model
for the analysis of ordinal longitudinal data.They examine the e±cacy of
a treatment for allergic rhinitis (recorded on a four-point scale),controlling
for various confounding variables.Cruz-Mesia and Marshall (2006) develop
Bayesian non-linear random e®ects models with continuous time autoregres-
sive errors for longitudinal medical studies with unequally spaced observations.
Other applications of Bayesian hierarchical models for the analysis of longitu-
dinal data include Hogan and Daniels (2002),Rasmussen (2004) and Congdon
(2002).
1.1.2 Missing Data Models
A further complication in analyzing longitudinal data is accounting for the
missing values due to non-response and attrition.Potentially,the missing
values are correlated with the variable under study and hence cannot be totally
excluded.
Missing data is generally characterised into three missingness patterns (Dig-
gle et al.,2002):
²
Missing completely at random (MCAR):Cases with complete data are
indistinguishable from cases with incomplete data.
²
Missing at random (MAR):Cases with complete data di®er from cases
with incomplete data,but the pattern of missingness is traceable or pre-
dictable from other independent variables rather than being due to the
speci¯c variable for which the data are missing.
²
Nonignorable (NI):The pattern of data missingness is non-random and
26
1.1.Introduction to Longitudinal Data Analysis
it is not predictable from other independent variables.In contrast to
the MAR situation where missingness is explained by other independent
variables in a study,nonignorable missing data arise due to the data
missingness pattern being explained by the variable for which the data
are missing.
The GEE approach relies on the assumption that the data are MCAR.
As noted by Liang and Zeger (1986) and many other authors subsequent to
this,the generalised estimating equations will generally yield biased estima-
tors if the missing data are not MCAR.Several methods have been proposed
to adjust the estimating equation to obtain unbiased estimators when missing
data are MAR rather than MCAR.These methods include weighted gener-
alised estimating equations (Robins et al.,1995) nonparametric estimation of
the conditional estimating scores (Reilly and Pepe,1995),modelling the con-
ditional distributions (Murphy and Li,1995) and multiple imputation (Rubin,
1987).All these methods except the second make additional parametric model
assumptions that potentially reduce the robustness of the estimates;the second
method has di±culties in dealing with continuous data.While these methods
reduce bias under a missing at random data process,if the data are missing
completely at random,they unnecessarily increase variance.Applications of
the GEE for longitudinal studies with data missing at random include Re-
boussin et al.(2002).
Likelihood based tests to assess whether or not the missing data are MCAR
have been proposed for contingency tables (Fuchs,1982) and for multivariate
normal data (Little,1988).A nonparametric test has also been proposed by
Diggle (1989),mainly for preliminary screening.Ridout (1991) proposed a
parametric test based on modelling the missing data process.In recent times
there have been two tests suggested for the assessment of MCARin conjunction
with the generalised estimating equations;Park and Lee (1997) and Chen
27
Chapter 1.Introduction
and Little (1999).Park and Lee (1997) extend an idea by Park and Davis
(1993) who use weighted least squares for repeated categorical data.Chen and
Little (1999) generalise the basic idea for constructing test statistics in Little
(1988) to the generalised estimating equation setting,avoiding distributional
assumptions.Qu and Song (2002) present an alternative approach that avoids
the exhaustive estimation of parameters for each missing data pattern necessary
for the method of Chen and Little (1999).
It is well known that maximum likelihood estimates ignoring the missing
data mechanism are valid when data is MAR (Rubin,1976).Hence the dis-
tinction between MCAR and MAR is less pertinent to maximum likelihood
estimation than to estimating equation methods.
When the missing data mechanism is non-ignorable,it has to be incor-
porated into the main model.The modelling of missing processes has had
considerable attention in the biostatistics literature and,for example,Diggle
and Kenward (1994) give a good summary of contemporary issues.Modelling
drop-out can give a priori information about the biases and potential for °awed
designs,(Fitzmaurice et al.,1996).Post-hoc modelling can e®ectively account
for informative drop-out,for example,Fitzgerald et al.(1996),Dionne et al.
(1998),and so removing biases.Yun and Lee (2005) examine a hierarchical like-
lihood method for the analysis of longitudinal data allowing for non-ignorable
and non-monotone (or\intermittent") missingness.Other authors who have
presented ideas on informative missing data include Wu and Follmann (1999)
and Crouchley and Ganjali (2002) and Gelman et al.(2003).
1.2 Bayesian methods,MCMC and BUGS
Recently,Bayesian methodology has been a popular choice for the analysis of
longitudinal data.Although it can be computer intensive,the advantages of
28
1.2.Bayesian methods,MCMC and BUGS
the Bayesian approach is that it can be implemented via Markov chain Monte
Carlo (MCMC) methods and the ongoing development software packages such
as WinBUGS (Spiegelhalter et al.,1998) has meant that complicated models
such as those to account for informative missing values can be readily handled.
1.2.1 Bayesian methods
In Bayesian parametric statistical modelling of data y,we require the speci¯ca-
tion of p(yjµ),the probability model for y conditional on a set of parameters µ,
and p(µ),the joint prior for µ.Inferences for µ are made on the joint posterior
given by
p(µjy)/p(yjµ)p(µ):
Bayesian inference can be based on any function of the posterior distribu-
tion.For any function f(µ),its posterior expectation is given by
E[f(µ)jy]/
Z
f(µ)p(yjµ)p(µ)dµ:(1.1)
It is often the case that the integration in (1.1) cannot be carried out analyti-
cally.Thus simulation techniques such as Markov Chain Monte Carlo (MCMC)
are needed to perform the analysis.
1.2.2 Markov Chain Monte Carlo
MCMC (see,for example,Gilks et al.(1998b) and Robert and Casella (2004))
is essentially Monte Carlo integration using Markov chains.Monte Carlo inte-
gration evaluates E[f(X)] by drawing samples fX
t
;t = 1;:::;ng throughout
the support of the target distribution,in our case p(µjy),in the correct pro-
portion and then approximating
E[f(X)] ¼
1
n
n
X
t=1
f(X
t
):
29
Chapter 1.Introduction
One way of generating these samples is by cleverly constructing a Markov
chain and running it for a long time.Gilks et al.(1998a) describe a Markov
chain as essentially,a sequence of random variables,fX
t
;t = 0;:::;ng,where
at each time t ¸ 0,the next state X
t+1
is sampled froma distribution P(X
t+1
jX
t
)
which depends only on the current state of the chain,X
t
.P(¢j¢) is called the
transition kernel of the chain.For discrete state-spaces,Gilks et al.(1998a)
discuss that convergence of a chain can be assured through the following the-
orem.
Theorem 1.2.1
If a Markov chain,X,is irreducible,aperiodic and has ¼ as
invariant distribution,then:
P(X
t
= xjX
0
= x
0
)!
t!1
¼(x) for all x and any x
0
:
Following the de¯nitions fromRoberts (1998),X is irreducible if for all x;x
0
there exist t > 0 such that P
x;x
0
(t) = P(X
t
= xjX
0
= x
0
) > 0.An irreducible
chain is said to be aperiodic if for all x
0
,
greatest common divisor ft > 0:P
x
0
;x
0
(t) > 0g = 1:
Finally,¼ is said to be the invariant distribution under the transition kernel
P(¢j¢) if
X
y
¼(y)P(zjy) = ¼(z)
for all z.
Some common MCMC methods include the Metropolis-Hastings algorithm
and the Gibbs sampler.
Metropolis-Hastings algorithm
A su±cient condition for the invariance of ¼ under P(¢j¢) is the reversibility
condition:
¼(y)P(zjy) = ¼(z)P(yjz) 8y;z;(1.2)
30
1.2.Bayesian methods,MCMC and BUGS
(Roberts,1998) and based on the general framework of Metropolis et al.(1953)
and Hastings (1970),the Metropolis-Hastings algorithm makes use of this re-
versibility condition.
At each time t,the next state X
t+1
in a Metropolis-Hastings algorithm
is chosen by ¯rst sampling a candidate point Y from a proposal distribution
q(¢jX
t
).Y is then accepted with probability ®(X
t
;Y ) where
®(X;Y ) = min
µ
1;
¼(Y )q(XjY )
¼(X)q(Y jX)

;
If Y is accepted,X
t+1
= Y,otherwise X
t+1
= X
t
and the chain does not move.
Thus the transition kernel is given by
P(X
t+1
jX
t
) = q(X
t+1
jX
t
)®(X
t
;X
t+1
)
+I(X
t+1
= X
t
)[1 ¡
Z
q(Y jX
t
)®(X
t
;Y )dY ] (1.3)
where I(¢) takes the value 1 when the chain does not move and 0 otherwise.
It can be shown that (1.3) satis¯es the reversibility condition (1.2) (see,for
example,Robert and Casella (2004)).
Advantages of the Metropolis-Hastings algorithm is that the target distri-
bution ¼(¢) needs only to be known up to a multiplicative factor and it can be
shown that,for ¯xed ¼(¢) and q(¢j¢) the acceptance probability minimizes the
variance estimation in the Monte Carlo integration.
Although theoretical convergence is guaranteed for q(¢jX
t
) of any form,the
convergence rate of the algorithm is highly dependent on the choice of q(¢jX
t
)
(Gilks et al.,1998a).Roberts (1998) brie°y discussed theoretical convergence
rates but noted that in practice,it is usually too di±cult to obtain.Instead,
it is common to run several chains in parallel at di®erent starting values and
compare their estimates.Graphically,it is also useful to examine plots of the
chain to determine convergence.
31
Chapter 1.Introduction
Gibbs sampler
Instead of updating the whole of X,it is often more convenient to divide X
into components fX
¢1
;X
¢2
;:::;X
¢h
g and update these one by one (Gilks et al.,
1998a).The Gibbs sampler takes this approach.Thus each iteration comprises
h updating steps.Let X
¢¡i
= fX
¢1
;:::;X
¢i¡1
;X
¢i+1
;:::;X
¢h
g and X
t¢i
be the
state of X
¢i
at the end of iteration t.For step i of iteration t +1,candidate Y
¢i
is generated from the proposal distribution q
i
(Y
¢i
jX
t¢i
;X
t¢¡i
) where
X
t¢¡i
= fX
t+1¢1
;:::;X
t+1¢i¡1
;X
t¢i+1
;:::;X
t¢h
g:
The acceptance probability is given by ®(X
t¢¡i
;X
t¢i
;Y
¢i
) where
®(X
¢¡i
;X
¢i
;Y
¢i
) = min
µ
1;
¼(Y
¢i
jX
¢¡i
)q
i
(X
¢i
jY
¢i
;X
¢¡i
)
¼(X
¢i
jX
¢¡i
)q
i
(Y
¢i
jX
¢i
;X
¢¡i
)

:
If the the proposal distribution of
q
i
(Y
¢i
jX
¢i
;X
¢¡i
) = ¼(Y
¢i
jX
¢¡i
)
is used,®(X
¢¡i
;X
¢i
;Y
¢i
) = 1.This is the Gibbs sampler.Thus Gibbs sampling
consists purely in sampling from full conditional distributions ¼(Y
¢i
jX
¢¡i
) and
candidates are always accepted.See Robert and Casella (2004) and Gilks et al.
(1998b) for further details and discussions.
1.2.3 BUGS
BUGS is a computer package that carries out Bayesian inference on statistical
problems using Gibbs sampling.In BUGS,the crucial idea is that in order to
fully specify the model,the users only need to provide the parent-child distri-
butions (Spiegelhalter et al.,1998).For most models,this can be represented
by a Directed Acyclic Graph (DAG) such as Figure (1.1).In a directed graph-
ical model,all quantities are represented as nodes with arrows running into
nodes from their direct in°uences (parents).See Lauritzen (1996) for a more
32
1.2.Bayesian methods,MCMC and BUGS
Figure 1.1:Graphical model for the`Pumps'example in Spiegelhalter et al.
(1998)
detailed description of the graphical model and other deviations of the DAG
models.
Essentially,the model represents the assumption that,given its parent
nodes parents[v],each node v is independent of all other nodes in the graph ex-
cept descendants of v.Thus the full joint distribution of all the quantities V has
a simple factorisation in terms of the conditional distribution p(vjparents[v])
such that
p(V ) =
Y
v2V
p(vjparents[v]):
BUGS then constructs the full conditional distribution of each node v,p(vjV n
33
Chapter 1.Introduction
v),given the remaining nodes V nv,for the Gibbs sampling algorithmas follows
p(vjV n v)/p(v;V n v)
/terms in p(V ) containing v
= p(vjparents[v])
Y
v2parents[w]
p(wjparents[w]):
Thus,the full conditional distribution for v contains a prior component
p(vjparents[v]) and likelihood components arising fromeach child of v (Spiegel-
halter et al.,1998).
1.3 Layout of the Thesis
In this chapter,we have introduced the basic methodology and reviewed the
literature in the analysis of longitudinal data.We present the results of our
research in the forthcoming chapters.
In chapter 2,we introduce the two surveys used in our analyses,the LSIA
and the GCLS and we brie°y review other international and national longi-
tudinal surveys that have been conducted.We discuss the general designs of
these surveys and for the LSIA,we present an initial univariate analysis of the
data set.
Chapter 3 presents an application of the marginal model to categorical
longitudinal data using readily available software such as STATA (Statcorp.,
2001) and SAS (SAS Institute Inc.,1999).We consider two approaches in the
modelling of the employment status from the LSIA data set.We illustrate
the di±culties and drawbacks of using these methods when missing values are
present.
In chapter 4,we examine Bayesian hierarchical models for longitudinal cat-
egorical data.The advantages of Bayesian modelling are that the parameters
may be estimated via Markov chain Monte Carlo (MCMC) methods and the
34
1.3.Layout of the Thesis
ongoing development of MCMC software packages such as WinBUGS (Spiegel-
halter et al.,1998) has meant that Bayesian hierarchical models are increasingly
being used to model longitudinal data.Also,in WinBUGS,missing data from
both the response and the explanatory variables can be routinely handled.We
applied various Bayesian models in the analysis of employment status using
¯rstly,the complete case LSIA data and then the data with incomplete re-
sponse and covariate entries.For the complete case data set,we employed the
Deviance Information Criterion (DIC) (Spiegelhalter et al.,2002) to compare
our models.
In chapter 5,we apply an alternative model selection technique in the form
of Bayes factors calculated via power posteriors as presented by Friel and Pet-
titt (2006).In Bayesian data analysis,there are numerous model comparison
criteria that one can use including DIC (Spiegelhalter et al.,2002),BIC (Kass
and Raftery,1995) and Bayes factors.The choice of which to use often de-
pends on the focus of the analysis.For example,when prediction is important,
DIC is appropriate while the Bayes factor is used if the aim is to obtain a
most probable model.When missing covariates are present,Bayes factors are
straightforward to interpret and calculate.We compute marginal likelihoods
(and ultimately Bayes factors) for the models in chapter 4 for both the fully
observed data set and that with missing values.Chapters 4 and 6 also contain
a literature review.
In chapter 6,we present a simulation study to investigate the DIC for re-
gression models with missing covariates.Although the imputation of missing
values has had considerable attention in the literature,the problem of building
or choosing between regression models when covariates are missing appears to
be overlooked.For model selection,the use of the DIC as it is de¯ned can
be unreliable when there are missing covariates that have to be imputed.We
investigate three versions of the DIC,a naive DIC (NDIC),a complete DIC
35
Chapter 1.Introduction
(CDIC) and a marginalised DIC (MDIC).Celeux et al.(2006) have consid-
ered similar extensions of the DIC speci¯cally for mixture modelling where the
missing data have been interpreted as the indicator for each observation for
the mixture component.That is,for their analysis,missingness is part of the
modelling process.
We present concluding remarks and discussions in Chapter 7.
36
Chapter 2
Longitudinal Surveys - A brief
review with reference to LSIA
and GCLS
Longitudinal designs are uniquely suited to the study of individual change
over time,including the e®ects of development,aging and other factors that
a®ect change in contrast to cross-sectional studies in which a single outcome
is measured for each individual.
In this chapter,we brie°y review international and national longitudinal
surveys that have been conducted.We discuss the overall designs of these
surveys and the schemes used to address problems such as attrition and geo-
graphical movements of the population.Brief descriptions of the GCLS and
LSIA are given in sections 2.2 and 2.3 respectively with an initial univariate
analysis of the LSIA in section 2.3.4.Table 2.1 contains a list of some of
the longitudinal surveys that have been conducted around the world and their
acronyms.
37
Chapter 2.Longitudinal Surveys - A brief review
Table 2.1:List of survey acronyms.
Acronym
Survey
GCLS
Goodna Community Longitudinal Survey
GSOEP
German Socio-economic Panel
HILDA
Household Income and Labour Dynamics in Australia
LAFANS
Los Angeles Family and Neighbourhood Survey
LSIA
Longitudinal Survey of Immigrants to Australia
NLSY
National Longitudinal Study of Youth
PSID
Panel Study of Income and Dynamics
2.1 Longitudinal Survey Designs
As opposed to standard cross-sectional surveys where interest is in population
characteristics at a single point in time,longitudinal (or panel) surveys are
concerned with capturing the pathways of the target population through time.
There are many aspects associated with implementing a longitudinal survey.
These include:survey instruments;interview mode;sample size;frequency of
the survey;and life span of the survey.Longitudinal studies vary markedly in
terms of design and collection method and there is no single approach that is
universally accepted as the best.Primarily,the design of a large scale panel
study is dependent on the key research objectives (Wooden,2001).
There are two common designs for collecting longitudinal data,the classic
design and the repeated medium life or rotating design.
2.1.1 The Classic Design
The classic design collects information on the same sample of units over the
entire life of the survey.Examples of this type of survey are the LSIA and
the National Longitudinal Study of Youth (NLSY).Research objectives of in-
38
2.1.Longitudinal Survey Designs
de¯nite life surveys are in a sense more purely longitudinal.For example,the
aim of the NLSY is to gather information such as labour force activity,mari-
tal status,fertility and participation in government assistance programs such
as unemployment insurance,in an event history format,in which dates are
collected for the beginning and ending of important life events.
For household based surveys similar to the GCLS and HILDA,the most
popular design is the classic life design with replenishment.In this design,the
sample is automatically extended over time by following rules that add to the
sample any new children of members of the selected households (including both
biological and adopted children) as well as new household members resulting
from changes in composition of the original households.
For the Panel Study of Income and Dynamics (PSID),Fitzgerald et al.
(1998) in Wooden (2001),have shown that 21 years on,and despite a loss of
50 per cent of the original sample,the sample still retained its cross-sectional
representativeness by adopting the following replenishment rules.
1.
A child is born to,or is adopted by,an`original'or`continuing sample
member'.This child automatically counts as an original sample member
and information about that child will be collected from parents until age
15 (after which they too will become eligible for interview).
2.
An original sample member moves into a di®erent household with one
or more new people.These new people will now become eligible for
interview,but are only treated as`temporary sample members'.
3.
One or more new people move in with an original sample member.Again,
these new people will now become eligible for interview,and are counted
as temporary sample members.
4.
All temporary sample members remain in the sample for as long as they
remain in the same household as the original sample member.Temporary
39
Chapter 2.Longitudinal Surveys - A brief review
sample members,however,are converted to continuing sample members if
they become the parent of a new birth with the original sample member.
Similarly,the replenishment scheme adopted by the LAFANS are:
1.
A randomly selected adult and randomly selected child are the primary
respondents.Once they join the sample they will be followed throughout
the study,whether they live together or apart
2.
In each wave,interview sampled respondents who have remained in the
neighbourhood as well as those who have left.
3.
Also select a sample of\new entrants"into the neighbourhood,that is,
people who have moved into the neighbourhood between the preceding
wave and the current wave.
4.
The new entrants become part of the sample and will be followed in
subsequent waves.
2.1.2 Rotating Design
The repeated medium life or the rotating design collects information on units
over a ¯xed life (say 5 to 10 years).Portions of the sample are then gradually
dropped fromthe survey and replaced with new but comparable samples drawn
from the current population.How long a unit remains in the sample before
rotation should be determined by operational constraints.
Rotation sampling o®ers a compromise for surveys with the dual purposes
of measuring level and change.Measures of change are best achieved by keeping
the same sample while for estimates of means or totals,it is best to draw a
new sample at each wave.Binder and Hidiroglou (1988) discuss approaches
to rotation design and methods of estimation for level and change.Surveys
40
2.1.Longitudinal Survey Designs
which employed a rotating design include the Survey of Labour and Income
Dynamics (SLID).
2.1.3 Proxy Interviews
A design aspect speci¯c to a household panel survey is whether to use proxy
interviews.In surveys including PSID,one household member answers on
behalf of other household members.In other studies such as the German Socio-
economic Panel (GSOEP) survey,interviews are expected from all members of
the household.The disadvantages of using proxy interviews are that it is
subject to more measurement error and it renders these studies less receptive
to more subjective questions about satisfaction and aspirations.On the other
hand,interviewing all members may be associated with higher attrition and
non-response rates.It has been suggested that what matters most for response
and attrition is not so much the length of a questionnaire,but the total time
spent in the household (Wooden,2001).Nonetheless,collection of data from
all household members permits more complicated analyses of family e®ects and
intra-household dynamics.
2.1.4 Addressing dropout
Another major di±culty introduced in the conduct and analyses of longitudinal
surveys is due to attrition or dropout.Potentially,the dropout process is
correlated with variables under study thus increasing bias.Attrition should
be addressed in ¯eldwork strategies and sample design (Wooden,2001).For
example,
²
The initial sample size can be in°ated to achieve a desired ¯nal sample
size;
²
Unequal probability sampling at subsequent waves can be adopted to
41
Chapter 2.Longitudinal Surveys - A brief review
replenish sub-populations exhibiting high attrition;
International experience tends to suggest that attrition is highest in the ¯rst
two years of the survey and then stabilizes.In the PSID,the response rates
were 76 percent and 88.5 percent in the ¯rst and second waves respectively.
Since the second wave,annual response rates have ranged between 96.9 and
98.5 percent (Hill,1991).
2.1.5 Summary
In this section,we have discussed how the design of a longitudinal survey
depends on its objectives.For more traditional longitudinal surveys,the classic
design is best whereas for household based surveys,replenishment schemes are
necessary to adjust for changes in the household structures.Design should also
address potential bias due to dropout.
In the next section,we introduce two Australian longitudinal surveys which
are the focus of our analyses.The LSIA uses a classic design while the GCLS
is household based thus introducing more complexities in designing issues.For
the LSIA dataset,we also present a univariate analysis of the variable of inter-
est,Employment Status,and examine the missingness patterns of this variable
over time.
2.2 The Goodna Community Longitudinal Sur-
vey
The Queensland Government has recently invested in the Goodna Service In-
tegration Project (GSIP) in response to an identi¯ed need to improve the
services o®ered or funded by government in Goodna (OESR,2002).The GSIP
is not about additional sources of funding for service delivery but to ensure
42
2.2.The Goodna Community Longitudinal Survey
that services o®ered in Goodna are integrated,improve community lifestyle
and strengthen the Goodna community.Speci¯c aims include reducing crime,
increasing school retention rates,stabilizing households,improving community
health,reducing unemployment,providing opportunities for community pride
and improve community relations.
To assess the GSIP,a survey was conducted to collect information on the
Goodna community in the areas of social well-being,environmental well-being
and economic well-being.A pilot survey commenced in 2002 and the ques-
tionnaire (see Appendix A) consists of items (mainly likert-scale and nominal
categorical items) to gauge:
²
community participation in group activities and in individual activities;
²
community perceptions on crime,local job opportunities and other issues;
²
service usage;
²
movement along child pathways;
² volunteerism within the community;and
²
changes in individuals'labour market statuses as well as howthese changes
occur.
2.2.1 The Suburb of Goodna
The suburb of Goodna is located between the major cities of Brisbane and
Ipswich.It has a diverse population of newly settled immigrants and Aboriginal
and Torres Strait Islanders.Based on the 1996 Census of Population and
Housing,a third of Goodna's 6,963 residents were born overseas and 6.1%were
Aboriginal and Torres Strait Islanders.It has one of the lowest socio-economic
indexes (863.9) in Queensland and an unemployment rate (percentage of those
43
Chapter 2.Longitudinal Surveys - A brief review
in the labour market that are unemployed) of 18.5%.With nearly half (42.09%)
of the residents who had left school at an age between fourteen and ¯fteen,the
most common occupation among workers in Goodna was intermediate clerical,
sales and service worker (17.2%).
Almost all (99.3%) of the residents in Goodna lived in a private dwelling and
more than half (51.1%) have changed address in the last ¯ve years.Goodna had
an estimated 2,272 households and an average household size of 3.1 persons.
Of the households within the Goodna suburb,28.1% were fully owned,43.0%
rented and 24.6% were being purchased or paid o® (Domrow,2002).
2.2.2 Survey Design of GCLS
In the ¯rst wave of the pilot study for the GCLS,all residents in private
dwellings in Goodna aged 18 years or over were in the survey scope.244
households were chosen at randomfromthe frame which consisted of addresses
of all properties located in the suburb of Goodna.For each dwelling with one
or more usual residents aged 18 years or over,one of those residents aged 18
years or over was randomly selected to complete the questionnaire face to face.
A total sample of 243 households and 13 caravans within the local caravan
park was selected for the survey.The sample was designed to achieve at least
150 completed interviews.This ¯gure accounts for 6.6% of the households in
Goodna.
The subsequent waves of the survey involved interviewing the following
three groups:
²
Respondents to the 2002 survey who had agreed to a follow-up survey
in 2003 and were still living in Goodna.For these respondents,a com-
bination of telephone and face to face interviews were conducted.Three
interviewers were used for the telephone phase and one interviewer for
the follow-up by face to face of non-contacts from the telephone phase;
44
2.3.The Longitudinal Survey of Immigrants to Australia
²
Respondents to the 2002 survey who had agreed to a follow-up survey
in 2003 and had moved from Goodna.Information for these respondents
were collected through telephone and mail out interviews;
² A new (replenishment) sample designed to maintain the total sample at
or above the 2002 level.Fourteen interviewers carried out face to face
interviews for this sub-sample.
263 households were chosen to supplement the 152 respondents that had
agreed to a further interview in 2003.(See OESR (2002) and OESR (2003))
The main objectives of the pilot study were to determine people's per-
ceptions of community wellbeing and the quality of services available to the
Goodna community and to determine the success rate for re-interviewing re-
spondents 12 months later.Other secondary information such as respondent
reaction to the survey,time per interview,response rate and factors which
might impact on the future repeats of this survey were also collected.
2.3 The Longitudinal Survey of Immigrants to
Australia
The Longitudinal Survey of Immigrants to Australia (LSIA) is the most com-
prehensive survey of immigrants ever to be undertaken in Australia.The survey
seeks to provide government and other agencies with reliable data to monitor
and improve immigration and settlement policies,programs and services.
2.3.1 Survey Design of LSIA
The sampling unit for the LSIA was the person upon whom the approval to
immigrate was based - the Principal Applicant (PA).The sample of 5192 PAs
was drawn to represent the entire population of o®shore visaed PAs aged 15
45
Chapter 2.Longitudinal Surveys - A brief review
years and older migrating to Australia between September 1993 and August
1995.New Zealand citizens,those under 15 years of age and people who were
granted a visa while in Australia were excluded from the scope of the survey.
The sample was randomly selected and strati¯ed by Visa eligibility category
and region/country of birth.There are ¯ve visa eligibility categories used in the
strati¯cation.These are Humanitarian,Preferential Family,Concessional Fam-
ily,Business Skills and Employer Nomination and Independent.Preferential
Family immigration is based on close family relationships.The Concessional
Family programlies between family-based and skill-based migration streams as-
sessing potential migrants on both skills and more distant family relationships.
Skill-based migration includes independent migrants without family relation-
ships who are points tested (Independents),migrants with pre-arranged o®ers
of employment (ENS or Employer Nomination Scheme) and migrants intend-
ing to establish businesses in Australia who meet certain capital requirements
(Business Skills).
There were about 50 region/country of birth categories used in strati¯ca-
tion.A mixture of categories was required because some individual countries
provide relatively few migrants and therefore for the purposes of strati¯ca-
tion they have to be aggregated into regions.For example,Peru,Chile and
Argentina have their own individual country of birth category.Other South
American countries are aggregated into the category`Other South America'.
The selected PAs,together with any accompanying spouse or partner who
migrated with themon the same visa application,were interviewed three times.
Wave 1 interviews were designed to take place approximately 6 months after
the PA entered Australia;Wave 2 interviews were designed to occur 12 months
after the Wave 1 interviews.The third and ¯nal Wave was designed to occur
a further 24 months after the Wave 2 interviews.To assist PAs in provid-
ing accurate responses,the time between arrival and the ¯rst interview was
46
2.3.The Longitudinal Survey of Immigrants to Australia
minimised.
2.3.2 LSIA Data
A relationship of particular interest for immigration policies is the in°uence
of Visa Category on employment status of the immigrants.There are sev-
eral reasons for this emphasis.Immigration policy in Australia makes a clear
distinction between skill-based migration and family-based migration.In re-
cent years,the Australian government has moved to increase the number of
places for skilled migrants while at the same time cutting the overall size of
the immigration program.This shift in policy is consistent with that of other
nations such as the United States where the number of U.S.visas reserved for
skill-based immigrants has increased substantially since the introduction of the
U.S.Immigration Act of 1990.Similarly,Canada increased its intake of skilled
independent migrants almost ¯ve-fold between 1984 and 1995.These policy
changes stem largely from the view that immigrants selected on the basis of
their labour market skills ¯nd the transition into the labour market easier and
make a greater contribution than immigrants selected on the basis of their
family relationships.
In Australia,several studies have been undertaken to analyse labour mar-
ket status at various stages of the settlement process.Williams et al.(1997)
analyse data fromthe ¯rst wave of the LSIA and conclude that there is a signif-
icant association between Visa Category and labour market status six months
after arrival.Cobb-Clark (2000) examines the ¯rst two waves of the LSIA
and suggests that migrants selected for their skills have better labour market
outcomes.The author also notes that labour market outcomes are better for
native English speakers and for those who visited Australia prior to migration.
We add to this growing body of work by examining the relationship between
selection criteria and employment status of immigrants entering Australia us-
47
Chapter 2.Longitudinal Surveys - A brief review
ing data from all three waves of the Longitudinal Survey of Immigrants to
Australia.
The data for employment status was collected through a question on the
survey which utilised show cards.The PA was asked to identify which category
best describes their current main activity in Australia.The show card has the
following categories:
Employed
1 - A wage or salary earner
2 - Conducting own business but not employing others
3 - Conducting own business and employing others
4 - Other employed
Unemployed
5 - Unemployed looking for full time work
6 - Unemployed looking for part time work
Non-Participant
7 - Student
8 - Home duties
9 - Retired
10 - Aged pensioner
11 - Other pensioner
Two codes were allowed for other responses:88 - Other and 98 - Refused/
Not stated and principal applicants who fell into these categories were removed
from the analysis.The multinomial variable created by these three categories
is referred to as\employment status."
The explanatory variables chosen to model the response variable are gen-
der,age (+ age
2
),marital status,wave,self-reported English speaking ability,
Visa Category,quali¯cation,state of residence,region of birth,pre-migration
employment status,and pre-migration visit to Australia.Of particular interest
48
2.3.The Longitudinal Survey of Immigrants to Australia
in our analyses are the explanatory variables of self-reported English speaking
ability and Visa Category.
Self-reported English speaking ability is derived from a show card question
which asks the principal applicant to nominate how well they speak English.
This is coded:1 - English Only,2 - Very well;3 - Well;4 - Not well 5 - Not
at all.Visa category was also recorded at the time of interview.The ¯ve visa
eligibility categories were as follows:
²
Preferential Preferential Family/Family Stream
²
Concessional Preferential Family/Family Stream
²
Business Skills and Employer Nomination
²
Independent
²
Humanitarian
The variable for quali¯cation is derived from a a show card question which
asks the principal applicant to nominate their highest quali¯cation.This is
coded as 1 - Higher degree;2 - Postgraduate Diploma;3 - Bachelor degree
or equivalent;4 - Technical/professional quali¯cation/diploma/certi¯cate;5 -
Trade;6 - 12 or more years of schooling;7 - 10-11 years of schooling;8 - 7-9
years of schooling;9 - 6 or fewer years of schooling;88 - other.These categories
were reduced to 1 - Tertiary quali¯cation;2 - Technical/trade quali¯cation;3
- 10 or more years of schooling,4 - <10 years of schooling.
The place of interview was recorded in terms of the state or territory in
which the interview took place.The states and territories were categorised as
1 - New South Wales;2 - Victoria;3 - Queensland;4 - South Australia;5 -
Western Australia;6 - Tasmania,Northern Territory,Australian Capital Ter-
ritory.Region of birth was categorised into ¯ve categories 1 - Oceania/Other
Africa;2 - Middle East/North Africa;3 - Asia;4 - Americas;5 - Europe and the
49
Chapter 2.Longitudinal Surveys - A brief review
former USSR.Two variables giving pre-migration information were included
in the analysis.The variable prior employment status records the employment
status (0 - employed,1 - unemployed,2 - non-participant) of the principal
applicant prior to migration and the variable Visit (1 - yes,2 - no) indicated
whether the principal applicant had visited Australia prior to migration.
2.3.3 Missing Data
Wave 1 of the Longitudinal Survey of Immigrants comprised 5192 principal
applicants.These were drawn to represent the entire population of o®shore
visaed PAs aged 15 years and older migrating to Australia between September
1993 and August 1995.Not all persons interviewed at Wave 1 were interviewed
at Wave 2.Similarly,not all persons interviewed at Wave 2 were interviewed
at Wave 3.At Wave 2,4468 (86%) of PAs were interviewed while at Wave 3,
3752 (72%) of PAs were interviewed.Table 2.2 shows the distribution of PAs
by interview status.
Table 2.2:Distribution of Principal Applicant Interview Status by Wave,be-
ginning with 5192 PAs at Wave 1
STATUS
WAVE 2
WAVE 3
Interviewed
4468
3575
Unable to track
251
563
Refused
109
225
Overseas temporarily
204
289
Overseas permanently
78
234
Australia - Out of Scope
27
41
Deceased
4
19
Other
51
69
It is important to note that the description of interview status by Wave
50
2.3.The Longitudinal Survey of Immigrants to Australia
does not necessarily describe a\drop-out"process.Drop-out is characterised
by the fact that once an individual has left the study,no more measurements are
obtained on that individual.For the LSIA,an attempt was made to interview
the PA in Wave 3 even when the PA was not interviewed at Wave 2.If a PA
was interviewed at Waves 1 and 3 but not Wave 2,this missing data pattern
is referred to as\intermittent"rather than a drop-out pattern (see Table 2.3).
Table 2.3:Missing data patterns for the Longitudinal Survey of Immigrants to
Australia;O - observed,M - missing
Missing data
Wave
pattern
1 2 3
0
O O O
1
O O M
2
O M M
3
O M O
In addition to these missing data patterns,there may be missing data due
to a principal applicant's refusal to answer a speci¯c question or questions of
the survey.The reasons for non-response to particular questions are many and
varied and include not knowing the answer to the question,not understanding
the wording of the question and unwillingness to provide information which
they feel is of a sensitive or private nature.This question-speci¯c missing data
coupled with the missing data due to non-interview can cause considerable
problems when the data comes to be analysed and interpreted.
Table 2.4 shows the distribution of PAs by employment status and wave of
survey interviews,for the data set containing the 3234 complete cases,and for
the full incomplete data set which includes missing response and covariate ob-
servations,respectively.The incomplete data set includes records for the 4950
PAs with age ranging from 19 to 64.For the complete case data,Table 2.4
51
Chapter 2.Longitudinal Surveys - A brief review
shows that the percentage of employed PAs increases substantially with wave
while the percentages of unemployed and non-participant PAs each decrease
more moderately with wave.Because the same individuals are being inter-
viewed at each wave,this implies that more people are entering the work force
from unemployed and non-participant states,as time progresses since arrival
in Australia.
For the larger incomplete case data set,this pattern at Wave 3 is not
obvious.During the ¯rst wave,39% of PA immigrants are employed,34% are
unemployed,23%are non-participant in the work force while a reasonably small
4% of responses are missing.By Wave 3,the percentage of missing responses
has risen to 30%which is double the percentage of missing responses in Wave 2.
While the percentage of PAs employed has remained at an approximate level
of 45%,and the percentages of PAs unemployed and non-participant have
appeared to fall across Waves 2 and 3 of interviews,it is unknown whether
these patterns in the data beyond Wave 1 are masked by the high percentage of
missing responses.This is a typical scenario in longitudinal survey data.The
incomplete case data contains records for 53% more PAs than the complete
case data,and so these incomplete cases should be included in the analyses
so as not to exclude valuable information.Relationships determined by the
complete cases can be used to impute missing response and covariate data for
the incomplete cases.
2.3.4 Initial Univariate Analysis
A summary of immigrant employment status,categorised by interview wave
and selected variables associated with the immigrant,is presented in Table 2.5.
Table 2.5 shows strong associations between the response variable employment
status and the explanatory variables.
For each level of the variables recorded for an immigrant,Table 2.5 shows
52
2.3.The Longitudinal Survey of Immigrants to Australia
Table 2.4:Counts (and proportions) of PAs by employment status and wave
of interview for the two data sets with`complete'and`incomplete'cases,re-
spectively
Employment Status for Complete Cases
Wave
Employed
Unemployed
Non-participant
Total
1
1325 (0.41)
1143 (0.35)
766 (0.24)
3234
2
1770 (0.55)
1009 (0.31)
455 (0.14)
3234
3
2072 (0.64)
837 (0.26)
325 (0.10)
3234
Employment Status for Incomplete Cases
Wave
Employed
Unemployed
Non-participant
Missing
Total
1
1918 (0.39)
1676 (0.34)
1151 (0.23)
205 (0.04)
4950
2
2267 (0.46)
1324 (0.27)
616 (0.12)
743 (0.15)
4950
3
2227 (0.45)
906 (0.18)
363 (0.07)
1454 (0.30)
4950
the percentages of immigrants who are employed,unemployed and non-participant
in the work force at the speci¯ed Wave of interviews.As discussed in section
3.4,the in°uence of Visa Category on employment status is of particular inter-
est.Table 2.5 shows that approximately 90% of immigrants with a Business
Skills/ENS visa are employed across all 3 Waves.Of the immigrants with an
Independent visa,a lower 59% are employed at Wave 1,however,this rises
to 87% employed at Wave 3.For each of the remaining Preferential Family,
Concessional Family and Humanitarian Visa Categories,the percentage of im-
migrants employed is lower in Wave 1 but has increased signi¯cantly by Wave
3.For all Visa Categories the percentage of immigrants who are unemployed
is considerably lower by Wave 3 interviews.Approximately 54% of immigrants
with a Preferential Family or Humanitarian visa are non-participant in the
work force at Wave 1 and this falls to 45% and 40% respectively,by Wave 3.
These univariate summary results indicate that immigrants with a Business
Skills/ENS visa are most likely to be employed shortly after arriving in Aus-
tralia and that a very high percentage of immigrants remain employed over
three years later.However,it also appears that the potential for employment
is reasonably high for immigrants with Independent and Concessional Family
53
Chapter 2.Longitudinal Surveys - A brief review
visas,after several years following arrival in Australia.
English-speaking ability is time-variant and may be improved as the immi-
grant settles into a new country.It is of interest to investigate to what extent
English-speaking ability in°uences employment status.Table 2.5 shows that
for immigrants who speak English only,70% are employed shortly after arrival
and that this increased to 80% by Wave 3 of interviews.Of those who speak
English very well,51% are employed shortly after arrival and a high 74% are
employed by Wave 3.For immigrants who do not speak well or at all,a very
low percentage are employed after arrival (19%,8% respectively).This per-
centage doubles approximately by Wave 3,however,it should be noted that a
very high percentage of immigrants who do not speak English at all (71%) are
classi¯ed as non-particpant at Wave 1,and this percentage increases to 81%
at Wave 3.Approximately half of immigrants who do not speak English well
are classi¯ed as non-participant.These univariate summary results show that
immigrants who speak English well are more likely to be a participant in the
work force and employed sooner,than those who do not.
Table 2.5 also shows that female immigrants are much more likely to be
non-participant in the work force than males.Approximately half of the female
immigrants are non-participant while only 14% of males are non-participant
at Wave 3.Although only 25% of female immigrants are employed at Wave 1,
this increase to 42% by Wave 3.A higher 75% of males are employed at Wave
3.
Table 2.5 also shows that immigrants with a tertiary education are just
as likely to be employed at all Waves as those who have a technical or trade
quali¯cation.The percentage of immigrants employed shortly after arrival is
greater for those who visited Australia at a previous time (55%) compared to
those who didn't (26%).For all variables listed in Table 2.5,the percentage
of immigrants employed at Wave 3 is considerably higher than the percentage
54
2.3.The Longitudinal Survey of Immigrants to Australia
Table 2.5:Summary percentage counts for Employment Status in Wave 1 (ES
1
),Wave 2 (ES
2
) and Wave 3 (ES
3
),
by explanatory variable.Employment Status categories are:E - employed,N-P - non-participant,U - unemployed.
Percentage counts have been computed after excluding missing values in Employment Status due to non-response as well
as failure to interview (Waves 2 and 3).
VARIABLE LEVEL
% ES
1
(n=4983)
% ES
2
(n=4414)
% ES
3
(n=3680)
E N-P U
E N-P U
E N-P U
Gender Female
25 58 17
35 55 10
42 50 8
Male
49 23 28
64 19 17
75 14 11
English Speaking Only
70 15 15
78 16 6
80 17 3
Very Well
51 22 27
64 25 11
74 18 8
Well
38 33 29
48 35 17
59 29 12
Not well
19 57 24
29 50 21
37 44 19
Not at all
8 71 21
15 73 12
13 81 6
Quali¯cation Tertiary
48 26 26
62 24 14
72 20 8
Technical/Trade
46 34 20
59 29 12
68 23 9
¸ 10 yrs school
25 52 23
38 46 16
46 42 12
< 10 yrs school
16 60 24
23 58 19
35 51 14
State NSW
40 38 22
54 31 15
63 27 10
VIC
29 38 33
46 37 17
55 33 12
QLD
52 37 11
60 30 10
69 25 6
SA
31 46 23
42 44 14
52 34 14
WA
44 36 20
56 35 9
65 27 8
TAS,NT,ACT
43 41 16
51 40 9
64 29 7
Visa Preferential Family
28 53 19
39 50 11
46 45 9
Concessional Family
48 23 29
63 19 18
77 12 11
Business Skills/ENS
89 8 3
92 5 3
92 6 2
Independent
59 15 26
77 12 11
87 9 4
Humanitarian
7 54 39
22 49 29
37 40 23
Region of Birth Oceania/Other Africa
48 31 21
62 28 10
65 26 9
Middle East/Nth Africa
17 47 36
28 43 29
39 39 22
Asia
38 38 24
51 36 13
62 30 8
Americas
40 43 17
52 39 9
63 30 7
Europe and former USSR
48 36 16
60 30 10
67 25 8
Marital Status Unmarried
41 37 22
53 35 12
57 30 13
Married
38 38 24
51 34 15
62 29 9
Visit Yes
55 29 16
67 26 7
72 23 5
No
26 45 29
40 40 20
53 33 14
Pre-Migration ES Employed
46 30 24
60 26 14
70 21 9
Non-participant
16 67 17
26 61 13
32 56 12
Unemployed
30 36 34
38 36 26
56 26 18
55
Chapter 2.Longitudinal Surveys - A brief review
employed at Wave 1.It therefore appears that Wave (time since arrival in
Australia) has a strong in°uence on employment status.
With a cohort study such as the LSIA it is extremely likely that the sample
design will be unbalanced with respect to the categorical variables of interest.
The sample is chosen to be representative of the population which will often
be unbalanced in reality.From Table 2.5,two important variables that appear
to be strongly associated with employment status (other than Wave) are Visa
Category and English-speaking ability.To assess whether the data for these
two variables are unbalanced we produced a cross-tabulation of the ¯ve levels
of Visa Category with the ¯ve levels of English-speaking ability.The results of
this cross-tabulation are reported in Table 2.6 as percentages of immigrants by
English-speaking ability within each Visa Category.The data is very unbal-
anced with 78%of immigrants with a Humanitarian visa being unable to speak
English well,while the majority of immigrants from the remaining four Visa
Categories are able to speak English well or better.This means,for example,
that we are unable to make any inferences about the associations of employ-
ment status with immigrants on a Humanitarian visa who speak English only.
It is important in this case to assess the signi¯cance of an interaction term
for Visa Category and English-speaking ability in the model as the main e®ects
of these variables may be confounded with each other.Some of the counts
associated with cells in Table 2.6 are very small.Thus,to estimate appropriate
model coe±cients in Section 3.5.2 we have chosen to combine categories of
English-speaking ability to form the three new categories of Very Well,Well
and Not Very Well.
With the exception of marital status at waves 1 and 2,all univariate asso-
ciations between the response and explanatory variables are statistically signif-
icant at the ® = 0:01 level.In Section 3.5.2 we investigate these associations
in the context of a multivariate model which allows the data for all waves to
56
2.4.Conclusions
Table 2.6:Percentages of immigrants in Survey by Visa Category and English-
speaking ability at Wave 1.
English-speaking Ability
Visa Category
English Only
Very Well
Well
Not Well
Not at All
Preferential Family
18
11
25
32
14
Concessional Family
32
13
26
25
4
Business Skills/ENS
48
15
17
15
5
Independent
41
21
30
7
1
Humanitarian
1
4
17
53
25
be modelled simultaneously.Throughout this section the\adjusted log-odds"
for a term refers to the estimated log-odds which have been\adjusted"for
the e®ects of other explanatory variables and their interactions that have been
included in the model.
2.4 Conclusions
Government agencies around the world have recently invested in longitudinal
surveys to provide information for their policy making.In this chapter,we
have introduced two such surveys that have been conducted in Australia.The
GCLS and the LSIA.The GCLS was conducted to collect information on the
Goodna community in the areas of social,environmental and economic well-
being while the LSIA seeks to provide government with data to monitor and
improve immigration and settlement policies,programs and services.
In section 2.1,we introduced the basic issues in longitudinal survey designs.
Like any survey designs,there is no single`best'approach but ultimately,the
design is dependent on the key research objectives and should incorporate
mechanisms to reduce potential bias due to drop-out.However,there are two
57
Chapter 2.Longitudinal Surveys - A brief review
common basic design structures,namely the classic design and the rotation
design.Surveys such as LSIA use the classic design to collect information on
the same sampling units over the life of the survey whilst household based
surveys,such as GCLS,require replenishment schemes to re°ect changes in
the household compositions.
For the LSIA dataset,we examined the missingness patterns of the vari-
able of interest,employment status,and observed slight transitional di®erences
with respect to time between the dataset containing 3234 complete cases (PA's
between the ages of 19 and 64) and the dataset with 4950 incomplete cases
(PA's between the ages of 19 and 64).Additionally,the incomplete case data
contains records for 53% more PAs than the complete case data,and so these
incomplete cases should be included in the analyses so as not to exclude valu-
able information.
In the next chapter,we begin our multivariate analysis by applying frequen-
tist marginal modelling to the employment status of immigrants to Australia.
The GCLS dataset will be used in chapter 6
58
Chapter 3
Marginal Models of Categorical
Longitudinal Data
The work in Chapter 3 can be found in Pettitt,A.,M.Haynes,T.Tran,and
J.Hay (2002)\A Model for Longitudinal Employment Status of Immigrants
to Australia"QUT e-Prints (available at http://eprints.qut.edu.au/).The re-
search was carried out in collaboration with Tony Pettitt,Michele Haynes
and John Hay.The original concepts were formed by Tony and John.Addi-
tional changes,improvements,implementation and write-up were carried out
by Michele and I.
3.1 Introduction
In recent years,considerable e®ort has gone into the development of statistical
methods for the analysis of longitudinal categorical response data.While much
of this e®ort has focused on techniques for binary or Poisson data,relatively
little attention has been given to nominal categorical variables.
A signi¯cant contribution to the analysis of nominal categorical data in
general is by McFadden (1976).By introducing a latent variable (also referred
59
Chapter 3.Marginal Models of Categorical Longitudinal Data
to as a utility function),Z say,the discrete choice model of McFadden (1976)
can be formulated as follows.Let Y
i
denote the state of individual i.Y
i
can
be in any state j(= 1;:::;J).The model is speci¯ed in terms of the so-called
utility,Z
ij
,of the individual i for choice j and is given by
Z
ij
= X
T
i
¯
j
+e
ij
;(3.1)
where X
T
i
¯
j
is the deterministic part of the utility and e
ij
is the stochastic part
capturing the uncertainty.Then Y
i
= j if and only if Z
ij
= max
1·k·J
fZ
ik
g,
(j = 1;:::;k).