RATIO MEAN AND WEIGHTED REGRESSION METHODOLOGY FOR ANALYSIS OF CATEGORICAL DATA

chocolatehookSecurity

Nov 30, 2013 (3 years and 8 months ago)

64 views


1






RATIO MEAN AND WEIGHTED REGRESSION
METHODOLOGY FOR ANALYSIS OF CATEGORICAL
DATA


Eliana H. Marques, Rosemeire L. Fiaccone
1



ABSTRACT



Primary response criteria of interest in health and social research studies often have a categorical nature.
Nowa
days, a common feature of these studies is a non
-
standard data structure with some degree of clustering.
The ratio statistics, of interest here, appears in the literature in the context of sample surveys. An extension of
the theory presented by Snyder (1
993) on ratio mean regression is extended in this paper to analyse discrete
bivariate response with a stratified two
-
stage cluster sampling structure. The technique which fits in perfectly
with Weight Least Squares (WLS) methodology is used to estimate in
cidence density and to model its
variation. An illustration is presented with data from the city of Salvador (State of Bahia), on evaluation of the
impact of environmental sanitation in poor areas of Salvador.


KEY WORDS:

Ratio statistic; Incidence densit
y; Complex structures.




1

Eliana H. Marques, Departamento de Estatística da Universidade Estadual de Campinas
-
Unicamp,
Brasil , eliana@ime.unicamp.br


Rosemeire L. Fiaccone, Pos
-
graduação do Departamento de Estatística da Unicamp / Departamento de
Estatística da Universidade Federal da Bahia
-

Ufba, Brasil, fiaccone@ufba.br



1. INTRODUCTION


1.1 Description of the Problem


Nowadays researchers have a large number of statistical procedures and computational means for
analysing their data. If there is not a careful choice of the methodology to be used,

inferences based on
the results may end up being wrong or not as accurate as it could or should be. Particularly, the role of
sampling needs to be taken into consideration when choosing a method of analysis for a set of data,
since the complexities of th
e sampling design are usually connected with the estimation procedure. In
recent years, a common feature of many studies in research areas such medicine, epidemiology,
economy etc. has been a non
-
standard data structure where the response of interest is c
ategorical and
involves a complex sampling design, for example, clusters of one or more levels.



Ratio methodology appears in the literature mainly in the context of survey sampling with complex
structures. Koch et al. (1975) and Freeman et al. (1976) pr
esented the methodology of Weighted Least
Squares(WLS) to analyse multivariate complex data considering the effect of the

2

sampling plan to calculate the statistic of interest. Landis (1987) used the same methodology to model
cumulative logits also consid
ering a complex sampling plan.


WLS is based on the model
E
F
X
A
(
)
~
~





where X is the design matrix,


is the regression
parameter vector and F is a vector of functions of response data. More generally, F can be any

vector
of functions obtained from combinations of linear, logarithmic, or exponential transformations of the
response data. For complex samples, F may be the vector of ratio estimates which are functions of
Horvitz
-

Thompson estimators for population tot
als(Davies, 1994).


The strategy of the WLS methodology consists of a two steps process. First, the measure of interest is
estimated along with its covariance matrix which is obtained by linear Taylor series aproximation
methods, then a linear model for t
he measure of interest is fit using WLS to study the effects of the
independent variables. In this paper interest is on ratio measures. The methodology for ratio mean
regression in a setting of cluster samples of binary outcomes will be extended to acco
modate discrete
outcomes. The situation of a ratio means vector (or a vector of log of ratio means) corresponding to
the crossclassification of categorical covariates will be modelled with WLS.



2. METHODOLOGY



2.1 Two
-
Stage Cluster Sampling Notation


Let N clusters be randomly selected with replacement from a population of


clusters. Let


, be the
probability of selecting cluster

,


1
2
,
,
,


. Let
n


be the elements sele
cted by simple random
sampling without replacement from cluster

. The j
-
th sampled element from the i
-
th sampled cluster
is represented by
y
ij
, where
i
N

1
2
3
,
,
,
,

e
j
n
i

1
2
3
,
,
,
,

. Note that
n
n
i



when

is the
selected cluster. Un umbiased estimator of the sample mean for cluster i, is

y
y
n
i
ij
j
n
i
i



1
.


Note that in general,
y
i

is not an unbiased estimator for the mean ov
er all clusters. Another estimate
sugested by Sukatme (1953) is
z
y
i
i



i
i
+



, where

i
is the total number of elements in the i
-
th
sampled cluster,

i

is the probability of selection of the i
-
th cluster
, and


is the total number of
elements in the population. This estimator is biased for the cluster means and unbiased for the overall
mean. If


i
i
+




1
, then
y
z
i
i


and both are unbiased estimators f
or the cluster means and overall
mean. The mean,
z
y
i
i



i
i
+




may be written as a mean of weighted sample elements,
i.e.,
z
w
i
i


y
i
, where
w
i



i
i
+




is the sampling weight.

The
z
i
are independen
t and identically distributed since clusters are selected with replacement and
sampling within clusters is independent of one another.




3

2.2 Ratio Mean Definition


When interest is on estimating ratios or proportions for population subgroups defined by the

crossclassification of explanatory variables, methods for ratio estimation have been used. For
example, recently, Lavange at al. (1994) proposed the use of the multivariate ratio method for
analysing incidence density in an observational study of low res
piratory infection desease in children
in their first year of life.



The ratio mean estimator for the overall population mean per element of an attribute of interest is
defined in this section for observations of a binary response from subjects in a two
stage cluster
sample. The assumed sampling method for clusters is simple random sampling with replacement (or,
equivalently, without replacement from a very large population).


Let
i
=1, 2, ...,
N

be the index refering to the sampled clusters,
j

=1, 2,

..
.,

n
i

the index for the elements
in the i
-
th cluster,
t

=1, 2, ...,
v
ij

the index for multiple observations for the j
-
th subject in the i
-
th
cluster.
N
represents the number of sampled clusters, n
i

the total number of subjects in the i
-
th cluster
and
v
ij

the total number of potential observations

for the j
-
th subject in the i
-
th cluster.


Let, X
ijt2

be a binary response which takes the value 1 if the
t
-
th observation for the
j
-
th subject in the
i
-
th cluster is relevant (or observed), and 0 otherwise an
d letY
ijt2

be a binary response which takes the
value 1 if the the
t
-
th observation for the
j
-
th subject in the i
-
th cluster is relevant and has the attribute
of interest and 0 otherwise. The index 2, in Y
ijt
2

and X
ijt
2

emphasizes that these random variabl
es result
from a two stage cluster sampling procedure.

Define

Y
Y
n
i
ijt
t
v
j
n
i
ij
i
.
.
2
2
1
1












and
X
X
n
i
ijt
t
v
j
n
i
ij
i
.
.
2
2
1
1












as the sample mean for the number of the relevant observations with the attribute per subject in the i
-
th
cluster and the sample mean

of relevant observations per subject in the i
-
th cluster respectively. Both
averages,
Y
i
.
.
2

e
X
i
.
.
2
, come from a two stage cluster sampling so they need to be ajusted, i. e. ,
weights need to be applied to them in or
der to avoid bias since clusters may be very different with
respect to sizes.

So,


Y
w
Y
i
w
i
i
.
.
.
.
2
2


,
X
w
X
i
w
i
i
.
.
.
.
2
2



where
w
i
i
i






is the sampling weight previously defined.


Since the sampling method for cluster
s is simple random sampling with

replacement, the

(
,
)
.
.
.
.
Y
X
i
w
i
w
2
2

are independent and identically distributed. So, the ratio mean estimator of the
measure of interest over all clusters is:


R
Y
X
Y
X
i
w
i
N
i
w
i
N
w
w






.
.
.
.
2
1
2
1
2
2

.


(2.1)


R

is an estimate of the number of relevant observations with atribute over all subjects and clusters,
divided by the number of relevant observations over all subjects and clusters.

Using matrix notation,
R

may be written as



4

f
Y
X
ijt
ijt
ijt
2
2
2


(
,
)

f
n
f
i
i
ijt
t
m
j
n
ij
i
~
~


2
2
1
1
1






f
w
f
i
w
i
i
~
~


2
2


e

f
f
N
w
i
w
i
n
~
~


2
2
1



.


Then,
R

is given by


R
A
f
w

exp
ln
~
~


2

where
A
~
[
,
]



1
1
.

(2.2)


Expression (2.1) depends on
w
i
, more specifically, it depends on the knowledge of the total nu
mber of
study subjects in cluster
i

(
v
i

), of the probability of selecting cluster
i
(

i
) and of the total number of
study subjects in the population(

+
). In practice, some of these quantities may not be known. Some
sampling schemes do not ask for the kno
wledge of some or all theses quantities. One such scheme is
discussed next.


This sampling plan avoids specification of population quantities in the analysis altogether. It is called
the self
-
weighting design. Let








be the proba
bility of selecting cluster


in the first stage,
i.e., clusters are selected proportionally to their sizes. Let also,
n

be a constant number of subjects
selected from each sampled cluster
i
,
i

=1,2,

...

,
N
, in the second stage. Then
, the common probability
of selection for each subject is given by











i
n
.
In this scheme it is necessary that we know
the size of each cluster in order to determine


.



2.3 Extension of Ratio Mean Regression for

Discrete Outcomes from Cluster
Sampling


Motivated by the structure of the dataset and the importance of quantities such as incidence of an event
during a followup period and prevalence of an illness in epidemiologic studies, other definitions for a
ratio

mean are proposed and the technique presented in the previous section developed by Snyder
(1993) is extended to analyse bivariate discrete responses. Since the quantities of interest express a
ratio of sums of variables, the method of ratio mean regress
ion is a way of estimating such measures,
adjusting them to the explanatory variables or risk factors.


The methodology for the ratio of discrete random variables is summarized below. The dataset comes
from a research study done by the Hydraulic and Sanit
ation Department of the Federal University of
Bahia. The study name is AISAM (“Avaliação do Impacto de Medidas de Saneamento Ambiental em
Áreas Pauperizadas de Salvador”)
-

Evaluation of the impact of environmental sanitation measures in
very poor urban a
reas of the city of Salvador, developed during August 1989
-

December 1990. The
objective was to evaluate the impact of sanitation actions in the health of the population living in the
outskirts of Salvador.


5


The epidemiologic measure of interest in this

ratio mean analysis is the incidence density, which of
course can be represented as a ratio of sums of random variables. The cluster sampling design includes
stratification and is a two
-
stage procedure. Clusters are selected by stratified simple random
sampling
with replacement at stage one, then elements are selected from the selected clusters without
replacement at stage two and then all subjects with the characteristics of interest are selected.


Ratio of means are calculated for sub
-
groups of observa
tions defined by the cross classification of
levels of characteristics of the clusters (strata represented by groups of communities with different
sanitation conditions), of the elements (households), and of the subjects (children) simultaneously.
Since
there exists an adequate number of observations in each subpopulation of interest it is assumed
(log) normality of the ratio mean estimators.


So, let

h

=1,2,

...

,
H


index the levels of the cluster characteristics, i.e., the levels of sanitation
conditio
ns of the communities (three levels,i.e., three types of sanitation conditions);

k

=1,2,

...

,
K


index the characteristics of the household;

l

=1,2,

...

,
L


index the levels of the characteristics of the child;

i

=1,2,

...

,
N
h
index the sampled clus
ters (communities) in the
h

-
th stratum (each of the
three strata was formed by three communities);

j

=1,2,

...

,
n
hi

index the selected households in the
i
-
th comunity of the
h
-
th stratum;

t

=1,2,

...

,

hij


index the children of the
j
-
th household

in the
i
-
th comunity of the
h

-
th
stratum.



N
h

is the number of selected communities from the
h
-
th stratum,
n
hi

is the number of households in the
ith community of the
h
-
th stratum,

hij

is the number of children in the jth household in the ith
community

of the hth stratum.
H

is the sanitation condition classification for the communities,
K

is the
number of characteristics for the households and L is the number of potential levels iof characteristics
of the child.


Define

Y
hikjlt2

the number of episodes

of diarrhea for the
t
-
th child with characteristic level
l

of the
j
-
th
household of type
k

in the
i
-
th community from stratum
h
;


X
hikjlt2

the total of fourteen days periods observed for the
t
-
th child with characteristic
l

of
the
j
-
th household of typ
e
k

in the
i
-
th community in stratum
h.


Let,

f
n
Y
X
hikl
hi
hikjlt
hikjlt
t
j
n
hij
hi
~
(
,
)


2
2
2
1
1
1
14









f
f
f
f
hi
hi
hi
hiKL
~
~
'
~
'
~
'
(
,
,
.
.
.
,
)




2
112
122
2




f
w
f
hi
w
hi
hi
~
~


2
2


and

f
f
N
h
w
hi
w
i
N
h
h
~

2
2
1




where

w
hi
hi
hi





.


It is important to note that in this context, the samplin
g weight refers to or takes into account not only
to the selected cluster but also to the stratum from which it was selected. Nevertheless, it is easy to
show that its form maintains the same. The argument being that strata is equally represented since
s
election of clusters within strata was uniform.



6

A consistent estimator for the variance of
f
h
w
~

2

is given by

v
N
N
f
f
f
f
f
h
h
hi
w
h
w
i
N
hi
w
h
w
h
w
h
~
~
~
~
~
(
)
(
)
(
)





2
1
1
2
2
1
2
2







.

(2.6)


Let also

f
f
f
w
w
H
W
~
~
'
~
'
(
,
.
.
.
,
)



2
12
2



and

(2.7)


v
BLOCK
v
f
f
W
h
W
~
~
~
~




2
2









.


So, the vector of

ratio means resulting from the crossclassification of the levels of clusters,
households and children may be written as


R
A
f
W
~
~
~
~
exp
ln





2

(2.8)

where
A
I
HKL
~
[
]




1
1

, I is the identity matrix
HKL

x

HKL

and


is th
e Kronecker product.

Using Taylor series approximation, the estimated variance of
R
~

is given by

V
H
V
H
R
f
w
~
~
~
~
~
~








2

where
H
D
A
D
R
f
w
~
~
~
~
~
~








1
2
.

(2.9)



2.4 Ratio Mean Regression


In order to study simultaneously the ef
fects of the levels of the characteristics for the clusters,
households and children using ratio means, a ratio mean regression model is proposed. Interactions
between two or more of the characteristics are also considered.


The ratio mean (or functions o
f ratio means) regression analysis is possible using the menthodology of
Grizzle, Starmer e Koch (1969) (GSK) and extensions of it, as discussed in Koch, Imrey et al. (1985).
This method is presented in the next section.



2.4.1 A Linear Model for R


Let
R
R
R
HKL
~
(
,
.
.
.
,
)



111

be a vector of ratio means for subpopulations defined by the
crossclassification of the levels of the characteristics for H strata K elements and L subjects. A linear
model for
R
~

is given by:


E
R
X
A
(
)
~
~
~






(2.10)


7

where
E
A
(

)

denotes the asymptotic expected value de R ,
X
~


is a design matrix
HKL

x




and

~


is a


x

1 vector of unknown parameters to be estimated. The covariance matrix for
R
~

may be
consistently estimated by
V
R
~
~


, like in (2.9

). The weighted least squares estimator
b
~


of

~


is given by:

b
X
V
X
X
V
R
R
R
~
~
~
~
~
~
~
(
)
(
)
~
~













1
.

(2.11)

In large samples, the estimator
b
~


has a normal approximation with

E
b
A
(
)
~
~





(2.12)

and a consistent estimator for the corresponding covariance matrix is

V
X
V
X
b
R
~
~
(
)
~
~
~








1
.

(2.13)

The goodness of fit of the model in (2.10

) can be evaluated with the

residual weighted least square
statistic

Q
R
X
b
V
R
X
b
R





(
)
(
)
~
~
~
~
~
~
~
~








1
.

(2.14)

If the model in (2.10

) applies,
Q

has approximately a chi
-
squared distribution with
(HKL
-

)

degrees of
freedom assuming that all
HKL

subgroups have at least moderately large sample

sizes
.



2.4.2 A Linear Model for
log(R)


Let
F
F
F
HKL
~
(
,
.
.
.
,
)



111

log
~
~


R

be a vector of the log of the ratio means corresponding to the
subpopulations defined by the classification of the H types of strata, K types of households and

L
types of characteristics of the children. An estimate of the covariance matrix of
F
~

, based on Taylor
series approximation is given by:

V
D
V
D
F
R
R
R
~
~
~
~
~
~
~
~












1
1

(2.15)

where
D
R
~
~



is a matrix
HKL

x

HKL

with the
elements of vector
R
~


on the diagonal and
V
R
~
~



is an
estimate of the covariance matrix for
R
~


given by (2.9

). The function vectior
F
~


may be modelled
using the GSK methodo
logy by directly forming F and its covariance matrix.


Depending on the research question of interest a different vector of ratios is proposed if interest is on
modelling, for example, the logit R. In this case
F
F
F
HKL
~
(
,
.
.
.
,
)



111

log
~
~
it
R





ln
(
)
ln
(
)
~
~
~
~




R
R
1

and the estimate of the covariance matrix is
given by
V
D
D
V
D
D
F
R
R
R
R
R
~
~
~
~
~
~
~
~
~
~
~
~
[
]
[
]




















1
1
1
1
1
1
.


8


3. EXAMPLE


The AISAM project involved 9 communities and was developed in the outskirts of Salvador. The
communities were stratified according
to sanitation conditions then a random selection of communities
was obtained from each stratum and at the second stage sampling, households were sampled from the
communities. All children less than five years of age living in the selected households were
followed
for diarrhea episodes.


This was a longitudinal study with 1162 chidren. The objective was to collect daily diarrhea
information for fourteen days periods. Criteria were established for the definition of severity of
episodes of diarrhea.


The ve
ctor of ratios (incidences of moderate and/or severe episodes of diarrhea) was constructed
according to the crossclassification of characteristics of the communities, households and children. A
first regression model for ratios was ajusted by WLS then a s
econd simplified model was tried. The
second model adequately fit the data (chi
-
square with 8 degrees of freedom=5.55, p
-
value=0,6980).
Results showed a 2.5 risk of moderate and/or severe episodes of diarrhea if children came from an area
with no sanitat
ion intervention as compared to the area with good basic sanitation intervention. There
was a 1.9 risk for moderate and/or severe episodes of diarrhea for children whose mothers had up to 5
years of study compared to mothers with more than 5 years of st
udy.


Note to the reader: This is a preliminary version of the article.


9


4. REFERENCES


Davies, C. S. (1994). Applications of sample methodology to repeated measures data structures in
dentistry. Unpublished doctoral dissertation,
Department of Bio
statistics,

University of North
Carolina, Chapel Hill.


Freeman, D. H., Freeman, J. L., Brock, D. B., Koch, G. G. (1976). Strategies in the multivariate
analysis of data from complex surveys II:An Application to the United States National Helth
Interview S
urvey.
International Statistical Review
,
44
, 317
-
330.


Grizzle, J. E., Starmer, C. F. & Koch, G. G. (1969). Analysis of categorical data by linear models
.
Biometrics
, 25, 459
-
504.


Koch, G., Freeman, D. H.,Freeman, J. L. (1975).Strategies in the Multivar
iate


Analysis of Data from Complex Survey.
International Statistical Review
,
43
, 59
-
78


KOCH, Gary G., IMREY, P. B., SINGER, J. M., ATKINSON, S. S. And STOKES, M. E. (1985).
Analysis of categorical data. In: Colletion Seminaire de Mathematiques Superi
eures
96, G. Sabidussi
(ed.) Montreal Les Pressas de L’Uniersité de Montreal


Landis, R., Lepkowski, J. M., Davis, C. S., Miller, M. (1987). Cumulative logit models for weighted
data from comples sample surveys.
Proceeding of the Social Sattistics Section
of the American
Statistical Association
, 165
-
170.


Lavange, L. M., Keys, L. L., Koch, G. G., Margolis, P. A. (1994). Application of Sample Survey
Methods for Modelling Ratios to Incidence Densities.
Statistical in Medicine
,
13
, 343
-
355.


Snyder, E. S. (19
93). The analysis of binary data with large, unbalanced, and incomplete clusters
using ratio means weight regression methods. Unpublished doctoral dissertation
, Department of
Biostatistics
, University of North Carolina, Chapel Hill.


Stanish, W. M., Gill
ings, G.G., Koch, G.G. (1987). An application of multivariate ratio methods for
the analysis of a longitudinal clinical trial with missing data.
Biometrics, 34
, 305
-
17