Methods and software for
editing and imputation: recent
advancements at Istat
M. Di Zio, U. Guarnera, O. Luzi, A. Manzari
ISTAT
–
Italian Statistical Institute
UN/ECE Work Session on
Statistical Data Editing
Ottawa, 16

18 May 2005
O
utline
•
Introduction
•
Editing: Finite Mixture Models for
continuous data
•
Imputation: Bayesian Networks for
categorical data
•
Imputation: Quis system for continuous
data
•
E&I: Data Clustering for improving the
search of donors in the Diesis system
Recent advancements at Istat
In
order
to
reduce
waste
of
resources
and
to
disseminate
best
practices
,
efforts
were
addressed
in
two
directions
:
–
identifying methodological solutions for
some common types of errors
–
providing survey practitioners with
generalized
tools
in order to
facilitate the
adoption of new methods and increase the
processes standardization
Editing
I
dentifying
systematic unity
measure errors (UME)
A
UME
occurs when the
“true” value of a variable
X
j
is reported in a wrong
scale (e.g.
X
j
∙C, C=100,
C=1,
000, and so on)
Finite M
ixture
M
odel
s of Normal
Distributions
Probabilistic clustering based on the
assumption that observations are from a
mixture of a finite number of populations or
groups
G
g
in various proportions
p
g
Given some parametric form for the density
function in each group maximum likelihood
estimates can be obtained for the unk
n
own
parameters
Finite M
ixture
M
odel
s for UME
G
iven
q
variables
X
1
,.., X
q
,
the
h = 2
q
possible
clusters
(
mixture component
s
)
correspond to
groups
of units with different subsets of
items
affected by
UME
(
error pattern
s
)
Assuming that valid data are normally distributed
and using a log scale, e
ach
cluster
is characterized
by a
p.d.f.
f
g
(
y
;
q
t
)
MN(
m
g
,
S
⤠
Ⱐ
睨wr攠
m
g
is translated
by a known
vector
and
S
楳i捯湳c慮琠f潲o慬氠捬畳瑥牳
Units are assigned to clusters based
on their
posterior probability
t
g
(
y
i
;
q
,
p
)
Model diagnostics used
to prioritise
units
for
manual check
Atypicality Index
:
allows to identify outliers
w.r.t. the defined model (e.g. units possibly
affected by errors other than the UME)
C
lassification probabilities
t
g
(
y
i
;
q
,
p
)
allow to
identify possibly misclassified units
.
They
can be
directly used to identify misclassifi
cations
that are
possibly influential
on target estimates
(
s
ignificance
e
diting
)
Main findings
Finite Mixture Modelling allows multivariate and not
hierarchical data analyses
.
Costs
for developing
ad
hoc
procedures are saved
Finite Mixture Modelling
produces
highly reliable
automatic data clustering/error
localization
Model diagnostics
can be used
for
reducing editing
costs
due to manual editing
The approach is robust for moderate departures
from normality
The number of
model
parameters is limited
by
the
model constraints on
m
慮搠
S
Imputatio
n
Bayesian Neworks for
categorical
variables
The first idea of using BNs
for imputation
is by
Thibaudeau and Winkler (2002)
•
Let C
1
….,C
j
be a set of categorical variables having
each
a finite set of mutually exclusive states
•
BNs allows to represent graphically and numerically
the joint distribution of variables:
–
A Bn can be viewed as a
Directed Acyclic Graph, and
–
an inferential engine that allow to perform inferences on
distributions parameters
Graphical representation of BNs
To each variable
C
with
parents Pa (
C
j
)
there is
attached a conditional
prob
ability
P(
C
Pa (
C
j
))
BNs allow
to
factorize
the
joint probability distribution
P(
C
1
,...,C
j
)
of so that
P(
C
1
….,C
j
)
=
Π
j
=1,n
P(
C
j
Pa(
C
j
))
BN
’
s and imputation: method 1
1.
Order variables according to
their “reliability”
2.
Estimate the network
conditioned on this order
3.
Estimate the conditional
probabilities for each node
according to (2)
4.
Impute each missing item by
a random draw from its
conditional prob. distribution
BNs and imputation: method
s
2
/3
In a multivariate context is
more convenient to use not
only information coming from
parents, but also from the
children. This can be done by
using Markov Blanket (M
b
):
M
b
(X)= Pa(X)+Ch(X)+Pa(X
Children)
In this case for each node the
conditional probabilities are
estimated w
.
r
.
t
.
its M
b
Main findings
BNs allow to express the joint probability distributions
with a dramatic decrease of parameters to be
estimated (
reduction of complexity
)
BNs
may
estimate the relationships between variables
that are really informative for predicting values
Parametric models like BNs are efficient in terms of
preservation of joint distributions
The graphical representation facilitates modelling
BN’s and hot deck methods have the same behaviour
only in the case that the hot deck is stratified
according to variables explaining exactly the missing
mechanism
Imputation
Quis system for
continuous variables
Quis
(
QUick Imputation System
)
is a SAS
generalized
tool developed at Istat to impute
continuous survey data
in
a unified environment
Given a set of variables subject to non response,
different m
ethods can be used in a
completely
integrated way
:
Regression Imputation via EM algorithm
Nearest Neighbour Donor Imputation (NND)
Multivariate Predictive Mean Matching (PMM)
Regression imputation via EM
In the context of imputation, the EM algorithm is
used for obtaining Maximum Likelihood estimates
in presence of missing data for the parameters of
the model assumed for the data
Assumptions
MAR mechanism
Normality
Regression imputation via EM
Once ML estimates of parameters have been
obtained, mi
ssing data
can be
imputed
in
two different ways:
directly
through expectations of missing
values conditional on observed ones
(
predictive means
)
by adding
a normal random residual to the
predictive means
(
i.e.
drawing values from
the conditional distributions of
missing
values)
Multivariate Predictive Mean
Matching (PMM)
Let
Y
=
(Y
1
,...Y
q
)
be
a set of variables subject to non
response
ML estimates of the parameters
q
of the joint
distribution of
Y
are derived via EM
For each pattern of missing data y
miss
, the parameters
of the corresponding conditioned distribution are
estimated starting from
q
(
sweep operator
)
For each unit
u
i
the predictive mean based on
estimated parameters is computed
For each unit with missing data, imputation is done
using the nearest donor
w.r.t. the predictive mean
The Mahalanobis distance is adopted to find donors
Data clustering for improving the
search for donors in the Diesis system
•
The DIESIS system has been developed at ISTAT
for treating the demographic variables of the 2001
Population Census
•
Diesis uses both the
data driven
and the
minimum
change
approach for editing and imputation
•
For each failed household, the set of potential
donors contains only the nearest passed households
•
The adopted distance function is a weighted sum of
the distances for each demographic variable over all
the individuals within the household
The in use approach for donor
search
•
For each failed household
e
, the identification of
potential donors should be made by searching
within the set of all passed households
D
•
When
D
is very large, as in the case of a Census,
the computation of the distance between each
e
and all
d
D
(
exhaustive search
) could require
unacceptable computational time
•
The in use
sub

optimal search
consists in arresting
the search before examining the entire set
D
according to some stopping criteria. This solution
does not guarantee the selection of the potential
donors having actual minimum distance from e
The new approach for donor
search
•
In order to reduce the number of passed
households to examine, the set of passed
households
D
is preliminarily divided into smaller
homogeneous subsets {
D
1
, …,
D
n
} (
D
1
…
D
n
=
D
,)
•
Such subdivision is obtained by solving an
unsupervised clustering problem (
donor search
guided by clustering
)
•
The search for the potential donors is then
conducted, for each failed household
e
, by
examining only the households within the cluster(s)
more similar to
e
Main findings
The
donor search guided by clustering
reduces
computational times preserving the E&I quality
obtained by the
exhaustive search
The
donor search guided by clustering
increases the proportion of actual minimum
distance donors selected with respect to the
sub

optimal search
(
this is especially useful for
households having uncommon structure for which
few passed households are generally available
)
Comments 0
Log in to post a comment