editing and imputation: recent

sharpfartsAI and Robotics

Nov 8, 2013 (3 years and 7 months ago)

69 views

Methods and software for
editing and imputation: recent
advancements at Istat


M. Di Zio, U. Guarnera, O. Luzi, A. Manzari

ISTAT


Italian Statistical Institute

UN/ECE Work Session on

Statistical Data Editing

Ottawa, 16
-
18 May 2005

O
utline


Introduction


Editing: Finite Mixture Models for
continuous data


Imputation: Bayesian Networks for
categorical data


Imputation: Quis system for continuous
data


E&I: Data Clustering for improving the
search of donors in the Diesis system

Recent advancements at Istat

In

order

to

reduce

waste

of

resources

and

to

disseminate

best

practices
,

efforts

were

addressed

in

two

directions
:


identifying methodological solutions for
some common types of errors


providing survey practitioners with
generalized
tools

in order to

facilitate the
adoption of new methods and increase the
processes standardization

Editing

I
dentifying
systematic unity
measure errors (UME)

A
UME
occurs when the
“true” value of a variable
X
j

is reported in a wrong
scale (e.g.
X
j

∙C, C=100,
C=1,
000, and so on)


Finite M
ixture
M
odel
s of Normal
Distributions


Probabilistic clustering based on the
assumption that observations are from a
mixture of a finite number of populations or
groups
G
g

in various proportions
p
g


Given some parametric form for the density
function in each group maximum likelihood
estimates can be obtained for the unk
n
own
parameters


Finite M
ixture
M
odel
s for UME


G
iven
q


variables
X
1
,.., X
q
,
the
h = 2
q

possible
clusters

(
mixture component
s
)

correspond to
groups

of units with different subsets of
items

affected by
UME

(
error pattern
s
)


Assuming that valid data are normally distributed
and using a log scale, e
ach
cluster
is characterized
by a

p.d.f.

f
g
(
y
;
q
t
)

MN(
m
g
,
S


睨wr攠
m
g

is translated
by a known
vector

and
S

楳i捯湳c慮琠f潲o慬氠捬畳瑥牳


Units are assigned to clusters based
on their
posterior probability

t
g
(
y
i
;
q
,
p

)


Model diagnostics used

to prioritise
units

for

manual check



Atypicality Index
:
allows to identify outliers
w.r.t. the defined model (e.g. units possibly
affected by errors other than the UME)


C
lassification probabilities
t
g
(
y
i
;
q
,
p

)
allow to
identify possibly misclassified units
.
They

can be
directly used to identify misclassifi
cations

that are
possibly influential
on target estimates

(
s
ignificance
e
diting
)


Main findings


Finite Mixture Modelling allows multivariate and not
hierarchical data analyses
.

Costs

for developing
ad
hoc
procedures are saved



Finite Mixture Modelling

produces
highly reliable

automatic data clustering/error
localization


Model diagnostics
can be used

for
reducing editing
costs

due to manual editing


The approach is robust for moderate departures
from normality


The number of
model
parameters is limited
by
the
model constraints on
m

慮搠
S

Imputatio
n

Bayesian Neworks for

categorical
variables


The first idea of using BNs
for imputation
is by
Thibaudeau and Winkler (2002)


Let C
1
….,C
j

be a set of categorical variables having
each

a finite set of mutually exclusive states


BNs allows to represent graphically and numerically
the joint distribution of variables:


A Bn can be viewed as a
Directed Acyclic Graph, and


an inferential engine that allow to perform inferences on
distributions parameters


Graphical representation of BNs

To each variable
C

with
parents Pa (
C
j
)

there is
attached a conditional
prob
ability

P(
C
|Pa (
C
j
))



BNs allow

to
factorize

the
joint probability distribution
P(
C
1
,...,C
j
)

of so that

P(
C
1
….,C
j
)
=
Π
j
=1,n
P(
C
j
|Pa(
C
j
))

BN

s and imputation: method 1

1.
Order variables according to
their “reliability”

2.
Estimate the network
conditioned on this order

3.
Estimate the conditional
probabilities for each node
according to (2)

4.
Impute each missing item by
a random draw from its
conditional prob. distribution


BNs and imputation: method
s

2
/3

In a multivariate context is
more convenient to use not
only information coming from
parents, but also from the
children. This can be done by
using Markov Blanket (M
b
):

M
b
(X)= Pa(X)+Ch(X)+Pa(X
Children)

In this case for each node the
conditional probabilities are
estimated w
.
r
.
t
.

its M
b

Main findings


BNs allow to express the joint probability distributions
with a dramatic decrease of parameters to be
estimated (
reduction of complexity
)


BNs
may

estimate the relationships between variables
that are really informative for predicting values


Parametric models like BNs are efficient in terms of
preservation of joint distributions


The graphical representation facilitates modelling


BN’s and hot deck methods have the same behaviour
only in the case that the hot deck is stratified
according to variables explaining exactly the missing
mechanism

Imputation

Quis system for

continuous variables

Quis
(
QUick Imputation System
)
is a SAS
generalized
tool developed at Istat to impute
continuous survey data
in

a unified environment

Given a set of variables subject to non response,
different m
ethods can be used in a

completely

integrated way
:



Regression Imputation via EM algorithm



Nearest Neighbour Donor Imputation (NND)



Multivariate Predictive Mean Matching (PMM)

Regression imputation via EM

In the context of imputation, the EM algorithm is
used for obtaining Maximum Likelihood estimates
in presence of missing data for the parameters of
the model assumed for the data

Assumptions



MAR mechanism



Normality

Regression imputation via EM


Once ML estimates of parameters have been
obtained, mi
ssing data
can be
imputed

in
two different ways:


directly
through expectations of missing
values conditional on observed ones
(
predictive means
)


by adding
a normal random residual to the
predictive means

(
i.e.
drawing values from

the conditional distributions of
missing

values)

Multivariate Predictive Mean
Matching (PMM)

Let
Y

=
(Y
1
,...Y
q
)
be
a set of variables subject to non
response


ML estimates of the parameters
q

of the joint
distribution of
Y

are derived via EM


For each pattern of missing data y
miss
, the parameters
of the corresponding conditioned distribution are
estimated starting from
q
(
sweep operator
)


For each unit
u
i

the predictive mean based on
estimated parameters is computed


For each unit with missing data, imputation is done
using the nearest donor

w.r.t. the predictive mean

The Mahalanobis distance is adopted to find donors

Data clustering for improving the
search for donors in the Diesis system


The DIESIS system has been developed at ISTAT
for treating the demographic variables of the 2001
Population Census


Diesis uses both the
data driven
and the
minimum
change
approach for editing and imputation


For each failed household, the set of potential
donors contains only the nearest passed households


The adopted distance function is a weighted sum of
the distances for each demographic variable over all
the individuals within the household

The in use approach for donor
search


For each failed household
e
, the identification of
potential donors should be made by searching
within the set of all passed households
D



When
D

is very large, as in the case of a Census,
the computation of the distance between each
e

and all
d

D

(
exhaustive search
) could require
unacceptable computational time


The in use
sub
-
optimal search

consists in arresting
the search before examining the entire set
D

according to some stopping criteria. This solution
does not guarantee the selection of the potential
donors having actual minimum distance from e

The new approach for donor
search


In order to reduce the number of passed
households to examine, the set of passed
households
D

is preliminarily divided into smaller
homogeneous subsets {
D
1
, …,
D
n
} (
D
1



D
n
=
D
,)


Such subdivision is obtained by solving an
unsupervised clustering problem (
donor search
guided by clustering
)


The search for the potential donors is then
conducted, for each failed household
e
, by
examining only the households within the cluster(s)
more similar to
e

Main findings


The
donor search guided by clustering

reduces
computational times preserving the E&I quality
obtained by the
exhaustive search


The
donor search guided by clustering

increases the proportion of actual minimum
distance donors selected with respect to the
sub
-
optimal search

(
this is especially useful for
households having uncommon structure for which
few passed households are generally available
)