to Impute Categorical Data

unknownlippsAI and Robotics

Oct 16, 2013 (4 years and 2 months ago)

48 views

24
-
26 September
2012

UNECE CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Use of Machine Learning Methods
to Impute Categorical Data

Pilar Rey del
Castillo*

EUROSTAT,

Unit

B
1
:

Quality,

Research

and

Methodology


UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Use of Machine Learning Methods to Impute
Categorical Data

2

24
-
26 September
2012



Problem

non
-
response in
statistical surveys

missing information
in machine learning

different

approaches

evaluation criteria


Aim: show the commitment to the almost exclusive use of probabilistic data
models prevents statisticians from using the
most convenient
technologies


Case of categorical variables: practical recommendations from the
statistical approach just reuse procedures designed for numeric variables

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Outline of the presentation

3

24
-
26 September
2012

1.
Review non
-
response treatments


imputation procedures:
evaluation criteria

2.
Recommendations for categorical data imputation from the
statistical community: why these are not appropriate

3.
Results of comparisons with two machine learning methods

4.
Final remarks

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Non
-
response treatments

4

24
-
26 September
2012


Deletion procedures
: using only the units with
complete data for further analysis


Tolerance procedures
: internal, not removing
incomplete records or completing them


Imputation procedures
: replacing each missing value
by an estimate

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Imputation procedures

5

24
-
26 September
2012


Algorithmic methods
: use an algorithm to produce
the imputations (cold and hot
-
deck, nearest
-
neighbour,
mean, machine learning classification & prediction
techniques…)


Model
-
based methods
: the predictive distributions
have a formal statistical model


state of the art:
MI

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Criteria for evaluating the imputation results

6

24
-
26 September
2012


Statistical surveys
: valid & efficient inferences, being
treatment part of the overall procedure

"…
Judging the quality of missing data procedures
by their ability to recreate the individual missing
values (according to hit
-
rate, mean square error,
etc.) does not lead to choosing procedures that
result in valid inference, which is our objective"

(Rubin, 1996)



Machine learning
:
general artificial intelligence
framework (empirical results through simulating missing
data and measuring the closeness between real &
imputed)

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Categorical data imputation in statistical surveys

7

24
-
26 September
2012

State of the art:
MI

or other model
-
based



Log
-
linear model : not always possible


Logistic regression models: sometimes problems at the estimation
step


Binary case: Rubin &
Schenker

(1986), Schafer (1997): to
approximate by using a Gaussian distribution


Non
-
binary case:
Yucel

&
Zaslavsky

(2003), Van
Gingel

et al.
(2007): rounding multivariate normal distribution


Criticisms
from the practical perspective (Horton (2003),
Ake

(2005), Allison (2006),
Demirtas

(2008))


Contradiction
(theoretical framework: focus on model adequacy)


(practical recommendations: models clearly not adequate)

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Problem of categorical data imputation
to
be solved

8

24
-
26 September
2012


Survey microdata
file: opinion poll (no.2750 in CIS catalogue)


Quantitative variables
(8):
ideological
self
-
location; rating
of three
specific political
figures;
likelihood to
vote; likelihood
to vote for three
specific political
parties…


Ordered
categorical variables
(2):
government and opposition party
ratings
(converted to quantitative)


Categorical variables with
non
-
ordered
categories
(7):
voting
intention
; voting memory;
the autonomous community;
the
political
party the respondent would prefer to see
win…


Voting intention

to be imputed: 11 categories (biggest political
parties, "blank vote", "abstention", "others")


13.280 interviews with no missing values

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Imputation methods to be compared

9

24
-
26 September
2012


MI

logistic regression


Classifiers (matching each class with one of the
Voting intention
categories)


Fuzzy min
-
max neural network classifier recently extended to
deal with mixed numeric & categorical data as inputs (Rey del
Castillo &
Cardeñosa
, 2012)


Bayesian network classifier: not Naïve Bayes classifier but a
more complex architecture learnt with a
score + search

paradigm

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Comparison
criterion

10

24
-
26 September
2012



Not possible classical surveys inference criterion because no
models


EUREDIT project: Wald statistic for categorical variables:
but none of the methods overcome the proposed test!


Correctly imputed rate
is used (ten
-
fold cross
-
validation)

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Results of the comparison

11

24
-
26 September
2012

Imputation

method

Correctly
imputed
rate %

MI

logistic regression

66.0

Fuzzy min
-
max neural network classifier

86.1

Bayesian network classifier

87.4

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on
Statistical
Data Editing

Conclusions & final remarks

12

24
-
26 September
2012

1.
Always similar differences between
machine learning
/
MI logistic

2.
Simplest case with missing data exclusively on one variable

3.
Extensible to numeric variables ?

4.
Machine learning procedures easier to automate


Non
-
dependence on model assumptions


Don't break down when large number of variables ?


More robust to outliers ?

5.
Machine learning may be used for massive imputation tasks



UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on Statistical Data Editing

Thank you !!!

13

24
-
26 September
2012

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on
Statistical
Data Editing

References (1)

14

24
-
26 September
2012


Ake
, C. F. (2005)
, Rounding After Multiple Imputation with Non
-
Binary
Categorical Covariates, SAS Conference Proceedings: SAS User Group
International 30, Philadelphia, PA, April 2005
.


Allison, P. (2006)
, Multiple Imputation of Categorical Variables under the
Multivariate Normal Model, paper presented at the Annual Meeting of the
American Sociological Association, Montreal Convention
Center
, Montreal,
Quebec, Canada, August 2006.


Demirtas
, H. (2008)
, On Imputing Continuous Data When the Eventual
Interest Pertains to
Ordinalized

Outcomes Via Threshold Concept,
Computational Statistics and Data Analysis, vol. 52, pp. 2261
-
2271
.


Horton, N. J.,
Lipsitz
, S. R. and
Parzen
, M. (2003)
, A Potential for Bias
when Rounding in Multiple Imputation, The American Statistician, vol. 57,
no. 4, pp. 229
-
232, November 2003
.


Rey
-
del
-
Castillo, P., and
Cardeñosa
, J. (2012),
Fuzzy Min

Max Neural
Networks for Categorical Data: Application to Missing Data Imputation,
Neural Computing and Applications, vol. 21, no. 6 (2012), pp. 1349
-
1362,
DOI 10.1007/s00521


011

0574

x, Springer
-
Verlag

London.


Rubin, D. B. (1996)
, Multiple Imputation After 18+ Years, Journal of the
American Statistical Association, vol. 91, no. 434, Applications and Case
Studies, June 1996.

UNECE
CONFERENCE OF EUROPEAN STATISTICIANS


Work Session on
Statistical
Data Editing

References (2)

15

24
-
26 September
2012


Rubin, D. B. and
Schenker
, N. (1986)
, Multiple Imputation for Interval
Estimation from Simple Random Samples with Ignorable Nonresponse,
Journal of the American Statistical Association, vol. 81, no. 394, Survey
Research Methods, June 1986.


Schafer
, J. L. and Graham, J. W. (2002)
, Missing Data: Our View of the
State of the Art, Psychological Methods, vol. 7, no. 2, pp. 147
-
177.


Van
Ginkel
, J. R., Van der Ark, L. A. and
Sijtsma
, K. (2007)
, Multiple
Imputation of Item Scores when Test Data are
Factorially

Complex, British
Journal of Mathematics and Statistical Psychology, vol. 60, pp. 315
-
337.


Yucel
, R. M. and
Zaslavsky
, A. M. (2003)
, Practical Suggestions on
Rounding in Multiple Imputation, Proceedings of the Joint American
Statistical Association Meeting, Section on Survey Research Methods,
Toronto, Canada, August 2003.