Curating Metadata for Social Science Data Applications

feelingmomInternet and Web Development

Dec 7, 2013 (3 years and 11 months ago)

82 views

Curating Metadata for Social Science Data Applications



Simon Jones*, Guy Warner*, Paul Lambert**, Jesse Blum*



* Computing Science, ** Applied Social Science

University of Stirling, Stirling, Scotland, UK


NCRM Research Methods Festival 2010

Session: Resources I: resources for data management

Oxford, UK 6 July 2010

DAMES: Data Management through e
-
Social Science

http://www.dames.org.uk

2

DAMES: Background


DAMES: Case studies, provision and support for data
management in the social sciences


This talk
: focusing on "
support

for data management"


Infrastructure/tools


Driven by social science needs for support for advanced
data management operations



“In practice, social researchers often spend more time
on data management than any other part of the
research process” (Lambert)


A ‘methodology’ of data management is relevant to
‘harmonisation’, ‘comparability’, ‘reproducibility’ in
quantitative social science

3

DAMES: Themes


Enabling the (social science) researcher:


To

deposit, search and process heterogeneous data
resources


To access online services/‘tools’ that enable
researchers to carry out repeatable and challenging
data management techniques such as:




fusion


matching


imputation …


Facilitating

access

is an important goal


Underlying computer science research themes


Metadata


Data curation


Data management/processing


Portals

5

Data management/processing
scenarios


Curation scenarios include:


Uploading occupational data to distribute across
academic community


Recording data properties prior to undertaking data
fusion involving a survey and an aggregate dataset


Fusion scenarios include:


Linking a micro
-
social survey with aggregate
occupational information (deterministic link)


Enhancing a survey dataset with ‘nearest match’
explanatory variables (probabilistic link)


Other processes: recoding, operationalising, linking,
cleaning…

6

Generic data flows

Data set

store

Processing

Data sets
are
deposited

Data sets
are
selected

Processing
is
configured

Data set selection, and the
configuration of processing jobs
must be
informed by knowledge
about the data sets
-

metadata

Result is
saved

7

Key role for metadata


Metadata records are absolutely core to the functioning
of the portal infrastructure


For adequate, searchable records for the
heterogeneous resources (data tables, command
files, notes and documentation)


To connect the resources and the data mgmt tools


To document the data sets
resulting from application
of the data mgmt tools
: inputs, process, rationale,…


DAMES requirements:


(Micro
-
)data based, very general, lifecycle oriented,
Grid friendly



DDI 3

DDI = Data Documentation Initiative

8

DDI 3


supports data lifecyle


Thanks to DDI Alliance

9

DDI 3


An XML language

<ns1:DDIInstance versionDate="2009
-
05
-
02"


id="Instance_SimonsStudy" xmlns:ns1="ddi:instance:3_0" ...>


<s:StudyUnit id="HIS
-
CAM
-
scale">


<r:Citation>


<r:Title>HIS
-
CAM scale for all countries</r:Title>


<r:Creator affiliation="University of Stirling">


Lambert, Paul, paul.lambert@stir.ac.uk</r:Creator>


<r:Publisher>University of Stirling,


www.camsis.stir.ac.uk</r:Publisher>


</r:Citation>


<s:Abstract id="Abstract">


<r:Content> This Occupational Information Resource ...


HIS
-
CAM is "Historical CAMSIS", and CAMSIS is


"Cambridge Social Interaction and Stratification".


See http://www.camsis.stir.ac.uk/hiscam/


</r:Content>


</s:Abstract>


...

10

The metadata "cycle"

Processing

Metadata

Search

Data is mirrored
by metadata

Configure/
process

Select

Deposit/curate

11

DAMES portal architecture overview

Portal

DAMES Resources

External
Dataset
Repositories

User

Services

Search

Enact
Fusion

File Access

Compute
Resources

Metadata

Local
Datasets

(Note: Security omitted)

12

Tools


Since metadata must have a key role in data
management…


So
tools

for managing and exploiting the metadata have
key role in the
use and operation

of the DAMES portal


At deposit/curation


For searching


For informing the configuration of processing steps


The following slides show prototypes of our tools



13

Curation Tool prototype


The source data:

14


15


16


17


18


19


20


21


22


23


24


25


26


Also automatically
uploaded to searchable
eXist database

27

Metadata searching prototype

28

Browsing the
search results

29

Fusion Tool prototype


Scenario: A soc sci researcher wishes to fuse Scottish
Household Survey data with privately collected study data:


Uses the data curation tool to upload the data


Uses the data fusion/imputation tool to select the data,
identify corresponding variables, and to generate a
derived dataset (held in the portal)


The metadata about this
derived dataset

is stored and
(may be) made public through the portal


Another researcher can now search the portal
(metadata) for SHS data and find the derived dataset


DAMES metadata handling must facilitate this process


30

The Fusion Tool prototype

Select datasets
(recipient and donor)

Select "common
variables"

Select variables to be
imputed

Select data fusion
method

Submit to fusion
"enactor"

31

Select datasets
(recipient and donor)

Select "common
variables"

Select variables to be
imputed

Select data fusion
method

Submit to fusion
"enactor"

32

Select datasets
(recipient and donor)

Select "common
variables"

Select variables to be
imputed

Select data fusion
method

Submit to fusion
"enactor"

Skipped

Metadata for result
dataset

33

Job submission: Information flow

Wizard

Enactor

Compute
resources
(Condor)

subjob1

subjob2

User's local

file store

Resultant

data

DDI

record

notify

(job id)

fetch job

submit

JFDL/JSDL

description.xml

Further

infra
-

structure

36

Fusion job flow metadata


<g:Concepts>


<!
--

A conceptual component describing the data that are being fused is


given for each mapped and imputed variable
--
>


...


<g:DataCollection>


<d:DataCollection>


<d:ProcessingEvent isIdentifiable="true" id="[Fusion Method Id]" …>


<d:Coding isIdentifiable="true" id="[Fusion Method Submit File]">


<d:GenerationInstruction>


<d:SourceVariable ...> <r:URN… <d:Mnemonic>Donor Dataset


<d:SourceVariable ...> <r:URN… <d:Mnemonic>Recipient Dataset


...


<r:Command>


<r:CommandFile formalLanguage="[Language such as SPSS or
STATA]">


...


<r:URI>[URI for the command file]




A fragmentary outline!

38

Summary/Work in progress


The DAMES infrastructure is still under development


We have a prototype outline portal, with prototype
curation and fusion tools


DDI 3 has been identified as highly appropriate, and
adopted as our metadata standard, but we are still:


Refining the JFDL


Refining the DDI3


Improving generation of DDI3 from JFDL


Improving searching and discovery of datasets

Thank you!