Data Management Best Practices

crookpatedspongySoftware and s/w Development

Dec 2, 2013 (3 years and 7 months ago)

58 views

Data Management Best Practices

Ryan Womack and
Aletia

Morgan

Rutgers University Libraries

Research Data Services

Data Management Best Practices

Ryan Womack and Aletia Morgan

Rutgers University Libraries

Research Data Services


October 29, 2013


Worst case scenarios


Lost data


Stolen data


Incomprehensible data


Unverifiable data

Why Data Management?

Why Data Management?


Best Case Scenarios, Data is…


Robust


Recoverable


Reliable


Reusable


Reproducible


Reputable


Renowned


The Data Lifecycle

Understand your data


Quantity (MB, GB, TB)


will affect your decisions about how to store, package, transport, and backup your data


Large numbers of discrete files, regardless of size, may be harder to handle


Multiple versions? Frequent updates?


Format


Seek out open formats or at least commonly used formats (.
xlsx

is ok, consider .
csv
)


Both for preservation and reuse, dependency on a single option can be a problem, even
if it is the most convenient in the short
-
term.


Rights and permissions


Reuse of other data sources (licensed or otherwise) may require investigation and
restrict data sharing options


Confidentiality of human subjects or guarding of information for patent protection may
be reasons to restrict data


Give appropriate credit to others’ contributions



Preserving your Data


What types of data need to be managed or and stored over the
course of your project?


Raw Data


Working Data


Processed or Final Data


Preserved Data for Possible Re
-
Use



Data Preservation Considerations


Access Control




Versioning




Backup


Local


Shared


Campus


Offsite

Security

Security includes both


Physical Security


Logical/Network/Password Security





Organizing your Data


Who controls the data for which parts of the project?


A logical structure and plan will help during and after data creation, for directories
and filenames (e.g. project directory has standard locations for raw data, analysis,
code, graphs, etc.)


Project folders with well
-
documented naming conventions (e.g., date of data
creation as part of file name in a structured format (ISO is
yyyy
-
mm
-
dd
). See
ARM
example
.


Versioning scheme should be agreed on and documented.


Goal is to have unique and understandable identifiers so that different parts of a
project can merge and be handled easily over time, even if participants and
conditions change.



Documenting your Data


Documentation helps


with your own reuse of the data at a later date,


others in your research group working with the data


w
ith the long
-
term reuse potential of your data



A Codebook is a structured way of explaining the contents of your data file


Readme files are a well
-
understood way of communicating information
about the contents and setup of your data


Documentation can also take the form of a simple text or Word document


Include any explanations of experimental methods, software code run on
the data, and any other tools needed to work with the data


Your previous work on creating and describing a consistent organization
for your data will help here.


Reproducible Research


Ideally, someone else can grab your data project as a complete
bundle of data, documentation, and software code, and recreate
the analysis to get exactly the same results


Reports and data can be integrated so that live analysis run on
actual data can be placed in reports (
some R packages do this
).


Many
initiatives
are advancing the concept of reproducible
research.


This high standard of evidence and validation is an assurance that
data and conclusions are not flawed (or faked).


Good data management practices lay the groundwork for success in
reproducible research

Data
Management
Plans


Many sponsored research agencies now require a
"Data Management Plan" (DMP)
as a component
of any proposal



The DMP
is a formal document that outlines what
the PI will do with data during and after
completion of a funded research project



The specific requirements for the DMP vary by
funder, and by research subject


Why
The DMP
Requirement?


Ensure the preservation of important research data


Support the potential re
-
use of
grant funded data by
other researchers to validate and potentially extend
the value of the data


Improve accountability for use of public revenue to
support research


Provide for the reasonable access to research data to
be consistent with the Freedom of Information
Act

Benefits of a Good DMP


Improved competitiveness for grant programs; a clear
and complete Data Management Plan will support the
project’s evaluation plan



Enhanced PI efficiency


writing a DMP encourages the
creation of a structured plan for managing research
data throughout the life of the project and beyond



Long
-
term protection and preservation of RU research
data



Who’s Asking for a DMP?



National Science Foundation
http
://
www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4



Centers
for Disease Control and Prevention

http://www.cdc.gov/od/foia/policies/sharing.htm


Department of Energy

http://www.cio.energy.gov/policy
-
guidance/federal_regulations.htm


Department of Defense

http://www.dtic.mil/whs/directives/corres/pdf/320014p.pdf


Environmental Protection
Agency

http://www.epa.gov/quality/informationguidelines/documents/EPA_InfoQualityGuidelines.pdf


Institute of Museum and Library Services

http://www.imls.gov/applicants/forms/DigitalProducts.pdf


NASA

http://nasascience.nasa.gov/earth
-
science/earth
-
science
-
data
-
centers/data
-
and
-
information
-
policy


National Endowment for the
Humanities

http://www.neh.gov/grants/guidelines/pdf/DataManagementPlans.pdf


National Institute of Justice

http://www.ojp.usdoj.gov/nij/funding/data
-
resources
-
program/welcome.htm


National Institute of Standards and Technology

http://www.nist.gov/director/quality_standards.htm


United States Department of Agriculture

http://www.csrees.usda.gov/


United State Department of Education

http://ies.ed.gov/funding/datasharing_policy.asp



Many academic journals have also begun to require data sharing as part of the submission process.


For

a
partial list of journals with data sharing mandates for their published articles, see the following:


http://oad.simmons.edu/oadwiki/Journal_open
-
data_policies

http://gking.harvard.edu/pages/data
-
sharing
-
and
-
replication






As described by Portland State University Library




http
://library.pdx.edu/digital
-
scholarship/data
-
mgmt
-
plans/who
-
requires
-
dmps.html



The NSF DMP Should Include


[
data attributes
]
The types of data, samples, physical collections, software,
curriculum materials, and other materials to be produced in the course of the
project;


[
metadata
]
The standards to be used for data and metadata format and content


[security
policies
]
Policies for access and sharing including provisions for
appropriate protection of privacy, confidentiality, security, intellectual property, or
other rights or requirements, including the right to embargo data for a specified
time period to allow first publication and thorough use of the data;


[
use policies
]
Policies and provisions for fair re
-
use, re
-
distribution, and the
production of derivatives; and


[
preservation
]
Plans for archiving data, samples, and other research products, and
for preservation of access to them.


DMP Web Support

Operated by the California
Digital
Library, the
DMPTool

is
a site supporting general DMP
development with some
school
-
specific guidance






RUL Data Resources


offers
information about services and
experts to support your data
management efforts

Discoverable Data


Publicly available archives such as
Dryad
,
Dataverse
,
ICPSR
,
and more allow other researchers to easily discover and
reuse data


Metadata provide standardized terminology for searching
and discovering data matching defined characteristics.
Unlike a data upload to a website, metadata provides a
well
-
structured way for computer indexing of the data.


Your discipline may have well
-
defined
metadata standards
(or not). Check with your librarian if you are in doubt.


Databib

and
re3data
are directories of research data
repositories that can be used to discover possible locations
for you to deposit and share your data.

Citing Data


By publicly sharing your data via an established
repository, you will typically get a DOI (Digital Object
Identifier) or other persistent URL. This will serve as a
permanent pointer to the data


You can
cite your data
, and others’ data, with DOI’s.
This is more precise and easier for others to work with
than a vague reference to the data by title, author, or
even citing a paper that uses the data.


Data citation
increases the impact of your research
!
This completes the data lifecycle.

The Data Lifecycle

Further References


Australian
National Data Service:
Data
Management for
Researchers


Australian National University: Data
Management


CIESIN: Geospatial Electronic
Records


ICPSR Guide to Social Science Data Preparation and
Archiving

(
pdf
):


Oak Ridge National Laboratory: Best Practices for
Preparing Environmental Data Sets to Share and
Archive


UK Data Archive: Create & Manage
Data

and
Managing
and Sharing Data: a Best Practice Guide for
Researchers

(
pdf
).