DC3A: Melbourne Neuropsychiatry Centre (MNC) Bioinformatics ...

thingyoutstandingBiotechnology

Oct 1, 2013 (3 years and 10 months ago)

125 views

DC3A: Melbourne Neuropsychiatry Centre (MNC)
Bioinformatics Development Project

1

Project Purpose and Activities

The MNC has one of the largest databases of brain scans and associated neuropsychiatric research data
in the world. It has National and
International collaborators using and contributing to the database.



Build workflow for automating documentation of dataset segments used in individual studies
and publications. This will include researchers, datasets, associated projects and publications.



Build workflow for automating creation of citable persistent identifiers for unique studies and
linking with publications.



Build software to automate capture of public facing metadata to University of Melbourne
Registry which will deliver collections metad
ata to the ARDC.



MNC has 270+ publications resulting from datasets stored in the MNC database. Completing the
work above will result in ~ 100 dataset descriptions described in the ARDC by June 2011, with an
expected 25+ being registered each year after tha
t.

2

Deliverables


Deliverable

D1

Project plan agreed by ANDS.


D2

Five sample Collection records in ARDC, with associated Party and Activity records,
of agreed standard.


D3

High level design documents:

a.

Metadata mappings between each pair of metadata formats where

mappings required, including to RIF
-
CS.

b.

Process descriptions for capturing MNC dataset metadata and storing it in
VITRO
registry.

c.

Process descriptions of integration between pre
-
print register, dataset
metadata and IP information.

d.

Design document of overa
ll system.


D4

Deployed system that:

a.

E
xtracts metadata from datasets in the MNC database.

b.

Automatically enriches dataset with pre
-
print register metadata, and
copyright/IP metadata, by connecting with those systems.

c.

Allows extracted metadata to be enriched by

user input.

d.

Generates RIF
-
CS Collection, Party and Activity records from metadata.

e.

Allows authorised external users to access datasets from MNC database,
using a query builder.

f.

Stores datasets in VITRO registry.

g.

Allows users to develop advanced queries to

find datasets.

h.

Deposits RIF
-
CS metadata in VITRO registry for ARDC harvest, including
Service descriptions.

i.

Automatically assigns persistent identifiers to datasets where required.


D5

Collection descriptions for 100 datasets, with associated, Party, Service

and Activity
descriptions produced by deployed system and
made visible
to ARDC.

a.

As many descriptions as possible should contain links, or access
information, immediately shareable data.


D6

Source code for all developed software, with developer’s manuals (to facilitate
reuse) deposited in
agreed
open
-
source repository.


D7


Deployed, permanent, operational feed of
C
ollection,
P
arty
, Service

and
A
ctivity
description
records to ARDC operational,
with output of agreed quality.



DC3B: Longitudinal qualitative and quantitative
survey data capture and re
-
use, Youth Resource
Centre

1

Project Purpose and Activities

The Youth Research Centre’s Life Patterns Research Program maintains an extensive qualit
ative and
quantitative data base on a cohort of 2000 young Australians who left secondary school in 1991 and of a
second cohort of 3000 who left school in 2005. With ARC funding through to 2014 for annual
quantitative and qualitative data capture for the s
econd cohort (Gen Y) and biannual data capture for
the first cohort (Gen X), this activity aims to enable wider access and use of the data by developing the
infrastructure to:

a)

make sets of the existing data available for re
-
use,

b)

streamline capture of new
data so that it is more readily available for re
-
use, and

c)

build the capacity to efficiently respond to future requests for derived data sets.

Appropriate structures for the capture of relevant metadata (compliant with DDI2, DDI3 and RIF
-
CS
schemas) and too
ls to extract this metadata from workflows will be developed.

2

Deliverables


Deliverable

D1

Project plan

agreed by ANDS, in ANDS standard template format.


D2

Statement of ethical issues policy, indicating how future datasets will be released.


D3

Five sample
collection descriptions (with associated activity
, service

and party
descriptions)
in

ARDC, including one for the Life Patterns Research Program Project
submitted to the ARDC


D4

High level design descriptions:

a.

Process descriptions for
deriving, describing
and publishing re
-
usable data
sets

b.

High level software design document, showing data flows and links
between components.


D5

Deployed
system that
:

a.

Automates the deriving and

publishing
of
re
-
u
sable data sets

b.

Automatically captures and extracts metadata
in qua
ntitative and
qualitative data capture workflows

c.

Allows extracted metadata to be enriched by human input.

d.

Allows authorised external users to access datasets.

e.

Automatically assigns persistent identifiers where required.


D6

Twenty Collection descriptions
, wit
h linked Party, Service and Activity descriptions,
produced by deployed system and
available for harvest by ARDC. These describe
linked, comparative longitudinal case studies on young people’s life trajectories
from each of cohort 1 and cohort 2 illustrati
ve of the underlying the qu
antitative
and qualitative data.

a.

As many descriptions as possible should contain links, or access
information, immediately shareable data.


D7

Source code for all developed software

deposited in
agreed
open
-
source repository,

with d
eveloper’s manuals to facilitate reuse.


D8

Deployed, permanent, operational feed of collection, party and activity records to
ARDC, with output of agreed quality.



DC3C: Optimising Metadata Capture, Data Sharing
Procedures and Long
-
term Re
-
use of Video
data in
the Social Sciences

3

Project Purpose and Activities

The University of Melbourne has an especially rich humanities and social science research community
that utilises video as its primary form of data capture. The increasing use of video as a researc
h tool
poses particular challenges for aggregated data storage initiatives. This project will integrate metadata
capture facilities at selected sites within the University of Melbourne as part of facilitating sharing and
re
-
use. The project will address cu
rrent metadata issues associated with large
-
scale audio
-
visual
repositories and workflows to enable efficient generation of metadata, ensuring that stored video data
is accessible and searchable through the ARDC. The project will:



Develop software to autom
ate the capture of metadata from existing mature video storage
systems developed by the ICCR (International Centre for Classroom Research),



Develop and


where possible
-

utilise existing infrastructure to identify generic workflow tools
that will enable r
ich knowledge of data sets, access services and parties to the research to be
systematically (RIF
-
CS) captured from the researchers,



Develop standards compliant video data and metadata deposit services.

These are generic goals which

are

broadly applicable
to activities elsewhere within the university, for
example in the Faculty of Architecture, Building and Planning and the Faculty of the VCA and Music.

4

Deliverables


Deliverable

D1

Project plan

agreed by ANDS, in ANDS standard template format.


D2

High level
design documents:

a.

Business process description and high level design for publication of video
dataset descriptions,
including case
-
specific protocols
and ethics
considerations
for data access locally, nationally and internationally.

b.

Metadata schema for vid
eo dataset descriptions

c.

Mapping of schema to RIF
-
CS

d.

Web services design documents and validation against existing research
projects.


D3

Five sample

Collection

descriptions, with linked
Activity, Party and Service
descriptions
,

representing the selected active video intensive projects across the
University
in
ARDC.



D4

Deployed system to be used by research staff and data librarians that:

e.

Allows
video data
to be deposited

f.

Automatically extracts metadata from video data

g.

Allows extr
acted metadata to be enriched by user input

h.

Generates RIF
-
CS Collection, Party and Activity descriptions from metadata.

i.

Ingests metadata into the University of Melbourne VITRO registry.

j.

Automatically assigns persistent identifiers where required.


D5

Operatio
nal automatic feeds of Collection descriptions and associated Party and
Activity information to ARDC


D6

Agreed number
(to be specified in project plan)
of Collection descr
i
ptions
, with

associated Party
, Service

and Activity descriptions,
produced by deployed

system
and
available to ARDC
.

a.

As many descriptions as possible should contain links to, or access
information for, immediately shareable data.


D7

Deposit of all developed software in agreed open source repository, accompanied
by developer manuals to
facilitate reuse.



DC3D: Human and mouse neuroimaging collections
in the national data commons

5

Project Purpose and Activities

DaRIS is a raw data management system based on the Mediaflux digital asset management platform and
has been in operation for th
e last 3 years at the Neuroimaging Computational and Data Management
Facility (CDMF). There it has been used to routinely receive MR images from researchers and organise
them into a subject
-
centric data model, ready for access by project members. It hosts
over 70 mouse
and human projects, each with many tens of subjects and some with time
-
dependent data.



Map DaRIS project
-
metadata to the ANDS schema



Write a DaRIS service to populate ANDS
-
compliant metadata,



Develop an adapter to harvest the ANDS
-
compliant m
etadata from DaRIS



Connect identifiers within DaRIS to ANDS persistent identifiers (PIDs).


Relationship to the National Imaging Facility (NIF) ANDS proposal

The DaRIS system has been selected by the NCRIS NIF to provide its data management capability. The
NIF will manage collections of data from a range of domains which are primarily but not only
neuroimaging (e.g. plant imaging, microscopy, etc.). This is pos
sible because the general DaRIS
framework
can be tailored
to a number of domains. Nonetheless, each domain requires different
metadata definition design, data capture protocols and workflows; therefore the metadata capture
process is inherently different f
or each domain.

The University of Melbourne ANDS proposal focuses on neuroimaging metadata exposure (with
collections managed by DaRIS held at UoM) whereas the separate NIF ANDS proposal focuses on
operationalising the DaRIS system to multiple nodes of th
e NIF as well as exposing NIF collections with
DaRIS
. There is thus no dependency between these two proposals.

6

Deliverables


Deliverable

D1

Project plan

agreed by ANDS, in ANDS standard template format.


D2

Mapping of DaRIS project metadata to RIF
-
CS.


D3

High
level design of:

a.

DaRIS RIF
-
CS generation service.

b.

DaRIS
-
ARDC OAI
-
PMH feed.

c.

Integration with ANDS Persistent Identifier Service.

d.

Automated extraction of metadata from datasets.


D4

Ethics policy for re
-
use of data collections. (If suitable number of
collections cannot
be shared with other researchers on an ongoing basis, project cannot proceed.)


D5

Five sample Collections descriptions, accompanied by Party, Service and Activity
descriptions, representing a range of different dataset types, in ARDC.


D6

Dep
loyed, tested, documented system that:

a.

Extracts metadata from datasets.

b.

Allows users to enhance metadata for datasets.

c.

Generates ANDS
-
compliant RIF
-
CS from datasets.

d.

Exposes RIF
-
CS as OAI
-
PMH feed, with controls to prevent harvesting of
non
-
shareable colle
ctions.

e.

Provides direct download access to datasets with appropriate
authentication and authorisation controls.

f.

Automatically assigns persistent identifiers where needed.


D7

Collection descriptions for
100 datasets
, including Service, Activity and Party
desc
ription links,

produced by deployed system and available for harvest by ARDC.

a.

As many descriptions as possible should contain links to, or access
information for, immediately shareable data.


D8

Deposit of all developed software in agreed open source
repository, accompanied
by developer manuals to facilitate reuse.



DC3E: Humanities and Social Science Data at the
University of Melbourne

7

Project Purpose and Activities

The University of Melbourne has one of the most rich and diverse humanities and soc
ial science (HASS)
research communities in Australia and is well ranked internationally. HASS researchers at Melbourne
generate and hold valuable data sets and associated materials that are currently not easily discoverable,
accessible or configured for fu
rther research purposes. This project will build infrastructure (tools and
services) to connect this diverse community with the UoM Registry (Vitro) which will in turn
communicate the relevant metadata to the ARDC. The project will:



Develop and utilize ex
isting (OHRM
-
based) infrastructure to identify generic workflow tools that
will enable rich knowledge of data sets and related materials, access services and parties to the
research to be systematically (RIF
-
CS) captured from the researchers.



Development o
f a generic web services
-
based data capture tool to be used both by researcher
staff, data librarians or other staff in the data management fabric. This will be based on the ‘pre
-
register’ work done for the Australian Women’s Register in 2009



Develop stand
ards compliant ‘access service’ descriptions



Ensure project, data, party and service descriptions concord with Data Documentation Initiative
(v2&3) requirements.



It will inform the development and utilisation of digital and analogue archival preservation,

curation and access systems for the University


8

Deliverables


Deliverable

D1

Project plan

agreed by ANDS, in ANDS standard template format.


D2

Sample

descriptions in ARDC as follows:

a.

C
ollection,
A
ctivity,
P
arty and
S
ervice descriptions
representing
five
selected active HASS projects across the University
in ARDC.


b.

Five sample Collection descriptions, with associated Party, Service and
Activity records, drawn from those projects made available in the ARDC.


D3

Mapping of one (or more, if applicable) data
set formats held in OHRM, to RIF
-
CS.


D4

Design document for web services to be built.


D5

Design document for a generic web service
-
based data entry, ingest and metadata
management tool (henceforth “web data capture tool”)
to be used both by
researcher staff,
data librarians or other staff
.


D6

Deployed, tested, documented system that:

a.

Allows data related to humanities projects to be input, managed, browsed,
searched.

b.

Provides a RIF
-
CS feed into the University of Melbourne’s VITRO registry.

c.

Is integrated with data

from a number of existing OHRM databases, to be
specified in the project plan.

d.

Can be controlled through web services, including the bulk retrieval of data.


D7

Agreed number of Collection descriptions (as determined and specified in project
plan), with asso
ciated Service, Party and Activity descriptions, produced by
deployed system and available for harvest by ARDC.

a.

As many descriptions as possible should contain links to, or access
information for, immediately shareable data.


D8

Deposit of all developed softw
are in agreed open source repository, accompanied
by developer manuals to facilitate reuse.


D9

A
ll descriptions available
for harvest from VITRO.



DC3F: Capturing multi
-
modal data to support
research in cardiovascular and neurological
medicine

9

Project
Purpose and Activities

Complex physiological data is routinely collected on patients as part of clinical care (echocardiography,
intravascular ultrasound, x
-
ray angiography, optical computerised tomography, patient clinical data,
etc.). However, this rich
multi
-
model data is not usually subjected to subsequent analysis nor is it made
available to researchers from other disciplines for novel analysis. Making this multi
-
model data available
along with patient outcomes such as morbidities will provide the oppo
rtunity for collaborative groups to
employ novel strategies to developed assessments and models based on this data. This project will form
necessary base of making multi
-
model data collections available, enabling the establishment of new
links between biom
edical research groups in engineering, physics and bioinformatics. This project will
occur in collaboration with BioGrid Australia where it will use the access, de
-
identification and privacy
protection protocols already established there.

The major activit
ies will be:



Map BioGrid metadata to the ANDS schema,



Write a service to populate ANDS
-
compliant metadata,



Develop a service to harvest ANDS
-
compliant metadata from multiple BioGrid data sets which
form a single study,



Enable the assignment of globally uni
que identifiers that link to the source of multi
-
modal
datasets.

10

Deliverables


Deliverable

D1

Project plan

agreed by ANDS, in ANDS standard template format.


D2

Ethics policy quantifying which datasets will be shareable and under which
conditions, and which
will never be shareable.


D3

High level design documents:

a.

Mapping of BioGrid datasets to RIF
-
CS.

b.

Design of service to generate RIF
-
CS.

c.

Process descriptions, including ethics approvals, ETL processes,
deidentification and metadata annotation.


D4

Five sample Coll
ection descriptions, with associated Service, Party and Activity
descriptions, in ARDC.


D5

Deployed, tested, documented system that:

a.

Extracts metadata from datasets in BioGrid.

b.

Allocates persistent identifiers where needed.

c.

Allows that metadata to be enriche
d by user input.

d.

Provides access for authorised external users to the datasets.

e.

Provides an OAI
-
PMH feed of ANDS
-
compliant RIF
-
CS.

f.

Allows datasets of different levels of shareability to be managed.


D6

Collection descriptions with associated Service, Party and Activity descriptions for
ten multi
-
modal
and all descriptions
produced by deployed system, and
available
for harvest by
ARDC
.

a.

As many descriptions as possible should contain links to, or access
in
formation for, immediately shareable data.


D7

Deposit of all developed software in agreed open source repository, accompanied
by developer manuals to facilitate reuse
.



DC3G: Founder and Survivors Project

11

Project Purpose and Activities

The Founders and
Survivors Project (
http://www.foundersandsurvivors.org/

) has brought together a
number of research data sets created from records relating to the 73,000 convicts transported to
Tasmania in the 19th cent
ury and their descendents to create a population database of national and
international significance for historical, demographic and population health researchers.


This project will:



Develop a toolkit based around the projects XML/TEI workflow for furthe
r relevant records sets
to be systematically ingested into the population database,



Build the infrastructure to enable persistent identification and descriptions of derived data sets
produced on request from the population database to be made available to
the ARDC
.


12

Deliverables


Deliverable

D1

Project plan

agreed by ANDS, in ANDS standard template format.


D2

High level design documents:

a.


Description of automated processes for deriving, describing and publishing
ingest data sets
.

b.


Description of automated processes for deriving, describing and publishing
derived data sets
.

c.

Mapping of data set metadata to RIF
-
CS.


D3

Five sample Collection descriptions, with associated Party and Activity descriptions,
representing both ingest and deriv
ed datasets in ARDC.


D4

Deployed system that:

a.

Generates RIF
-
CS Collection, Party and Activity descriptions from derived
and ingest datasets.

b.

Includes an
extraction and ingestion toolkit for researchers to incorporate in
their data collection workflows to fac
ilitate the production of ingest data
sets to the population database


D5

User documentation for all types of users.


D6


Collection, party and activity descriptions for 20 ingest data s
ets that meet ANDS
requirements, produced by the deployed system, and
available
for harvest
to the
ARDC.

a.

As many descriptions as possible should contain links to, or access
information for, immediately shareable data.


D7


Source code for
all developed software

published to open source repository
, with
developer documentation t
o facilitate reuse
.