Standardizing Metadata Associated with NIAID ... - Ird-dev.org

educationafflictedΒιοτεχνολογία

4 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

105 εμφανίσεις

Richard H. Scheuermann, Ph.D.

Department of Pathology

Division of Biomedical Informatics

U.T. Southwestern Medical Center

Standardizing Metadata Associated with NIAID Genome Sequencing Center
Projects and their Implementation in NIAID Bioinformatics Resource Centers

N01AI2008038

N01AI40041

Richard H. Scheuermann, Ph.D.

Director of Informatics

J. Craig Venter Institute

Standardizing Metadata Associated with NIAID Genome Sequencing Center
Projects and their Implementation in NIAID Bioinformatics Resource Centers

N01AI2008038

N01AI40041

Genome Sequencing Centers for Infectious Disease (GSCID)

Bioinformatics Resource Centers (BRC)

www.viprbrc.org

www.fludb.org

High Throughput Sequencing


Enabling technology


Epidemiology of outbreaks


Pathogen evolution


Host range restriction


Genetic determinants of virulence and
pathogenicity


Metadata requirements


Temporal
-
spatial information about isolates


Selective pressures


Host species of specimen source


Disease severity and clinical manifestations

Metadata

Submission Spreadsheets


1

1

1

1

2

2

3

3

4

4

4

Complex Query Interface


Metadata Inconsistencies


Each project was providing different types of
metadata


No consistent nomenclature being used


Impossible to perform reliable comparative
genomics analysis


Required extensive custom bioinformatics
system development

GSC
-
BRC

Metadata Standards Working Group


NIAID assembled a group of representatives from
their three Genome Sequencing Centers for Infectious
Diseases (Broad, JCVI, UMD) and five
Bioinformatics Resource Centers (
EuPathDB
, IRD,
PATRIC,
VectorBase
,
ViPR
) programs


Develop metadata standards for pathogen isolate
sequencing projects


Bottom up approach


Assemble into a semantic framework

GSC
-
BRC Metadata Working Groups


Metadata Standards Process


Divide into pathogen subgroups


viruses, bacteria, eukaryotic pathogens and vectors


Collect example metadata

sets from sequencing project white papers and other project sources
(e.g. CEIRS)


Identify data fields that appear to be common across projects within a pathogen subgroup
(core) and data fields that appear to be project specific


For each data field, provide common set of attributes, including definitions, synonyms,
allowed value sets preferably using controlled vocabularies, and expected syntax, etc.


Merge subgroup core elements into a common set of core metadata

fields and attributes


Assemble set of pathogen
-
specific and project
-
specific metadata fields to be used in
conjunction with core fields


Compare, harmonize, map to other relevant initiatives, including OBI, MIGS,
MIxS
,
BioProjects
,
BioSamples

(ongoing)


Assemble

all metadata fields into a semantic network (ongoing)


Harmonize

semantic network with the Ontology of Biomedical Investigation (OBI)


Draft data submission spreadsheets to be used for all white paper and BRC
-
associated

projects


Finalize version 1.0 metadata standard and version 1.0
data submission spreadsheet


Beta test version 1.0 standard with new white paper projects, collecting feedback

Data Fields:

Core Project

Core Sample

Attributes

organism

e
nvironmental

material

equipment

person

specimen

s
ource role

specimen

capture role

specimen

collector role

t
emporal
-
spatial

region

spatial

region

temporal

interval

GPS

location

d
ate/time

specimen X

s
pecimen isolation

procedure X

isolation

protocol

plays

has_specification

has_part

located_in

name

denotes

spatial

region

geographic

location

located_in

affiliation

has_affiliation

ID

denotes

specimen type

s
pecimen isolation

procedure type

instance_of

Specimen Isolation

organism part

hypothesis

is_about

IRB/IACUC

approval

has_authorization

environment

organism

pathogenic

disposition

has part

ID

denotes

CS1

gender

age

health status

has quality

CS4

CS5/6

CS7

CS2/3

CS8

CS9/10

CS11/12

CS13

CS14

CS18

CS15/16

Metadata Processes

d
ata transformations


image processing

assembly

s
equencing assay

specimen source



organism or environmental

specimen

collector

input sample

reagents

technician

equipment

type

ID

qualities

t
emporal
-
spatial

region

d
ata transformations


variant detection

serotype marker detect.

gene detection

primary

data

sequence

data

genotype/serotype/

gene data

specimen

microorganism

enriched

NA sample

microorganism

g
enomic NA

s
pecimen isolation

process

isolation

protocol

sample

processing

data archiving

process

sequence

data record

has_specification

has_part

has_part

is_about

has_input

is_about

GenBank

ID

denotes

located_in

denotes

has_quality

instance_of

t
emporal
-
spatial

region

located_in

Specimen Isolation

Material Processing

Data Processing

Sequencing Assay

Investigation

t
emporal
-
spatial

region

located_in

t
emporal
-
spatial

region

located_in

t
emporal
-
spatial

region

located_in

t
emporal
-
spatial

region

located_in

quality assessment

assay

Quality Assessment

Outcome of Metadata Standards WG


Consistent metadata captured across GSCID


Guidance to collaborators regarding metadata
expectations for sequencing and analysis services


Support more standardized BRC interface
development


Harmonization with related stakeholders


Genome Standards Consortium
MIxS
, OBO
Foundry OBI and NCBI
BioSample


Represented in the context of an extensible
semantic framework

Conclusions


Metadata standards for microorganism sequencing projects


Bottom up approach focuses standard on important features


Harmonizing with related standards from the Genome Standards
Consortium, OBO Foundry and NCBI


Being beta
-
tested by
GSCIDs

for adoption by all NIAID
-
sponsored
sequencing projects


Utility of semantic representation


Identified gaps in data field list (e.g. temporal components)


Includes logical structure for other, project
-
specific, data fields
-

extensible


Identified gaps in ontology data standards (use case
-
driven standard
development)


Identified commonalities in data structures (reusable)


Support for semantic queries and inferential analysis in future


Ontology
-
based framework is extensible


Sequencing => “
omics