Metadata Activities in Biology

jumentousmanlyInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

143 εμφανίσεις

Metadata Activities in Biology


INIGO SAN GIL

LTER Network Office, Department of Biology, University of New Mexico, Albuquerque, New Mexico, USA

VIVIAN HUTCHISON

National Biological Information Infrastructure
,
USGS Western Fisheries Research Center

Seattle, Washington, USA

MIKE FRAME

National Biological Information Infrastructure, USGS Center for Biological Informatics, Oak Ridge National Laboratory, Oak
Ridge, Tennessee, USA

GIRI PALANISAMY

Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA


The National Biological Information Infrastructure program has advanced the biological
sciences ability to
standardize,
share, integrate and synthesize data by making the
metadata program a core of its

activities.


Through strategic partnerships, a series
of
crosswalks for the main biological metadata specifications have enabled data providers and
international clearinghouses to aggregate and disseminate tens of thousands of metadata
sets describing petabytes of data records.


New efforts at the National Bi
ological Information
Infrastructure are focusing on better metadata creation and curation tools, semantic
mediation for data discovery and other curious initiatives.


KEYWORDS: metadata, metadata creation, metadata curation, metadata quality control
,
data
management

INTRODUCTION


The National Biological Information Infrastructure (NBII)
1

facilitates the access and use of
the biological information. The NBII relies heavily on quality metadata to service biological
information. NBII follows the metadata dire
ctives outlined in the nineties (United States
Executive Order 12906
2
).


Ever since, the NBII has been constantly adapting to the
profound changes in technologies related to communication and digital data process and
storage. The NBII achieved its goal of
unifying diverse, high
-
quality biological databases,
information products, and analytical tools forging partnerships with government agencies,
academic institutions, non
-
government organizations, and private industry. In this paper,
we will review some of
the metadata related achievements and experiences of the NBII,
with a particular focus on the case example of the partnership with NSF's Long Term
Ecological Research Program (LTER)
3
.


Background

The United States Geological Survey
(USGS), home to the

NBI
I program
, has been a leader
in establishing global infrastructures, standards, and collaborations in support of biological
data

management and delivery. The NBII, began

in 1993 (Sepic and Kase, 2002)
is a
broad, collaborative program to provide increase
d access to data and information on the
nation's
biological resources. Coordinated by the USGS, t
he NBII links diverse, high
-
quality biological
databases, information products, and analytical tools maintained by NBII partners and other
contributors in
government agencies, academic institutions, non
-
government organizations, and
private industry. NBII partners and collaborators also work on new standards, tools, and
technologies that make it easier to find, integrate, and apply biological resources info
rmation.
Resource managers, scientists, educators, and the general public use the NBII to answer a wide
range of questions related to the management, use, or conservation of this nation's biological
resources.

For the past several years, t
he NBII program
has
centered their work on forging
collaborative
partnerships with biology and ecology community stakeholders, with special
emphasis in facilitating the interoperability among
all biological producing organizations.

Scientific data is the key component
to advance science, but data without context lacks
value (Michener, 2005).


Here, we call metadata the context of scientific data (See Hodge,
(2001) for a nice introduction to metadata).


There was no nationally coordinated effort to
capture biological met
adata until the NBII program started.


Before NBII
,

some collectives,
groups and organizations captured mostly unstructured biological metadata. These early
uncoordinated metadata efforts
sometimes
resulted in the preservation of the value of the
data, but

little interoperability
and
data sharing were

achieved.


Interoperable data means,
in this context, that the data and metadata are available and usable by applications and
services that are developed by other agencies
organizations,
and the public in gene
ral.


Interoperable metadata details should therefore be expressed in a common specification
used by many groups, entities, agencies and even countries
.


The last decade technological
advances provided us the affordable infrastructure for documenting and p
reserving data
beyond the realm of the scientific publications.


However, metadata awareness and the
culture shift that entails, needs more than infrastructure. NBII's efforts are focused in both
deploying the infrastructure needed to nurture the metadata
lifecycle as well as providing
training
, quality control, and other metadata services

for organizations. NBII's metadata
program has a strong educational component on the value of metadata

and the importance
of scientific data management
.



The Long Term E
cological Research network has been conducting ecological and biological
research for over twenty five years.


Over two thousand scientists have conducted
ecological and biological research (Hobbie, 2003) and monitored changes in our
ecosystems.


Arctic, m
arine, alpine, desert, prairies and other habitats are hosts to a variety
of LTER long term observations and scientific projects.


The LTER network is centralized by a
coordinating office
4

and it covers 26 different locations
5

in the western hemisphere, wi
th
special representation in the United States mainland.


The long
-
term aspect of the LTER
studies called for a plan to preserve data and metadata for the long term.


LTER has had
data preservation plans and mandates for a long time, and recently, the LTER

network
adopted a metadata specification as the network metadata standard
6
. The LTER information
managers committee wrote a metadata best practices document
7

to coordinate the
metadata process.


NBII and LTER partnered in 2004 to foster the program intero
perability, leverage and share
resources and knowledge. A liaison position was created
by the NBII Program
to make the
interaction fluid enough to accomplish the goals set in the agencies cooperative agreement.
Specific and i
m
mediate goals included sharing

the wealth of metadata holdings of both
programs
, harmonizing data standards, leveraging joint training activities, and
jointly
support biological/ecological community data management initiatives.


Status at a glance

The NBII hosts a wealth of informati
on on standardized metadata records in the NBII
Clearinghouse, a system for which
Oak Ridge National Laboratory (ORNL) provides both
technical support and hosting services. Over seventy metadata providers account for over
seventy two thousand metadata reco
rds placed in the NBII metadata clearinghouse. All
records are publicly accessible through a user friendly interface and various web services. In
the NBII metadata Clearinghouse, term based searches can be combined with advanced
searches or catalog style b
rowsing.


Filtering on results is available to narrow the searches
further.


Metadata records are pre
-
tagged using the NBII Biocomplexity thesaurus and key
metadata sections are indexed to enhance the search results
.


Metadata records are classified into
several categories according to the potential
functionality they can provide (San Gil, 2008).


Revisions of the current metadata and
proper curation processes are underway.


Also, a whole suite of tools to ease the processes
of metadata editing and entry a
re being deployed. Semantic mediation is also being worked
on in several levels through

the
incorporation
of the NBII thesaurus
(
http://thesaurus.nbii.gov
)
and its services, and also the re
-
tagging of content with
thesaurus terms to enhance the metadata discovery process.


Trainings for metadata
curators, principal investigators and information managers are also an integral part of the
decade long NBII metadata program.



Paper outline

After this brief introduction,

we offer details on the metadata program today, we present
the current NBII data providers and an introduction to metadata quality levels.


In the next
section, we provide an overview of the metadata tools being deployed and the steps we are
taking to mak
e the metadata creation process simpler: Administrative dashboards for the
metadata providers, linkages to related resources, and new editing tools.


We discuss the
convergence of metadata standards. We will finalize with a discussion on current challenges

and future tasks along with plans to serve the science with the best tools to inform the
public.


BIOLOGICAL METADATA PROGRAM


In this section we present an overview of the metadata program in biological sciences, from
the point of view of the activities
of the National Biological Information Infrastructure. We
present the goals of the program, the achievements, some technological details such as
metadata
standards used
, some details about the biological metadata clearinghouse,
metadata training programs a
nd the projects on integrative biology that have spun from
these efforts.


Metadata diversity in Biology


There are four major metadata specifications that are widely used by the ecological
community. These four specifications are implemented using the ex
tensible markup
language (XML). XML is an optimal vehicle for digital data interoperability among data
centers.


The Federal Geographic Data Committee (FGDC
8
) makes several specifications
-
the Content Standard for Digital Geospatial Metadata and the FGDC p
rofiles. The profiles are
customizations, for biology we use the Biological Data Profile or BDP
9
.

These are primarily
used by many federal government and state government in the United States.


ESRI
10

produces FGDC
-
affine specifications.


This commercial (
used extensively by Geographic
Information Systems) specification is a superset of the FGDC.


All these profiles and
supersets are quite interoperable, presenting little technical impediments to make most of
the metadata content easily integrable across th
e diverse platforms using ESRI, FGDC and
BDP.


The Ecological Metadata Language (EML
11
) is another comprehensive specification.


EML is used throughout the ecology research community, for example, the Ecological
Society of America, the LTER Network and th
e Organization of Biological Field Stations
12

among other organizations. Note that we have used the word
specification
instead of
standard
to refer to these XML
-
based set of contextual rules. Technically a standard needs
to be vetted by an internationally
recognized organization devoted to the standards matters.


Centralized Holdings for Biological Metadata


The NBII program serves over
seventy two thousand biological metadata records on the
NBII clearinghouse. More than seventy data providers (mostly organ
izations) are regularly
providing metadata through a weekly
harvesting
program. In table 1 shows summary of
metadata holdings divided by provider. In figure 1, we show the relative composition of the
NBII metadata clearinghouse holdings as a function of nu
mber of records per provider.


[Table 1] [ Figure 1]


There are four main XML based standards accepted to harvest metadata, the FGDC, EML,
and Dublin Core
13

with the addition of FGDC's Biological Data Profile, and ESRI's based
superset. Also, metadata reco
rds that are compliant with the Darwin core
14

, and the ISO
19115
15

and ISO 19137
16

standards are accepted at the NBII clearinghouse.


This disparity of common metadata specifications and standards to structure and share
metadata poses a barrier in terms o
f interoperability.


The NBII, through a partnership with
the LTER released a number of
crosswalks
17

(specification translations) to seamlessly
operate with any of these major metadata standards.


Some metadata crosswalks are not bi
-
directional.


Some spec
ifications, such as the Dublin
Core, are not as comprehensive as the above mentioned when it comes to describe a
scientific data resource.


The Dublin Core Metadata Standard typically collects such
elements as Title, Author, Publisher, Contributor, and oth
er elements. However, a

crosswalk between the FGDC and the Dublin Core may yield a complete Dublin Core record
when translating an FGDC metadata record. On the other hand, an FGDC translation of a
Dublin Core record may yield an incomplete metadata recor
d. There are other problems
with crosswalks that require human intervention, for example resolving differences in
granularity (San Gil et. al, 2010)


The NBII hosts the largest clearinghouse for biological metadata, but is not the only
clearinghouse.


The
Knowledge Network for Biodiversity also offers a network of biological
clearinghouses with structured metadata.
Both
of t
h
e
s
e

clearinghouse
s

allow faceted
search, term
-
based searches with geo
-
temporal qualifiers and filtering of results.

In
addition, all t
he information resources hosted at the NBII clearinghouse are available to the
following specialized services: Geospatial One Stop, data.gov, Dryad, GEOSS and the raptor


the NBII federated search. The clearinghouse is opened to the Google indexing servi
ces
and other web crawlers.

Trainings and o
utreach

A metadata program becomes stronger when combined with a training program.


The NBII
offers metadata trainings through
a dedicated Metadata Program Director
, who imparts on
average six to eight trainings
per year.


There are two types of trainings
:

a user training,
where the user becomes familiar with the concept of metadata, the many sections of
biological metadata, and the tools available for the metadata entry and editing process.


The
NBII offers also
another series of trainings with focus on creating new metadata trainers
-

"train the trainer".


The NBII
supports
numerous outreach opportunities, as the program mission is executed by
forging partnerships.


NBII participates in many new biological initia
tives, and networks with
most of the stakeholders in
the biological/
ecolog
ical communities
.


A typical outreach effort
would consist of reaching out to a biological scientific organization or group to offer NBII
services and look for synergistic opportunit
ies. Examples of such NBII services are the long
-
term storage for metadata, a centralized repository that provides the NBII partners added
visibility, metadata trainings,
and data management capabilities.



Metadata Driven data integration and data analys
is


The NBII biological metadata program makes biological data discoverable for scientists and
the public in general.


However, the NBII metadata program goes beyond data discovery.


The program is making efforts to enhance the quality of the metadata hold
ings to provide
further functionality to the data.


Comprehensive specifications such as the EML
specification, enable machine mediation for the interpretation and manipulation of the
associated data. Some early prototypes (San Gil et al., 2010) show the f
unctional value of
rich
-
content metadata.


This models inspired information management architectures where
the data flow is based on the underlying metadata.


One of the frameworks proposed for
LTER is the Provenance Aware System Tracking Architecture or P
ASTA (Servilla et. al,
2006).


One early instance that implemented part of the PASTA framework is the EcoTrends
project (Servilla et al, 2008). EcoTrends allows the public to explore the time series data
produced by the LTER sites since the corresponding p
rojects were conceived. EcoTrends
provides access to more than twenty thousand time series through a web portal
18
.


EcoTrends has only implemented a small part of the metadata
-
driven structure, the web
portal application, and a few metadata components.


A
full PASTA implementation for
EcoTrends would have required a thorough revision of the metadata and data holdings.


However, it is foreseen that most of the data applications at LTER will be based on the
PASTA framework. The success of these prototypes rel
ies mostly on the foundations, the
quality and fidelity of the metadata.


We will discuss more the challenges ahead and what
are we doing to meet those in the last section of this paper.



The Dryad repository (Greenberg et. al, 2009) is a new initiative
that has the long
-
term goal
of being semantically mediated while meeting the needs of today's community demands.


Dryad takes a denovo approach, avoiding the obstacles observed in legacy systems.



DataONE
19

is another large scale project that relies on fu
nctional metadata. DataONE is a
virtual data center for biology, ecology, and the environmental sciences.


DataONE uses
metadata and the data
-
libraries that parse metadata to harvest data into a long
-
term
repository network. Quality, rich content metadata
will provide the functionality that will
contribute to the success of the work flows to ingest data into this distributed network of
repositories for the long term.

DataONE is jointly supported by LTER and the USGS NBII
Programs.




METADATA TOOLS AND
SERVICES


Metadata Entry and Metadata quality control
.



Today, the metadata creation or metadata entry process requires a substantial amount of
human resources.


Many organizations, including the NBII and LTER devote some resources
to assist with the meta
data record creation process.


It is not an easy task to evaluate the
cost of metadata.


Lytras (Lytras and Sicilia, 2007) provides a framework to evaluate the
metadata lifecycle, however, accurate total costs remain elusive in large scale programs.


The N
BII also manually curates and revises metadata records for some of its partners, but
mostly, the task of preserving and maintaining metadata is left to the providers.


Quality
metadata editing tools are essential to the metadata maintenance process.


Metad
ata Creation Tools


Some in the biological community have used the native XML editing tools such as the
XMLSpy
19

and oXygen
20
.


There are also customized metadata entry tools available to the
users.


These tools are an advantage over the raw XML editing to
ols, as these tools are
oriented to the actual rules and conventions adopted by the back end standard.



For the FGDC family of standards, we have several metadata editors. ArcCatalog is tightly
coupled with all the GIS tools served by ESRI, however the ed
itor can be used to produce
metadata records outside ESRI software applications.


Other resourceful editors are TKME
and XTKME, which are easily coupled with the metadata analysis (validator, quality control
tools) called MP (for Metadata Parser).


Metavis
t is a popular editor created by Rugg (2005)
and the Natural Park Service has released the NPS metadata editing tools.


There are more
editing tools oriented to the FGDC standards, among those are the MetaDoor, developed at
the Baruch Institute, which is w
eb based with a good number of functionalities that facilitate
the metadata entry process.


Morpho
21

is a wizard like for the creation of EML compliant metadata records. Morpho also
has a tree based (hierarchical) editor that resembles the XMLSpy function
ality from a fair
distance.


Morpho was developed under NSFs SEEK program, and


it is an open source
editor.


The Morpho tool is tied to the Metacat, a hybrid metadata server.


Before a record is
inserted in the Metacats repositories, there is a validation

process that goes a bit beyond
the mere XML
-
schema compliance tests.


Some human intervention is needed to asses the
quality and accuracy of the content entered.


The ISO standards are proprietary, this places another barrier to the already difficult task

of
establishing an interoperable metadata program, and a shift in culture in the ecological
-
biological community. Yet the ISO 19115 North American Profile will be required by the
Federal Government as the next iteration of a metadata standard from the cur
rent FGDC
Standard.


NBII

always looking for better tools to ease the metadata lifecycle experience. Renewed
joined efforts at NBII, ORNL and LTER produced (Aguilar et al., 2009) the seminal work for
a more integrative tool set.


There is a difference from

content management system based
editors and all the above mentioned editors.


A content management system provides the
organization a tool set that relates the all the information managed, whether photos, videos,
data, projects, personnel directories or p
ublication lists.


The comprehensive metadata fields
covered by EML and FGDC intersect many of the typical information content (See Figure 2)
managed at biological field stations, LTER sites, NBII
Regional/Thematic
nodes and the like.


This content informa
tion overlap presents the question:


Where does metadata begin, and
where does it finish? This question is better addressed when we look at the expected
functionality of the record.


For example, a metadata record can be defined as the set of
descriptive f
ields that encompasses the bibliographical reference of a journal citation:
minimal personnal information (such as first name, last name) and a title and locator for the
journal, volume, number and pages. However, we have seen here that from the data
integ
ration and synthesis point of view, a metadata record needs to be extensive.


In the
context of an information system, the metadata record concept may be replaced by all the
information that is related to a dataset, extending the functionality (data mashup
s) and
improving it (expanded and enhanced ability to discover the data).


In this context, we
developed a basic content
-
management based system to handle biological metadata.


However, the content management system approach (Van de Weerd et al., 2006) ena
bles
us to connect all the contextual information that relates directly and indirectly to each data
hosted by the system.


We chose Drupal because 1) is a widely used Content Management
System 2) Drupal has all the functionality we need to serve the inform
ation of any biological
field station 3) Drupal is free and 4) It is very easy to configure, maintain and use.


This set of user and server requirements are often found when surveying LTER sites (San Gil
et al., 2008). Furthermore, by decoupling the metada
ta from the particulars of a standard,
we are able to serve the metadata in any of the flavors mentioned above (FGDC, EML,
Dublin Core, etc). The standard convergence is desirable for interoperability, as stated in
this paper repeatedly.


Our system provid
es built
-
in semantic mediation (tagging) that
enables us to uncover relations with data and metadata that is not directly connected to a
particular person, publication or geo
-
temporal constraint.


Details on this different tool will
be published soon, alth
ough some demos
22

are already available for testing and in
production.


Controlled vocabularies and taxonomies are other advantage of this system.


The system does not implement a full fledged ontology, however, it is ideal to accommodate
all the
folksonom
ies
created over the life of LTER sites, and it gives us an opportunity to
advance a network
-
wide controlled vocabulary. Rather than trying to implement a top
-
down
system to a inertial community, we work towards a unified vocabulary that emerges from
the l
ocal terminologies and are gradually merged through tools such as the NBII thesaurus
23




CHALLENGES


Establishing a comprehensive metadata program is expensive (Lytras and Sicilia, 2007),
and large scale metadata solutions that come from idealized scenari
os have come short of
their initial goals in ecological and biological programs.


We have sought pragmatic
approaches to standardize legacy holdings (San Gil et al., 2008), a strategy that has given
us a success that earlier approaches to metadata tools yi
elded just a vanished promise
(Berkley, 2003).


For years metadata clearinghouses have been considered “black holes”
(
Schnase

et al
, 200
3; Maier et al, 2001
). Data providers, PI’s, and data managers have
contributed metadata to various clearinghouses, with little or no rational as to Why? What
Benefits? and/or Recognition.


However, an analysis of the standardization processes showed what remains to be do
ne in
order to take advantage of metadata driven applications (San Gil and Baker, 2007). The
automated metadata standardization processes inherit all the flaws from the legacy
metadata sources.


No ontology or optimal metadata schema can fix a bad quality
metadata
record
-

the metadata entry process and metadata conversion process need to be revised:


Specialized ontologies may fail to classify incomplete records. Specifications with inadequate
granularity hamper the integration process.



Ambiguous measure
ment classifications (Velleman and Wilkinson, 1993) adopted by EML
hinder the proper use of the standard.


FGDC's poor measurement classification
("unrepresentable domain") is a popular measurement classification option that leads to
similar results.


Is t
here a need to define another XML based metadata transportation
specification? Consider rather the following approaches to reduce the measurement
classification entropy.

1) Adopt some
Best or Common Practices
, (San Gil et al., 2010) a practical guide for
m
etadata users with clear examples that clarifies ambiguous classifications.


2) Evolve existing standards. Improve the current standard leveraging of the successful,
coherent practical uses of such problematic specification sections.


In practice,
impleme
nting and revising Best Practices is often easier and faster than revising established
standards.


In our experience, evolving a standard may require anywhere from two years
(EML, minor revision) or many more years (new ISO North American Profile).


Ontolo
gies, thesaurus, controlled vocabularies or folksonomies? We advanced in the tools
section our bottom
-
up approach to the semantic web implementation.


There have been
some substantial efforts in LTER
-
Europe to create an ontology based system for metadata
c
ataloguing and management (van der Werf, 2009).


Efforts at NCEAS, and some other
explorations at LTER, have yet to reach maturity. A usable mainstream tool with a general
applicable purpose in ecology may be still some time away, but we can
easily i
mpleme
nt
discovery functionality with simple controlled vocabularies. The LTER group has had an
active controlled vocabulary group since 2005. The group was not successful at finding the
vetting mechanism (an authority) to elevate a selected set of terms to a ne
twork level
vocabulary. Porter's efforts (2009), led to a candidate set that has undergone a series of
selection criteria.


Perhaps the absence of a specific application that would test the validity
of a vocabulary (or application of thereof) is another im
pediment for the advance of the
semantic tools at LTER and ecology.


The completeness factor:


Metadata records describing a data resource as a bibliographical
reference can be used for basic local cataloging.


Richer records, containing keywords,
abstract
s, geo
-
temporal information, have better
discovery
functionality.


Geo
-
tagged
records enable data discovery through geo
-
fielded searches. Dated annotations (when was
the study conducted, when was the study published, and when was the metadata created
and l
ast edited) help narrow searches in large catalogs.


From the point of view of data
integration, any incomplete or inaccurate records prevents automatic aggregation.



Another important challenge is the data access mediation.


In genetics, a researcher nee
ds
to upload any new sequences to a centralized repository (such as Genbank
24
, DDBJ
25
)
before publication of the sequence related scientific results in peer reviewed journals.


This
sharing policy has been key to the rapid advancement of the research in ge
netics.


Data
sharing


has been shown to increase the number of citations for those publications
associated with the open data (Piwowar et al., 2007).


However, in ecology, the practice of
unfettered data access is still not mainstream.


Some o
rganizations

still place login barriers
to access data
26
, bruising the path to open, semi
-
automated data synthesis.


Scientific Units: Disparity of units affects all sciences (I.e: Mars NASA orbiter
27
). At LTER,
the surprising diversity of units makes the data integr
ation process laborious and expensive.
The LTER has a unit task force group working on reducing the unit spread and creating a
unit best practice document, and a central unit repository.
28


Perhaps another grand challenge is the dispersion of information.


We are witnessing the
birth of new metadata repositories, data repositories
18
, project databases
29
, social networks
and the like.


Potentially, all the information resources can be interconnected to the atomic
level (the record level), using SOAP
30

web se
rvices or, perhaps better,


RESTfull services
(Pautasso et al., 2007).


In practice, many of these instances wind up disconnected with
similar content, lacking minimal integration to even local, available information.


Versioning system: As time passes, t
he information changes, both in the content, the
representation and format.
Van de Sompel

(2009) proposes an enhanced protocol to
facilitate temporal access on the web, however, it would be up to the recourse providers to
include snapshot and versions of
previous metadata.

Genomics and Ecology:


Novel affordable sequencing technologies (Amaral
-
Zettler et al.,
2008) are tackling projects in
-
situ rather than in lab, creating a need for added metadata
that transcends the classic Genbank minimum metadata re
quirements.


The Genomics
Standards Consortium (GSC
31
) is coordinating the flurry of new needed metadata activities
in genomics and ecology. Often, what is considered data in Ecology is classified as metadata
in Genomics, which is reflected on the metadata

specifications produced (Field et. al,
2008).


San Gil (2008) describes some challenges ahead in making the genomics and
ecological standards interoperable.





THE METADATA ROAD AHEAD


Immediate plans call for actions to improve the NBII Clearinghouse fu
nctionality

and overall
community metadata management. Through NBII’s a
dding a data provider administrative
dashboard, a process to enhance the metadata records (quality and content volume),
support the metadata tools that facilitate editing, curation an
d data discovery, continue the
aggregation of new metadata records to the central repository as well as improving the user
experience.


We are also harmonizing the existing ecological and biological standards with
the wealth of standards in the genomics f
ields.


We will test the functionality of the
genomics GCDML and EML convergence for a multi
-
location aquatic microbial census
(Amaral
-
Zettler et al., 2009).
Plans to integrate the metadata for the scientific data with
other contextual information such as
NBII Thesaurus, and the NBII

LIFE (Library of Images
from the Environment) photo gallery and other content are being considered.


Integration
with such activities as the Integrated Taxonomic

Information System (ITIS) will establish
relationships between me
tadata and species information.
New metadata tools will be
deployed for the community to use and also

available for organizations that would like to
benefit from them.


Additionally,
socially

cultural

changes are also needed to provide rewards and recognition
for those researchers and investigators who are incorporating proper data management
practices and policies within their research activities.

The USGS NBII Program is leading
this effort, along w
ith other organizations,
in establishing metadata activities as core
components of the research process.


ACKNOWLEDGMENTS


Inigo San Gil acknowledges the support of the National Biological Information Infrastructure
and Long Term Ecological Research Coo
perative Agreement.

Cross
-
disciplinary (genomics


& ecology) metadata initiatives are sponsored through the
support of a National Science Foundation Research Coordination Network Program
(RCN4GSC).

NOTES

1


See National Biological Information Infrastructu
re main portal at
http://
www,
nbii.gov

2


Executive Order pertinent to US
-
metadata mandates, resource at
http://www.archives.gov/federal
-
register/executive
-
orders/pdf/12906.pdf

3


See the Long Term Ecological Research
Network website at http://lternet.edu

4


See the LTER Network Office services at http://lno.lternet.edu

5


LTER sites (or nodes) at http://www.lternet.edu/sites/

6


Harmon, M. Motion to adopt EML at the Coordinating Committee. (2003)
http://intranet.lterne
t.edu/archives/documents/reports/Minutes/lter_cc/Spring2003CCmtng/
Spring_03_CC.htm

7 EML Best Practices document can be found at:
http://harvardforest.fas.harvard.edu/data/doc/emlbestpractices_oct2004.pdf

8 ORNLDAAC.


The Oak Ridge National Laboratory Dist
ributed Archive Center, at
http://daac.ornl.gov/

8 FGDC: The Federal Geographic Data Committee, resource at http://fgdc.gov

9 BDP. The Biological Data Profile, an expanded XML document type definition of the FGDC
standard.


Find it at: http://www.fgdc.gov/
standards/projects/FGDC
-
standards
-
projects/metadata/biometadata/biodatap.pdf

10 ESRI.


A dominant software application company that occupies the Geographic
Information Systems sector.


http://esri.com

11 EML.


The guidelines for EML use are at http://knb.e
coinformatics.org/software/eml/eml
-
2.0.1/index.html

12 OBFS.


The Organization for Biological Field Stations portal is at http://obfs.org/

13 Dublin Core.


Explore this metadata resource at http://dublincore.org

14 Darwin Core.


One of the hosts of this bi
odiversity
-
oriented metadata schema can be
found at http://digir.net/schema/conceptual/darwin/manis/1.21/darwin2.xsd

15 ISO 19115. A international metadata standard.


http://www.iso.org/iso/catalogue_detail.htm?csnumber=260205


16 ISO 19137. Another ISO st
andard to cover geographic metadata.


http://www.iso.org/iso/catalogue_detail.htm?csnumber=32555

17 See for example the ESRI and BDP to EML crosswalk page at


http://intranet.lternet.edu/im/project/Esri2Eml

18 DataONE portal at http://dataone.org

19 The XM
LSpy XML editor (2010) Obtained at
http://www.altova.com/products/xmlspy/xml_editor.html

20 The oXygen XML editor. (2010). Obtained at at http://www.oxygenxml.com



21 Peter Schweitzer (USGS) developed the TKME and XTKME editors. Free download at
http://w
ww.sco.wisc.edu/wisclinc/metatool/tkme.htm

22 Examples of system implementing a drupal based editor at at http://tierra.unm.edu,
http://inigo.lternet.edu/drupal614 and http://gorilla.ites.upr.edu

23 NBII Thesaurus tool and web client accessible at http:/
/thesaurus.nbii.gov

24 Genbank. http://www.ncbi.nlm.nih.gov/Genbank/

25 DDBJ, the DNA data bank of Japan. http://www.ddbj.nig.ac.jp/

26 The Data Access Server. Description and resource at http://lno.lternet.edu/projects/das

27 The NASA misshap.


Resour
ce at
http://instruct.westvalley.edu/vaughn/p10/homework/000_p4a_metric_mishap_caused_los
s_of_nasa_orbiter.pdf

28 See proposal for the Unit Task Force at
http://lno.lternet.edu/files/lno/LTER%20Unit%20Registry%2009uwg_postAsmProposal_final
.pdf



29 The LT
ER project DB. http://www.lternet.edu/news/Article224.html

30 SOAP. http://www.w3.org/TR/soap/

31 Genomics Standard Consortium http://gensc.org/

REFERENCES


Aguilar, R., Pan, J., Gries, C., San Gil, I. and Palanisamy, G. (2009). A flexible online
metadat
a editing and management system. Ecological Informatics, 5(1), 26
-
31.


Amaral
-
Zettler L, Peplies J, Ramette A, Fuchs B, Ludwig W. and Glöckner FO. (2008).
Proceedings of the international workshop on Ribosomal RNA technology, April 7

9, 2008,
Bremen, Germa
ny. Syst and Appl Microbiol. 31, 258

268


Amaral
-
Zettler, L.A. and McCliment, E.A. and Ducklow, H.W. and Huse, S.M. (2009) A
Method for Studying Protistan Diversity Using Massively Parallel Sequencing of V9
Hypervariable Regions of Small
-
Subunit Ribosomal
RNA Genes. Public Library of Science
ONE. 4(7).


Resource at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2711349/


Berkley, C. (2003). Monarch: metadata
-
driven analytical processing.


Databits.


Spring
2003, 8
-
15


Cardoso, J. (2009). Metadata and Semantics
.


Springer.


Ed: Sicilia, M
-
A, and Lytras, M.D.
p.254
-
255.


Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, Tatusova T, Thomson N, Allen
MJ, Angiuoli SV., Ashburner, M, Axelrod, N., Baldauf, S., Ballard, S., Boore, J., Cochrane,
G., Cole, J.,

Dawyndt, P., De Vos, P., dePamphilis, C., Edwards. R., Faruque,N, Feldman, R.,
Gilbert, J., Gilna, P., Oliver Glöckner, F., Goldstein, P., Guralnick, R., Haft,D., Hancock, D.,
Hermjakob, H., Hertz
-
Fowler, C., Hugenholtz, P., Joint, I., Kagan, L., Kane, M.
, Kennedy, J.,
Kowalchuk, G., Kottmann, R., Kolker,E., Kravitz, S., Kyrpides, N., Leebens
-
Mack, J., Lewis,
S.E., Li, K.,Lister, A.E., Lord, P., Maltsev, N., Markowitz, V., Martiny, J., Methe, B., Mizrachi,
I., Moxon, R., Nelson R., Parkhill, J., Proctor, L
., White, O., Sansone, S., Spiers, A.,
Stevens, R., Swift, P., Taylor, C., Tateno, Y., Tett, A., Turner, S., Ussery, D., Vaughan, B.,
Ward, N., Whetzel. T., San Gil, I., Wilson, G. and Wipat, A. (2008). Towards a richer
description of our complete collecti
on of genomes and metagenomes: the Minimum
Information about a Genome Sequence” (MIGS) specification. N
ature Biotechnology 26,
541
-
547

Greenberg, J., White, C. H., Carrier, S. and Scherle, R. (2009). A Metadata


Best Practice for
a Scientific Data Reposito
ry. Journal of Library Metadata. 9(3), 194
-
212.


Harmon, M. Motion to adopt EML at the Coordinating Committee. (2003)
http://intranet.lternet.edu/archives/documents/reports/Minutes/lter_cc/Spring2003CCmtng/
Spring_03_CC.htm


Hobbie, J.E. (2003). Scientific
Accomplishments of the Long Term Ecological Research
Program: An Introduction. BioScience, 53(1), 17

20


Hodge, G.M. (2001). Metadata made simpler. Niso Press Bethesda, MD


Lytras, M. D. and Sicilia M.A (2007) Where is the value in metadata?. International

Journal
of Metadata, Semantics and Ontologies. 2(4), 235
-
241

D. Maier, E. Landis, J. Cushing, A. Frondorf, A. Silberschatz, M. Frame, J.L. Schnase,
(
2001
)
. Research directions in biodiversity and ecosystem informatics. Report of an NSF,
USGS, NASA Work
-
s
hop on Biodiversity and Ecosystem Informatics (NASA Goddard Space
Flight Center, June 22
-
23, 2000, Greenbelt, Maryland), 30 pp.

Michener, W. K. (2005) Meta
-
information concepts for ecological data management.


Ecological Informatics 1(1), 3
-
7

Pautasso, C
.; Zimmermann, O.; Leymann, F.


(2008). RESTful Web Services vs. Big Web
Services: Making the Right Architectural Decision. Conference paper. 17th International
World Wide Web Conference (WWW2008) Beijing, Chine. Resource available at
http://doi.acm.org/10
.1145/1367497.1367606

Piwowar, H.A., Day, R.S. and Fridsma, D.B. (2007). Sharing detailed research data is
associated with increased citation rate. PLoS One. 2(3). Resource at
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1817752/

Porter, J. (2009) Develop
ing a Controlled Vocabulary for LTER Data.


Databits. Fall 2009
issue. Resource at http://databits.lternet.edu/node/70

Rugge, D. J. (2005) Creating FGDC and NBII metadata using metavist 2005. Technical
Report http://ncrs.fs.fed.us/pubs/gtr/gtr_nc255.pdf


Sepic, R. and Kase, K. (2002) The national biological information infrastructure as an E
-
government tool Government Information Quarterly 19(4), 407
-
424

San Gil, I. and Baker, K.


(2007) The Ecological Metadata Language Milestones, Community
Work Force, a
nd Change. Databits. Fall 2007. 4
-
7


San Gil, I., Sheldon, W., Schmidt, T., Servilla, M., Aguilar, R., Gries, C., Gray, T., Field, D.,
Cole J., Pan J.Y., Palanisamy, G., Henshaw, D., O'Brien, M., Kinkel, L., McMahon, K.,
Kottmann, R., Amaral
-
Zettler, L., H
obbie, J., Goldstein, P., Guralnick, R.P., Brunt, J.W. and
Michener W.K. (2008)


Defining linkages between the GSC and the NSF’s LTER program:
How the Ecological Metadata Language relates to GCDML and other outcomes. Omics:
Journal of Integrative Biology,
12(2) 151
-
156

San Gil, I., Baker, K., Campbell, J.,

Denny, E. G.,

Vanderbilt, K., Riordan, B., Koskela, R.,

Downing, J.,

Grabner, S.,

Melendez, E.,

Walsh, J. M., Kortz, M, Conner, J., Yarmey, L.,
Kaplan, N., Boose, E. R , Powell, L., Gries, C., Schr
oeder, R., Ackerman, T., Ramsey, K.,
Benson, B., Chipman, J., Laundre, J., Garritt, H., Henshaw, D., Collins, B., Gardner, C.,
Bohm, S.,. O'Brien, M., J. Gao, W. Sheldon, S. Lyon, D. Bahauddin, M. Servilla, D. Costa
and J. Brunt. (2009). The Long Term Ecol
ogical Network metadata standardisation
implementation process: A progress report. International Journal of Metadata, Semantics
and Ontologies. v4, n3, pp141
-
153


San Gil, I., Vanderbilt, K. and Harrington, S. (2010).


Examples of ecological data synthesi
s
driven by rich metadata, and practical guidelines to use the Ecological Metadata Language
specification to this end.


Submitted to International Journal of Metadata, Semantics and
Ontologies.


Schnase, J. L., Cushing, J., Frame, M., Frondorf, A., Landis,

E., Maier, D., and Silberschatz,
A.
(
2003
)
. Information technology challenges of biodiversity and ecosystems informatics.
Inf. Syst.

28, 4 (Jun. 2003), 339
-
345. DOI=
http://dx.doi.org/10.1016
/S0306
-
4379(02)00070
-
4

Servilla, M.S., Brunt, J.W., San Gil, I., and Costa, D. (2006). Pasta: A Network
-
level
Architecture Design for Generating Synthetic Data Products in the LTER Network. Databits


Fall 2006. Long Term Ecological Research Network

Servi
lla, M.W.,


Costa, C., Laney, C.,


San Gil, I. and


Brunt J. W. (2008). The EcoTrends
Web Portal: An Architecture for Data Discovery and Exploration. Environmental Information
Management Conference, Albuquerque, NM USA.


Van de Sompel, H. and Nelson, M.L.

and Sanderson, R. and Balakirva, L.L. and Ainsworth,
S. and Shankar, H. (2009). Memento: Time Travel for the Web.


Arxiv preprint
arXiv:0911.1112.


http://arxiv.org/PS_cache/arxiv/pdf/0911/0911.1112v2.pdf


Van der Werf, B., Adamescu, M., Ayromlou, M., Ber
trand, N., Borovec, J., Boussard, H.,
Cazacu, C., van Daele, T., Datcu, S., Frenzel, M. (200 ) A Long
-
Term Biodiversity,
Ecosystem and Awareness Research Network.


See resource of a report at
http://anet.vbnlive.com/SITE/UPLOAD/DOCUMENT/outputs/WPI6_2009_0
7_Experimental_
Data_Project.pdf


Van de Weerd, I., Brinkkemper, S., Souer, J. and Versendaal, J. (2006). A situational
implementation method for web
-
based content management system
-
applications: method
engineering and validation in practice.


Software Pro
cess: Improvement and Practice,
11(5), 521
-
538.


Velleman, P.F. and Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are
misleading, American Statistician. 65
-
72