Kitchen Sinks, Plumbing and
Virtual
Observatories
Peter Fox
pfox@cs.rpi.edu
June 4, 2010
–
CSIRO Aspendale
Introduction
•
Systems compared to frameworks?
•
The need, and shifting the burden
•
Virtual Observatories
•
Architectures of
VOs
and semantics
•
In the lower layers of
VOs
–
Data access and transport
–
Formats, formats, formats
–
Sensor streams
•
How do you/ would you participate?
2
Tetherless World Constellation
Frameworks vs. Systems
•
Rough definitions
–
Systems have very well
-
define entry and exit
points. A user tends to know when they are using
one. Options for extensions are limited and
usually require engineering
–
Frameworks have many entry and use points. A
user often does not know when they are using
one. Extension points are part of the design
•
Treat this as a working definition
Tetherless World Constellation
3
Diversity, Integration, Size, …
•
Not just large (well organized, long
-
lived, well
-
funded) projects/ programs
want to make their data available
•
Data policies are
emerging but are still
highly variable
(or
non
-
existent
)
–
How does a user deal with this?
•
Need to manage data
to solve challenging scientific or societal problems
without the continued need for a scientist to know every detail of complex
data management systems
•
Large
-
scale, scientific data repositories:
–
Most data still created in a manner to simplify generation,
not
access or use
–
Very diverse organization of data; files, directories, metadata,
emails,
etc.
–
Source/origin management is driven by meta
-
mechanisms for integration,
interoperability (but still need
performance
)
•
Virtual Observatories
•
Data Grids
•
Increasing realization: need management for all forms of ‘data’, I.e. virtual
data products are becoming the norm
Shifting the Burden from the User
to the Provider (with the help of
VOs
)
6
Terminology
•
Workshop: A Virtual Observatory (VO) is a suite of
software applications on a set of computers that
allows users to uniformly find, access, and use
resources (data, software, document, and image
products and services using these) from a
collection of distributed product repositories and
service providers. A VO is a service that unites
services and/or multiple repositories.
•
VxOs
-
x
is one discipline, domain, community,
country
•
NB: VO also refers to Virtual Organization
7
What should a VO do?
•
Make “standard” scientific research much more
efficient
.
–
Even the principal investigator (PI) teams should want to use them.
–
Must improve on existing services (mission and PI sites, etc.).
VOs
will
not replace these, but will use them in new ways.
•
Enable new, global problems to be solved.
–
Rapidly gain integrated views from the solar origin to the terrestrial
effects of an event.
–
Find data related to any particular observation.
–
(Ultimately) answer “higher
-
order” queries such as “Show me the
data from cases where a large coronal mass ejection observed by the
Solar
-
Orbiting
Heliospheric
Observatory was also observed
in situ
.”
(science
-
speak) or “What happens when the Sun disrupts the Earth’s
environment” (general public)
8
Virtual Observatories
•
Conceptual examples:
•
In
-
situ: Virtual measurements
–
Related measurements
•
Remote sensing: Virtual, integrative measurements
–
Data integration
•
Both
usage patterns lead to additional data management challenges at the source
and
for
users; now managing virtual ‘datasets’
9
Virtual Observatories
Make data and tools quickly and easily accessible to a
wide audience.
Operationally, virtual observatories need to find the
right balance of data/model holdings, portals and
client software that researchers can use without
effort or interference
as if all the materials were
available on his/her local computer
using the user’s
preferred language
: i.e.
appear to be local and
integrated
Likely to provide controlled vocabularies that may be
used for interoperation in appropriate domains along
with database interfaces for access and
storage
10
Early days of VxOs
… … … …
VO
1
VO
2
VO
3
DB
2
DB
3
DB
n
DB
1
?
11
Federation
… … … …
VO
1
VO
2
VO
3
DB
2
DB
3
DB
n
DB
1
VO
4
12
The Astronomy approach;
data
-
types
as a service
… … … …
VO App
1
VO App
2
VO App
3
DB
2
DB
3
DB
n
DB
1
VOTable
Simple
Image
Access
Protocol
Simple
Spectrum
Access
Protocol
Simple
Time
Access
Protocol
VO
layer
Limited
interoperability
Lightweight
semantics
Limited
meaning,
hard
coded
Limited
extensibility
Under
review
OGC: {WFS, WCS, WMS} and
SWE
{SOS, SPS, SAS}
use the same approach
Similarities to Astronomy
•
Some disciplines have chosen a data format (some even use FITS)
•
Common applications, community standards appearing
•
Images, spectra (incl. multi
-
band), …
•
More and more data is on
-
line, some (near) real
-
time
•
Data flood
-
synoptic measurements, spatial/ spectral resolution,
number of instruments, cadence
-
all increasing (
peta
-
byte to
exa
-
byte is real), data mining and knowledge extraction are now real
needs
•
Don’t move (or replicate?) the data when possible
•
Means for interoperation is being demanded
-
service
-
oriented
architectures
•
Some
VOs
even implementing
IVoA
standards (primarily
heliophysics
and space physics)
Differences with astronomy
•
Data types (+station/point, irregular, multi
-
resolution, ragged
arrays, swath, …)
•
Data formats
-
many
•
Lots of
VOs
•
Metadata conventions range from strict to non
-
existent
•
Provenance, derivation and semantics being applied in
(more) formal ways
•
Geo
-
spatial dominates (
cf
helio
-
spatial), some standards but
little/no enforcement
-
efforts at conventions/ standards are
at data model level
•
New to the theme of integration and inter
-
disciplinary
•
Number and complexity of projects, systems, frameworks
-
need to interoperate at many levels
•
Social, political and mission forces are immense
Fox
-
APAC 2007, Driving
e
-
research:
Grids and Semantics
15
… … … …
VO
Portal
Web
Serv.
VO
API
DB
2
DB
3
DB
n
DB
1
Semantic
mediation
layer
-
VSTO
-
low
level
Semantic
mediation
layer
-
mid
-
upper
-
level
Education,
clearinghouses,
other
services,
disciplines,
etc
.
Metadata,
schema,
data
Query,
access
and
use
of
data
Semantic
query,
hypothesis
and
inference
Semantic
interoperability
Added
value
Added
value
Added
value
Added
value
Mediation Layer
•
Ontology
-
capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and
associated classes, properties) and Service Classes
•
Maps queries to underlying data
•
Generates access requests for metadata, data
•
Allows queries, reasoning, analysis, new hypothesis
generation, testing, explanation, etc.
16
Semantic Web Benefits
•
Unified/ abstracted query workflow: Parameters, Instruments, Date
-
Time
•
Decreased input requirements for query: in one case reducing the number of
selections from
eight
to
three
•
Generates only syntactically correct queries: which was not always insurable in
previous implementations without semantics
•
Semantic query support: by using background
ontologies
and a
reasoner
, our
application has the opportunity to only expose coherent query (portal and
services)
•
Semantic integration: in the past users had to remember (and maintain codes)
to account for numerous different ways to combine and plot the data whereas
now semantic mediation provides the level of sensible data integration
required,
and exposed
as smart web services
–
understanding of coordinate systems, relationships, data synthesis,
transformations.
–
returns independent variables and related parameters
•
A broader range of potential users (PhD scientists, students, professional
research associates and those from outside the fields)
Virtual Carbon Observatory
Tetherless World Constellation
17
Environmental Assessment
Understand Communities Of
Stakeholders
Tetherless World Constellation
20
Multi
-
domain Knowledge Base
Provenance
Science
Data
Processing
Science
21
Vocabularies and
Ontologies
•
An underlying aspect of all
VOs
is the need to
develop/ agree on a common presentation of the
(virtual) holdings, aka a
catalog
•
As disciplines boundaries are crossed… (ecology)
•
Vocabularies are increasingly important in this
provision
•
And, interestingly, there is a real push toward
more explicit representations of semantics in the
form of
ontologies
•
… and provision of vocabulary services*
Tetherless World Constellation
22
Let’s turn to plumbing
•
Data formats are of resurgent interest but not
so much for exchange
–
For structural representation and efficiency
–
For transparency and preservation
–
However, a lot of end
-
users still care about
formats immensely
•
Data access and transport
•
Implications of computing closer to the data
Tetherless World Constellation
23
netCDF
and similar
•
Version 3 (classic) vs. version 4 (aka CDM)
•
V4
-
slow adoption to date (no specific reason)
•
Conventions (e.g. units, CF
-
1) make it work
•
Traditional focus on grids is now evolving as
in
-
situ data and model comparisons are
becoming common, i.e. unstructured data
Tetherless World Constellation
24
Discipline neutral access
•
One such approach, since 1993, is the DAP
–
Data Access Protocol (NASA, NOAA standard)
•
opendap.org
(U.S. not
-
for
-
profit)
•
OPeNDAP
is the software
–
Core, server (version 4
–
Hyrax), client, services
Tetherless World Constellation
25
26
OPeNDAP
Hyrax
Architecture
OLFS
BES
OPeNDAP Lightweight Front end Server (OLFS)
Receives requests and asks the BES to fill them
Uses Java Servlets
Does not directly ‘touch’ data
Multi
-
protocol
Data
Back End Server (BES)
Reads data files, Databases, et c., returns info
May return DAP2 objects or other data
Does not require web server
Client
27
GridFTP
DAP2
HTTP
DAP2
ASCII output
HTML form
Info output
OPeNDAP
Lightweight
Front end Server
THREDDS
Request Formulation**
Request from client
Response to client
BES
SOAP
-
DAP (HTTP)
DAP2 (
GridFTP
, HTTP)
RDF, OWL, JSON (HTTP
)
PML output
28
Hyrax/ Back
-
end Server
Network Protocol and
Process start/stop
activities
Data Store Interfaces
BES Framework
PPT*
Initialization/
Termination
DAP2
Access
NetCDF3
HDF4
RDF/ SPARQL
…
Provenance
Commands**
BES Commands/
XML Documents
*PPT is built in (other protocols)
**Some commands are built in
Data
Data
Data
Data
Catalogs
Status of the Community
OPeNDAP
Server Software
•
Hyrax 1.6 provides support for
NcML
-
based
aggregation
•
Faster THREDDS implementation (but not full
featured)
•
Full security audit and static code analysis
certification to comply with NOAA and NASA
requirements
•
DAP4 (which includes
netCDF
4 support) is not
available yet
•
AND other things
Earth System Grid Center for Enabling Technologies: (ESG
-
CET)
Earth System Grid Center for
Enabling Technologies
•
Large data sets, numbers and sizes
–
High performance
–
Flexible architecture, both client and several types and numbers of
servers
–
Aggregation
–
Server side operations
–
Multiple
transport protocol options
•
Full ESG security support as well as loose federation
•
Full
function client access via API (netCDF/CDM)
To satisfy the new goals, the OPeNDAP services for ESG have been re
-
architected.
We now use parts of the standard OPeNDAP framework Hyrax, focusing on
high performance for the client side and extended flexibility.
Earth System Grid Center for Enabling Technologies: (ESG
-
CET)
Requirements leading to
OPeNDAP
-
g
•
Separation of the core Data Access Protocol (DAP) from the
transport protocol (HTTP).
•
High Performance Computing. The previous CGI based servers
did not have the capacity required by ESG. Error and memory
handling added.
•
Security. Once the
OPeNDAP
was independent of the
transport protocol, adding security was possible by relying on
the
Globus
gsiFTP
system.
•
Aggregation.
OPeNDAP
3.0 did not operate on aggregated
datasets.
OPeNDAP
-
g
does.
•
Transport protocol independence and HPC were incorporated
back into
OPeNDAP
leading to the current version.
Security
and aggregation
initially were
ESG only
features.
Earth System Grid Center for Enabling Technologies: (ESG
-
CET)
The Remote
NetCDF
Invocation (RNI)
The client is the
netCDF
library. It has exactly the
same API as the standard C library
netCDF
, but it
can deal with local files or files reachable via HTTP,
PPT or
gridFTP
.
The third tier, the BES server can be reached only
via PPT.
NetCDF
services for all
NetCDF
calls are
implemented a a BES module.
The middle tier, acts like a proxy between the RNI
client and server and deals with security.
Earth System Grid Center for Enabling Technologies: (ESG
-
CET)
RNI Architecture
CLIENT
DATA
GridFTP
OPeNDAP
BES
NetCDF
Library
RNI Module
connection acts like
RNI Library
Earth System Grid Center for Enabling Technologies: (ESG
-
CET)
Characteristics of the RNI as
part of a data access system
•
Full Support of standard
OPeNDAP
URLs. RNI is being
developed with the integrated
Unidata/OPeNDAP
netCDF
library (and CDM)
•
Transparent access to either standard
netCDF
files and
aggregated datasets via the
NetCDF
Markup Language (NCML).
•
For remote containers, all write operations are disable for
security. That is, for HTTP/HTTPS, PPT and
gridFTP/gsiFTP
the
RNI system is a read only API.
•
RNI utilizes
Just in Time
access. Caching is only for metadata.
No pre
-
fetching of data.
•
RNI transparently accesses secure (
gsiFTP
, HTTPS) or insecure
(
gridFTP
, HTTP) remote data.
Other DAP client/ API library
status
•
OPeNDAP
-
Unidata
project to fold ‘
libnc
-
dap’
into the standard
netCDF
distribution, i.e. you
get ‘DAP’ for free
•
New C
-
API for DAP
–
‘
oc
’ replaces
ocapi
and
will be the basis for rewrites of the IDL and
Matlab
(and other) client interfaces
Earth System Grid Center for
Enabling Technologies: (ESG
-
CET)
NOAA/IOOS
•
DAP adopted by DMAC
•
Gateway project for
OPeNDAP
–
Support for WCS/WFS as source and response
type in Hyrax
–
Implementation of AIS (Ancillary Information
Service) for RDF return prototype
–
Initial DAP ontology data model
Tetherless World Constellation
36
Cloud
•
Microsoft ported
OPeNDAP
Hyrax to their
Azure cloud
–
http://opendap.cloudapp.net/dap
–
Web
-
client/form is at
http://opendap.cloudapp.net/dap/data/nc/conte
nts.html
•
Work on Azure Drive (
Xdrive
) underway
•
No decisions on future or other cloud
environments
Tetherless World Constellation
37
Security (
authn/z
)
•
Developed with Bryan Lawrence (BADC/STFC)
for federation of
OPeNDAP
security
•
Specd
. In May 2009, implementations
presented at EGU in 2010
•
Will appear in ESG and community
OPeNDAP
releases
•
AAF compatible?
Tetherless World Constellation
38
Sensors
•
Due to the increasing demand to process off
the
sensor
:
–
Sky surveys
–
volume
–
Monitoring
–
for rapid response and decision
support
–
As part of a network, or on the internet, a
web
•
There is a corresponding increase in need to
ingest/ publish data much earlier than has
previously been needed
•
Trend toward treating them as RT/NRT sensors
Tetherless World Constellation
39
Directions for sensor and
spatial standards (my view)
•
Has grown out of a limited set of semantic
constructs
–
Geography, features,
coverages
, maps, streams
•
Integration needs are driving different (good)
developments, e.g. WCS 2
v
WFS 2.
•
Transparency requirements are going to drive
very different approaches, e.g. encapsulation can
be a barrier
•
Refactoring of standards: much as is happening in
astronomy will be required
Tetherless World Constellation
40
Who is developing?
Your participation?
•
VOs
–
U.S.
–
NASA, NSF, NOAA are developing/ funding
–
EU
–
many, e.g. HELIO, SOTERIA
•
DAP/
OPeNDAP
–
World
-
wide community, strong Australian
contributions/ use
•
Sensors
–
W3 recent
–
incubator for semantic sensor web
–
very,
very important work
•
Vocabulary servers (more than the vocabularies)
–
Interest in community
-
based (or W3) effort
Tetherless World Constellation
41
•
Scaling to large numbers of data providers
•
Security
,
policy
enforcement
•
Data
quality
•
Branding and attribution (where did this data come
from and who gets the credit, is it the correct version, is
this an authoritative source?)
•
Provenance/derivation (propagating key information as
it passes through a variety of services, copies of
processing algorithms, …)
•
Sustainability
Issues for Virtual
Observatories
-
Geo
Summary/ Discussion
•
The VO paradigm in is wide
-
spread use in Earth and
Space Sciences
–
Successful implementations in production
and use (some even
have evaluations)
–
New science is being enabled and performed
–
There are
active
programs at the agency level
–
Active communities;
meeting, publishing, developing,
implementing
•
Data access and transport is an active field
•
New attention to
spatio
-
temporal standards and
vocabularies in the context of services
•
Substantial re
-
visiting of architectures due to the need to
accommodate explicit semantics (esp. in regard to
sensors)
Further Information
•
http://tw.rpi.edu/
•
http://www.opendap.org
and
http://docs.opendap.org
•
Lots of others (ask me)
•
Contact:
–
pfox@cs.rpi.edu
Tetherless World Constellation
44
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment