Kitchen Sinks, Plumbing and

cabbagewheatInternet and Web Development

Nov 13, 2013 (3 years and 11 months ago)

187 views

Kitchen Sinks, Plumbing and
Virtual

Observatories

Peter Fox

pfox@cs.rpi.edu

June 4, 2010


CSIRO Aspendale

Introduction


Systems compared to frameworks?


The need, and shifting the burden


Virtual Observatories


Architectures of
VOs

and semantics


In the lower layers of
VOs


Data access and transport


Formats, formats, formats


Sensor streams


How do you/ would you participate?

2

Tetherless World Constellation

Frameworks vs. Systems


Rough definitions


Systems have very well
-
define entry and exit
points. A user tends to know when they are using
one. Options for extensions are limited and
usually require engineering


Frameworks have many entry and use points. A
user often does not know when they are using
one. Extension points are part of the design


Treat this as a working definition



Tetherless World Constellation

3

Diversity, Integration, Size, …


Not just large (well organized, long
-
lived, well
-
funded) projects/ programs
want to make their data available


Data policies are

emerging but are still
highly variable

(or
non
-
existent
)


How does a user deal with this?


Need to manage data
to solve challenging scientific or societal problems
without the continued need for a scientist to know every detail of complex
data management systems


Large
-
scale, scientific data repositories:


Most data still created in a manner to simplify generation,
not

access or use


Very diverse organization of data; files, directories, metadata,
emails,
etc.


Source/origin management is driven by meta
-
mechanisms for integration,
interoperability (but still need
performance
)


Virtual Observatories


Data Grids


Increasing realization: need management for all forms of ‘data’, I.e. virtual
data products are becoming the norm

Shifting the Burden from the User

to the Provider (with the help of
VOs
)

6

Terminology


Workshop: A Virtual Observatory (VO) is a suite of
software applications on a set of computers that
allows users to uniformly find, access, and use
resources (data, software, document, and image
products and services using these) from a
collection of distributed product repositories and
service providers. A VO is a service that unites
services and/or multiple repositories.


VxOs

-

x

is one discipline, domain, community,
country


NB: VO also refers to Virtual Organization


7

What should a VO do?


Make “standard” scientific research much more
efficient
.


Even the principal investigator (PI) teams should want to use them.


Must improve on existing services (mission and PI sites, etc.).
VOs

will
not replace these, but will use them in new ways.


Enable new, global problems to be solved.


Rapidly gain integrated views from the solar origin to the terrestrial
effects of an event.


Find data related to any particular observation.


(Ultimately) answer “higher
-
order” queries such as “Show me the
data from cases where a large coronal mass ejection observed by the
Solar
-
Orbiting
Heliospheric

Observatory was also observed
in situ
.”
(science
-
speak) or “What happens when the Sun disrupts the Earth’s
environment” (general public)

8

Virtual Observatories


Conceptual examples:


In
-
situ: Virtual measurements


Related measurements



Remote sensing: Virtual, integrative measurements


Data integration








Both
usage patterns lead to additional data management challenges at the source
and

for
users; now managing virtual ‘datasets’

9

Virtual Observatories

Make data and tools quickly and easily accessible to a
wide audience.

Operationally, virtual observatories need to find the
right balance of data/model holdings, portals and
client software that researchers can use without
effort or interference
as if all the materials were
available on his/her local computer
using the user’s
preferred language
: i.e.
appear to be local and
integrated

Likely to provide controlled vocabularies that may be
used for interoperation in appropriate domains along
with database interfaces for access and
storage

10

Early days of VxOs

… … … …

VO
1

VO
2

VO
3

DB
2

DB
3

DB
n

DB
1

?

11

Federation

… … … …

VO
1

VO
2

VO
3

DB
2

DB
3

DB
n

DB
1

VO
4

12

The Astronomy approach;
data
-
types

as a service

… … … …

VO App
1

VO App
2

VO App
3

DB
2

DB
3

DB
n

DB
1


VOTable


Simple
Image

Access
Protocol


Simple
Spectrum

Access
Protocol


Simple
Time

Access
Protocol

VO

layer

Limited

interoperability

Lightweight

semantics

Limited

meaning,

hard

coded

Limited

extensibility


Under

review

OGC: {WFS, WCS, WMS} and


SWE
{SOS, SPS, SAS}

use the same approach

Similarities to Astronomy


Some disciplines have chosen a data format (some even use FITS)


Common applications, community standards appearing


Images, spectra (incl. multi
-
band), …


More and more data is on
-
line, some (near) real
-
time


Data flood
-

synoptic measurements, spatial/ spectral resolution,
number of instruments, cadence
-

all increasing (
peta
-
byte to
exa
-
byte is real), data mining and knowledge extraction are now real
needs


Don’t move (or replicate?) the data when possible


Means for interoperation is being demanded
-

service
-
oriented
architectures


Some
VOs

even implementing
IVoA

standards (primarily
heliophysics

and space physics)

Differences with astronomy


Data types (+station/point, irregular, multi
-
resolution, ragged
arrays, swath, …)


Data formats
-

many


Lots of
VOs


Metadata conventions range from strict to non
-
existent


Provenance, derivation and semantics being applied in
(more) formal ways


Geo
-
spatial dominates (
cf

helio
-
spatial), some standards but
little/no enforcement
-

efforts at conventions/ standards are
at data model level


New to the theme of integration and inter
-
disciplinary


Number and complexity of projects, systems, frameworks
-

need to interoperate at many levels


Social, political and mission forces are immense


Fox
-

APAC 2007, Driving
e
-
research:
Grids and Semantics

15

… … … …

VO
Portal

Web
Serv.

VO
API

DB
2

DB
3

DB
n

DB
1

Semantic

mediation

layer

-

VSTO

-

low

level

Semantic

mediation

layer

-

mid
-
upper
-
level

Education,

clearinghouses,

other

services,

disciplines,

etc
.

Metadata,

schema,

data

Query,

access

and

use

of

data

Semantic

query,

hypothesis

and

inference

Semantic

interoperability

Added

value

Added

value

Added

value

Added

value

Mediation Layer


Ontology
-

capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and
associated classes, properties) and Service Classes


Maps queries to underlying data


Generates access requests for metadata, data


Allows queries, reasoning, analysis, new hypothesis
generation, testing, explanation, etc.


16

Semantic Web Benefits


Unified/ abstracted query workflow: Parameters, Instruments, Date
-
Time


Decreased input requirements for query: in one case reducing the number of
selections from
eight

to
three


Generates only syntactically correct queries: which was not always insurable in
previous implementations without semantics


Semantic query support: by using background
ontologies

and a
reasoner
, our
application has the opportunity to only expose coherent query (portal and
services)


Semantic integration: in the past users had to remember (and maintain codes)
to account for numerous different ways to combine and plot the data whereas
now semantic mediation provides the level of sensible data integration
required,

and exposed
as smart web services


understanding of coordinate systems, relationships, data synthesis,
transformations.


returns independent variables and related parameters


A broader range of potential users (PhD scientists, students, professional
research associates and those from outside the fields)

Virtual Carbon Observatory

Tetherless World Constellation

17

Environmental Assessment



Understand Communities Of
Stakeholders



Tetherless World Constellation

20

Multi
-
domain Knowledge Base

Provenance

Science
Data
Processing

Science

21

Vocabularies and
Ontologies


An underlying aspect of all
VOs

is the need to
develop/ agree on a common presentation of the
(virtual) holdings, aka a
catalog


As disciplines boundaries are crossed… (ecology)


Vocabularies are increasingly important in this
provision


And, interestingly, there is a real push toward
more explicit representations of semantics in the
form of
ontologies



… and provision of vocabulary services*

Tetherless World Constellation

22

Let’s turn to plumbing


Data formats are of resurgent interest but not
so much for exchange


For structural representation and efficiency


For transparency and preservation


However, a lot of end
-
users still care about
formats immensely


Data access and transport


Implications of computing closer to the data

Tetherless World Constellation

23

netCDF

and similar


Version 3 (classic) vs. version 4 (aka CDM)


V4
-

slow adoption to date (no specific reason)


Conventions (e.g. units, CF
-
1) make it work


Traditional focus on grids is now evolving as
in
-
situ data and model comparisons are
becoming common, i.e. unstructured data

Tetherless World Constellation

24

Discipline neutral access


One such approach, since 1993, is the DAP


Data Access Protocol (NASA, NOAA standard)


opendap.org

(U.S. not
-
for
-
profit)


OPeNDAP

is the software


Core, server (version 4


Hyrax), client, services

Tetherless World Constellation

25

26

OPeNDAP

Hyrax
Architecture

OLFS

BES


OPeNDAP Lightweight Front end Server (OLFS)


Receives requests and asks the BES to fill them


Uses Java Servlets


Does not directly ‘touch’ data


Multi
-
protocol

Data


Back End Server (BES)


Reads data files, Databases, et c., returns info


May return DAP2 objects or other data


Does not require web server

Client

27

GridFTP

DAP2

HTTP

DAP2

ASCII output

HTML form

Info output

OPeNDAP

Lightweight
Front end Server

THREDDS

Request Formulation**

Request from client

Response to client

BES

SOAP
-
DAP (HTTP)

DAP2 (
GridFTP
, HTTP)

RDF, OWL, JSON (HTTP
)

PML output

28

Hyrax/ Back
-
end Server

Network Protocol and

Process start/stop

activities

Data Store Interfaces

BES Framework

PPT*

Initialization/

Termination

DAP2

Access

NetCDF3

HDF4

RDF/ SPARQL



Provenance

Commands**

BES Commands/

XML Documents

*PPT is built in (other protocols)

**Some commands are built in

Data

Data

Data

Data

Catalogs

Status of the Community
OPeNDAP

Server Software


Hyrax 1.6 provides support for
NcML
-
based
aggregation


Faster THREDDS implementation (but not full
featured)


Full security audit and static code analysis
certification to comply with NOAA and NASA
requirements


DAP4 (which includes
netCDF

4 support) is not
available yet


AND other things


Earth System Grid Center for Enabling Technologies: (ESG
-
CET)

Earth System Grid Center for
Enabling Technologies


Large data sets, numbers and sizes


High performance


Flexible architecture, both client and several types and numbers of
servers


Aggregation


Server side operations


Multiple
transport protocol options


Full ESG security support as well as loose federation


Full
function client access via API (netCDF/CDM)


To satisfy the new goals, the OPeNDAP services for ESG have been re
-
architected.


We now use parts of the standard OPeNDAP framework Hyrax, focusing on
high performance for the client side and extended flexibility.

Earth System Grid Center for Enabling Technologies: (ESG
-
CET)

Requirements leading to
OPeNDAP
-
g


Separation of the core Data Access Protocol (DAP) from the
transport protocol (HTTP).


High Performance Computing. The previous CGI based servers
did not have the capacity required by ESG. Error and memory
handling added.


Security. Once the
OPeNDAP

was independent of the
transport protocol, adding security was possible by relying on
the
Globus

gsiFTP

system.


Aggregation.
OPeNDAP

3.0 did not operate on aggregated
datasets.
OPeNDAP
-
g

does.


Transport protocol independence and HPC were incorporated
back into
OPeNDAP

leading to the current version.
Security
and aggregation

initially were
ESG only
features.


Earth System Grid Center for Enabling Technologies: (ESG
-
CET)

The Remote
NetCDF

Invocation (RNI)



The client is the
netCDF

library. It has exactly the
same API as the standard C library
netCDF
, but it
can deal with local files or files reachable via HTTP,
PPT or
gridFTP
.



The third tier, the BES server can be reached only
via PPT.
NetCDF

services for all
NetCDF

calls are
implemented a a BES module.



The middle tier, acts like a proxy between the RNI
client and server and deals with security.

Earth System Grid Center for Enabling Technologies: (ESG
-
CET)

RNI Architecture

CLIENT

DATA

GridFTP

OPeNDAP

BES

NetCDF

Library

RNI Module

connection acts like

RNI Library

Earth System Grid Center for Enabling Technologies: (ESG
-
CET)

Characteristics of the RNI as
part of a data access system


Full Support of standard
OPeNDAP

URLs. RNI is being
developed with the integrated
Unidata/OPeNDAP

netCDF

library (and CDM)


Transparent access to either standard
netCDF

files and
aggregated datasets via the
NetCDF

Markup Language (NCML).


For remote containers, all write operations are disable for
security. That is, for HTTP/HTTPS, PPT and
gridFTP/gsiFTP

the
RNI system is a read only API.


RNI utilizes
Just in Time

access. Caching is only for metadata.
No pre
-
fetching of data.


RNI transparently accesses secure (
gsiFTP
, HTTPS) or insecure
(
gridFTP
, HTTP) remote data.

Other DAP client/ API library
status


OPeNDAP
-
Unidata

project to fold ‘
libnc
-
dap’
into the standard
netCDF

distribution, i.e. you
get ‘DAP’ for free


New C
-
API for DAP



oc
’ replaces
ocapi

and
will be the basis for rewrites of the IDL and
Matlab

(and other) client interfaces

Earth System Grid Center for
Enabling Technologies: (ESG
-
CET)

NOAA/IOOS


DAP adopted by DMAC


Gateway project for
OPeNDAP


Support for WCS/WFS as source and response
type in Hyrax


Implementation of AIS (Ancillary Information
Service) for RDF return prototype


Initial DAP ontology data model

Tetherless World Constellation

36

Cloud


Microsoft ported
OPeNDAP

Hyrax to their
Azure cloud


http://opendap.cloudapp.net/dap


Web
-
client/form is at
http://opendap.cloudapp.net/dap/data/nc/conte
nts.html


Work on Azure Drive (
Xdrive
) underway


No decisions on future or other cloud
environments

Tetherless World Constellation

37

Security (
authn/z
)


Developed with Bryan Lawrence (BADC/STFC)
for federation of
OPeNDAP

security


Specd
. In May 2009, implementations
presented at EGU in 2010


Will appear in ESG and community
OPeNDAP

releases


AAF compatible?

Tetherless World Constellation

38

Sensors


Due to the increasing demand to process off
the
sensor
:


Sky surveys


volume


Monitoring


for rapid response and decision
support


As part of a network, or on the internet, a
web


There is a corresponding increase in need to
ingest/ publish data much earlier than has
previously been needed


Trend toward treating them as RT/NRT sensors

Tetherless World Constellation

39

Directions for sensor and
spatial standards (my view)


Has grown out of a limited set of semantic
constructs


Geography, features,
coverages
, maps, streams


Integration needs are driving different (good)
developments, e.g. WCS 2
v

WFS 2.


Transparency requirements are going to drive
very different approaches, e.g. encapsulation can
be a barrier


Refactoring of standards: much as is happening in
astronomy will be required

Tetherless World Constellation

40

Who is developing?

Your participation?


VOs


U.S.


NASA, NSF, NOAA are developing/ funding


EU


many, e.g. HELIO, SOTERIA


DAP/
OPeNDAP


World
-
wide community, strong Australian
contributions/ use


Sensors


W3 recent


incubator for semantic sensor web


very,
very important work


Vocabulary servers (more than the vocabularies)


Interest in community
-
based (or W3) effort

Tetherless World Constellation

41


Scaling to large numbers of data providers


Security
,

policy
enforcement


Data
quality


Branding and attribution (where did this data come
from and who gets the credit, is it the correct version, is
this an authoritative source?)


Provenance/derivation (propagating key information as
it passes through a variety of services, copies of
processing algorithms, …)


Sustainability

Issues for Virtual
Observatories
-

Geo

Summary/ Discussion


The VO paradigm in is wide
-
spread use in Earth and
Space Sciences


Successful implementations in production
and use (some even
have evaluations)


New science is being enabled and performed


There are
active
programs at the agency level


Active communities;
meeting, publishing, developing,
implementing


Data access and transport is an active field


New attention to
spatio
-
temporal standards and
vocabularies in the context of services


Substantial re
-
visiting of architectures due to the need to
accommodate explicit semantics (esp. in regard to
sensors)

Further Information


http://tw.rpi.edu/


http://www.opendap.org

and
http://docs.opendap.org



Lots of others (ask me)


Contact:


pfox@cs.rpi.edu

Tetherless World Constellation

44