Climate Knowledge Discovery Workshop Report

stemswedishΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

134 εμφανίσεις

Climate Knowledge Discovery
Workshop Report

Reinhard Budich
1

MPI
-
M
,

Peter Fox
2

RPI
,
A
uroop Ganguly
3

ORNL,

Jim Kinter
4

COLA
,

Per Nyberg
5

CRAY,
Tobias Weigel
6

DKRZ

1.

Background

Numerical simulation based science follows a new paradigm: its knowledge discovery process
rests upon massive amounts of data. We are entering the age of
data
-
intensive science. Data is
either generated by equipment such as satellite
sensors
, microscopes,

particle colliders etc., or
is just born digital, i.e. generated by an intensive usage of high performance computing.

One of the largest repositories of scientific data in any discipline
is
the
model
-
generated and
observational geoscience data
produced an
d used in climate science
.
Climate scientists
gather
data faster than they can be interpreted. Current approaches to data volumes are primarily
focused on traditional
methods,
best
-
suited
for

large
-
scale phenomena and coarse
-
resolution
data sets.
The data
volumes from climate modeling will increase dramatically due to both
increasing resolution and number of processes described. What is needed is a suite of new
techniques interpreting and linking phenomena on and between different time
-

and length scales
as

well as realms and processes
. Such tools could provide unique insights into challenging
features of the
E
arth system, including
extreme events
, nonlinear dynamics and
chaotic regimes
.
Based on experience
, including that

from other
sciences
, t
he breakthroughs needed to address
these challenges will come from
well
-
organized
collaborative efforts involving several disciplines,
including end
-
user
s and scientists in

climate and related area
s

(atmosphere, o
c
ean
, land
surface,
…)
, computer and comput
ational scientists, computing engineers, and mathematicians

xii
.

This report summarizes the findings from the Climate Know
l
edge Discovery workshop organized
by
Deutsches Klimarechenzentrum

GmbH (DKRZ),
Max
-
Planck
-
Institut für Meteorologie

(MPI
-
M)
and Cray
Inc. to bring together experts from various domains to investigate the use and
application of large
-
scale graph analytics, semantic technologies and knowledge discovery
algorithms in climate science. The workshop was held from 30 March to 1 April 2011 at
DKRZ in
Hamburg, Germany.




1

Corresponding Author Address:
Reinhard

Budich,
reinhard.budich@zmaw.de

2

Corresponding Author Address: Peter Fox,
pfox@cs.rpi.edu


3

Corresponding Author Address: Auroop Ganguly,
gangulyar@ornl.gov

4

Corresponding Author Address: Jim Kinter,
kinter@cola.iges.org

5

Corresponding Author Address: Per Nybe
rg, nyberg@cray.com

6

Corresponding Author Address: Tobias Weigel, weigel@dkrz.de

CKD Workshop Report


October 15, 2013

2

2.

Workshop Objective

The aim of the workshop was to formulate science and technology strategies to
further
develop
climate knowledge discovery methods and tools to
enable

a data
-
intensive
approach to climate
science
.
Should
such
advanced data analytical tools be proven viable,
their
methods should be
used in future
Coupled Model Inter
-
comparison Project (
CMIP
7
)

efforts
, including the 5
th

round
that is underway now (CMIP5)
.

Three basic questions were posed to pursue this aim:

1)

What

steps are required to realize the potential of graph analytics, semantic technologies and
knowledge discovery algorithms in climate science?


2)

Where methods and technologies already exist, how can they be leveraged?

3)

Where gaps are identified, what steps
must be taken to address them
?

3.

Data Driven
Opportunities for
Clim
ate Science

Climate science faces
long
-
standing gaps in
the
understanding of key processes like convection,
land
-
atmosphere interaction, and teleconnections. A combination of improved models,

observations and computing power has been suggested as the way
towards improving this
understanding
.
Our postulate: C
onfronting model
-
simulations with observation
s

in new ways
and

learning functional relations within climate variables from observations an
d models


which
may
generalize to predictive insights under non
-
stationary climate
-

would be
invaluable

for basic
climate research, and

key to address
ing

adaptation and resource management issues in a timely
manner. This provides the motivation for novel

and scalable methodologies in data
-
guided
descriptive and predictive analysis
,

which can handle massive volumes of "
5
-
dimensional"
(space x
-
,

y
-
,
and

z
-
,
t
ime

and parameters
) data generated from nonlinear processes with
complex
dependencies and feedback l
oops
.

Many application areas
vi

have recognized the need for representing and reasoning about domain
and multi
-
disciplinary knowledge
,

and recognizing t
he reality of science today, that

knowledge is

distributed over many resources
.
Computer science r
esearch areas
being put into practice
include
:

information integration, distributed knowledge management, the semantic web (including
the Resource Description Framework (RDF)

and
the
Web Ontology Language

(OWL
)

which are
both World
-
Wide
-
Web Consortium Reco
mmendations (i.e. standards)
), multi
-
agent and
distributed reasoning

systems
.

M
ethods of automated reasoning and advanced analytics have
been used in a number

of

domains by
empl
o
ying

formal ontologies expressed in logic

viii
.
Ontologies are essential comp
onents of
modern
knowledge management systems,
and are
becoming more prevalent in
distributed computing environments
based on

Web Services, and
in
applications such as e
-
Commerce and e
-
Science.
In information systems science, a
n
ontology

vii

is a formal re
presentation of knowledge
as

a set of concepts within a domain and the
relationships
among

those concepts
.

Formalized ontologies are
used to
enable
automate
d

reason
ing

about the entities within that domain,
in addition

to describ
ing

the domain
.

Ontologies



7

http://www
-
pcmdi.llnl.gov/projects/cmip/index.php

CKD Workshop Report


October 15, 2013

3

are useful in articulating s
hared vocabular
ies, which

can be used to model a domain
based on

types of objects and/or concepts that exist
,

as well as

their properties and relations

among
objects
.
More broadly, s
emantic technologies
comprising query and reas
oning capabilities,
semantic storage and retrieval capabilities,
have already been implemented in community efforts
such as the Earth System Curator
8

and the METAFOR project
9
,
the National Aeronautics and
Space Administration (
NASA
)

Semantic Web for Earth
and Environmental Terminology
(SWEET)
10
,
and the International Research Institute (
IRI
) Lamont
-
Doherty Earth Observatory
(
LDEO
)

Climate Data Library
11
.

4.

Complex Network S
olutions

Data
-
intensive science in general has been called the fourth paradigm of scienti
fic exploration

ix
,

in addition to

theory
,

experimentation and the more recent
ly

add
ed

computational modeling and
simulation.
As the meeting progressed there was a growing awareness that a

combination of
high
-
performance analytics from model
-
simulated and observed data, with algorithms motivated
from network science and graph theory, nonlinear dynamics and statistics, as well as data mining
and machine learning, may be
one

way forward for t
he community. Recent developments in
climate networks
i
,
ii
,
iii
,
iv
,
v
,

while impressive,
have
barely scratch
ed

the surface in terms of what may
be eventually possible. While statistical and dynamical downscaling have become
widely applied

in the climate community
, challenging issues about bias variance tradeoffs,
long
-
range
dependence or teleconnections and impacts of boundary or initial conditions remain largely
unsolved.
H
ere complex networks
12

may offer the next

generation of solutions in climate science
and
cli
mate change
consequence management
.

An
other
emergent
research

area is
to explore
ways to connect network solutions from climate change science all the way to impacts science
like urban sustainability and endangered natural ecosystems, where both the climat
e and
impacted systems may be represented through loosely
coupled
network paradigms.
T
he ability
to interpret the results from the climate science perspective and develop hybrid models that
blend available process knowledge with complex networks or other d
ata
-
guided approaches
will
remain issues of key importance.

5.

Construction of Graphs from Climate Data

Graphs are mathematical structures used to model pairwise relations between objects.

The
grap
h

is represented by a collection of
vertices

or 'nodes' and
a collection of
edges

that connect
pairs of vertices.
The workshop provided a venue for an open and frank discussion of the
potential role of graph theory and graph analytics to climate data. It was noted that the first step


getting community information

into graph form


has not been widely adopted.
The construction
of graphs is an
i
nformation modeling process
. It is important to distinguish this ‘modeling’ from



8

http://www.earthsystemcurator.org/ontologies/

9

http://metaforclimate.eu

10

http://sweet.jpl.nasa.gov/ontology/

11

http://iridl.ldeo.columbia.edu/ontologies/

12

A

complex network

is a
network

(
graph
) with non
-
trivial
topological

features

features that do not occur
in simple networks such as
lattices

or
random graphs

but often occur in real graphs.

(Wikipedia)

CKD Workshop Report


October 15, 2013

4

the familiar modeling of climate data using mathematical equations, and increasingly with the

implementation of algorithms in computer software and execution on modern computer systems
.
T
o invoke this process, t
he c
limate community
needs to

define
what is important to the graph
:

the types of nodes in the graph and their first order connections. Be
yond that, t
he construction of
climate science domain ontologies
would be a

key
next
step
, but to apply these ontologies,
it is
also necessary to
define
the
process of
semantically
annotating climate data and representing it
in graphs.


Two
basic
approache
s were described to construct graphs from climate data
.
First, the
gridded
values of geophysical fields representing physical phenomena such as sea
-
ice or melt pond
can
be described as nodes in the graphs
.
The edge between two nodes
can
descri
be
domain
pro
cess
es
.
Alternatively
, the edge
can describe the physical
phenomena
and the nodes describe
the processes.
This bi
-
directional approach seems to have a lot of potential.
It was noted
,

however
,

that the graph community does not currently have good knowledge
representation
forms for
physical
processes
.
Encoding of s
cience
knowledge has

traditionally
been more
successful when
describing objects that can be measured
.
Both approaches
rely on

a rich set of
feature detection algorithms to transform the grid
-
based primary data into
objects in
a graph
as

the first step.

More complex graphs can also be constructed with a
named
definition of the network where
node types do not need to be consistent

in definition
.
For example,
one n
ode

could define a
geographic area

(Pacific Region)
and be
connected to
a n
ode
with physical measurements
(
time
series of temperatures)

with a named and typed relationship (measured location)
.

Sub
-
graphs
can be constructe
d to
represent different components currently implemented in coupled
Earth
system
models (atmosphere, ocean, land

surface
, …)
.

D
ifferent algorithmic approaches to
climate
modeling can then
be
manifest in

m
ultiple
graph
in
stantiations.

6.

P
otential Areas of A
pplication

a.

Model
Inter
-
comparison:

Model i
nter
-
comparison

project
s

are
a standard
procedure

that
enables a diverse community of
scientists to analyze
climate models

in a systematic fashion, which serves to facilitate model
improvement.
In a model inter
-
comparison, many models with different assumptions are
subjected to the same experimental protocol, and the output of all the models are made available
to a community of researchers. This procedure

requires

a community
-
based infrastructure in
support of cl
imate model diagnosis, validation, inter
-
comparison, documentation and data
access.

The construction of graphs using model output provide
s

another method
, in addition to more
traditional methods,

by which to compare outputs either
from
different models for

the same
simulated period or the same model using perturbed conditions.
G
raph algorit
hms such as graph
isomorphism

allow for powerful exploration of the encoded knowledge base (i
somorphism is an
equivalence relation between graphs
)
.

b.

Climate Teleconnection
s
:

Climate science features challenging characteristics such as nonlinear dynamics, chao
tic
regimes

and multi
-
physics complexity
.
Anomalies can be
related to each other at large distances
(typically thousands of kilometers)
, a feature referred to as a tele
-
connection
.
Information is
CKD Workshop Report


October 15, 2013

5

propagated

between the distant points through the atmosphere

or ocean

by transport processes
or wave dispersion
.
In addition, t
he climate system is dynamic

such that
remote atmospheric and
ocean responses to large
-
scale fluctuat
ions in the climate system will never occur exactly the
same way twice
. From a computational perspective, this problem is extremely challenging
requiring the search for patterns and relationships at large distances on a graph within and
between time
-
slices
.
Traversing a
graph looking for what is connected to what causes every
memory reference to be global and random, since every reference will lead to new connections
that are not located in the same sections of memory.

c.

Scale Interactions
:

It is also the cas
e that climate phenomena are observed at a wide range of spatial (10
3



10
7

m)
and temporal scales (10
-
2



10
8

years). Because of the fluid dynamics of the atmosphere and
oceans and because the processes governing phenomena on different time scales are co
-
dependent, there are strong interactions among the spatial and temporal scales of climate. This
complicates th
e analysis of climate data

and presents challenges for data processing and
automating inference.

7.

Technology Requirements

Applications that
aim to
combine semantic annotation and reasoning with graph algorithms
and
climate data
are more successful when com
bined with
sound ontologies. Many science
ontologies have been developed in the past, but
their

consisten
cy

and methodological
sound
ness (and completeness) are highly variable
. Most importantly, realistic use cases must be
defined even before starting to e
ngineer ontologies. The time frame for such developments is
therefore
often
multiple years
.

In the METAFOR project, an ontology for model metadata has been developed
, the METAFOR
Common Information Model (
CIM
13
)
. Like the
Earth System Grid (
ESG
)

ontology, i
t is intended to
replace existing model metadata.
While ESG uses Semantic Web standards such as RDF and
OWL to encode the ontology, the METAFOR project uses eXtensible Markup Language (XML)
Schema for the application
-
level encoding and CIM support tools. A
n encoding of the CIM with
RDF and OWL has been proposed, but there are open issues
. T
he encoding methodology
is
unclear so far
and
the feasibility of automated translation is disputed. There are also differences
in the meta
-
model of Semantic Web standards

and International Standards Organization (ISO)
standards that the CIM relies on which impede a direct transition.

METAFOR
also
has progressed on
the important aspect of the
development of controlled
vocabularies for community terms.
While the CIM captures

information related to the scientific
workflow, the controlled vocabularies describe
the
physical phenomena

in the data, which
provides a foundation for
developing
domain ontologies and
any subsequent
graph
representation

of the data
.

Graphs
constructed f
rom climate data could reach
millions or billions of
nodes
. Graphs of such
size

are well beyond simple comprehension or visualization.
Parallel software and hardware
technologies will be essential to enable complex data analytics on

today’s multi
-
terabyte

and
tomorrow’s petabyte
datasets
that will require future climate analysis
.
Investigation of existing



13

http://m
etaforclimate.eu/

CKD Workshop Report


October 15, 2013

6

and new parallel graph a
lgorithms and data structures capable of analyzing spatio
-
temporal
, and
possibly dynamic,

data at massive scale on parallel
and multithreaded
systems

is required
.

From a
contemporary
computing perspective,
the
performance of graph algorithms is typically
limited
by memory latency with access patterns being
highly
data dependent.

8.

Next Steps

There exists a set of highly articu
lated tools
in both
the graph and climate communities
.
Investigation is required to

explore and, in some cases, jumpstart the use of
knowledge
discovery algorithms in climate science.

These approaches would augment the traditional
methods of
Earth system

modeling and further leverage the volumes of observational and model
data
.


Graph and data
-
driven
technologies have been successfully applied to social and bioinformatics
areas
.
Work to date by groups including
Potsdam Institute for Climate Impact Researc
h (PIK)
,
U
niversity of
Wisc
onsin

and O
ak
R
idge
N
ational
L
aboratory

have demonstrated initial
applicability to climate science
.

The
workshop concluded that work it would be valuable

to
further
stimulate uptake and evaluation

within the climate research community.

Many of the traditional methods of climate analysis originated in meteorology
:
these
methods are
used to understand the weather. The equations used in numerical weather prediction and climate
simulation are nearly th
e same insofar as geophysical fluid dynamics are at the core of
atmospheric and oceanic motions. Weather and climate analysis, however, are distinct in
many

ways, because climate feedback processes occur on time scales that are long compared to the
variati
ons of the weather.

Climate
varies over

weeks through millions of years
.
Processes exist
that impact
the full range of time scales,

but it
is
not yet fully understood wh
ich of the processes
are
most
important

on what time scales
and for what areas
.

As an
example of an
area of interest
,
the
A
rc
tic will experience changes
in climate
more rapidly
than other parts of the globe
.
Possible reasons include
positive feedbacks that amplify the
general warming trend due to increasing greenhouse gas concentrations:
sn
ow
-
ice albedo
,
s
tabilization of atmosphere
, c
loud influence

and s
ea
-
ice influenced by wind
.
What cannot be
determined in climate models today or in the traditional analysis of climate observations is which
of those is most relevant to the current changing
climate.

Another critical area of concern is the degree of uncertainty in climate predictions and
projections.
A

h
igh confidence
exists
in
the
projections of climate change
at the
global

scale

and
understanding of climate processes

at the large scales
,
but confidence

decreases with
decreasing spatial scale
.

Decision
-
making, in contrast, occurs at the scale of human institutions,
which is much more local or regional in scope.
Furthermore, there is a growing interest in
attribution of climate events, eithe
r to proximate causes or to their ultimate origins, in particular,
whether those origins are natural or related to human activity. Attribution depends on an
understanding of the predictability of climate phenomena.
It is critical, therefore, to assess
pred
ictability, confidence and uncertainty at scales relevant for decision support.

Two areas for development were identified

at the workshop
: ontolog
ies

to describe the
relations
hip between features and events and a CKD test
-
bed based on a subset of the CMIP
5
data
.
A key outcome of discussions among the attendees was to
pursue a systematic
CKD Workshop Report


October 15, 2013

7

methodology and begin by
f
ocus
ing

on a specific use case

or two
that will provide some instance
level modeling
. Understanding
the principal processes of a specific use case

(such as
Arctic
climate change)
is key to defining representations and
will have the highest probability for
success
.

From a methodology perspective,
the use case and questions requiring answers that arise from
it, define the required vocabulary terms, i.e. semantics and their relations. Since the ultimate
knowledge base construction arises from a graph structure, and associated information modeling
is
best conducted in a group setting.
Initially a small
multidisciplinary
team
including

information/
knowledge modelers and domain scientists work
ing

within

a conceptual model
framework
, suitably influenced by familiarity with the underlying data of interest,

develop the
knowledge model. This model can be discussed and vetted by a larger group of domain experts
and a prototype knowledge base can be constructed, and analyzed for assessing suitability,
consistency and completeness of the model. Subsequently a mo
re

formal evaluation process

can be applied to determine iterations and improvements
.

This approach has been used
successfully in a number of other
semantically
-
based
projects

x,xi
.
The primary aim of this
approach is to obtain suitably structured ‘data’,
i.e. in graph model form upon which graph and
complex network analyses can be applied.

Further examination was also proposed on a set of science topics that will
bring
the
climate
science analysis
and graph

analytic

communities together on a given number o
f
concrete
examples
.
Specific test cases are to be developed but are expected to include model inter
-
comparison, feature
detection,
and teleconnections.

These test cases will expose the detailed
technical implementation (graph construction, graph isomorphi
sm approaches, etc…) and
develop a common understanding of the
concepts and
terms used
by the various communities
(graphs, s
emantic
s, etc
…).

It is anticipated that such studies can advise on the right mix of
automation, human control and abstraction that w
ork best for analyzing new climate data sets.

9.

CKD Workshop Attendees

Name

Affiliation

David A. Bader

Georgia Institute of Technology

Venkatramani Balaji

NOAA

Joachim Biercamp

DKRZ

Benno Blumenthal

Columbia IRI

Michael Böttinger

DKRZ

Reinhard Budich

MPI
-
M

Alexey Cheptsov

HLRS

Kendall Clark

Clark & Parsia

Traute Crüger

MPI
-
M

Gerry Devine

University of Reading

Andi Drebes

Uni Hamburg

John Feo

PNNL

Peter Fox

RPI

Bernadette Fritzsch

AWI

CKD Workshop Report


October 15, 2013

8

Name

Affiliation

S
teven Haflich

Franz

Illia Horenko

University of Lugano

Heike Jänicke

University of Heidelberg

Elke Keup
-
Thiel

CSC

Stephan Kindermann

DKRZ

Jim Kinter

COLA

and George Mason University

Ingo Kirchner

FU Berlin

Kerstin Kleese van Dam

PNNL

Luis Kornblueh

MPI
-
M

Uwe Kuester

HLRS

Jürgen Kurths

PIK

Michael
Lautenschlager

DKRZ

Tobias Lippert

ParStream

Alexander Löw

MPI
-
M

Thomas Ludwig

DKRZ

Jim Maltby

Cray

Jochem Marotzke

MPI
-
M

Norbert Marwan

PIK

Thorsten Mauritsen

MPI
-
M

Philipp Metzner

University of Lugano

Shoaib Mufti

Cray

Craig Norvell

Franz

Per
Nyberg

Cray

Christian Pagé

CERFACS

Stephen Pascoe

BADC

Michael Ponater

DLR

Hans Ramthun

DKRZ

Nurcan Rasig

Cray

David Rogers

Sandia

Will Sawyer

CSCS

Gary Stanbridge

Cray

Karsten Steinhaeuser

Univ. of Notre Dame / ORNL

Anastasios Tsonis

University
of Wisconsin

Tobias Weigel

DKRZ




CKD Workshop Report


October 15, 2013

9

10.

References




i

K. Steinhaeuser, N. V. Chawla and A. R. Ganguly (in press, available at doi:10.1002/sam.10100).
Complex Networks as a Unified Framework for Descriptive Analysis and Predictive Modeling in Climate
Science. Statistical Analysis and
Data Mining.

ii

K. Steinhaeuser, N. V. Chawla and A. R. Ganguly (2010). An Exploration of Climate Data Using Complex
Networks. SIGKDD Explorations 12(1), 25
-
32.

iii

Zou, Y.; Donges, J. F.; Kurths, J. Recent advances in complex climate network analysis,
Comple
x
Systems and Complexity Science
,
2011. 27
-
38 p

iv

J. F. Donges, Y. Zou, N. Marwan, and J. Kurths. Complex networks in climate dynamics. Europhysics

Letters, 87:48007, 2009.

v

A. A. Tsonis, K. L. Swanson, and P. J. Roebber. What Do Networks Have to Do with
Climate? Bulletin

of the American Meteorological Society, 87(5):585

595, 2006.

vi

Deborah L. McGuinness, Peter Fox, Boyan Brodaric, Eli
sa F. Kendall
, The Emerging Field of Semantic
Scientific Knowledge Integration, in IEEE Intelligent Systems, 24(1): 25
-
26
, 2009
.

vii

T. R. Gruber. A translation approach to portable ontologies.
Knowledge Acquisition
, 5(2):199
-
220, 1993.

viii

F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, P. F. Patel
-
Schneider:
The Description Logic
Handbook: Theory, Implementation,
Applications
. Cambridge University Press, Cambridge, UK, 2003
.

ix

The Fourth Paradigm: Data Intensive Scientific Discovery, Eds. Tony Hey, Stewart Tansley and Kristin
Tolle, Microsoft External Research
, 2009.

x

Benedict, J.L., McGuinness, D.L., & Fox, P. 2
007, A Semantic Web
-
based Methodology for Building
Conceptual Models of Scientific Information, EOS Trans. AGU, 88(52), Fall Meeting Suppl., Abstract
IN53A
-
0950.

xi

Fox, P. and McGuinness, D. L.,
An Open
-
World Iterative Methodology for the Development of
Semantically
-
enabled Applications
, in preparation, 2011.

xii

Navarra, A., J. L. Kinter III, and J. Tribbia, 2010: Crucial Experiments in Climate Science.
Bull. Amer.
Meteor. Soc
.,
91
, 343
-
352