Climate Applications in the EELA
, C. Baeza
, M. Carrillo
, J. Casado
, I. Dutra
, F. Echeverría
, R. Miguel
y R. Mayo
(on behalf of the EELA Project)
El proyecto EELA (E
infraestructura compartida entre Europa y Latino América) busca
construir un puente digital entre las iniciativas existentes en Europa y América Latina
mediante la creación de una red de colabora
ción que compartirá una infraestructura de
Malla (Grid) para apoyar el desarrollo y prueba de aplicaciones.
De especial interés en el proyecto son las aplicaciones de Climatología y, en particular, el
análisis y la comprensión de los patrones y fenómenos
de variabilidad atmosférica y
oceánica más importantes en Latino América (El Niño). Este problema requiere el acceso
y análisis eficiente de bases de datos climatológicos y de simulaciones de modelos
numéricos oceánicos y atmosféricos muy extensos en su n
úmero y en su tamaño.
También es necesario realizar de forma adecuada distintos análisis estadísticos (minería
de datos) para descubrir patrones de variabilidad significativos y las relaciones entre
Además, se estudiarán los efectos a largo plazo d
el cambio climático en estos
fenómenos. En este caso se requiere combinar predicciones globales ya existentes con
los resultados de nuevas predicciones de alta resolución para la zona de interés
integrando complejos modelos numéricos del clima bajo distint
os escenarios de
El uso de las tecnologías Grid permite compartir recursos de almacenamiento y de
computación de forma transparente, resolviendo así en tiempo real los problemas que
ahora requieren días de procesamiento.
A Project (E
infrastructure shared between Europe and Latin America) aims to
build a digital bridge between the existing initiatives in Europe and Latin America through
the creation of a collaborative network which will share a Grid infrastructure for supp
the deployment and test of applications.
Of main interest in the Project are the Climate Applications and, in particular, the analysis
and understanding of the most important patterns and phenomena in the atmospheric and
oceanic variability (El Niñ
o). This problem requires the efficient access and analysis of the
climate data bases and the simulation of numerical oceanic and atmospheric models, the
number and size of which is quite large. It is also necessary doing rightly statistical
mining) for discovering significance patterns and the relationship among
Besides, the long term effects on the Climate Change will be studied in these phenomena.
For doing this, already performed global predictions will be combined with the result
resolution predictions for the region of interest just integrating complex
numerical Climate models under the conditions fixed by different forced scenarios.
The use of Grid technologies allows sharing storage and computing resources in a
ansparent way, just solving in a brief time some problems that last for days nowadays.
One of the objectives of the EELA project (E
infrastrucutre shared between Europe and
o identify and
promote a sustainable framework for e
Science. This objective is reachable not only by
deploying mature Grid applications, but by deploying new ones coming from other
european and latinamerican scientific communities. In this document the fi
applications that are going to be deployed in the EELA infrastructure concerning Climate
The EELA project deals with four scientific areas organized over three tasks (BioMedical
Applications (Task 3.1), High Energy Physics Applicat
ions (Task 3.2), and Additional
Applications (Task 3.3)). The applications covered in this document belong to the last one;
this is Climate.
For doing this selection, the criteria of previous maturity of the application and the interest
of the Latin Ameri
can communities have been used. Three applications in the area of
Climate have been selected for their future use in the Grid. At least two of these Climate
applications will be ready for demonstrations by March 2007.
The main goal of this document is to
provide a general overview of the applications
selected and the criteria used for their consideration. Thus, the climate applications which
are going to be deployed on the pilot EELA infrastructure for both production and
dissemination purposes will be des
cribed. Applications have been identified from the
expertise and research activity of the current LA and EU partners in EELA. The next
sections will describe the relevance of Grids in climate and the objectives of using such
Grids within the frame of EELA.
Modern climate science deals with different sources of geographically distributed
observational data (surface, atmosphere, ocean, etc.) stored in different platforms and
formats. Moreover, an increasing number of global climate simulations a
nd predictions is
available from numerical atmospheric and oceanic models (reanalysis projects, ensemble
model and multi
model experiments, etc.). These sources of data can jointly help to solve
many important problems, such as regional climate change proj
ections, i.e., the effects of
climate change on different regions of interest. To this aim, efficient problem
statistical analysis tools are required for discovering knowledge, or useful information,
within the huge amount of information. Data minin
g and machine learning techniques
have been developed in the last decades to deal with this task, and different alternatives
have been studied to make easier the process in a distributed environment such as the
El Niño phenomenon is a key factor for
American climate prediction. El Niño has a
special interest due to direct effects on Pacific coast of South America and in particular for
Peru and Chile. Moreover, research institutes from Peru and Chile (EELA LA partners) run
global and regional cl
imate models and need to compare their results with other
simulations performed by international centres in the El Niño area.
The increasing need of computing power required for climate applications is currently
addressed by using either supercomputers or
performance computer networks. In
each case, the development of parallel computing techniques and algorithms face
different problems, due to the respective hardware and network constraints. Moreover the
huge amount of data involved in this process is
locally stored. Although grid technology
offers a solution to these problems, only a limited number of initiatives using grid
technology have appeared in the last few years. For instance, the Earth System Grid
(ESG) project provides distributed and transpa
rent data access to climate simulations
using the grid infrastructure formed by five USA research centres. Moreover, some
climate research centres have done some work to adapt and deploy their numerical
weather and climate parallel models in grid environme
However, typical climate applications required by end
users (agriculture, energy, etc.)
usually require a set of processes to be run in cascade:
Simulation of atmosphere/ocean models
Efficient data access to observations and previous simulations
a analysis and mining applications.
The climate applications involved in EELA are organized around these three tasks and
focus on their interconnection to solve typical end
Global and regional climate simulations: Global si
mulations will be carried out with
source numerical climate model CAM Model (Community Atmosphere
Model) [R1]. These simulations will be used as boundary conditions to run in
cascade the regional model MM5 (PSU/NCAR Mesoscale Model) [R2] (and/or t
new version WRF) focusing on the area of El Niño phenomenon.
Distributed access to climate datasets: The output of climate simulations can be
efficiently accessed using standard formats and middleware such as OpenDAP.
This middleware will be gridified i
n order to access simulation datasets stored in a
distributed form in the EELA grid.
Development of data mining applications: Clustering data mining techniques will
be deployed to test the climate cascade, obtaining weather classes from global
models in the El Niño area.
The main goal of this EELA task is the integration of these components of the climate
analysis cascade into the EELA testbed having in mind the EELA partners as end users.
LA partners (SENAMHI and UDEC) will work in the first
step of the cascade, and U
will work on the distributed data access and on the data mining applications.
This application meets the requirements suitable for GRID processing, since climate
simulation requires running the same model over different in
itial data sets (production) or
running different configurations (parameterizations) of the model over the same initial
data, datasets are distributed among different weather services and research laboratories,
and data mining algorithms are computing inte
nsive, thus requiring parallel processing. An
efficient development of this category of applications for the GRID environment requires
middleware components for application
performance monitoring, ef
ficient distributed data
and specific resource man
agement. Users should be able to run their applications
on the GRID, without needing to know details of the GRID structure and operation.
GLOBAL CLIMATE SIMUL
The Community Atmosphere Model (CAM) is the latest in a series of global atmosphere
ls developed at NCAR for the weather and climate research communities [R3]. CAM
also serves as the atmospheric component of the Community Climate System Model
(CCSM). CAM is a numerical model that uses governing equations of atmosphere to
make climate fore
cast for long periods of time (centuries). The proposal of SENAMHI is to
use the CAM model in a global scale at 300 Km of resolution simulating the climate
system of the last fifty years in a global scale. An ensemble of different simulations from
t initial conditions can be produced as different GRID jobs to characterize the
model climatology (see Fig. 1). This model can be also run with different forcing emission
scenarios to analyze climate change.
The Monthly global ra
infall generated running the CAM model at SENAMHI in a
REGIONAL CLIMATE APPLICATIONS
The PSU/NCAR mesoscale model (known as MM5), and the recent Weather Research
and Forecasting (WRF) version, are limited
area models designed to simulate or
regional atmospheric circulation (Fig. 2). These models can work with nested domains
with different resolutions and require as input the boundary conditions from a global model
(e.g., the CAM model). The model is supported by several pre
programs, which are referred to collectively as the MM5 modelling system, and has been
used as a computing performance benchmark in several tests.
Regional models are highly dependent on the specific parameterizations chosen for
grid physical phenomena resolved explicitly. Therefore, an optimal tuning of
the model for a given region requires running an ensemble of simulations with different
combinations of model parameters (e.g. running slightly different models with the same
. Four nested domain used to simulate the atmospheric evolution over central Chile.
The orography of the inner domain (above) and the surface temperature (below) resulting
from a simulation performed at UDEC are sh
own in the right panels.
DISTRIBUTED DATASETS AND ACCESS
The output of global and regional meteorological models is stored in particular binary
meteorological formats (netCDF, GRIB, BUFR, HDF5, etc.) and different simulations are
uted among different centres. Moreover, most of the end
applications only need to access partially the datasets since only a subset of the original
data is used (e.g. a certain geographical region). These characteristics make it difficult to
rs to locate and obtain the optimal desired data to perform statistical and data
mining analysis using standard techniques. To address these problems a distributed
inventory system becomes a requirement to allow data producers (weather services,
aboratories, etc) to expose metadata information to be harvest by users using
any common protocol like HTTP and allow a universal access. The THREDDS (Thematic
Realtime Environmental Distributed Data Services) [R4] project is developing middleware
e the gap between data providers and data users; the goal is to simplify the
discovery and use of scientific data. THREDDS Dataset Inventory Catalogs are used to
provide virtual directories of available data and their associated metadata. These catalogs
n be generated dynamically or statically. The THREDDS Data Server (TDS) [R5] is a
web server that provides access to the subsets of data included in the catalog based on
the metadata information. To this aim, a number of existing technologies have been
ended (NetCDF, OpenDAP, etc.).
A recent initiative, the Earth System Grid (ESG) project, has made an initial attempt to
gridify this technology. To this aim, OpenDAP data servers are included within the grid
infrastructure and data enters the grid stora
ge elements when they are first requested
from OpenDAP servers. This is an initial solution to the problem which can be further
analyzed to improve efficiency and integration into grid testbed. As a first task, the solution
developed in ESG will be adopted
into EELA testbed, but it will be further analyzed to
make THREDDS catalogs compatible with grid catalogs and to improve data subsetting
DATA MINING APPLICATIONS: SELF
Due to the high
dimensional character of the data
involved in the climate simulations, it is
necessary to first analyze and simplify the data in order to extract some useful knowledge.
Some data mining techniques are appropriate for this context. Unsupervised clustering
techniques allow partitioning the s
imulation databases, producing realistic weather or
climate models of great variability governing the global dynamics. Self
(SOM) are amongst the most popular clustering algorithms, which are especially suitable
for high dimensional data vi
sualization and modelling. It uses unsupervised learning (no
domain knowledge is needed and no human intervention is required) for creating a set of
prototype vectors representing the data (Fig. 3). Moreover, a topology preserving
projection of the prototy
pes from the original input space onto a low
dimensional grid is
carried out. Thus, the resulting ordered grid can be efficiently used for extracting data
features, clustering the data, etc. Self
Organized maps have been recently applied in
logical problems, such as classifying climate modes and anomalies for El
Niño phenomenon in the area of Peru [R6].
SOM lattice projected onto the space spanned by the first two principal components
of a reanalysis database (left) and 1000 mb temp
erature fields of some of the resulting
The suitability of different scalable parallel implementations of this algorithm for parallel
computers with predetermined resources and fast communications has been analyzed in
[R7]. On the contr
ary, a first attempt to design an adaptive scheme for distributing data
and computational load according to the changing resources available for each GRID job
submitted was analyzed in the CROSSGRID project [R8]. The simplest form for
parallelizing the SOM
algorithm is splitting up the data between different processors, as
shown in Fig.4(a). However, in this case, after each complete cycle the slaves must send
the prototypes to the master, which computes them up, sending the final centres back to
. This is not an efficient implementation for the GRID environment, since it
requires intensive message passing of high
dimensional data. Figures 4(b) and (c) show
two different alternatives for distributing computational resources with replicated (or
ralized) prototype vectors. The different messages required for each of the schemes
are shown using dashed lines, which may correspond to either an iteration of the
algorithm, or a whole cycle.
An MPI implementation was considered in this work.
Three different parallel schemes for the SOM training algorithm. (a) distributing
data, (b) distributing computational resources with replicated prototype vectors, (c)
distributing computational resources with centralized prototype vectors.
The above al
gorithms will be deployed and further analyzed within the EELA testbed,
running in cascade with the global and/or regional models above described, using the
data stored in the grid in previous simulations.
IDENTIFIED CLIMATE APPLICATIONS
As a conclusio
n of this part, three different climate applications have been identified to be
deployed into EELA testbed:
CAM. Global climate simulations.
MM5/WRF. Regional climate simulations
SOM. Clustering data mining technique.
Moreover, some extra work shall be d
one to deploy data access middleware in order to
couple the above applications in a cascade, accessing the data through grid catalogs.
EELA climate partners have experience running these applications into local clusters and
have already started to built u
p their Resource Centers to start the griddification process.
Collins W.D. et al.
The Community Climate System Model version 3 (CCSM3).
of Climate, 2006,
THREDDS Data service, http://jodi.ecs.soton.ac.uk/Articles/v02/i04/Domenico/)
Cofiño, A.S., Gutiérrez, J.M. and Cano, R.. Analysis and downscali
seasonal forecasts in Perú using self
organizing maps. Tellus A, 2006,
Lawrence R.D., Almasi G.S. and Rushmeier H.E. A scalable parallel algorithm for
organizing maps with applications to sparse data mining problems.
Data Mining and
Knowledge Discovery, 1999,
Luengo, F, Cofiño, A.S. and Gutierrez, J.M, GRID oriented implementation of Self
Organizing Maps for data mining in Meteorology, in GRID Computing, Proceendings of 1st
European Across GRIDs Confer
ence, Rivera, F, Bubak, M., Gomez Tato, A and Doallo,
R. Eds, 2003, 163