cresis-ci_july12-11x - Indiana University

candlewhynotΔιαχείριση Δεδομένων

31 Ιαν 2013 (πριν από 4 χρόνια και 2 μήνες)

281 εμφανίσεις

Cyberinfrastructure and Its
Application

CReSIS REU Presentation

July 12
2011

Geoffrey Fox

gcf@indiana.edu



http://www.infomall.org

http://www.futuregrid.org



Director, Digital Science Center, Pervasive Technology Institute

Associate Dean for Research and Graduate Studies,


School of Informatics and Computing

Indiana University Bloomington

Important Trends


Data Deluge
in all fields of science


Multicore

implies parallel computing important again


Performance from extra cores


not extra clock speed


GPU enhanced systems can give big power boost


Clouds



new commercially supported data center
model replacing compute
grids

(and your general
purpose computer center)


Light weight clients
: Sensors, Smartphones and tablets
accessing and supported by backend services in cloud


Commercial efforts
moving

much faster
than

academia
in both
innovation
and
deployment

Jaliya Ekanayake
-

School of Informatics and Computing

3

Big Data in Many Domains


According to
one

estimate, we created 150 exabytes (billion gigabytes) of data
in 2005. This year, we will create 1,200
exabytes


PC’s have ~100 Gigabytes disk and 4 Gigabytes of memory



Size of the web
~ 3 billion web
pages: MapReduce at Google was on
average
processing
20PB per day in January
2008



During 2009, American drone aircraft flying over Iraq and Afghanistan sent
back around 24 years’ worth of video footage


http://
www.economist.com/node/15579717


New models being deployed this year will produce ten times as many data streams as
their predecessors, and those in 2011 will produce 30 times as many



~108 million sequence records in
GenBank

in 2009, doubling in every 18 months



~20 million purchases at Wal
-
Mart a day



90 million
Tweets a day



Astronomy, Particle Physics, Medical Records …




The
Fourth Paradigm: Data
-
Intensive Scientific
Discovery


Large Hadron Collider at CERN; record ~100 Petabytes data to find Higgs Boson



6

6

What is Cyberinfrastructure


Cyberinfrastructure is (from NSF) infrastructure that supports
distributed research and learning

(
e
-
Science, e
-
Research, e
-
Education
)


Links data, people, computers


Exploits
Internet technology

(
Web2.0
and
Clouds
) adding (via
Grid

technology) management, security, supercomputers etc.


It has two aspects:
parallel


low latency (microseconds) between
nodes and
distributed



highish latency (milliseconds) between
nodes


Parallel needed to get
high performance

on
individual

large
simulations, data analysis etc.; must
decompose problem


Distributed aspect
integrates

already distinct components


especially natural for data (as in biology databases etc.)

7

7

e
-
moreorlessanything




e
-
Science

is about global collaboration in key areas of science,
and the next generation of infrastructure that will enable it.’ from
inventor of term
John Taylor

Director General of Research
Councils UK, Office of Science and Technology



e
-
Science

is about developing tools and technologies that allow
scientists to do ‘faster, better or different’ research


Similarly

e
-
Business

captures the emerging view of corporations
as dynamic
virtual organizations

linking employees, customers
and stakeholders across the world.


This generalizes to
e
-
moreorlessanything
including
e
-
DigitalLibrary
,

e
-
FineArts
,

e
-
HavingFun

and

e
-
Education


A
deluge of data

of unprecedented and inevitable size must be
managed and understood.


People
(virtual organizations),
computers
,
data

(including
sensors

and
instruments
)

must be linked via hardware and software
networks

The Span of Cyberinfrastructure


High definition videoconferencing linking people across
the globe


Digital Library of music, curriculum, scientific papers


Flickr,
Youtube
, Amazon ….


Simulating a new battery design (
exascale

problem)


Sharing data from world’s telescopes


Using cloud to analyze your personal genome


Enabling all to be equal partners in creating knowledge
and converting it to wisdom


Analyzing Tweets…documents to discover which stocks
will crash; how disease is spreading; linguistic
inference; ranking of institutions


8

Data Centers Clouds &

Economies of Scale I

Range in size from “edge”
facilities to
megascale
.

Economies of scale

Approximate costs for a small size
center (1K servers) and a larger,
50K server center.

Each data center is

11.5 times

the size of a football field


Technology

Cost in small
-
sized

Data
Center

Cost in Large

Data Center

Ratio

Network

$95 per Mbps/

month

$13 per

Mbps/

month


7.1

Storage

$2.20 per GB/

month

$0.40 per GB/

month


5.7

Administration

~140 servers/

Administrator

>1000 Servers/

Administrator


7.1

2 Google warehouses of computers on
the banks of the Columbia River, in
The
Dalles
, Oregon

Such centers use 20MW
-
200MW

(Future) each with 150 watts per CPU

Save money from large size,
positioning with cheap power and
access with Internet

10



Builds giant data centers with 100,000’s of computers;


~ 200
-
1000 to a shipping container with Internet access


“Microsoft will cram between 150 and 220 shipping containers filled
with data center gear into a new 500,000 square foot Chicago
facility. This move marks the most significant, public use of the
shipping container systems popularized by the likes of Sun
Microsystems and
Rackable

Systems to date.”

Data Centers, Clouds

& Economies of Scale II

Gartner 2009 Hype Curve

Clouds, Web2.0

Service Oriented Architectures


Transformational




High






Moderate



Low

Cloud Computing


Cloud Web Platforms


Media Tablet

Clouds and Jobs


Clouds
are a major industry thrust with a growing fraction of IT
expenditure that IDC estimates will grow to
$44.2 billion direct
investment
in
2013
while
15% of IT investment in 2011
will be
related to cloud systems with a 30% growth in public
sector.


Gartner
also rates cloud computing high on list of critical
emerging technologies with for example “Cloud Computing” and
“Cloud Web Platforms” rated as
transformational

(their highest
rating for impact) in the next 2
-
5
years.


Correspondingly
there is and will continue to be major
opportunities for new jobs in cloud computing with a recent
European study estimating there will be
2.4 million new cloud
computing jobs in Europe alone by
2015
.


C
loud
computing is an attractive for projects focusing on
workforce development
. Note that the recently signed “America
Competes Act

calls out the importance of economic
development in broader impact of NSF
projects

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Hubble
Telescope

Palomar
Telescope

Sloan
Telescope


The Universe is now being
explored systematically
, in a
panchromatic way, over a
range of spatial and
temporal scales that lead to
a more complete, and less
biased understanding of its
constituents, their evolution,
their origins, and the
physical processes
governing them.”


Towards a National Virtual
Observatory


Tracking the Heavens

14

Virtual Observatory Astronomy Grid

Integrate Experiments

Radio

Far
-
Infrared

Visible

Visible + X
-
ray

Dust Map

Galaxy Density Map

15


Particle Physics at the CERN LHC

UA1 at CERN 1981
-
1989

"hermetic detector"


ATLAS at LHC, 2006
-
2020

150
*
10
6

sensors

LHC experimental collaborations (e.g. ATLAS)
typically involve over 100 institutes and over
1000 physicists world wide

www.egi.eu

EGI
-
InSPIRE

RI
-
261323

www.egi.eu

EGI
-
InSPIRE RI
-
261323

European Grid Infrastructure

Status April 2010 (yearly increase)


10000 users: +5%


243020 LCPUs (cores): +75%


40PB disk: +60%


61PB tape: +56%


15 million jobs/month: +10%


317 sites: +18%


52 countries: +8%


175 VOs: +8%


29 active VOs:
+32%

16

1/10/2010

NSF & EC
-

Rome 2010

TeraGrid Example: Astrophysics


Science: MHD and star formation;
cosmology at galactic scales (6
-
1500
Mpc
) with various components: star
formation, radiation diffusion, dark
matter


Application:
Enzo

(loosely similar to:
GASOLINE, etc.)


Science Users: Norman,
Kritsuk

(UCSD),
Cen
,
Ostriker
, Wise (Princeton), Abel
(Stanford), Burns (Colorado), Bryan
(Columbia), O’Shea (Michigan State),
Kentucky, Germany, UK, Denmark, etc.

DNA Sequencing Pipeline

Visualization


Plotviz

Blocking

Sequence

alignment

MDS

Dissimilarity

Matrix


N(N
-
1)/
2 values

FASTA File

N Sequences

Form
block

Pairings

Pairwise

clustering

Illumina
/
Solexa

Roche/454 Life Sciences Applied
Biosystems
/
SOLiD

Internet

Read
Alignment

~300 million base pairs per day leading to

~3000 sequences per day per instrument

? 500 instruments at ~0.5M
$ each

MapReduce

MPI

100,043 Metagenomics Sequences

20

Lightweight
Cyberinfrastructure
to support mobile
Data gathering
expeditions plus
classic central
resources (as a cloud)

See talk by
Je’aime

Powell ECSU

Cyberinfrastructure



Supports the Expeditions with light weight field system


hardware and system support


Then perform offline processing at Kansas, Indiana and ECSU


Indiana and ECSU facilities and initial field work funded by NSF PolarGrid MRI
which is now (essentially) completed


Initial basic processing to Level 1B


Extension to L3 with image processing and data exploration
environment


Data is archived at NSIDC

Prasad Gogineni
With the on
-
site processing capabilities provided by PolarGrid, we are
able to quickly identify Radio Frequency Interference (RFI) related problems and
develop

appropriate

mitigation techniques. Also, the on
-
site processing capability
allows us to process and post data to our website within 24 hours

after a flight is
completed. This enables
scientific and technical personnel

in the continental United
States to evaluate the results and provide the field team with near real
-
time feedback on
the quality of the data. The review of results also allows us to re
-
plan and re
-
fly critical
areas of interest in a timely manner.

IU Field Support, Spring 2011


OIB and Twin Otter flights
simultaneously, two engineers
in the field


The most equipment IU has
sent to the field in any season


processing and data transfer
server at each site


two arrays at each field site


Largest set of data
capture/backup jobs yet
between CReSIS/IU

Supporting Higher Level Data
Products


Image Processing


Data Browsing Portal from Cloud


Standalone Data Access in the field


Visualization

Hidden Markov Method based Layer Finding

P. Felzenszwalb, O. Veksler, Tiered Scene Labeling with Dynamic Programming,
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Current CReSIS Data Organization


The
data are organized by season.
Seasons
are broken
into data segments which are contiguous blocks of data
where the radar parameters do not change.


Data
segments are broken into frames (typically 50 km
in length
). Associated data for each frame are stored in
different file formats CSV
(flight path), MAT (depth
sounder data),
PDFs (image products).


CReSIS data products website lists direct download
links for individual files.

PolarGrid Data Browser Goals


Organize the data files by its spatial attributes.


Support multiple protocols for different user groups,
such as KML service and direct spatial database access.


Support efficient access methods in different
computing and network environments.


Cloud and Field (standalone) versions


Support high level spatial analyses functions powered
by spatial database


PolarGrid Data Browser Architecture


Two main components: Cloud distribution service and
special service for PolarGrid field crew.


Data synchronization is supported among multiple
spatial databases.

Google Earth

Matlab/GIS

GeoServer

Spatial Database

GIS Cloud Service

WMS

KML

Cloud Access

Field Access

SpatiaLite

SQLite Database

Field Service

Spatial Database

Virtual Appliance

Data Portal

Single User

Multiple Users

(local network)

Virtual Storage

Service

PolarGrid Data Browser:

Cloud GIS Distribution Service


Google
Earth example:
2009 Antarctica season


Left image: overview of 2009 flight paths


Right image: data access for single frame


Technologies in

Cloud GIS Distribution Service


G
eospatial
sever is based on
GeoServer

and
PostGreSQL

(spatial database), and configured
inside the Ubuntu
virtual
machine.


Virtual storage service
attaches terabyte
storage
to the virtual machine.


The Web Map Service (WMS) protocol enables
users to access the original data set from Matlab
and GIS software. KML distribution is aimed for
general users. Data
portal are built with Google
Map, and can be embedded into any website.




PolarGrid data distribution on Google
Earth


Processed on cloud using MapReduce

PolarGrid Field Access Service


Field crew has limited computing resource and internet connection.


Essential data set are downloaded from Cloud
GIS
distribution
s
ervice, packed as spatial
database
virtual appliance with
SpatiaLite
.
The whole system can be carried around on a USB flash drive.


Virtual appliance is built on Ubuntu JeOS (just enough operating
system), it has almost identical functions as GIS Cloud service,
works on local network with
VirtualBox
. The virtual appliance runs
with 256 M virtual memory.


SpatiaLite
database is
a light
-
weight spatial database based
on
SQLite.
It aims at a single user;


the
data can be accessed through GIS
software,
and
a
native API for
Matlab

has
also
been developed.

PolarGrid Field Access Service


SpatiaLite data access
with Quantum
GIS interface


Left image
: 2009 Antarctica s
eason vector data, originally stored in
828 separate files.


Right
image: visual crossover
analysis for quality control (work in
progress)



Use of Tiled Screens

C4 = Continuous Collaborative

Computational Cloud

C4 EMERGING VISION

While the internet has changed the way we communicate and get entertainment,
we need to empower the next generation of engineers and scientists with
technology that enables interdisciplinary collaboration for lifelong learning.


Today, the cloud is a set of services that people explicitly have to
access

(from
laptops, desktops, etc.). In 2020 the C4 will be part of our lives, as a larger,
pervasive, continuous experience. The measure of success will be how “invisible” it
becomes
.



We are no prophets and can’t anticipate what exactly will work, but we
expect to
have high bandwidth and ubiquitous connectivity for everyone everywhere, even in
rural areas (using power
-
efficient micro data centers the size of shoe boxes
). Here
the cloud will enable business, fun, destruction and creation of regimes (societies)

C4
Society Vision

Education should also embrace C4

C
4

Continuous

Collaborative

Computational

Cloud

C
4

I

N

T

E

L

I

G

L

E

N

C

E

Motivating

Issues


job / education mismatch



Higher Ed rigidity



Interdisciplinary work



Engineering v Science, Little v. Big science

Modeling

& Simulation

C(DE)SE

C
4

Intelligent Economy

C
4

Intelligent People

C
4

Intelligent Society

NSF

Educate “Net Generation”

Re
-
educate pre “Net Generation”

in
Science and Engineering

Exploiting and developing C
4

C
4

Curricula, programs

C
4

Experiences (delivery mechanism)

C
4
REUs, Internships, Fellowships

Computational Thinking

Internet &

Cyberinfrastructure

Higher Education 2020

CDESE is Computational and Data
-
enabled Science and Engineering

ADMI Cloudy View on

Computing Workshop

June
2011


Jerome took two courses from IU in this area Fall 2010 and Spring 2011


ADMI:
Association
of Computer and Information Science/Engineering
Departments at Minority
Institutions


Offered on FutureGrid (see later)


10 Faculty and Graduate Students from ADMI Universities


The workshop provided information from cloud programming models to case
studies of scientific applications on FutureGrid.


At the
conclusion of the workshop, the participants
indicated that they
would
incorporate cloud computing into their
courses and/or research
.


Concept and Delivery by

Jerome Mitchell:

Undergraduate
ECSU,

Masters
Kansas, PhD Indiana

ADMI Cloudy View on Computing
Workshop Participants


Workshop Purpose


Introduce ADMI to the basics of the emerging
Cloud Computing paradigm


Learn how it came about


Understand its enabling technologies


Understand the computer systems constraints, tradeoffs, and
techniques of setting up and using cloud


Teach ADMI how to implement algorithms in the Cloud


Gain
competence in
cloud
programming
models for
distributed
processing of large datasets
.


Understand how different algorithms can be implemented and
executed on cloud frameworks


E
valuating the performance and identifying bottlenecks when
mapping applications to the clouds




3
-
way Cyberinfrastructure


Use it in faculty, graduate student and
undergraduate
research


~12 students each summer at IU from ADMI


Teach

it as it involves areas of Information
Technology with lots of job opportunities


Use it to support
distributed learning
environment


A cloud backend for course materials and
collaboration


Tiled display for visualization


Green computing infrastructure

Some Next Steps


Develop Appliances (Virtual machine based
preconfigured computer systems) to support
programming laboratories


Offer Cloud Computing course with


Web portal support


FutureGrid or Appliances locally


Distance delivery


Deliver first to ECSU, then other MSI’s


Write proposals with Linda Hayden at ECSU and …


Develop Cloud Computing Certificates and other
degree offerings


Masters, Undergraduate, Continuing education …..

US Cyberinfrastructure Context


There are a rich set of facilities


Production TeraGrid
facilities with distributed and
shared memory


Experimental “Track 2D” Awards


FutureGrid
: Distributed Systems experiments cf. Grid5000


Keeneland
: Powerful GPU Cluster


Gordon
: Large (distributed) Shared memory system with
SSD aimed at data analysis/visualization


Open Science Grid
aimed at High Throughput
computing and strong campus bridging


42

TeraGrid: 3
Petaflops

High
Performance Networks (40 GB/sec)

The TeraGrid currently delivers an average of 420,000
cpu
-
hours per
day

50 petabytes
of
online and archival data storage

https://portal.futuregrid.org

FutureGrid key Concepts I


FutureGrid
is a 4 year $15M project with 7 clusters at 5
sites across country with 8 funded partners


FutureGrid is a flexible
testbed
s
upporting
Computer
Science
and
Computational Science
experiments in


Innovation and scientific understanding of
distributed computing
(cloud, grid) and
parallel computing
paradigms


The engineering science of
middleware

that enables these
paradigms


The use and drivers of these paradigms by important
applications


The
education

of a new generation of students and workforce on
the use of these paradigms and their
applications


interoperability
,
functionality
,
performance

or
evaluation



https://portal.futuregrid.org

FutureGrid key Concepts II


Rather than loading images onto VM’s, FutureGrid supports
Cloud, Grid and Parallel computing
environments by
dynamically provisioning
software as needed onto “bare
-
metal”


Image library
for MPI,
OpenMP
, Hadoop, Dryad,
gLite
, Unicore, Globus,
Xen
,
ScaleMP

(distributed Shared Memory), Nimbus, Eucalyptus,
OpenNebula, KVM, Windows …..


Growth comes from users depositing novel images in library


Each use of FutureGrid is an

experiment
that is
reproducible


Developing
novel software to support these goals which build on
Grid5000 in France



Image1

Image2

ImageN



Load

Choose

Run

https://portal.futuregrid.org

FutureGrid Partners



Indiana University
(Architecture, core software, Support)


Purdue University
(HTC Hardware)


San Diego Supercomputer Center
at University of California San Diego
(INCA, Monitoring)


University of Chicago
/Argonne National Labs (Nimbus)


University of Florida
(
ViNE
, Education and Outreach)


University of Southern California Information Sciences (Pegasus to manage
experiments)


University of Tennessee Knoxville (Benchmarking)


University of Texas at Austin
/Texas Advanced Computing Center (Portal)


University of Virginia (OGF, Advisory Board and allocation)


Center for Information Services and GWT
-
TUD from
Technische

Universtität

Dresden. (VAMPIR)


Red institutions
have FutureGrid hardware

https://portal.futuregrid.org

FutureGrid:

a Grid/Cloud/HPC Testbed

Private

Public

FG Network

NID
: Network
Impairment Device

https://portal.futuregrid.org

5 Use Types for FutureGrid


Training Education and Outreach


Semester and short events; promising for outreach


Interoperability test
-
beds


Grids and Clouds; OGF really needed this


Domain Science applications


Life science highlighted


Computer science


Largest current category


Computer Systems Evaluation


TeraGrid (TIS, TAS, XSEDE), OSG, EGI

48

https://portal.futuregrid.org

Education & Outreach on FutureGrid


Build up
tutorials

on supported software


Support development of curricula requiring privileges and
systems destruction capabilities
that are hard on
conventional TeraGrid


Offer suite of
appliances

(customized VM based images)
supporting online laboratories


Supporting several workshops including
Virtual Summer
School
on “
Big Data
” July 26
-
30 2010;
TeraGrid ‘10
“Cloud
technologies, data
-
intensive science and the TG”
August
2010;
CloudCom

conference tutorials Nov 30
-
Dec 3 2010;
ADMI Cloudy View of Computing
workshop June 2011


Experimental
class use
at Indiana, Florida and LSU