Programming models for

compliantprotectiveΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

90 εμφανίσεις

Programming models for

data
-
intensive computing

A multi
-
dimensional problem


Sophistication of the target user


N(data

analysts) >
N(computational

scientists)


Level of expressivity


High level important for interactive analysis


Volume of data


The complex gigabyte vs. the enormous
petabyte


Scale and nature of platform


How important are reliability, failure, etc.


What QoS needs? Where enforced?


Separating concerns


What things carry over from conventional HPC?


Parallel file systems, collective I/O, workflow, MPI,
OpenMP
,
PETsc

etc., ESMF


What things carry over from conventional data?


Need for abstractions and data
-
level APIs: R, SPSS,
MatLab
, SQL,
NetCDF
, HDF,
Kepler
,
Taverna


Streaming databases, streaming data systems


What is unique to “data HPC”?


New needs at the platform level


New tradeoffs between HL and platform


Current models


Data
-
parallel


A space of data objects


A set of operators on those objects


Streaming


Scripting

Conclusions


Current HPC programming models fail to
address important data
-
intensive needs


An urgent need for a careful gap analysis
aimed at identifying important things that
cannot [easily] be done with current tools


Ask people for their “top 20” questions


Ethnographic studies


A need to revisit the “stack” from the
perspective of data
-
intensive
HPC apps

Programming models for data
-
intensive
computing



Will flat message
-
passing model scale for >1M cores?


How does multi
-
level //ism impact DIC (e.g.,
GPUs
)


MR, Dryad, Swift

what apps do they support?


how suited for
PDEs


How will 1K
-
core PCs change DIC?


Powerful data
-
centric programming primitives to express HL //ism
in a natural way while shielding physical configuration issues

what do we need?


If we design a supercomputer for DIC, what are
reqs
?


If storage controllers allow application
-
level control? Permit cross
-
layer control


New frameworks for reliability and availability (go beyond
checkpointing
)


How will different models and frameworks interoperate?


How do we support people who want large shared memory?


Programming models


Data parallel


MapReduce


Loosely synchronized chunks of work


Dryad, Swift, scripting


Libraries


e.g.,
Ntropy


Expressive power vs. scale


BigTable

(
Hbase
)


Streaming, online


Dataflow


What operators for data
-
intensive computing? (>M/R)


Sum, Average, …


Two main models


Data parallel


Streaming


Goal: “use within 30 minutes; still discovering new power in 2
years time”


Integration with programming environments


Working remotely with large datasets


Dataset


put in time domain, freq domain, plot the result


Multiple levels of abstraction? All
-
pairs.


Note that there are many ways to express things at the high
level, the challenge is implementing things


“Users don’t want to compile anymore”


Who are we targeting? Specialists or generalist?


Focus on need for rapid decision making


Composable

models


Dimensions of problem


Level of expressivity


Volume of data


Scale of platform


reliability, failure, etc.


Gauge the size of the problem you are asking to solve


QoS guarantees


Ability to practice on smaller datasets



Types of data + nature of the operators


Select, e.g. on spatial region, temporal operators


Data scrubbing: Data transposition, transforms


Data normalization


Statistical analysis operators


Look at LINQ


Aggregation


combine


Smart segregation to fit on the hardware


Need to deal with distributed data


e.g., column
-
oriented stores can help with that



What things carry over from conventional HPC


Parallel file systems, collective I/O, workflow, MPI,
OpenMP
,
PETsc

etc., ESMF


What things carry over from conventional data


Need for abstractions and data
-
level APIs: R, SPSS,
MatLab
, SQL,
NetCDF
, HDF,
Kepler
,
Taverna


What is unique to data HPC



Moving forward


Ethnographic studies (e.g.,
Borgman
)


Ask for people’s top 20 questions/scenarios


Astronomers


Environmental science


Chemistry …





E.g., see
SciDB

is reaching out to communities

DIC hardware architecture


Different compute
-
I/O balance


0.1 B/flop for supercomputer (“all
mem

to disk in 5
mins
” is an unrealizable goal)


Assume that it should be greater: Amdahl


See Alex
Szalay

paper


GPU
-
like systems but with more memory per core


Future streaming rates


what are they?


Innovating networking

data routing


Heterogeneous systems perhaps

e.g., M
vs

Ws


Reliability


where is it implemented?


What about software failures


A special OS?


New ways of combining hardware and software?


Within a system, and/or between systems

Modeling


“Query estimation” and status monitoring for
DIC applications

1000
-
core PCs


Increases data management problem


Enables a wider range of users to do DIC


More complex memory hierarchy

200
mems


We’ll have amazing games with realistic physics

Infinite bandwidth


Do everything in the cloud

MapReduce
-
related thoughts


MR is library
-
based. This makes optimization
more difficult. Type checking. Annotations.


Are there opportunities for optimization if we
incorporate ideas into extensible languages?


Ways to enforce/leverage/enable domain
-
specific semantics.


Interoperability/portability?


Most important ideas


How badly it doesn’t work so well: current
HPC practice fails for DIC. Make it easier for
the domain scientist, enable new types of
science


Gap analysis

articulate what we can do with
MPI and MR; what we can’t do with either,
and why


Propagating information between layers