here - Systems Research Group - University of Illinois at Urbana ...

clumpfrustratedBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

67 views


Roy H. Campbell


Reza Farivar,
Abhishek

Verma
, Cristina Abad

{farivar2
,
verma7
,
cabad
}@illinois.edu


rhc@illinois.edu


1

Motivation and Goals


R
esearch teams and practitioners are embracing
cloud computing technologies for compute
intensive tasks


E.g. Genetic Algorithms, Financial Algorithms,
Bioinformatics, Astronomy, Machine Learning, Web
Analytics, etc.


Many economic advantages


Not clear if such tasks perform optimally using
MapReduce on COTS clusters (specially GPU clusters)



Research Goals: Investigate bottlenecks of

COTS + MapReduce + Compute Intensive Tasks

2

Summary



Financial Computations


Genetic Algorithms for Optimization


Astronomy


Gene Alignment


Partitioned Iterative Algorithms: Best Effort


Clouds, Machine Learning and Reliability


Storage
Workload Characterization


Workload Modeling

3

Financial Computations


Black Scholes future options pricing on a
MapReduce cluster


Using MITHRA, our modified “MapReduce on
GPU clusters” Middleware


MITHRA runs
map()
on GPUs as CUDA kernels


reduce()

runs on the cluster CPUs


Better use of GPU hardware



increased
locality exploiting

4

Genetic Algorithms for Optimization

1.
Initialize population with random individuals.

2.
Evaluate fitness value of individuals.

3.
Repeat steps 4
-
5 to 2 until some convergence
criteria are met.

4.
Select good solutions by using tournament
selection without replacement.

5.
Create new individuals by recombining the
selected population using uniform crossover.

5

Map

Reduce

Astronomy


Use Hadoop Streaming to Run multiple, parallel
instances of an Astronomy source extraction program:
Sextractor


Use MapReduce intermediate key grouping / sorting to
help merge catalog records

File Fetch MapReduce
Job

HDFS

SExtractor MapReduce Job

Merging


uses X,Y as key

HDFS

Post
-
Processing

Merged

catalog

Individual

Catalogs

File
1

File
2

File
3

File
4



Pre
-
processing

/


Metadata
generation

Phase 1

Phase 2

Phase 3

Phase 4

Unique

ID
s

6

Gene Alignment: Distributed Filtering

TGCCTTCATTTCGTTATGTACCCAGTAGTCATAAAAGCACTAGCTTGCCAAGTT

TGCCTT

GCCTTC

CCTTCA

CTTCAT

TTCATT

1 1 0

1 0 1

0 1 1

Sorted Masked Arrays

TG00TT

GC00TC

CC00CA

CT00AT

TT00TT

TGCC00

GCCT00

CCTT00

CTTC00

TTCA00

00CCTT

00CTTC

00TTCA

00TCAT

00CATT

Distributed pigeon hole filter

7

Masked Read Matching

CCATCA

1 1 0

1 0 1

0 1 1

CCAT00

CC00CA

00ATCA

1 1 0

1 0 1

0 1 1

Sorted Masked Arrays

TG00TT

GC00TC

CT00AT

CC00CA

TT00TT

TGCC00

GCCT00

CCTT00

CTTC00

TTCA00

00CCTT

00CTTC

00TTCA

00TCAT

00CATT

A Short Read

CC00CA

8

Iterative Computations

PageRank

Clustering

BFS

Youtube

Video
S
uggestion

Pattern Recognition

9

Local
iteration

?

Partitioned Iterative Convergence:

Best Effort

10

Model
Update

Current
Model(s)

New sub
-
model

New
Model

Model effect
applicator

Input
Partitioner

Global Model
Merge

?

?

Convergence
test

Cluster node 1

Cluster node 2

Cluster node 3

Shared model
management

Convergence
Criteria

Clouds, Machine Learning and
Reliability


Trend: Clouds will expand into diverse roles


Big Data


Data mining and machine learning


Real time data


Streaming clouds (e.g. Storm)


Economic pressure: Massive clouds adoption


Results fed into Cyber physical systems


Result: The reliability and security of (1) clouds
and (2) ML algorithms on clouds will impact real
-
world phenomena


The current cloud solutions are orders of
magnitude less dependable than minimum
requirements for cyber physical systems


11

Cloud Storage
Workload Characterization


Studied how MapReduce interacts with storage
layer


Findings relevant to storage system design and tuning:

o
Workloads are dominated by high file churn

o
80%−90% files accessed 1
-
10 times in 6 months

o
Small % of very popular files

o
Young files:


High % of accesses,


Small % of bytes stored

o
Requests are bursty

o
Files are very short
-
lived:


90% deletions
target



files
< 1.25 hours old

12

Big Data Storage Workloads:
Modeling and Synthetic Generation


One potential storage bottleneck:

o
Metadata server: must handle large







number of bursty requests


New schemes have been proposed







but
evaluation has been insufficient

o
No adequate traces or models


Mimesis:
synthetic workload generator

o
Suitable for Big Data workloads

o
Reproduces desired statistical workload of original trace

o
Accurate: low RMSE (root mean squared error) when used in
place of original traces


Used to evaluate a LRU metadata cache for HDFS

Performance Modeling of MapReduce
Environments


Performance modeling techniques for MapReduce
environments


Analytical models, Simulation, Experimental
measurements


Service level objectives:


Automatic Resource Inference and Allocation of resources
for MapReduce workloads


Optimization of
makespans

of set of jobs and DAGs


Comparison of hardware alternatives

14

Comparison of
Hardware Alternatives


Designed a synthetic MapReduce application
based on the CPU, memory, disk and network used


Goal:

Find a minimum set (basis) of these synthetic
applications onto which any MapReduce workload
can be projected on to


Using performance of the basis on old and new
hardware, estimated performance of any workload
on new hardware within 10% error.

15