Data Grid Technologies

internalchildlikeInternet and Web Development

Nov 12, 2013 (3 years and 6 months ago)

78 views

Data Grid Technologies

Sathish Vadhiyar

Sources/Credits: Technical papers
listed in references

Replica Strategies

Problem Motivation

Replication to deal with faults and provide
scheduling flexibility.

Given a file that is partitioned into blocks
that are replicated throughout a wide
-
area
file system, how can a client retrieve the
file with the best performance?

Various algorithms

Basic Downloading Algorithm

The client opens a thread to each server
containing the file

A block size is chosen

Each thread selects a different block to
download and all threads start downloading

A thread then chooses a new block that is
currently not being downloaded by any other
thread

Adaptive


Servers with higher bandwidths to
clients download more blocks

Selection of block size
-

tricky

Aggressive Redundancy

To provide fault tolerance and to improve
download time

A redundancy factor, R

The client downloads a block
simultaneously from R servers

Only 1 is chosen


whichever returns first

Progress
-
Driven Redundancy

Retry a download only when it is
progressing slowly

Progress number
-

P, redundancy factor


R

Each block assigned a download number
initialized to 0

When a thread attempts to download a
block, it increments the block’s download
number


Progress
-
Driven Redundancy
(Continued)

For selecting a new block to download


If there is a block B whose download number
< R, and if there are P blocks after B whose
downloads have completed, then select B


Else select next block whose download
number is zero

Fastest
1

Another approach

For downloading a block, choose a server
that has minimum value of time*(l+1)


time


predicted time to download a block
when there is no contention. Obtained from
NWS numbers before download is initiated.


l


number of threads currently downloading
from the server

Multiple clients

Situation arises when parallel data for
computation on parallel clients have to be
selected from available replica server locations

More challenges


download decision by a client
can impact download performance on other
clients. Need to predict this impact.

Periodic network monitoring have to be
augmented by measurements corresponding to
current downloads


Collective Download algorithm

Each algorithm connects to a server only once
even if some of the data belongs to other clients


download phase

The clients then redistribute data among
themselves


redistribution phase

Widely followed in parallel
-
I/O

Especially useful when clients and servers are
on either side of WAN


multiple latencies can
be avoided at the cost of less expensive
redistribution phase

Replica Placement Strategies

Replica placement questions


When should replicas be created?


Which files should be replicated?


Where should replicas be placed?

The model assumes that data is produced
in tier
-
1 (root) and there are storage
spaces at various tiers (levels of hierarchy)

Clients that request data form the leaves
of the hierarchy


Placement strategies

1.
Best client

Each storage node maintains history
regarding number of requests for the files it
contains

If the number of requests for a file exceeds
the threshold, the node creates a replica of
the file in that client node that has generated
most requests for that file (best
-
client)

The request details for the file are cleared.

Strategies …

2.
Cascading replication

Analogy to a 3
-
tiered function

Once a threshold for a file is exceeded at
the root, a replica is created at the next level
on the path to the best client and so on…

Geographical locality is exploited

3.
Plain caching


done at the client

4.
Caching plus Cascading Replication

Strategies…

5.
Fast Spread

A replica of the file is stored at each node
along its path to the client

Replica selection


closest replica

Replica replacement


least popular file
with oldest age is replaced. Popularity
logs are cleared periodically



Findings

Best
-
client performs worst for random access
patterns and shows improvement for access
patterns with a bit of geographical locality

Fast spread works much better than cascading
for random data access

Bandwidth savings are more in fast spread than
in cascading

Fast spread has high storage requirements

Sources / References / Credits

Algorithms for high Performance, Wide
-
area
distributed file downloads
.
J.S. Plank, S.
Atchley, Y.Ding and M. Beck, Parallel
Processing Letters, vol. 13, no. 2, pp 207
-
224,
June 2003.

Downloading Replicated Wide
-
Area Files


a
Framework and Empirical Evaluation.
R.L.
Collins and J.S. Plank. NCA 2004.

Identifying Dynamic Replication Strategies
for a High
-
Performance Data Grid.
K.
Ranganathan and I. Foster. Grid 2002.

Sources / References / Credits

Grid
-
Based Galaxy Morphology Analysis for the
National Virtual Observatory
.
Ewa Deelman, Raymond
Plante, Carl Kesselman, Gurmeet Singh, Mei
-
Hui Su,
Gretchen Greene, Robert Hanisch, Niall Gaffney,
Antonio Volpicelli, James Annis, Vijay Sekhri,
Fermi
Tamas Budavari, Maria Nieto
-
Santisteban, William
O'Mullane, David Bohlender, Tom McGlynn, Arnold Rots,
Olga Pevunova, Supercomputing 2003.

Applying Chimera virtual data concepts to cluster
finding in the Sloan Sky Survey.
James Annis , Yong
Zhao, Jens Voeckler, Michael Wilde, Steve Kent, Ian
Foster. SC 2002.

Sources / References / Credits

Kavitha Ranganathan and Ian Foster,
Decoupling Computation and Data
Scheduling in Distributed Data Intensive
Applications, Proceedings of the 11th
International Symposium for High
Performance Distributed Computing
(HPDC
-
11), Edinburgh, July 2002.