Data Grid Technologies

internalchildlikeInternet και Εφαρμογές Web

12 Νοε 2013 (πριν από 8 χρόνια και 2 μήνες)

270 εμφανίσεις

Data Grid Technologies

Sathish Vadhiyar

Sources/Credits: Technical papers
listed in references

Replica Strategies

Problem Motivation

Replication to deal with faults and provide
scheduling flexibility.

Given a file that is partitioned into blocks
that are replicated throughout a wide
file system, how can a client retrieve the
file with the best performance?

Various algorithms

Basic Downloading Algorithm

The client opens a thread to each server
containing the file

A block size is chosen

Each thread selects a different block to
download and all threads start downloading

A thread then chooses a new block that is
currently not being downloaded by any other


Servers with higher bandwidths to
clients download more blocks

Selection of block size


Aggressive Redundancy

To provide fault tolerance and to improve
download time

A redundancy factor, R

The client downloads a block
simultaneously from R servers

Only 1 is chosen

whichever returns first

Driven Redundancy

Retry a download only when it is
progressing slowly

Progress number

P, redundancy factor


Each block assigned a download number
initialized to 0

When a thread attempts to download a
block, it increments the block’s download

Driven Redundancy

For selecting a new block to download

If there is a block B whose download number
< R, and if there are P blocks after B whose
downloads have completed, then select B

Else select next block whose download
number is zero


Another approach

For downloading a block, choose a server
that has minimum value of time*(l+1)


predicted time to download a block
when there is no contention. Obtained from
NWS numbers before download is initiated.


number of threads currently downloading
from the server

Multiple clients

Situation arises when parallel data for
computation on parallel clients have to be
selected from available replica server locations

More challenges

download decision by a client
can impact download performance on other
clients. Need to predict this impact.

Periodic network monitoring have to be
augmented by measurements corresponding to
current downloads

Collective Download algorithm

Each algorithm connects to a server only once
even if some of the data belongs to other clients

download phase

The clients then redistribute data among

redistribution phase

Widely followed in parallel

Especially useful when clients and servers are
on either side of WAN

multiple latencies can
be avoided at the cost of less expensive
redistribution phase

Replica Placement Strategies

Replica placement questions

When should replicas be created?

Which files should be replicated?

Where should replicas be placed?

The model assumes that data is produced
in tier
1 (root) and there are storage
spaces at various tiers (levels of hierarchy)

Clients that request data form the leaves
of the hierarchy

Placement strategies

Best client

Each storage node maintains history
regarding number of requests for the files it

If the number of requests for a file exceeds
the threshold, the node creates a replica of
the file in that client node that has generated
most requests for that file (best

The request details for the file are cleared.

Strategies …

Cascading replication

Analogy to a 3
tiered function

Once a threshold for a file is exceeded at
the root, a replica is created at the next level
on the path to the best client and so on…

Geographical locality is exploited

Plain caching

done at the client

Caching plus Cascading Replication


Fast Spread

A replica of the file is stored at each node
along its path to the client

Replica selection

closest replica

Replica replacement

least popular file
with oldest age is replaced. Popularity
logs are cleared periodically


client performs worst for random access
patterns and shows improvement for access
patterns with a bit of geographical locality

Fast spread works much better than cascading
for random data access

Bandwidth savings are more in fast spread than
in cascading

Fast spread has high storage requirements

Sources / References / Credits

Algorithms for high Performance, Wide
distributed file downloads
J.S. Plank, S.
Atchley, Y.Ding and M. Beck, Parallel
Processing Letters, vol. 13, no. 2, pp 207
June 2003.

Downloading Replicated Wide
Area Files

Framework and Empirical Evaluation.
Collins and J.S. Plank. NCA 2004.

Identifying Dynamic Replication Strategies
for a High
Performance Data Grid.
Ranganathan and I. Foster. Grid 2002.

Sources / References / Credits

Based Galaxy Morphology Analysis for the
National Virtual Observatory
Ewa Deelman, Raymond
Plante, Carl Kesselman, Gurmeet Singh, Mei
Hui Su,
Gretchen Greene, Robert Hanisch, Niall Gaffney,
Antonio Volpicelli, James Annis, Vijay Sekhri,
Tamas Budavari, Maria Nieto
Santisteban, William
O'Mullane, David Bohlender, Tom McGlynn, Arnold Rots,
Olga Pevunova, Supercomputing 2003.

Applying Chimera virtual data concepts to cluster
finding in the Sloan Sky Survey.
James Annis , Yong
Zhao, Jens Voeckler, Michael Wilde, Steve Kent, Ian
Foster. SC 2002.

Sources / References / Credits

Kavitha Ranganathan and Ian Foster,
Decoupling Computation and Data
Scheduling in Distributed Data Intensive
Applications, Proceedings of the 11th
International Symposium for High
Performance Distributed Computing
11), Edinburgh, July 2002.