Towards Simulation of Parallel File System Scheduling Algorithms with PFSsim

hushedsnailInternet και Εφαρμογές Web

12 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

97 εμφανίσεις

Towards Simulation of Parallel File System Scheduling Algorithms with
PFSsim


Yonggang Liu, Renato Figueiredo

Department of Electrical and Computer
Engineering
,
University of Florida,
Gainesville, FL

{yonggang,
r
enato}@acis.ufl.edu


Dulcardo Clavijo
,

Yiqi Xu, Ming Zhao


School of Computing and Information
Sciences
,
Florida International University,
Miami, FL

{darte003
,
yxu006,ming}@cis.fiu.edu



Abstract


M
any high
-
end computing (HEC) centers and
commercial data centers adopt
parallel file systems
(PFSs) as their storage solution
s
.
With the concurrent

applications

in PFS grow in both quantity

and
variety
,
it is expected that scheduling algorithms for data
access will play an increasingly important role in PFS
service quality
. H
owever, it is costly and disruptive to
thoroughly research different scheduling mechanisms
in
the

peta
-

or exascale
systems
;

and

the complex
ity in

schedul
ing policy implementation

and experimental

data

gathering

even make
s

t
he test
s

harder
. While
a
few
par
allel file system simulation frameworks have
b
een proposed

(e.g., [15,
16])
, their
goals

ha
ve

not
been in the
scheduling algorithm

evaluation
. In this
paper, we propose PFSsim, a simulator designed for
the purpose of
evaluating

I/O
scheduling algorithms

in
PFS
. PFSsim is a trace
-
driven simulator based on the
network simulation framework OMNeT++ and the dis
k
device simulator DiskSim
.

A

proxy
-
based scheduler
mod
u
l
e is implemented for
scheduling algorithm
deployment, and the system parameters are highly
configu
rable
.
We have simulated PVFS2 on

PFSsim
,
and the e
xperimental results show that PFSsim is
capable of simulating
the
sy
stem characteristics and
captur
ing

the effects of
the
scheduling algorithms.


1.

Introduction


Recent years,
Parallel file

systems

(PFS)

suc
h as
Lustre[1], PVFS
2
[2], Ceph[
3], and Pan
FS
[4] are
increasingly popular in high
-
end computing (HEC)
centers and commercial data centers


for instance, as
of April 2009, half of the world’s t
op 30
supercomputers use Lustre
[5] as their storage
solution
s
. P
FSs

outperform
traditional

distr
ibuted file
systems such as NFS
[6] in many application domains,

and among the
reasons
, an important one

is that

they
adopt

the object
-
based storage model
[7] and stripe
large data accesses
into

smaller storage objects
distrib
uted in the storage system for high
-
throughput
parallel access and load balancing.

In HEC and data center systems, there are often
large numbers
of applications that access data from a
distributed
storage system with a variety of
Quality
-
of
-
Service require
ments
[8]. As such systems are predicted
to continue to grow in term
s of
amount

of resources
and concurrent applications, I/O scheduling strategies
that allow performance isolation are expected to
become increasingly important.

Unfortunately, most
PFSs

are
not able to manage I/O flows on a per
-
data
-
flow basis.

The scheduling modules that come with the
PFSs are typically configured to fulfill an overall
performance goal
, instead of the quality of service for
each application.

There is a considerable
amount

of

existing
work
[9
,
10
,
11
,
12
,
13
,24
]

on the problem of achieving
differentiated service on a

centralized
management
point
.

N
evertheless,
the applicability of these
algorithms in the context of parallel file systems has
not been thoroughly studied. Challenges
in HEC
environments include the fact
s

that applications have
data flows issue
d from potentially large number

of
clients and parallel checkpointing of the applications

become
s

increasingly important to achieve desired
levels of reliability; in such environm
ent, centralized
scheduling algorithms can be limiting from scalabili
ty
and availability standpoints
. To the best of our
knowledge, there are not many

existing decentralized

I/O
scheduling

algorithms

for

distributed storage
systems, while the proposed d
ece
ntralized algorithms
(e.g., [14])

may
still need

verif
ication

in terms of their
suitability for
PFSs
, which

are a

subset of distributed
storage system
s
.

While
PFS
s are widely adopted in the HEC field,
research on corresponding scheduling algorithms

is not
easy
. The two key factors that prevent the testing on
real systems are: 1) the cost of scheduler testing on a
peta
-

or exascale file system requires complex
deployment and
experimental data

gathering
; 2)
experiments with the storage resources used
in HEC
systems can be very disruptive, as deployed production
systems typically are expected to have high utilization.
Under this context, a simulator that allows developers
to test and evaluate different scheduler designs
for
HEC systems
is very valuable.

It extricates the
developers from
complicated

deployment headaches in
the real systems and cuts their cost in the algorithm
development. Even though simulation results are
bound to have discrepancies compared to the real
performance, simulation results ca
n offer very useful
insights in the performance trends and allow the
pruning of the design space before implementation and
evaluation
on

a real testbed or a deployed system.

In this paper, we propose a Parallel File System
simulator, PFSsim. Our design obj
ectives for this
simulator are: 1) Easy
-
to
-
use: Scheduling algorithms,
PFS
characteristics

and network topolog
ies

can be
easily configured

at compile
-
time
; 2) fidelity
: It can
accurately model the effect of
HEC
workloads and
scheduling algorithms; 3)
Ubiqu
it
y:
the simulator
should have good flexibility to simulate large variety
of storage and network characteristics
.

4)
Scalable: It
should be able to

simulate up to thousands of machines
in a medium
-
scale scheduling algorithm study.

The rest of the paper is
organized as follows. In
section 2, we introduce the related work on PFS
simulation. In section 3, we talk about the PFS and
scheduler abstractions. In section 4, we illustrate the
implementation details of PFSsim. In section 5, we
show the validation res
ults. In the last section, we
conclude our work and discuss the future work.


2.

Related Work


To the best of our knowledge, there are two parallel
file system simulators presented in the literature, one is
the IMPIOUS simulator proposed

by E. Molina
-
Estolano,
et. al.
[15],
and
the other one is the simulator
developed by P. Carns
et.

al.
[16].

The IMPIOUS simulator is developed for fast
ev
aluation of
PFS

design
s
. It simulates the parallel file
system abstraction with user
-
provided file syst
em
specifications
, which include

data
placement strateg
ies,
replication strategies, locking disciplines and cache
strategies
. In this simulator, the client

modules

read the
I/O traces

and the
PFS

specifications
,

and
then
directly
issue th
em to the Object S
torage Device (OSD
)

modules

according to the
configuration
s. The OSD

modules

can be s
imulated
with the DiskSim
simulator
[17]

or the

simple disk model

, while the
former one provides higher accuracy and the later one
provides higher
efficiency
. For the goa
l of fast and
efficient simulation, IMPIOUS simplifies the PFS
model by omitting the metadata

server

modules and
corresponding
communications, and

since the focus is
not on the
schedulin
g strategies,
it does not
support

explicit

deployment of
schedul
ing
policies
.

The other

PFS

simulator
is

described in the paper
written by P. H. Carns
et. al.
, which is

about the
PVFS
2

server
-
to
-
serve
r communication

mechanisms

[16]. This simulator
i
s used for testing the overhead of
metadata communications
, specifically in

PVFS
2
.

Thus,

a detailed TCP/IP based network model is
implemented
. The authors

employ
ed

the INET
extension
[18] of the OMNeT++ disc
rete event
simulation framework
[19]

to simulate the network
.

The
simulator
also uses a

bottom
-
up


technique to
simulate the
underlying systems (PVFS
2

and Linux),
which achieves high fidelity but compromises on
flexibility.

We take inspiration from these related systems and
develop an expandable, modular
ized

design where the
emphasis is on the scheduler. Based on this goal, we
u
se DiskSim to simulate physical disks in detail, and
we use the extensible OMNeT++ framework for the
network simulation and the handling of simulated
events. While we currently use a simple networking
model, as pointed out above OMNeT++ supports INET
exten
sions that can be incorporated to enable precise
network simulations in our simulator, at the expense of
longer simulation times
.


3.

System Modeling


3.1.

Abstraction of Parallel File Systems


In this subsection
,
we will first
describe

the
similarities in
Parallel File Systems (
PFSs
)
, and then
we are going to discuss the differences exist among
different PFSs.

Considering all the commonly used PFSs
,

we find
that the majority of them

share the same basic
architecture:

1. There
is

one or more

data servers, wh
ich
are built
based on the

local file systems

(e.g., Lustre, PVFS2) or
the
block devices (e.g., Ceph). The application data are
stored in the form of fixed
-
size PFS
objects
,

whose IDs
are unique in
a
global

name

space
. File clocking
feature is enabled in s
ome PFSs (e.g., Lustre, Ceph)
;

2. There is one or
more

metadata servers, which
typically manage

the mapping
s

from
PFS
file

name
space

to

PFS storage object

name space
,
PFS object
placement,
as well as

the metadata operations;


3. The
PFS
clients are run on

the
system users’
machines; they provide the interface
(e.g., POSIX)
for
users/user applications

to access the
PFS
.

For

a
general

PFS
, a file access request
(read/write
operation)
goes through the following steps:


1.
Receiving

the
file
I/O
request:
By calling an API,
the
system user

sends
its

request

{operation, offset,
file_path, size}
to the PFS client running on

the

user’s
machine.

2.
Object mapping: The client tries to map the
tuple
{offset, file_path, size}

to

a serial of objects
,
which
contain
the file data.
T
his information is
either
available
locally or requires the client to query the
metadata server.

3. Locating the object: The client locates the object
to the data servers. Typically each data server stores a
static set of object IDs, and
this mapping information is
often available on the client.

4
.
Data
transmission
:
The client
sends out data I/O
requests
to the designated data servers
with the
information {operation, object_ID}
. The data servers
reply the requests, and the data I/O starts
.

Note that we have
omitted

the access permission
check (often conducted on the metadata server) locking
schemes (conducted on either the metadata server or
the data server).

Although different PFSs follow the same
architecture,

they

diff
er from each other

in many ways
,
such as data distribution methodology, metadata
storage pattern, user API, etc.

N
evertheless,

there are
four

aspects that
we consider to
have
significant

effects
on the I/O performance:
metadata management,
data
placement

strategy
,

data
re
p
l
ication

model

and

data
caching

policy
.

Thus, t
o construct a
scheduling

policy
test
simulator for

various PFSs
, we should have the
above specifications faithfully simulated
.

It is proved that

at least

in some cases, metadata
operation takes

a

big proportion of file system
workloads[27], and
also
because
it lies in the critical
path, the metadata management can be very important
to the overall I/O performance.

Different PFSs
use

different

techniques to manage metadata to achieve
different leve
ls of
metadata
access speed, consistency
and reliability
. For example,

Ceph adopts the
dynamic
subtree partitioning

technique
[30]

to
distribute the
metadata
onto multiple metadata servers for

high
access locality and cache efficiency. Lustre
deploys
two me
tadata servers, which includes one

active


server and one

standby


server for failover. In PVFS2,
metadata are
distributed

onto data servers
to prevent
single point of failure and the performance bottleneck.

By tuning the metadata server module and the
network
topology in our simulator, system users are able to set
up the metadata storage, caching and access patterns.

Data placement strategies are
designed with the
basic goal of

achieving

high
I/O parallelism and server
utilization/
load balancing. But di
fferent PFS still vary

with each other
significantly
, by reason of different
usage contexts. Ceph

is aiming at the large
-
scale
storage systems that
potentially

have big metadata
communication overhead.

Thus, Ceph
uses the local
hashing function and CRUSH (
Controlled Replication
Under Scalable Hashing) technique[26] to map
the
object

ID
s to the
corresponding
OSDs

in a distributed
manner
, which
avoid
s

metadata communication during
data location lookup,
and

redu
ces

the update frequency
of the system map.

In
co
ntr
a
st
, a
iming to serve the
users with higher trust and
skills,

PVFS2
provides
flexible data placement options to the users, and it even
delegates the users the ability to store the data on
user
-
specified data servers.

In our simulation, data
placement strategies
are set up on the client modules,
and the reason is that, this approach gives the clients
the capability to simulate both the distributed
-
style and
centralized
-
style data location lookup.

Data replication and f
ailover models also
affect the
I/O performance, because for systems with data
replication setup, data are written to multiple locations
,
which may prolong the writing process
.

For example,
with data replication enabled

in Ceph
,
every write
operation is com
mitted to both the
primary

OSD and
the
replica

OSDs inside a
placement group
. Though
Ceph maintain
s parallelism when forwarding the data
to the
replica

OSDs, the costs of data forwarding and
synchronization are still non
-
negligible
. Lustre and
PVFS2 do not

implement explicit data
replication

models
assuming that the replication is done by the
underlay hardware.

Data caching
on the server side and client side may
promote

the PFS I/O performance.

But it is importance
that, the client
-
side caching coherency al
so needs to be
managed.

PanFS data servers implement
write
-
data
caching that aggregates multiple writes for

efficient
transmission and data layout at the OSDs
, which may
increase

the disk I/O rate.

Ceph

implements
the
O_LAZY

flag for
open
operation
at
client
-
side
that
allows applications to explicitly relax the usual
coherency requirements for a shared
-
write file
, which
facilitates those

HPC applications
that

often have
concurrent access
es

to different parts of the files
.

Some
PFSs do not implement clie
nt caching by default, such
as PVFS.

We have not implemented the

data

caching
components

in PFSsim, and we plan to implement it in
the near future.


3.2.

Abstraction of PFS Scheduler


Among the many
proposed centralized or
decentralized scheduling algorithms in
distributed
storage systems
,
large varieties of network fabric and
deployment loca
tions are chosen.

For instance, in [14],
the schedulers are deployed on the
Coordinators
,
which reside between the system clients and the
storage
Bricks
. In [24], the scheduler is
implemented

on a
centralized
broker
, which captures all the system
I/O and d
ispatches them to the disks. In [30],

the
scheduling policies are

deployed

on the network
gateways which serve as the storage system portals to
the clients. And in [29], the
scheduling policies are
deployed

on the per
-
server proxies, which intercept I/O
a
nd virtualize the data servers to the system clients.

In our simulator, the
system
network is simulated
with high flexibility, which means the users are able to
deploy their own network fabric

with
the

basic or user
-
defined devices.

T
he schedulers can
also

be
created
and positioned to any part of the network. For more
advanced designs, inter
-
scheduler
communication
s can
also be enabled. The scheduling algorithms are to be
defined by the PFSsim users, and abstractive APIs are
exposed to

enable the schedulers to keep track of the
data server status.


4.

Simulator Implementation


4.1.

Parallel File System Scheduling Simulator


Based on the abstractions we mentioned, we have
developed the Parallel File System scheduling
simulator (PFSsim) based on
the discrete event
simulation framework OMNeT++4.0 and the disk
model simulator DiskSim4.0.

In our simulator, the client modules, metadata
server modules, scheduler modules and the local file
system modules are simulated by OMNeT++.
OMNeT++ also simulates
the network for
communications between client


metadata server,
client


data server, and scheduler


scheduler.
DiskSim is employed for the detailed simulation of
disk models; one DiskSim process is deployed for each
independent disk module being simulat
ed. For details,
please refer to Figure 1.

The simulation input is provided in the form of
trace files which contain the I/O requests of the system
users. Upon reading one I/O request from the input file,
the client creates a QUERY object, and sends it to
the
metadata server through the simulated network. On the
metadata server, the corresponding data server IDs and
the stripping information are added to the QUERY
object, which is sent back to the client. The client
stripes the I/O request according to the
stripping
information, and for each stripe the client issues a JOB
object to the designated data server through the
simulated network.

The JOB object is received by the scheduler module
on the data server (the detailed design of the scheduler
is covered in

following subsection). When a JOB
object is dispatched by the scheduler, the local file

system module maps the logical address of the job to
the physical block number. Finally, the job information
is sent to the DiskSim simulator via inter
-
process
communi
cation over a network connection (currently,
TCP).

When a job is finished on DiskSim, its ID and finish
time are sent back to the data server module on
OMNeT++. Then, the corresponding JOB object is
found from the local record. After writing the “finish
ti
me” into the JOB object, it is sent to the client.
Finally, the client writes the job information into the
output file, and the JOB object is destroyed.


4.2.

Scheduler Implementation


client

trace

output

client

trace

...

Metadata

server

Stripping

strategy

Metadata

server

Scheduling

Algorithm

l
ocal file
system

Disk

l
ocal file
system

Disk

l
ocal file
system

Disk

...

...

OMNeT
++

DiskSim

Simulated network

Schedulers

Figure

1
.

T
he architecture of PFSsim.

The two big dash
-
line boxes mean the entities inside are
simulated by OMNeT++ platform and DiskSim platform.


To provide an interface for the users to implement
their own scheduling algo
rithms, we provide a base
class for all algorithms, so that scheduling algorithms
can be implemented by inheriting this class. The base
class mainly contains the following functions:

void jobArrival(JOB * job);

void jobFinish(int ID);

void getSchInfo(Me
ssage *msg);

JOB means the JOB object referred in the above
subsection. Message is defined by the schedulers for
exchanging the scheduling information. The jobArrival
function is called by the data server when a new job
arrives at the scheduler. The jobFin
ish function is
called when a job just finishes the serving phase. The
getSchInfo function is called when the data server
receives a scheduler
-
to
-
scheduler message. The
simulator users can overwrite these functions to specify
the corresponding behaviors re
garding to their
algorithms.

The data server module also exports interfaces to
the schedulers. Two important functions exported by
the DataServer class are:

bool dispatchJob(JOB * job);

bool sendInfo(int ID, Message * msg);

The dispatchJob function is ca
lled for dispatching
jobs to the resources. The sendInfo function is called
for sending scheduling information to other schedulers.


4.3.

TCP Connection
between

OMNET++ and
DiskSim


One challenge of building a system with multiple
integrated simulators is the
virtual time
synchronization. Each simulator instance runs its own
virtual time, and each one has a large amount of events
emerging every second. The synchronization, if
performed inefficiently, can inevitably become a
bottleneck on the simulation speed.

S
ince DiskSim has the functionality of getting the
time stamp for the next event, OMNeT++ can always
proactively synchronize with every DiskSim instance
at the provided time stamp.

Currently we have implemented TCP connections
between the OMNeT++ simulator
and the DiskSim
instances. Even though optimizations are done in
improving the synchronization efficiency, we found the
TCP connection cost is still a bottleneck of the
simulation speed. In the future work, we plan to
introduce more efficient synchronizati
on mechanisms,
such as shared memory.


4.4.

Local File System Simulation


In our simulator, the local file system is a simple
file system that provides a static mapping between file
and disk blocks. This model is inaccurate, but the
reason why we did not implem
ent a fully simulated
local file system is the local file system block
allocation heavily depends on the context of storage
usage (e.g., EXT4 [22]), which is out of our control.
This simple file system model guarantees the data that
are adjacent in the log
ical space are also adjacent on
the physical storage. In the future, we plan to
implement a general file system with cache
functionality enabled.



5.

Validation and Evaluation


In this section, we validate the PFSsim simulation
results against the real syste
m performance. In the real
system, we deployed a set of virtual machines; each
virtual machine is configured with 2.4GHz AMD CPU
and 1 GB memory.

EXT3 is used as the local file
system. We deployed the PVFS2 parallel file system,
containing 1 metadata serve
r and 4 data servers on

both
the real system

and the simulator. On each data server
node, we also deploy a proxy
-
based scheduler that
intercepts all the I/O requests going through the local
machine.

In the experiments, we use the Interleaved or
Random (IOR
) benchmark [23] developed by the
Scalable I/O Project (SIOP) at Lawrence Livermore
National Laboratory. This benchmark facilitates us by
allowing the specification of the trace file patterns.


5.1.

Simulator Validation


T
o validate the simulator fidelity in di
fferent scales,
we have performed five independent experiments with
4, 8, 16, 32 and 64 clients. In this subsection, we do
not implement the scheduling algorithms on the
proxies. We conduct two sets of experiments and
measure the performance of the system.

In Set 1, each
client continuously issues 400 requests to the data
servers. Each of these requests is a 1 MB sequential
read from one file, and the file content is stripped and
evenly distributed onto 4 data servers. Set 2 is has the
same configuration as

Set 1, except that the clients
issue sequential write requests instead of sequential
read requests.

We measure the system performance by collecting
the average response time for all I/O requests. The
term “response time” is the elapsed time between when
a

I/O request issues its first I/O packet and when it
receives the response for its last packet. As shown in
Figure 2, the average response time of the real system
increases super
-
linearly as the number of client
increases. The same behavior is also observe
d on the
PFSsim simulator. For the read requests from smaller
number of clients, the simulated results do not match
the real results very well. This is probably because the
real local file system EXT3 provides read prefetching,
which accelerates the readin
g speed when recent
accessed data are adjacent in locality. This can also be
proved by the fact that, as the number of clients
accessing the file system increases, the difference
between real results and simulated results gets smaller,
which means the read

prefetching becomes less
effective. For both read and write requests, the
simulated results follow the same trend of the real
system results as the number of client grows, which
confirms that while the absolute simulated values may
not predict with accura
cy for the real system, relative
values can still be used to infer performance trade
-
offs.


5.2.

Scheduler Validation


In this subsection, we validate the ability of PFSsim
in implementing request scheduling algorithms. We
deploy 32 clients in both real system
and the simulator.
The clients are separated in two groups, each with 16
clients, for the purpose of algorithm validation. Each
client continuously issues 400 I/O requests. Each of
these requests is a 1 MB sequential write to a single
file, and the file co
ntent is stripped and evenly
distributed onto 4 data servers. The Start
-
time Fair
Queuing algorithm with depth D = 4 (SFQ(4)) [10] is
deployed on each data server for resource proportional
-
sharing.

We conducted three sets of tests, and in each set, we
assi
gn a static weight ratio for two groups to enforce
the proportional
-
sharing. Every set is done in both real
system and the simulator. Set a, b, c, have the weight
ratio (Group1:Group2) of 1:1, 1:2 and 1:4,
respectively. We measure and analyze the system
th
roughput ratio each group achieves during the first
100 seconds of system runtime.

As shown in Figure 3, we see that the average
throughput ratios of the simulated system have the
same
trend as that of the results from the real system,
which reflects the characteristics of the proportional
-
sharing algorithm. This validation proves that the
PFSsim simulator is able to simulate the performance
effect of proportional
-
sharing algorithms unde
r various
configurations on the real system.

From the real system results, we can also see the
throughput ratio has more oscillation as the difference
in two groups’ sharing ratio grows. This behavior also
slightly exists in the results from the simulator,

but we
observe that the results from the simulator have less
oscillation. This is because the real system
environment is more complex, where many factors can
contribute to the variations in the results, such as TCP
time out. But in our simulator, we do no
t have the
variations caused by these factors. In order to capture
dynamic variations with higher accuracy, future work
will investigate which system modeling aspects need to
be accounted for.


6.

Conclusion and Future Work


The design objective of PFSsim is
to simulate the
important effects of workload and scheduling
algorithms on parallel file systems. We implement a
proxy
-
based scheduler model for flexible scheduler
deployment. The validations on system performance
and scheduling effectiveness show the syst
em is
capable of simulating system performance trend given
specified workload and scheduling algorithms. For
scalability, as far as we tested, the system scales for
simulations of up to 512 clients and 32 data servers.

In the future, we will work on optimi
zing the
simulator to improve simulation speed and fidelity. For
this goal, we plan to implement efficient
synchronization scheme between OMNeT++ and
DiskSim, such as shared memory. We will refine the
system by implementing some important factors

that
also

contribute to the performance, such as TCP
timeout and characteristics of the local file system.


7.

References


[1]

Sun Microsystems, Inc., “Lustre File System:
High
-
Performance Storage Architecture and
Scalable Cluster File System”, White Paper,
October 2008.

[2]

P
.

Carns,

W.

Lig
on
, R
.
Ros
s and

R
.

Thakur,
“PVFS: A Parallel File System For Linux
Clusters”, Proceedings of the 4th Annua
l
Linux
Showcase and Conference, Atlanta, GA, October
2000, pp. 317
-
327


Figure
2
.

Average Response Time with Different
Number of clients


[3]

S. Weil, S. Brandt, E. Miller, D. Long and C.
Maltzahn,

Ceph: A
Scalable, High
-
Performance
Distributed File System

, Proceedings of the 7th
Conference on Operating Systems Design and
Implementation (OSDI

06), November 2006.

[4]

D. Nagle, D. Serenyi and A. Matthews. “The
Panasas ActiveScale storage cluster
-
delivering
scalable high bandwidth storage”, ACM
Symposium on Theory of Computing (Pittsburgh,
PA, 0
6

12 November 2004), 2004.

[5]

F. Wang, S. Oral, G. Shipman, O. Drokin, T.
Wang, and I. Huang.

Understanding lustre
filesystem internals
”,

Technical Report
ORNL/TM
-
2009/117, Oak Ridge National Lab.,
National Center for Computational Sciences,
2009.

[6]

R.
Sandberg, “The Sun Network Filesystem:
Design, Implementation, and Experience,” in
Proceedings of the Summer 1986 USENIX
Technical Conference and Exhibition, 1986.

[7]

M. Mesnier, G. Ganger, and E. Riedel, “Object
-
based strage”, IEEE Communications Magazine,
4
1(8):84
-
900, August 2003.

[8]

Z. Dimitrijevic and R. Rangaswami, “Quality of
service support for real
-
time storage systems”, In
Proceedings of the International IPSI
-
2003
Conference, October 2003.

[9]

C. Lumb, A. Merchant, and G
.

Alvarez,“Façade:
Virtual Storage D
evices with Performance
Guarantees”, In Proceedings of the 2nd USENIX
Conference on File and Storage Technologies,
2003.

[10]

W. Jin, J. Chase, and J. Kaur. “Interposed
proportional sharing for a storage service utility”,
Proceedings of the International Confer
ence on
Measurement and Modeling of Computer Systems
(SIGMETRICS), Jun 2004.

[11]

P. Goyal, H. M. Vin, and H. Cheng, “Start
-
Time

Fair Queueing: A Scheduling Algorithm
ForIntegrated Services Packet Switching
Networks”, IEEE/ACM Trans. Networking, vol. 5,
no. 5,
pp. 690

704, 1997.

[12]

J. Zhang, A. Sivasubramaniam, A. Riska, Q.
Wang, and E. Riedel,

“An interposed 2
-
level I/O
scheduling framework for performance
virtualization”, In Proceedings ofthe International
Conference on Measurement and Modelingof
Computer Systems

(SIGMETRICS), June 2005.

[13]

W. Jin, J. S. Chase, and J. Kaur, “Interposed
Proportional

Sharing For A Storage Service
Utility”, InSIGMETRICS, E. G. C. Jr., Z. Liu, and
A. Merchant,Eds. ACM, 2004, pp. 37

48.

[14]

Y.
W
ang
,
A. Merchant,


Proportional
-
s
hare
scheduling

for distributed storage systems
”,

In
Proceedings of the 5th USENIX Conference on
File and Storage Technologies, San Jose, CA, 47

60.

[15]

E. Molina
-
Estolano, C. Maltzahn, J. Bent and S.
Brandt, “Building a parallel file system simulator”,
2009,
Journal of Phys
ics: Conference Series 180
012050
.

[16]

P
.
Carns,
B. Settlemyer and W.
Ligo
n
, “Using
Server
-
to
-
Server Communication in Parallel File
Systems to Simplify Consistency and Improve
Performance”, Proceedings of the 4
th

Annual

(a1) weight ratio = 1:1, real system

(b1) weight ratio =1:2, real system (c1) weight ratio = 1:4, real

system




(a2) weight ratio = 1:1, simulated

(b2) weight ratio = 1:2,simulated



(c2) weight ratio = 1:4, simulated


Figure
3
.

The throughput percentage each group takes over in the first 100 second of runtime in both real
sytem and the simulated sytem. Under the pictures we indicate the weight ratio of Group1:Group2 in SFQ(4)
algorithm. The average throughput ratios for Group 2
are: (a1)50.00%; (a2)50.23%; (b1)63.41%;
(b2)66.19%; (c1)68.58%; (c2)77.67%.

Linux Showcase and Conference (Atlanta,
GA) pp
317
-
327.

[17]

J. Bucy, J. Schindler, S. Schlosser, G. Ganger and
contributers, “The disksim simulation environment
version 4.0 reference manual”, Technical Report,
CMU
-
PDL
-
08
-
101 Parallel Data Laboratory,
Carnegie Mellon University Pittsburgh, PA.

[18]

INET f
ramework, URL: http://inet.omnetpp.org/

[19]

A. Varga
,


The OMNeT++ discrete event
simulation system
”,

In European Simulation
Multiconference (ESM'2001), Prague, Czech
Republic, June 2001.

[20]

J. C. Wu and S. A. Brandt
,


The design and
implementation of AQuA: an ad
aptive quality of
service aware object
-
based storage device

,

In
Proceedings of the 23rd IEEE / 14th NASA
Goddard Conference on Mass Storage Systems
and Technologies, pages 209

218, College Park,
MD, May 2006.

[21]

Y. Saito, S. Frølund, A. Veitch, A. Merchant, and
S. Spence.

F
AB
: Building distributed enterprise
disk arrays from commodity components

,

In
Proceedings of ASPLOS. ACM, October 2004.

[22]

A
.

Mathur, M
.

Cao, and S
.

Bhattacharya.

The
new ext4 filesystem: curren
t status and future
plans
”,

In Proceedings of the 2007 Ottawa Linux
Symposium, pages 21

34, June 2007.

[23]

IOR
: I/O Performance

benchmark, URL:

https://asc.llnl.gov/sequoia/

ben
chmarks/IOR_summary_v1.0.pdf

[24]

A. Gulati and P. Varman,

Lexicographic QoS
Scheduling

for Parallel I/O

. In
Proceedings of the
17th

annual ACM symposium on Parallelism in
algorithms and architectures

(SPAA

05)
,
Las
Vegas, NV
,
June
, 2005.

[25]

S.A. Weil, K.T. Pollack, S.A. Brandt, and E.L.
Miller.

Dynamic metadata managemnet for
petabyte
-
scale
file systems

. In proceedings of the
2004 ACM/IEEE Conference on Supercomuting
(SC

04). ACM, Nov. 2004.

[26]

S.A. Weil, S. A. Brandt, E. L. Miller, and C.
Maltzahn.

CRUSH: Controlled, scalable,
decentralized placement of replicated data

. In
Proceedings of the

2006 ACM/IEEE Conference
on Supercomputing (SC

06), Tampa, FL, Nov.
2006. ACM.

[27]

D. Roselli, Jo Lorch, and T. Anderson.

A
comparison of file system workloa
ds

,

In
Proceedings of the 2000 USENIX Annual
Technical Conference, pages 41
-
54, San Diego,
CA, June 2000. USENIX Association.

[28]

R. B. Ross and W. B. Ligon III
,

Server
-
side
scheduling in cluster parallel I/O systems”
.
in
Calculateurs Parallèles Journal, 2001.

[29]

Y
. Xu, L. Wang,

D.
Arteaga,

M. Zhao, Y. Liu and
R. Figueiredo,

“Virtualization
-
based Bandwidth
Management for Parallel Storage Systems”
.
I
n 5th
Petascale Data Storage Workshop (PDSW

10
),
pages 1
-
5, New Orleans, LA, Nov. 2010.

[30]

D. Chambliss, G. Alvarez, P. Pandey and D. Jadav,

Performance virtualization for large
-
scale storage
sytems

,
in Symposium on Reliable Distributed
Systems, pages 109

118. IEEE, 2003.