Author manuscript, published in "HIgh Performance Computing (2013) 20"

Approximation Algorithms for Energy Minimization

in Cloud Service Allocation

under Reliability Constraints

Olivier Beaumont Philippe Duchon Paul Renaud-Goud

Inria University of Bordeaux Inria

Bordeaux, France Bordeaux, France Bordeaux, France

Email: Olivier.Beaumont@inria.fr Email: Philippe.Duchon@labri.fr Email: paul.renaud-goud@inria.fr

Abstract—We consider allocation problems that arise in the In the static case, mapping VMs with heterogeneous

context of service allocation in Clouds. More speciﬁcally, we

computing demands onto PMs with (possibly heterogeneous)

assume on the one part that each computing resource is associated

capacities can be modeled as a multi-dimensional bin-packing

with a capacity, that can be chosen using the Dynamic Voltage

problem. Indeed, in this context, each physical machine is char-

and Frequency Scaling (DVFS) method, and with a probability of

acterized by its computing capacity (i.e. the number of ﬂops

failure. On the other hand, we assume that the services run as a

it can process during one time-unit), its memory capacity (i.e.

set of independent instances of identical Virtual Machines (VMs).

the number of different VMs that it can handle simultaneously,

Moreover, there exists a Service Level Agreement (SLA) between

given that each VM comes with its complete software stack)

the Cloud provider and the client that can be expressed as follows:

and its failure rate (i.e. the probability that the machine will

the client comes with a minimal number of service instances that

fail during the next time period) and each service comes with

must be alive at anytime, and the Cloud provider offers a list

of pairs (price, compensation), the compensation having to be its requirements, in terms of CPU and memory demands, and

paid by the Cloud provider if it fails to keep alive the required

reliability constraints.

number of services. On the Cloud provider side, each pair actually

corresponds to a guaranteed reliability of fulﬁlling the constraint In order to deal with capacity constraints in resource allo-

on the minimal number of instances.

cation problems, several sophisticated techniques have been

developed in order to optimally allocate VMs onto PMs,

In this context, given a minimal number of instances and a

either to achieve good load balancing [5]–[7] or to minimize

probability of success, the question for the Cloud provider is to

energy consumption [8], [9]. Most of the works in this domain

ﬁnd the number of necessary resources, their clock frequency and

have therefore concentrated on designing ofﬂine [10] and

an allocation of the instances (possibly using replication) onto

online [11], [12] solutions of Bin Packing variants.

machines. This solution should satisfy all types of constraints

(both capacity and reliability constraints). Moreover, it should

Reliability constraints have received much less attention in

remain valid during a time period (with a given reliability in

the context of Cloud computing, as underlined by Walfredo

presence of failures) while minimizing the energy consumption

Cirne in [13]. Nevertheless, related questions have been ad-

of used resources. We assume in this paper that this time period,

that typically takes place between two redistributions, is ﬁxed and dressed in the context of more distributed and less reliable sys-

known in advance. We prove deterministic approximation ratios

tems such as Peer-to-Peer networks. In such systems, efﬁcient

on the consumed energy for algorithms that provide guaranteed

data sharing is complicated by erratic node failure, unreliable

reliability and we provide an extensive set of simulations that

network connectivity and limited bandwidth. Thus, data repli-

prove that homogeneous solutions are close to optimal.

cation can be used to improve both availability and response

time and the question is to determine where to replicate data

Keywords—Cloud, reliability, approximation, energy savings

in order to meet performance and availability requirements in

large-scale systems [14]–[18]. Reliability issues have also been

I. INTRODUCTION

addressed by the High Performance Computing community.

Indeed, recently, a lot of efforts has been done to build

A. Reliability and Energy Savings in Cloud Computing

systems capable of reaching the Exaﬂop performance [19],

This paper considers energy savings and reliability issues

[20] and such exascale systems are expected to gather billions

that arise when allocating instances of an application consisting

of processing units, thus increasing the importance of fault

in a set of independent services running as Virtual Machines

tolerance issues [21]. Solutions for fault tolerance in Exascale

(VMs) onto Physical Machines (PMs) in a Cloud Computing

systems are based on replication strategies [22] and rollback

platform. Cloud Computing [1]–[4] has emerged as a well-

recovery relying on checkpointing protocols [23], [24].

suited paradigm for service providing over the Internet. Using

virtualization, it is possible to run several Virtual Machines This work is a follow-up of [25], where the question of

on top of a given Physical Machine. Since each VM hosts how to evaluate the reliability of a general allocation has

its complete software stack (Operating System, Middleware, been addressed and a set of deterministic and randomized

Application), it is moreover possible to migrate VMs from a heuristics have been proposed. In this paper, we concentrate on

PM to another in order to dynamically balance the load. energy savings issues and we propose proved approximation

hal-00788964, version 3 - 10 Oct 2013algorithms. In order to minimize energy consumption, we B. Notations

assume that sophisticated mechanisms exist in order to ﬁx the

In this section, we introduce the notations that will be used

clock frequency of the PMs, such as DVFS (see [26]–[30]).

throughout the paper. Our target Cloud platform is made ofm

In this context, the capacity of the PM can be expressed as

physical machinesM ;M ;:::;M . As already noted, we

1 2 m

a function of the clock frequency. In general, the probability

assume that machineM is able to handle the execution of

j

of failure may itself depend on the clock frequency (see for

CAPA instances of services. We also assume that we can rely

j

instance [31]); nevertheless, we did not ﬁnd in the literature

on Dynamic Voltage Frequency Scaling (DVFS) mechanism

a widely admitted model stating how clock frequency and

in order to adapt CAPA . The energy consumed by machine

j

failures relate and we leave this issue for future works.

M when running at capacity (speed proportional to) CAPA is

j j

given byE =E (j)+E (j), whereE (j) =e CAPA .

stat dyn dyn j

j

This means that the energy consumed by machineM can be

j

To assess precisely the speciﬁc complexity of energy min-

seen as the sum of a leakage term (paid as soon as the machine

imization introduced by reliability constraints in the context

is switched on) and of a term that depends (most of the works

of services allocation in Clouds, we concentrate on a simple

consider that 2 3) on its running speed. We assume in

context, that nevertheless captures the main difﬁculties. First,

addition continuous speeds, which means that any CAPA can

j

we consider that the applications running on the Cloud plat-

be achieved by machineM (as advocated in [32]–[34]), so

j

form can be seen as a set of independent services, and that

that we can obtain readable and interesting results.

the services themselves consist in a number of identical (in

terms of requirements) and independent instances. Therefore,

On this Cloud platform, our goal is to run (all through

we do not consider the problems introduced by heterogene-

a given time period, as deﬁned in the SLA) n services

ity, that have already been considered (see for instance [6],

S ;S ;:::;S . DEM identical and independent instances of

1 2 n i

[7]). Indeed, as soon as heterogeneity is considered, basic

service S are required, and the instances of the different

i

allocation problems are amenable to Bin Packing problem

services run as Virtual Machines. Several instances of the

and are therefore intrinsically difﬁcult. Then, we consider

same service can therefore run concurrently and independently

static allocation problems only, in the sense that our goal is

on the same physical machine, even if it lowers the service

to ﬁnd the allocation that optimizes the reliability during a

reliability. We will denote byA the number of instances of

i;j

P

time period. This time period corresponds to the time period

S running onM . Therefore, A represents the overall

i j i;j

i

between two phases of migrations and reconﬁguration of the

number of instances running onM and therefore, it has to

j

P

allocation of VMs onto PMs. During this time period, the goal

be smaller than CAPA . Respectively, A represents the

j i;j

j

P

for the provider is to ensure that a minimal number of instances

overall number of running instances ofS . In general, A

i i;j

j

of each service is running whatever the machine failures. In

is larger than DEM since replication, i.e. over-provisioning of

i

order to enforce reliability constraints, the provider will over-

services, is used in order to enforce reliability constraints.

provision resources by allocating and running more instances

More precisely, each machineM comes with a failure rate

j

than actually required by the services in order to cope with

FAIL , that represents the probability of failure ofM during

j j

failures. Combining these static and dynamic phases is out of

the time period. During the time period, we will not reallocate

the scope of this paper. Therefore, our work enables to assess

instances of services to physical machines but rather provision

precisely the complexity introduced by machine failures and

extra instances for the services (replicas) that will actually be

service reliability demands on energy minimization.

used if some machines fail. As said previously, we will assume

for the results proved in this paper that FAIL does not depend

j

on CAPA .

j

Throughout this paper, we assume that the characteristics of

the applications and their requirements (in terms of reliability

We will denote by ALIVE the set of running machines. In

in particular) have been negotiated between a client and the

our model, at the end of the time period, the machines are

provider through a Service Level Agreement (SLA). In the

either up or completely down, so that the number of instances

SLA, each service is characterized by its demand in terms of

of serviceS running onM isA if jjM 2 ALIVE, and

i j i;j j

P

processing capability (i.e. the minimal number of instances of

0 otherwise. Therefore, ALIVEINST = A de-

i i;j

M2ALIVE

j

VMs that must be running simultaneously) and in terms of

notes the overall number of running instances ofS at the end

i

reliability (i.e. the maximal probability so that the service will

of the time period. In addition,S is running properly at the end

i

P

not beneﬁt from this number of instances at some point during

of the time period if and only if A DEM .

i;j i

jjM2ALIVE

j

the next time period). Equivalently, the reliability requirement

may be negotiated through the payment of a ﬁne by the Cloud Of course, our goal is not that all instances should run

Provider if it fails to provide the required amount of resources. properly at the end of the time period. Indeed, such a reliability

In the case where it may be difﬁcult for the user to a priori cannot be achieved in practice since the probability that all

decide the level of reliability, we discuss in Section V how machines fail is clearly larger than 0 in our model. In general,

reliability can be proposed by the cloud provider as a list of as noted in a recent paper of the NY Times [35], Data Centers

(price;compensation) pairs. In all cases, the goal, from the usually over-provision resources (at the price of high energy

provider point of view, is therefore to determine the cost of re- consumption) in order to (quasi-)avoid failures. In our model,

liability, since a higher reliability will induce more replication we assume a more sustainable model, where the SLA deﬁnes

and therefore more energy consumption. Our goal in this paper the reliability requirement REL for serviceS (together with

i i

is to ﬁnd allocations that minimize energy consumption while the penalty paid by the Cloud Provider ifS does not run with

i

enforcing reliability constraints, and therefore to determine the at least DEM instances at the end of the period). Therefore,

i

price of reliability. the Cloud provider faces the following optimization problem:

hal-00788964, version 3 - 10 Oct 2013BestEnergy(m;n; DEM; REL): Find the set ON of ma- 10 instances of the service to the ﬁrst 2 machines and 5

chines that are on and the clock frequency assigned to machine instances to the 8 remaining machines. Therefore, the optimal

M , represented by CAPA and an allocationA of instances solutions allocate a total of 60 instances, whereas 20 instances

j j

of servicesS ;S ;:::;S to machinesM ;M ;:::M such only are required at the end of the time period, in order to

1 2 n 1 2 m

that satisfy reliability constraints. The shape of the optimal solution

reﬂects the complexity of the problem. Indeed, it has been

n

X

proved in [25] that even in the case of a single service and

(i)8j2 ON; A CAPA ;

i;j j

even if the allocation is given, estimating its reliability is a

i=1

#P -complete problem. The #P complexity class has been

(ii)8i; P(ALIVEINST DEM ) 1 REL ;

i i i

introduced by Valiant [38] in order to classify the problems

i.e. the probability that a least DEM instances ofS are running where the goal is not to determine whether there exists a

i i

on alive machines after the time period is larger than the solution (captured by NP-completeness notion) but rather

reliability requirement 1 REL , to determine the number of solutions. In our context, the

i

P

(iii) the overall energy consumption E (j)+ reliability of an allocation is related to the number (weighted

stat

j2ON

by their probability) of ALIVE sets that lead to an allocation

e CAPA is minimized.

j

j

where all service demands are satisﬁed. In this example, in

order to check that the reliability is larger (in fact equal to)

C. Methodology

than REL, we can observe that all conﬁgurations where at

Throughout the paper, we will rely on the same general least 4 machines are alive are acceptable (since at least 20

approach. Through Section II to Section IV, in order to

instances are alive as soon as 4 machines are up), together

prove claimed approximation ratios, we rely on the following

with all conﬁgurations with 3 machines, as soon as a machine

techniques.

loaded with 10 instances is involved, and the solution with

only the ﬁrst two machines alive. Counting the number of

For the lower bounds, we prove that for a service, given

such valid conﬁgurations (weighted by their probability) leads

the reliability constraints of this service and given failure

to the reliability of the allocation.

probabilities of the machines, at least a given number of

instances, or at least a given level of energy is needed. These

Generally speaking, the question of determining the opti-

results are obtained through careful applications of Hoeffding

mal solution remains open and all the references to the optimal

Bounds [36].

in the paper rely either on comparisons to a lower bound

or on exhaustive enumeration of the solutions (for instance,

For the upper bounds, we concentrate on a special alloca-

the optimality statement for the example of this section has

tion schemes, namely Homogeneous. In a solution of Homoge-

been obtained through exhaustive search). Nevertheless, we

neous, for each service, we assign to every machine the same

will concentrate on Homogeneous solutions, i.e. those where

number of instances, i.e.8i;8j2 ON;A =A . Using this

i;j i

all PMs are given the same number of instances. We provide

allocation scheme, we are able to derive theoretical bounds

in Section II algorithms to compute the BestHomogeneous

relying on Chernoff bounds [37]. Moreover, the comparison

solution.

with the lower bound shows that the quality of obtained

solutions is reasonably high, especially in the case of energy

We can notice that the optimal solution involves 60 in-

minimization and even asymptotically optimal when the size

stances against around 67 for best fractional homogeneous

of the platform or the overall volume of service instances to

solution. Indeed, the best fractional solution allocates 20=3

be handled, becomes arbitrarily large.

instances to each machine, so that all conﬁgurations with 3

alive machines are enough, thus leading to a better reliability

(at a higher cost). Note that this case has been determined

D. Motivating example

using exhaustive search among all possible allocations with

In order to illustrate the objective functions that we con-

10 machines and where the number of instances given to each

sider throughout this paper and the notations, let us consider

PM is an integer, so that this example can be seen as a worst

a service with a demand DEM = 20 and a reliability request

case.

6

of REL = 4:5 10 , that has to be mapped onto a Cloud

composed of m = 10 physical machines, whose failure As far as energy minimization is concerned, we can

1

probability is FAIL = 10 . Figure 1 depicts the kind of notice that if we assume = 2, despite the bad load

solutions that we consider in this paper. In terms of minimizing balancing among the machines in the optimal solution for the

the number of instances, the best solution consists in allocating number of instances, this solution remains optimal. Indeed,

Figure 1. Motivating example

hal-00788964, version 3 - 10 Oct 2013the dynamic energy of the unbalanced solution is given by typically serving requests and where the demand is given as a

2 2

2 10 +8 5 = 400 and the energy of the homogeneous minimal number of request per time unit, it is both sufﬁcient

2

one is given by 10 (20=3) = 445. On the other hand, if and necessary to enforce that the remaining serving capacity

= 3 for instance, then the homogeneous solution consumes given failures is large enough with the reliability expressed in

3

less energy (10 (20=3) = 2967) than the unbalanced solution the SLA .

3 3

(2 10 8 5 = 3000). Thus, we can observe on this example

that minimizing the dynamic energy (rather than minimizing

A. Lower bound

the number of instances) favors homogeneous solutions.

Let us consider the case of a single service to be mapped

Therefore, in the rest of this paper, we will use fractional

onto a ﬁxed number of machines when the objective is to

homogeneous solutions in order both to derive approximation

minimize the amount of resources necessary to enforce the

algorithms and upper bounds on the number of required

conditions deﬁned in the SLA in terms of quantity (of alive

resources. Indeed, we prove in Section II that homogeneous

instances at the end of the time period) and reliability. The

allocations are asymptotically optimal for dynamic energy

problem comes into two ﬂavours depending on the resources

minimization when the number of involved PMs becomes

we want to optimize. Recall thatA is the number of instances

j

large. In Section IV, we provide an extensive set of simulations

of the service initially allocated to machineM . In its phys-

j

that prove that homogeneous solutions are in general close to

ical machines version, the optimization problem consists in

optimal for general energy minimization in a large number of

minimizing the number of instances allocated to the different

P

situations.

machines, i.e. minimizing A . In its energy minimization

j

j

version, we rely on DVFS mechanism in order to adapt the

voltage of a machine to the need of the instances allocated to it.

E. Outline of the Paper

In general, energy consumption models assume that the energy

As we have noticed through the motivating example,

dissipated by a processor running at speeds is proportional to

BestEnergy is in general difﬁcult since verifying that a given

s . Therefore, the energy dissipated by a processor runningA

j

allocation satisﬁes a given reliability constraint is already

instances will be proportional toA and the overall objective

j

P

#P complete. Nevertheless, we prove in this paper that

is to minimize the overall dissipated energy, i.e. A .

j

j

even when the allocation is to be determined, it is possible

In order to ﬁnd the lower bound, let us consider any

to provide low-complexity deterministic approximation algo-

allocation (whereA is the number service instances initially

rithms, that are even asymptotically optimal when the sum j

allocated to machineM ) and let us prove that if the amount

of the demands becomes arbitrarily large. Another original j

of resources is too small, then reliability constraints cannot be

result that we prove in this paper is that minimizing the

met. Recall that ALIVEINST is the number of instances of

energy (relying on DVFS) induced by replication is easier than j

the service that are alive on machineM at the end of the

minimizing the number of replicas, whereas in many contexts j

time period. ALIVEINST is thus a random variable equal to

(see [39]) the non-linearity of energy consumption makes the j

A with a probability 1 FAIL and to 0 with a probability

optimization problems harder. In our context, approximation j

FAIL.

ratio are smaller for energy minimization than for classical

replication minimization (that would correspond to makespan

Hence, the expected number of alive instances is given by

P

or load balancing in other contexts).

m

E(ALIVEINST) = (1 FAIL) ALIVEINST . Hoeffding

j

j=1

inequality (see [36]) says how much the number of alive

To prove this result, we progressively come to the most

resources may differ from its expected value. In particular,

general problem through the study of more simple objective

for the lower bound, we will use it in the following form,

functions. Firstly, we consider several models for energy

that bounds the chance of being lucky, i.e. to ﬁnd a correct

minimization. First, we address in Section II the case where

allocation with few instances. More precisely, it states that for

dynamic energy only is concerned, i.e. without taking explic-

all t> 0:

itly the leakage term into account. Then, we introduce the static

!

energy part in Section III and the more general MIN-ENERGY

2

t

problem. For MIN-ENERGY, the setting is the same except that

P(ALIVEINST E(ALIVEINST)+t) exp 2P :

m

2

A

the number of participating machines is to be determined and

j=1 j

DVFS can be used to determine the capacity of each machine.

q

P

At last, in Section IV, we perform some simulations in order

n

2

Let us choose t = ln(1 REL) A =2, so that

j

j=1

to show that homogeneous solutions are in fact very close to

2

ln(1 REL)

t 0

optimal.

P

exp 2 = 1 REL. Noting K = , and

m 2

A 2

j=1 j

P

m

since E(ALIVEINST) = (1 FAIL) A , the previous

j

j=1

II. DYNAMIC ENERGY MINIMIZATION USING DVFS

equation becomes

In this section, we concentrate on the dynamic energy

0 1

minimization problem. Therefore, we assume that the number v

u

m m

X X

of resources that are switched on is ﬁxed in advance. Then, u

0 2

@ t A

P ALIVEINST A + K A 1 REL:

j

since no reallocation or VM migration will take place dur- j

j=1 j=1

ing the considered period, our goal is to actually run more

instances than what is actually required by the demand of

the service, so as to cope automatically with machine failures Now, if a given allocation succeeds, then, by deﬁnition,

during the period. Indeed, since we are considering services P(ALIVEINST DEM) 1 REL.

hal-00788964, version 3 - 10 Oct 2013Thus we obtain that a necessary condition on theA ’s so which also holds true for = 2.

j

that the reliability constraint is enforced is given by

Therefore, any solution that satisﬁes quantitative and qual-

v

u

m m

X X itative constraints must consume at least MINENERGY, what-

u

2

0

t

(1 FAIL) A + K A DEM:

j ever the distribution of instances onto machines is.

j

j=1 j=1

B. Upper bound – Homogeneous

As stated in the introduction of this section, we are inter-

P

ested either in minimizing A for resource use minimiza-

j

j

P

1) MIN-REPLICATION: As explained above, in order to

tion, and A for energy minimization. To obtain lower

j j

obtain upper bounds on the amount of necessary resources

bounds on these quantities in order to achieve quantitative

(either in terms of number of instances or energy), it is enough

(number of alive instances) and qualitative (reliability con-

to exhibit a valid solution (that satisﬁes the constraints deﬁned

straints), we rely on Hoelder’s inequality, that states that if

in the SLA). To achieve this, we will concentrate in this part on

1=p+1=q = 1, then

homogeneous (fractional) solutions, with an equally-balanced

0 1 0 1

1=p 1=q

allocation among all machines (i.e.8j;A =A).

j

X X X

p q

@ A @ A

8a ;b 0; a b a b :

j j j j j j

An assignment is considered as failed when there are not

j j j

enough instances of the service that are running at the end

of the time period, hence P = P(ALIVEINST < DEM).

P fail

2

With p = q = 2;a = b = A , we obtain A

j j j

j From the homogeneous characteristics of the allocations, we

P

2

( A ) , so that derive that ALIVEINST = A j ALIVEj, then P =

fail

j

DEM

P jALIVEj< . jALIVEj can be described as the sum

v

A P

u m

m m

X X

of random independent variables X , where, for all

u j

j=1

2

0

t

(1 FAIL) A + K A

j

j j2f1;:::;mg,X depicts the fact that machineM is alive

j j

j=1 j=1

at the end of the time period (X is equal to 1 with probability

j

m

X 1 FAIL, and to 0 with probability FAIL).

p

0

1 FAIL+ K A :

j

Hence, the expected value of jALIVEj is given by

j=1

E(jALIVEj) = (1 FAIL)m. Chernoff bound (see [37])

Hence a necessary condition in order to satisfy the constraints

says how much the number of alive machines may differ

is given by

from its expected value. We use in this part Chernoff bounds

m

X rather than Hoeffding bounds because the random variables

DEM

p

A = MINREP:

j take their value in f0;1g instead of f0;:::;Ag and Cher-

0

1 FAIL+ K

j=1

noff bounds are more accurate in this case. In particular,

for the upper bound, we will use it in the following form,

Therefore, any solution that satisﬁes quantitative and qual-

that bounds the chance of being unlucky, i.e. to fail hav-

itative constraints must allocate at least MINREP instances,

ing a correct allocation while allocating a large number of

whatever the distribution of instances onto machines is.

instances. More speciﬁcally, Chernoff bound gives that for

2

With p = ; 1=q = (1 1= ); a =A and b = 1, we 2" m

j j j

all " > 0, P(jALIVEj (1 FAIL ")m) e : As

P P

1=

1 1=

obtain A A m .

we want to ensure that P REL, we choose " such

j fail

j

p

2

2" m

that e = REL, i.e. " = K=m by noting K =

Similarly, assuming that > 2 hence = 2> 1, with p =

ln(REL)

2

. This allows to rewrite the previous equation into:

= 2; 1=q = (1 2= ); a =A and b = 1, we obtain

j j 2

j

p

0 1 P jALIVEj (1 FAIL K=m)m REL: Finally, we

2=

m m

X X

obtain a sufﬁcient condition, so that the reliability constraint

2 1 2=

@ A

A A m ; so that

j j

is fulﬁlled for the service:

j=1 j=1

DEM

Am q = MAXREP;

v K

1 FAIL

u

m m m

X X

u

0 2

t

(1 FAIL) A + K A

j

j

since then

j=1 j=1

0 1

1=

P =P(ALIVEINST < DEM)

m

fail

X

p

1 1= 1=2 1=

0

@ A

(1 FAIL)m + K m A : =P(jALIVEjA< DEM)

j

p

j=1

P jALIVEj (1 FAIL K=m)m

Also, we can derive another necessary condition deﬁned as

P REL:

0 1 fail

!

m

X

DEM

@ A

A p

j

1 1= 0 1=2 1=

Therefore, it is possible to satisfy the SLA with at most

(1 FAIL)m + K m

j=1

MAXREP instances of the service. Similarly, we can derive an

= MINENERGY;

upper bound of the energy needed to enforce the SLA. Indeed,

hal-00788964, version 3 - 10 Oct 2013with the same value ofA, we obtain Moreover, we have seen that a necessary condition (see

Section II-A) for allocationA to be valid is given by

j

DEM

v

A m p

u

1 1= 1=2 1= m m

(1 FAIL)m Km X X

u

2

0

t

(1 FAIL) A + K A DEM;

j

= MAXENERGY: j

j=1 j=1

C. Comparison

1=

MINENERGY

what induces (1 FAIL) V +

m

p

1=

When minimizing the number of necessary instances to

MINENERGY 0

p 0

K m V DEM and ﬁnally

0

m

1 FAIL+ K

MAXREP

p

enforce the SLA, we obtain = : For

0 MINENERGY DEM

K

MINREP p

V < or equivalently

1 FAIL

m m

(1 FAIL)m Km

realistic values of the parameters, above approximation ratio

q

p

0 DEM DEM

ln(1 REL) p p

m V :

0

K 0

is good (close to one), since both K =

1 FAIL+ K

1 FAIL

2

m

q q

ln(REL)

K

and = are small as soon as m is large.

m 2m

III. OVERALL ENERGY MINIMIZATION

Nevertheless, the ratio is not asymptotically optimal when m

In above section, we have considered the case where the

becomes large.

number of used machines is ﬁxed in advance. In this context,

On the other hand, for energy minimization, we have

the leakage term is paid for all machines, and is a constant.

p

1 1= 1=2 1= In general, in the context of a Cloud platform, both the set of

0

MAXENERGY

(1 FAIL)m + K m

p

=

1 1= 1=2 1= used resources and the voltage associated to them have to be

(1 FAIL)m Km

MINENERGY

! determined. In this case, given that k2f1;:::;mg, the goal

q

0

K

(1 FAIL)+ is to minimize

m

p

= ;

!

K

(1 FAIL)

m

DEM

(low)

E (k) =k E +k p :

stat

0

so that this ratio tends to 1 when m becomes arbitrarily (1 FAIL)k+ K k

large. This shows that for energy minimization, homogeneous

In above problem, there is intuitively an interesting compro-

fractional solutions provide very good results whenm is large

mise to be done. Since 2, the machines are more efﬁcient

enough. In the following section, we prove that an allocation

in terms of requests per watt when running at a low frequency.

with a large dispersion (in a sense described precisely below)

On the other hand, running the machines at a lower frequency

of the number of instances allocated to the machines cannot

requires a larger number of machines and therefore induces a

achieve SLA constraints with optimal energy.

higher leakage term.

D. Can optimal solutions be strongly heterogeneous ?

A. Lower bound

Above results state that for the minimization of the number

Let g be the function deﬁned on ]0;+1[ by g(x) =

of instances and for the minimization of the energy, homo-

g (x)=g (x). Let us prove that if g is non-decreasing, con-

t d

d

geneous allocations provide good solutions. Nevertheless, we

cave, positive, and g is non-increasing, convex and positive,

t

know from the example depicted in Figure 1 that optimal solu-

theng is convex. On the one hand, ifg fulﬁlls its constraints,

d

tions, for both the minimization of the number of instances and

then g is non-increasing, convex and positive, and on the

d

the minimization of the energy are not always homogeneous. In

other hand, the product of two non-increasing, convex and

the case of energy minimization, the dispersion of an allocation

positive is a convex function (this can be easily seen on the

cannot be too large, as stated more formally in the following

derivative).

theorem.

= 2

Let us apply above lemma withg (x) =x=x (which is

t

p

p

Theorem 1: Let us consider a valid allocationA whose

j

0

convex since 2) and g (x) = (1 FAIL) x+ K , and

d

energy is not larger than MAXENERGY, the upper bound on

(low)

deduce easily that E is convex.

the energy consumed by an homogeneous allocation. Then,

P P

= 2

2 = 2 2

(low)

(A ) A

0 j j

Therefore, E admits a unique minimum on [1;m].

if V = is used as the measure of

m m 0

(low) (low) (low)

Since E ! +1 and E ! +1, E is null

dispersion of theA ’s (related to the = 2-th moment of their

j 0 1

(low)

square values), then

at some point in [0;+1[, and let us deﬁne x such that

min

0 1 0

(low)

(low)

!

E (x ) = 0, i.e. as

min

DEM DEM

@ A

0 1

m V q p :

0

K 1 FAIL+ K

1 FAIL

DEM

m

@ A

E + q

stat

(low) (low)

0

P

P

(1 FAIL)x + K x

min min

A

A

j j

!

s

Proof: Let us ﬁrst introduceV = . Then

m m

0

K

P

= 2 P

2

A ( 1)(1 FAIL)+ 1 = 0: (1)

A

j j

0 0

(low)

V V . Indeed, V V = that has

2

x

m m

min

P

P

1=2

2

A

A

j j (low)

the same sign as that is non-negative The minimum of function E is reached on [1;m] for

m m

(low)

by application of Hoelder’s inequality. min(max(x ;1);m).

min

hal-00788964, version 3 - 10 Oct 20131.0e+08 1e+08

7.5e+07

8e+07

Algorithm

lll lower.bound

5.0e+07

lll theo.homo

6e+07

lll best.homo

2.5e+07

4e+07

l

l

l

0.0e+00

250 500 750 1000 250 500 750 1000

m m

(a) Dynamic energy consumption. (b) Energy consumption.

2 4 5 4

Figure 2. Simulation results for FAIL = 10 , DEM = 10 , REL = 10 , = 3, E = 5 10 .

stat

B. Upper bound – Homogeneous 2) theo.homo: This algorithm builds a valid solution fol-

lowing the Homogeneous policy. We have exhibited such a

The energy consumption of an Homogeneous solution on

solution in Section II-B. In order to determine the frequency

k machines is given by

at which each PM should be run, we rely on Chernoff bounds

0 1

to estimate the reliability of the allocation. Therefore, due to

1 DEM

(up)

@ A

the application of conservative Chernoff bounds, this solution

E (k) =k E + q :

stat

1

k

K

is in general pessimistic, in the sense that induced energy may

(1 FAIL)

k

not be optimal.

Let us apply again above lemma with g (x) =

t

q 3) best.homo: In order to cope with the limitations of

1 K (up)

theo.homo algorithm, best.homo ﬁnds the best solution (i.e.

DEM =x andg (x) = 1 FAIL to prove thatE

d

x

the one that minimizes the energy consumption) following

is convex and consequently admits a unique minimum on

(up) (up)

Homogeneous policy. To do this, we need to estimate precisely

[1;m]. Moreover, E (x)! +1 and E (x)! +1

x!1 x!0

the reliability of an allocation, instead of relying on a lower

0

(up) (up)

(up)

so that we can uniquely deﬁnex by E (x ) = 0,

bound as in theo.homo. best.homo can be decomposed into

min min

i.e.

an off-line and an on-line phase; the former is executed once

0 1

and for all, while the latter is to be run for each reliability

DEM

constraint.

@ A

E = q

stat

(up) (up)

In the off-line phase, we rely a double-entry table, where

(1 FAIL)x Kx

min min

!

s

a row is associated with a number of machines m and a

0

K

column corresponds to a reliability requirement REL. The

( 1)(1 FAIL)+ 1 : (2)

0

(up)

2

value of a cell indicates the maximum numberm such that the

x

min

0

probability of having m m alive machines among the m

initial machines at the end of the day is not less than 1 REL.

IV. SIMULATIONS

Those values can be obtained thanks to a cumulative binomial

The application of Chernoff bounds enables to ﬁnd valid

distribution.

solutions (satisfying the reliability constraints) and to obtain

theoretical upper bounds, but Chernoff bounds are in general In the on-line phase, we perform a binary search on the

too pessimistic, especially in the case when the number of machine capacity, so that we end up with a valid solution

machines is small. Hence, we derive in this section a heuristic minimizing the energy. Obviously, this solution is the one that

that returns a homogeneous allocation with lower energy than minimizes the common clock frequency of the machines, and

the one obtained in Section II-B. if the reliability constraint is fulﬁlled for a given capacity, it is

a fortiori true for a higher frequencies. At each step, for a given

A. Algorithms for MIN-ENERGY-NO-SHUTDOWN Problem frequency, we just have to check, using the table, whether the

number of alive instances is large enough.

In this section, we concentrate of the dynamic energy part

only, and we assume that the overall number of running PMs

B. Algorithms for MIN-ENERGY problem

is ﬁxed so that the leakage term has to be paid for all PMs.

1) lower.bound: In order to evaluate the performance of the Let us now consider the case when both static (leakage) and

heuristics, we rely on the lower bound proved in Section II-A. dynamic energy have to be taken into account, and when both

This is a lower bound on the energy consumption that is the number of PMs and their frequency have to determined.

required in order to fulﬁll the reliability constraint. When adding a non-zero static energy, all heuristics and

hal-00788964, version 3 - 10 Oct 2013

Dynamic energy

Total energy5e+07

4e+07

4e+07

3e+07

2e+07

2e+07

1e+07

0e+00 0e+00

25000 50000 75000 100000 25000 50000 75000 100000

Estat Estat

1 7 3 7

(a) FAIL = 10 ; REL = 5 10 (b) FAIL = 10 ; REL = 5 10

5e+07

4e+07

4e+07

3e+07

3e+07

2e+07

2e+07

1e+07

1e+07

0e+00 0e+00

25000 50000 75000 100000 25000 50000 75000 100000

Estat Estat

2 5 2 9

(c) FAIL = 10 ; REL = 10 (d) FAIL = 10 ; REL = 10

Algorithm lower.bound theo.homo best.homo

(e) Legend

4

Figure 3. Simulation results for DEM = 10 and = 3

2 4 5

bounds are such that the overall dissipated energy tends to settings: FAIL = 10 , DEM = 10 , REL = 10 , = 3,

4

+1 if the number of machines tends to 0 (because of the E = 5 10 andm varies between1 and250. lower.bound

stat

dynamic energy) or to +1 (because of the static energy). is depicted in red, best.homo in blue, theo.homo in green. As

There remains to ﬁnd for each of them the optimal number of expected, the dynamic energy decreases with the number of

machines. machines and as we proved in Section II-C, the lower and the

upper bound converge when the number of machines becomes

We have proved the convexity of the energy function re-

large. When both leakage and dynamic energy terms are

turned by lower.bound. Thus, solving Equation 1 using binary

taken into account, then the plots obtained for lower.bound,

searc is enough in order to obtain the optimal m. We operate

best.homo and theo.homo are convex, as proved in Section III.

in the same way for theo.homo, solving Equation 2 thanks

Using binary search for each plot, we are able to determine,

to a binary search. Since the energy consumption of the best

for each heuristic, the point that minimizes the overall energy

homogeneous allocation is also convex (as a function of the

(respectively the red point for lower.bound, the blue point for

number of machines), we also rely on the same technique for

best.homo and the green point for theo.homo).

best.homo on the MIN-ENERGY problem. More speciﬁcally,

we perform a binary search in order to obtain the number of

used machines that leads to minimum energy consumption.

In this example, the energy consumed by lower.bound

7 7

is 2:58 10 , while best.homo consumes 2:67 10 and

C. Results for MIN-ENERGY problem

7

theo.homo 2:94 10 , showing that theo.homo is 14% larger

1) For a single conﬁguration: In Figure 2, we compare than the lower bound and that best.homo only 4% larger than

the performance of all three heuristics under the following the lower bound.

hal-00788964, version 3 - 10 Oct 2013

Total energy Total energy

Total energy Total energy2) Simulation Results: In order to study the inﬂuence of the assume that migrations and redistributions take place at regular

different parameters, we performed a large set of simulations, time steps. It would be very interesting to mix both migrations

whose results are depicted on Figure 3. Each point in Figure 3 and static allocations in order to minimize the overall required

corresponds to the results of an experiment for a single energy, since more frequent redistributions induce less energy

conﬁguration described in Section IV-C1. For instance, the consumed by replication but more energy wasted by migration

results of the conﬁguration depicted in previous section can phases.

4

be read on Figure 3(c) when E = 5 10 .

stat

In general, we can observe that the simulation results

REFERENCES

prove the efﬁciency of homogeneous distributions with respect

[1] Q. Zhang, L. Cheng, and R. Boutaba, “Cloud computing: state-of-the-art

to energy minimization. Indeed, the red plots correspond to

and research challenges,” Journal of Internet Services and Applications,

a lower bound that holds true for any (possibly heteroge-

vol. 1, no. 1, pp. 7–18, 2010.

neous) solutions. In all cases, the ratio between the upper

[2] M. Armbrust, A. Fox, R. Grifﬁth, A. Joseph, R. Katz, A. Konwinski,

bound theo.homo and the lower bound lower.bound is always

G. Lee, D. Patterson, A. Rabkin, I. Stoica et al., “Above the clouds: A

smaller than 1.2 and that the ratio between the upper bound

berkeley view of cloud computing,” EECS Department, University of

California, Berkeley, Tech. Rep. UCB/EECS-2009-28, 2009.

best.homo and the lower bound lower.bound is always smaller

than 1.08. [3] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud

computing and emerging it platforms: Vision, hype, and reality for

Therefore, our simulations results prove both that

delivering computing as the 5th utility,” Future Generation Computer

Systems, vol. 25, no. 6, pp. 599 – 616, 2009. [Online]. Available:

lower.bound is always very close to the lower bound and that

http://www.sciencedirect.com/science/article/pii/S0167739X08001957

the approximation ratio provided by theo.homo is in general

[4] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel, “The cost

not too pessimistic.

of a cloud: research problems in data center networks,” SIGCOMM

Comput. Commun. Rev., vol. 39, no. 1, pp. 68–73, Dec. 2008. [Online].

V. PRICING ISSUES

Available: http://doi.acm.org/10.1145/1496091.1496103

[5] H. Van, F. Tran, and J. Menaud, “SLA-aware virtual resource manage-

In practice, it may be difﬁcult for the cloud user to evaluate

ment for cloud infrastructures,” in IEEE Ninth International Conference

the reliability requirements for the service they are running. On

on Computer and Information Technology. IEEE, 2009, pp. 357–362.

the other hand, our work enables the cloud provider to price the

[6] R. Calheiros, R. Buyya, and C. De Rose, “A heuristic for mapping

reliability constraint since it is possible to estimate the overall

virtual machines and links in emulation testbeds,” in 2009 International

Conference on Parallel Processing. IEEE, 2009, pp. 518–525.

price of the energy PRICE(E(REL)) that is required to enforce

reliability REL for a given service. From this information, it [7] O. Beaumont, L. Eyraud-Dubois, H. Rejeb, and C. Thraves, “Hetero-

geneous Resource Allocation under Degree Constraints,” IEEE Trans-

is possible for the Cloud provider to turn its offer into a list

actions on Parallel and Distributed Systems, 2012.

of pairs (price;compensation), so that

[8] A. Berl, E. Gelenbe, M. Di Girolamo, G. Giuliani, H. De Meer,

M. Dang, and K. Pentikousis, “Energy-efﬁcient cloud computing,” The

(1 REL) price REL compensation = PRICE(E(REL)):

Computer Journal, vol. 53, no. 7, p. 1045, 2010.

In this case, the expectation of the price received by the

[9] A. Beloglazov and R. Buyya, “Energy efﬁcient allocation of virtual

provider is equal to its actual energy cost. machines in cloud data centers,” in 2010 10th IEEE/ACM International

Conference on Cluster, Cloud and Grid Computing. IEEE, 2010, pp.

577–578.

VI. CONCLUSION AND OPEN PROBLEMS

[10] M. R. Garey and D. S. Johnson, Computers and Intractability, a Guide

to the Theory of NP-Completeness. W. H. Freeman and Company,

In this paper, we have proposed approximation algorithms

1979.

for minimizing both the number of used resources and the

[11] L. Epstein and R. van Stee, “Online bin packing with resource aug-

dissipated energy in the context of static service allocation

mentation.” Discrete Optimization, vol. 4, no. 3-4, pp. 322–333, 2007.

under reliability constraints in Clouds. For both optimization

[12] D. Hochbaum, Approximation Algorithms for NP-hard Problems. PWS

problems, we have given lower bounds and we have exhibited

Publishing Company, 1997.

algorithms that achieve claimed reliability. In the case of

[13] W. Cirne, “Scheduling at google,” in 16th Workshop on Job Scheduling

energy minimization, we have even been able to prove that

Strategies for Parallel Processing (JSSPP), in conjunction with IPDPS

proposed algorithm is asymptotically optimal when the overall 2012, 2011.

demand or the number of machines becomes arbitrarily large.

[14] K. Ranganathan, A. Iamnitchi, and I. Foster, “Improving data avail-

Such a result is important since it enables, for the point of ability through dynamic model-driven replication in large peer-to-

peer communities,” in Cluster Computing and the Grid, 2002. 2nd

view of the Cloud provider, to associate a price to reliability

IEEE/ACM International Symposium on, may 2002, p. 376.

(or equivalently to ﬁx penalties in case of SLA violation).

[15] D. da Silva, W. Cirne, and F. Brasileiro, “Trading cycles for information:

This work opens many perspectives. First, relying on different

Using replication to schedule bag-of-tasks applications on computa-

techniques, better approximation ratio in the case of low

tional grids,” in Euro-Par 2003 Parallel Processing, ser. Lecture Notes

number of resources are needed. Then, the extension to several

in Computer Science, H. Kosch, L. Böszörményi, and H. Hellwagner,

Eds. Springer Berlin / Heidelberg, 2003, vol. 2790, pp. 169–180.

services is trivial in the case of resource usage minimization,

but not trivial in the case of energy minimization. It would also [16] M. Lei, S. V. Vrbsky, and X. Hong, “An on-line replication strategy

to increase availability in data grids,” Future Generation Computer

be interesting to explicitly take into account the memory print

Systems, vol. 24, no. 2, pp. 85 – 98, 2008. [Online]. Available:

of the services, so as to limit the number of different services

http://www.sciencedirect.com/science/article/pii/S0167739X07000830

that a machine can handle. This would lead to different results,

[17] H.-I. Hsiao and D. J. Dewitt, “A performance study of three

by enforcing to limit the number of participating physical

high availability data replication strategies,” Distributed and Parallel

machines to the deployment of each individual service. At

Databases, vol. 1, pp. 53–79, 1993, 10.1007/BF01277520. [Online].

last, we concentrate in this work on the static phase, and we Available: http://dx.doi.org/10.1007/BF01277520

hal-00788964, version 3 - 10 Oct 2013[18] E. Santos-Neto, W. Cirne, F. Brasileiro, and A. Lima, “Exploiting repli- [28] A. P. Chandrakasan and A. Sinha, “Jouletrack: A web based tool for

cation and data reuse to efﬁciently schedule data-intensive applications software energy proﬁling,” in Design Automation Conference. IEEE

on grids,” in Job Scheduling Strategies for Parallel Processing, ser. CS Press, 2001, pp. 220–225.

Lecture Notes in Computer Science, D. Feitelson, L. Rudolph, and

[29] H. Aydin and Q. Yang, “Energy-aware partitioning for multiprocessor

U. Schwiegelshohn, Eds. Springer Berlin / Heidelberg, 2005, vol.

real-time systems,” in Proceedings of the International Parallel and

3277, pp. 54–103.

Distributed Processing Symposium (IPDPS), 2003, pp. 113–121.

[19] J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert, S. Matsuoka,

[30] J.-J. Chen and T.-W. Kuo, “Multiprocessor energy-efﬁcient scheduling

P. Messina, T. Moore, R. Stevens, A. Trefethen et al., “The international

for real-time tasks,” in Proceedings of International Conference on

exascale software project: a call to cooperative action by the global high-

Parallel Processing (ICPP). IEEE CS Press, 2005, pp. 13–20.

performance community,” International Journal of High Performance

[31] X. Qi, D. Zhu, and H. Aydin, “Global reliability-aware power man-

Computing Applications, vol. 23, no. 4, pp. 309–322, 2009.

agement for multiprocessor real-time systems,” in RTCSA, 2010, pp.

[20] “Eesi, "the european exascale software initiative", 2011,” http://www.

183–192.

eesi-project.eu/pages/menu/homepage.php.

[32] T. Ishihara and H. Yasuura, “Voltage scheduling problem for dynam-

[21] F. Cappello, “Fault tolerance in petascale/exascale systems: Current

ically variable voltage processors,” in Proceedings of International

knowledge, challenges and research opportunities,” International Jour-

Symposium on Low Power Electronics and Design (ISLPED). New

nal of High Performance Computing Applications, vol. 23, no. 3, pp.

York, NY, USA: ACM Press, 1998, pp. 197–202.

212–226, 2009.

[33] P. Langen and B. Juurlink, “Leakage-aware multiprocessor scheduling,”

[22] K. Ferreira, J. Stearley, J. Laros III, R. Oldﬁeld, K. Pedretti,

Journal of Signal Processing Systems, vol. 57, no. 1, pp. 73–88, 2009.

R. Brightwell, R. Riesen, P. Bridges, and D. Arnold, “Evaluating the

[34] R. Mishra, N. Rastogi, D. Zhu, D. Mossé, and R. Melhem, “Energy

viability of process replication reliability for exascale systems,” in

aware scheduling for distributed real-time systems,” in Proceedings

Proceedings of 2011 International Conference for High Performance

of the International Parallel and Distributed Processing Symposium

Computing, Networking, Storage and Analysis. ACM, 2011, p. 44.

(IPDPS), 2003, pp. 21–29.

[23] M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, “Check-

[35] “Data centers waste vast amounts of energy belying industry

pointing strategies for parallel jobs,” in High Performance Computing,

image,” http://www.nytimes.com/2012/09/23/technology/

Networking, Storage and Analysis (SC), 2011 International Conference

data-centers-waste-vast-amounts-of-energy-belying-industry-image.

for. IEEE, 2011, pp. 1–11.

html.

[24] F. Cappello, H. Casanova, and Y. Robert, “Checkpointing vs. migration

[36] W. Hoeffding, “Probability inequalities for sums of bounded random

for post-petascale supercomputers,” ICPP’2010, 2010.

variables,” Journal of the American Statistical Association, vol. 58, no.

[25] O. Beaumont, L. Eyraud-Dubois, and H. Larchevêque, “Reliable

301, pp. 13–30, 1963.

Service Allocation in Clouds,” in IPDPS 2013 - 27th IEEE

[37] H. Chernoff, “A measure of asymptotic efﬁciency for tests of a hypoth-

International Parallel & Distributed Processing Symposium, Boston,

esis based on the sum of observations,” The Annals of Mathematical

États-Unis, 2013. [Online]. Available: http://hal.inria.fr/hal-00743524

Statistics, vol. 23, no. 4, pp. 493–507, 1952.

[26] T. Ishihara and H. Yasuura, “Voltage scheduling problem for dynam-

[38] L. Valiant, “The complexity of computing the permanent,” Theoretical

ically variable voltage processors,” in Proceedings of International

Computer Science, vol. 8, no. 2, pp. 189 – 201, 1979.

Symposium on Low Power Electronics and Design (ISLPED). ACM

[Online]. Available: http://www.sciencedirect.com/science/article/pii/

Press, 1998, pp. 197–202.

0304397579900446

[27] K. Pruhs, R. van Stee, and P. Uthaisombut, “Speed scaling of tasks

[39] A. Benoit, P. Renaud-Goud, and Y. Robert, “Power-aware replica

with precedence constraints,” Theory of Computing Systems, vol. 43,

placement and update strategies in tree networks,” in IPDPS, 2011,

pp. 67–80, 2008.

pp. 2–13.

hal-00788964, version 3 - 10 Oct 2013

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο