A Hybrid OS Cluster Solution

snottysurfsideΔιακομιστές

9 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

268 εμφανίσεις





Architect of an Open World
TM






















A Hybrid OS Cluster Solution


Dual
-
Boot and Virtualization with

Windows HPC Server 2008 and

Linux B
ull Advanced Server for Xeon


Published:
June 2009


Dr. Patrice Calegari
,
HPC Application S
pecialist, BULL S.A.S.

Thomas V
arlet, HPC Technology Solution P
rofessional, Microsoft






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



2


The proof of concept

presented in this document is neither a product nor a service offered by Microsoft or
BULL S.A.S.

The inf
ormation
conta
ined in this document represents the current view of Microsoft Corporation and
BULL S.A.S.

on the issues discussed
as of the date of publication. B
ecause Microsoft and
BULL S.A.S.

must respond to changing market conditions, it should not be interprete
d to be a commitment on the part
of Microsoft or
BULL S.A.S.
, and Microsoft and
BULL S.A.S.

cannot guarantee the accuracy of any
information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT and
BULL S.
A.S.

MAKE NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights
under copyright, no part of this document may be rep
roduced, stored in or introduced into a retrieval
system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise), or for any purpose, without the express written permission of Microsoft Corporation and
BU
LL
S.A.S.

Microsoft and
BULL S.A.S.

may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as expressly provided in any
written license agreement from Microsoft
or
BULL S.A.S.
, as applicable, the furnishing of this document
does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

©
2008
,

2009

Microsoft Corporation and
BULL S.A.S
.

All rights reserved.

NovaScale

is

a
re
gistered
trademark

of Bull S
.
A
.
S.

Microsoft,
Hyper
-
V,
Windows
,
Windows

Server, and the Windows logo

are trademarks of the Microsoft
group of companies.

PBS GridWorks®,
GridWorks™
, PBS

Professional®, PBS™ and Portable Batch System® are trademarks

of Altair
Engineering
, Inc
.

The names of actual companies and products mentioned herein may be the tradema
rks of their
respective owners.






Initial publication:

release 1.2,

52 pages,

p
ublished in

June

2008


Minor updates:

release 1.5
,

56 pages,

published in

Nov.

2008


This paper with meta
-
scheduler implementation:

r
elease 2.0
,

7
6

pages,

published in

June 2009






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



3


Abstract

The choice of an
operating system

(OS) for a
high performance computing

(HPC) cluster is a critical
decision for
IT departments
.

The goal of this

paper is to show that simple techniques are available today
to
optimize
the

return on investment

by making

that choice unnecessary, and keep
ing

the

HPC
infrastructure versatile and flexible
.
This paper introduces Hybrid Operating System Clusters (HOSC).
A
n

HOSC is a
n

HPC cluster that can run several OS
’s

simultaneously. This paper
addresses the situation
where
two OS
’s are

running simultaneously: Linux Bull Advanced Server for Xeon a
nd Microsoft
®

Windows
®

HPC Server

2008. However, most of the information
presented in th
is

paper can apply to

3

or
more simultaneous OS
’s
, possibly from other OS distributions, with slight adaptations. This
document

gives general concepts as well as detailed setup information. First
ly
, technologies necessary to design a
n

HOSC a
re defined (dual
-
boot, virtualization, PXE
,
resource

manager and job scheduler
). Second
ly
,
different approaches of HOSC architectures are analyzed and tech
nical recommendations are given with
a focus on c
omputing performance and management flexibility. The

recommendations are then
implemented

to determine the

best technical choices for
design
ing

a
n

HOSC prototype. The installation
setup
of the prototype
and
the
configuration steps are

explained.
A meta
-
scheduler based on Altair
PBS
Professional

is implement
ed.
Fin
ally, basic HOSC administrator

operations are listed and ideas for future
works are proposed.










This paper can be downloaded from the following web sites:

http://www.bull.com/techtrends

http://www.microsoft.com/downloads

http://technet.microsoft.com/en
-
us/library/cc700329(WS.10).aspx







A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



4


ABSTRACT

................................
................................
................................
................................
..............................

3

1

INTRODUCTION

................................
................................
................................
................................
..............

7

2

CONCEPTS AND PRODUCT
S

................................
................................
................................
............................

9

2.1

M
ASTER
B
OOT
R
ECORD
(MBR)

................................
................................
................................
............................

9

2.2

D
UAL
-
BOOT

................................
................................
................................
................................
.......................

9

2.3

V
IRTUALIZATION

................................
................................
................................
................................
...............

10

2.4

PXE

................................
................................
................................
................................
...............................

12

2.5

J
OB SCHEDULERS AND RE
SOURCE MANAGERS IN A

HPC

CLUSTER

................................
................................
................

13

2.6

M
ETA
-
S
CHEDULER

................................
................................
................................
................................
............

13

2.7

B
ULL
A
DVANCED
S
ERVER FOR
X
EON

................................
................................
................................
.....................

14

2.7.1

Description

................................
................................
................................
................................
...........

14

2.7.2

Cluster installation mechanisms

................................
................................
................................
..........

14

2.8

W
INDOWS
HPC

S
ERVER
2008

................................
................................
................................
...........................

16

2.8.1

Description

................................
................................
................................
................................
...........

16

2.8.2

Clus
ter installation mechanisms

................................
................................
................................
..........

16

2.9

PBS

P
ROFESSIONAL

................................
................................
................................
................................
..........

18

3

APPROACHES AND RECOM
MENDATIONS

................................
................................
................................
....

19

3.1

A

SINGLE OPERATING SYS
TEM AT A TIME

................................
................................
................................
................

19

3.2

T
WO SIMULTANEOUS OPER
ATING SYSTEMS

................................
................................
................................
............

21

3.3

S
PECIALIZED NODES

................................
................................
................................
................................
...........

23

3.3.1

Management node

................................
................................
................................
..............................

23

3.3.2

Compute nodes

................................
................................
................................
................................
....

23

3.3.3

I/
O nodes

................................
................................
................................
................................
..............

24

3.3.4

Login nodes

................................
................................
................................
................................
..........

24

3.4

M
ANAGEMENT SERVICES

................................
................................
................................
................................
....

25

3.5

P
ERFORMANCE IMPACT OF

VIRTUALIZATION

................................
................................
................................
...........

25

3.6

M
ETA
-
SCHEDULER FOR
HOSC
................................
................................
................................
.............................

26

3.6.1

Goals

................................
................................
................................
................................
....................

26

3.6.2

OS switch techniques

................................
................................
................................
...........................

26

3.6.3

Provisioning and distribution policies

................................
................................
................................
..

26

4

TECHNICAL CHOICES FO
R DESIGNING AN
HOSC PROTOTYPE

................................
................................
........

27

4.1

C
LUSTER APPROACH

................................
................................
................................
................................
..........

27

4.2

M
ANAGEMENT NODE

................................
................................
................................
................................
........

27

4.3

C
OMPUTE NODES

................................
................................
................................
................................
..............

27

4.4

M
ANAGE
MENT SERVICES

................................
................................
................................
................................
....

28

4.5

HOSC

PROTOTYPE ARCHITECTU
RE
................................
................................
................................
........................

32

4.6

M
ETA
-
SCHEDULER ARCHITECTU
RE

................................
................................
................................
........................

33






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



5


5

SETUP OF THE HOSC PR
OTOTYPE

................................
................................
................................
.................

34

5.1

I
NSTALLATION OF THE M
ANAGEMENT NODES

................................
................................
................................
..........

34

5.1.1

Installation of the RHEL5.1 host OS with Xen
................................
................................
.......................

34

5.1.2

Creation of 2 virtual machines

................................
................................
................................
.............

34

5.1.3

Installation of XBAS management node on a VM

................................
................................
................

36

5.1.4

Installation of InfiniBand driver on domain 0

................................
................................
......................

36

5.1.5

Installation of HPCS head node on a VM

................................
................................
.............................

36

5.1.6

Preparation for XBAS deployment on compute nodes

................................
................................
.........

37

5.1.7

Preparation for HPCS deployment on compute nodes

................................
................................
.........

37

5.1.8

Configuration of services on HPCS head n
ode

................................
................................
.....................

38

5.2

D
EPLOYMENT OF THE OPE
RATING SYSTEMS ON TH
E COMPUTE NODES

................................
................................
..........

39

5.2.1

Deployment of XBAS on compute nodes

................................
................................
..............................

39

5.2.2

Deployment of HPCS on compute nodes

................................
................................
..............................

40

5.3

L
INUX
-
W
INDOWS INTEROPERABIL
ITY ENVIRONMENT

................................
................................
...............................

43

5.3.1

Installation of the Subsystem for Unix
-
based Applications (SUA)

................................
........................

43

5.3.2

Installation of the Utilities and SDK for Unix
-
based Applications

................................
........................

43

5.3.3

Installation of add
-
on tools

................................
................................
................................
..................

43

5.4

U
SER ACCOUNTS

................................
................................
................................
................................
...............

43

5.5

C
ONFIGURATION OF SSH

................................
................................
................................
................................
.....

44

5.
5.1

RSA key generation

................................
................................
................................
..............................

44

5.5.2

RSA key
................................
................................
................................
................................
.................

4
4

5.5.3

Installa
tion of freeSSHd on HPCS compute nodes

................................
................................
................

45

5.5.4

Configuration of freeSSHd on HPCS compute nodes

................................
................................
............

45

5.6

I
NSTALLATION OF
PBS

P
ROFESSIONAL
................................
................................
................................
...................

45

5.6.1

PBS Professional Server setup

................................
................................
................................
..............

46

5.6.2

PBS Professional setup on XBAS compute nodes

................................
................................
.................

46

5.6.3

PBS Professional setup on HPCS nodes

................................
................................
................................

46

5.7

M
ETA
-
S
CHEDULER QUEUES SETU
P

................................
................................
................................
.......................

46

5.7.1

Just in time provisioning setup

................................
................................
................................
.............

48

5.7.2

Calendar provisioning setup

................................
................................
................................
................

48

6

ADMINISTRATION OF TH
E HOSC PROTOTYPE

................................
................................
...............................

49

6.1

HOSC

SETUP CHECKING

................................
................................
................................
................................
.....

49

6.2

R
EMOTE REBOOT COMMAND

................................
................................
................................
..............................

49

6.3

S
WITCH A COMPUTE NODE

OS

TYPE FROM
XBAS

TO
HPCS

................................
................................
......................

49

6.4

S
WITCH A COMPUTE NODE

OS

TYPE FROM
HPCS

TO
XBAS

................................
................................
......................

50

6.4.1

Without sshd on the HPCS compute nodes

................................
................................
..........................

50

6.4.2

With sshd on the HPCS compute nodes

................................
................................
...............................

50

6.5

R
E
-
DEPLOY AN
OS

................................
................................
................................
................................
............

50

6.6

S
UBMIT A JOB WITH THE

META
-
SCHEDULER

................................
................................
................................
............

51






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



6


6.7

C
HECK NODE STATUS WIT
H THE META
-
SCHEDULER

................................
................................
................................
...

52

7

CONCLUSION AND PERSP
ECTIVES

................................
................................
................................
................

54

APPENDIX A: ACRONYM
S

................................
................................
................................
................................
....

55

APPEND
IX B: BIBLIOGRAPHY
AND RELATED LINKS

................................
................................
..............................

57

APPENDIX C: MASTER
BOOT RECORD DETAILS

................................
................................
................................
....

59

C.1

MBR

S
TRUCTURE

................................
................................
................................
................................
.............

59

C.2

S
AVE AND RESTORE
MBR

................................
................................
................................
................................
...

59

APPE
NDIX D: FILES USED
IN EXAMPLES

................................
................................
................................
...............

60

D.1

W
INDOWS
HPC

S
ERVER
2008

FILES

................................
................................
................................
....................

60

D.1.1

Files used for compute node deployment

................................
................................
............................

60

D.1.2

Script for IPoIB setup

................................
................................
................................
............................

61

D.1.3

Scripts used for OS switch

................................
................................
................................
....................

62

D.2

XBAS

FILES

................................
................................
................................
................................
.....................

63

D.2.1

Kickstart
and PXE files

................................
................................
................................
..........................

63

D.2.2

DHCP configuration

................................
................................
................................
..............................

64

D.2.3

Scripts used for OS switch

................................
................................
................................
....................

65

D.2.4

Network interface bridge configuration

................................
................................
..............................

67

D.2.5

Network hosts

................................
................................
................................
................................
......

67

D.2.6

IB network interface configuration

................................
................................
................................
......

68

D.2.7

ssh host configuration

................................
................................
................................
..........................

68

D.3

M
ETA
-
SCHEDULER SETUP FILE
S

................................
................................
................................
............................

68

D.3.1

PBS Professional configuration files on XBAS

................................
................................
.......................

68

D.3.2

PBS Professional configuration files on HPCS

................................
................................
......................

69

D.3.3

OS load balancing files

................................
................................
................................
.........................

69

APPENDIX E: HARDWAR
E AND SOFTWARE USED
FOR THE EXAMPLES

................................
................................

72

E.1

H
ARDWARE

................................
................................
................................
................................
.....................

72

E.2

S
OFTWARE

................................
................................
................................
................................
......................

72

APPENDIX F: ABO
UT ALTAIR AND PBS GR
IDWORKS

................................
................................
............................

73

F.1

A
BOUT
A
LTAIR

................................
................................
................................
................................
.................

73

F.2

A
BOUT
PBS

G
RID
W
ORKS

................................
................................
................................
................................
..

73

APPE
NDIX G: ABOUT MICRO
SOFT AND WINDOWS HPC

SERVER 2008

................................
................................

74

G.1

A
BOUT
M
ICROSOFT

................................
................................
................................
................................
..........

74

G.2

A
BOUT
W
INDOWS
HPC

S
ERVER
2008

................................
................................
................................
.................

74

APPENDIX H: ABOUT B
ULL S.A.S.

................................
................................
................................
........................

75






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



7


1

I
ntroduction

The choice of the right
o
perating
s
ystem
(OS)
for
a

h
igh
p
erformance
c
omputing
(HPC)
cluster can be a
very difficult decision for IT departments. And this choice will usually have a big impact on the Total Cost
of Ownership (TCO) of the cluster. Parameters like multiple user needs, application environment
requirements and security policies

are adding to the complex human factors included in training,
maintenance and support planning, all leading to associated risks on the final
return on investment (
ROI
)

of the whole HPC

infrastructure. The goal of this

paper is to show that simple techniqu
es are available
today to make that choice unnecessary, and keep your HPC infrastructure versatile and flexible.

In

this white p
aper

we will

study how to provide

the best flexibility for running several
OS
’s

on a
n

HPC
cluster
.
There are two

main

types of approach
es

to providing

this
service depending on

w
h
ether a

single
operating system is selected
each time

the
whole

cluster
is

booted
, or

w
h
ether

several

operating systems
are run
simultaneously

on
the cluster.
The most common

approach of the fir
st type is
called
the dual
-
boot
cluster
(described in

[1]

and
[2
]
)
. For the second type of approach,
we introduce the concept of

a

Hybrid
Operating System Cluster (HOSC): a cluster with
some computing nodes run
ning

one

OS
type
while the
remaining

nodes

run another OS

type
.

Several a
pproa
ches

to

both types

are studied in this document

in
order to determine

the
ir

properti
es

(
requirements, limits,
feasibility
,
and useful
ness
)

with a clear focus

on

c
omputing perform
ance and management flexibility.


The stud
y

is

limited to 2

operati
ng systems:

Linu
x Bull Advanced Server for Xeon

5v1.1
and

Micros
oft
Windows HPC Server

2008 (
respectively noted
XBAS

and
HPCS

in this paper
)
.

For optimizing the
interoperability between the two OS worlds, we use the
Subsystem for
Unix
-
based Applications

(SUA) for
Windows
.
Th
e description of the methodologies

is as general as possible in order to apply to other OS
distributions but examples are given
exclusively
in the
XBAS
/
HPCS
context.

T
he concepts

develo
ped in
this document

could

apply to 3 or more
simultaneous OS
’s

with slight adaptations. However, t
his is out of
the scope of this
paper
.


We in
troduce a meta
-
scheduler

that provides a single submission point for both Linux and Windows. It
selects the
cluster

nodes with the

OS
type

required by

s
ubmitted jobs. The

OS type of
compute

nodes
can be

switched automatically and safely without administrator intervention
.

This
optimiz
es

computational
workloads

by adapting the distribution of OS types among the compute nodes.

A

technical proo
f of concept

is given by designing, installing and
running

a
n

HOSC
prototype
.

This
prototype

can provide

computing power under

both
XBAS

and
HPCS
simultaneously
. It has
two

virtual

management node
s

(ak
a head node
s
)
on a single server

and the choice of the OS distribution among the
compute

nodes

can be done dy
namically.

We have chosen
Altair
PBS Professional

software
to
demon
strate
a m
eta
-
sched
u
ler implementation
.

This project is
the result of the collaborative

work
of

Microsoft and Bu
ll.

Chapter

2 defines the main technologies used in HOSC: the Master Boot Record (MBR), the
dual
-
boo
t
method, the virtualization,

the
Pre
-
boot eXecution Environment

(PXE)
, the resource manager and job
scheduler tools
.
If you are already familiar with these

concepts
, you may want to skip this chapter and go
directly to
Chapter

3
that
analyzes
different approaches
to

HOSC architectures and

gives

technical





A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



8


recommendations

for their design
. The recommendations are
implemented

in Chapter

4 in order
to
determine
the

best technical choices for
building

a
n

HOSC prototype. The installation setup of the
prototype and
the
configuration steps are

explained

in Chapter

5
.
Appendix

D shows the files that were
used during this step.

Fin
ally, basic HOSC administrator

operati
ons are listed
in Chapter

6
and ideas for
future works are proposed

in Chapter

7
,

which

concludes this paper
.

This document is intended for computer scientists who are familiar w
ith HPC cluster administration.

All acronyms used in this paper are listed in
Appendix

A
. C
omplementary information can be found in the
documents
and web pages
listed in Appendix

B.







A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



9


2

Concepts

and products

We assume that the readers
may
not

be

familiar
with
every
concept

discussed

in the remaining chapters
in both Linux and Windows environments.
Therefore, this c
hapter introduces the
technologies

(Master
Boot Record, Dual
-
boot, virtualization and
Pre
-
boot eX
ecution Environment) and products
(Linux Bull
Advanced Server,
Windows HPC
Server 2008

and
PBS Professional
)
mentioned

in this document.


I
f you are already familiar with these concepts or are more interested in general Hybrid OS Cluster
(HOSC)
considerations, you may want to skip this chapter and go directly to Chapter

3.

2.1

Master

Boot Record (MBR)

T
he 512
-
byte boot sector

is called

t
he Master Boot Record (MBR)
. It is

the first sector of a partitioned data
storage device such as a hard dis
k.

The MBR is usually
over
written

by
operating system (
OS
)

installation
procedures
;
the
MBR
pr
evious
ly written on the device is then lost
.

The MBR

includes the

partition

table of
the
4

primary partitions

and
a bootstrap code that can

start the
OS

or

load and run the boot

loade
r code

(see
the complete
MBR
structure in

Table

3

of Appendix

C
.1
)
.
A

partition is encoded as a 16
-
byte structure with size, location and characteristic fields. The first 1
-
byte
field of the partition stru
cture is called the
boot flag.

Windows MBR

starts the OS installed on the active partition. The active partition is the f
irst primary
partition that has its boot flag enabled. Y
ou can select an OS by activating the partition where it is
installed. Tools
diskpart.exe

and
fdisk

can be used to change partition activation.
Appendix

D.1.3

and Appendix

D.2.3

give

examples of commands that en
able/disable the boot flag.

Linux MBR

can run a boot
loade
r code (e.g.,
GRUB

or Lilo)
. You can then select an OS
interactively
from
its user interface at the console. If no choice is given at the console, the OS selection is ta
ken from the
boot loader configuration file that you can edit in advance before a reboot (e.g.,

grub.conf

for the
GRUB

boot loader). If necessary, the Linux boot loader configuration file (that is written in a Linux
partition) can be replaced from a Window
s command line with the
dd
.exe

tool.

Appendix

C
.2

explains how to save and restore the MBR of a device. It is very important to understand
how the MBR works in order to properly configure dual
-
boot systems.

2.2

Dual
-
boot

Dual
-
booting

is
an easy way
to have
sev
eral

operating systems
(OS)
on
a node.
W
hen an OS is run,
it
has no interac
tion with the other OS installed
so the
native

performance

of the node
is

not affected by the
use of the dual
-
boot feature
.

The only limitation

is that

these OS
’s

cannot be
run

simultaneously.

When designing a dual
-
b
oot node, the following point
s

should be analyzed:



The c
hoice of

the

MBR
(and choice of the boot loader if
applicable
)






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



10




The d
is
k partition restrictions. For example,
Windows must have

a

system
partition
on

at least
on
e

primary

partition of

the first device)



The c
ompatibility with

Logical Volume Manager
s

(LVM)
. For example
,

RHEL5.1 LVM creates a
logical volume with the entire first device by default and this
makes it impossible to install

a
second OS on this device.

Whe
n booting a computer, the dual
-
boot
feature gives

the
ability to
choose which

OS to start from
multiple
OS’s installed

on
that

computer
.
At boot time
,

the
way you can select the OS of a node depends
on the installed MBR. A dual
-
boot method that relies on
Linux MBR and GRUB is described in

[1].
Another dual
-
boot method that exploits the properties of active partitions is described in

[2]

an

[3]
.

2.3

Virtualization

The virtualization technique is used to hide the physical
characteristics

of computers and only pr
esent a
logical

abstraction

of these characteristics
.
Vi
rtual Machines (VM) can

be created by the virtualization
software: each VM has virtual resources (
CPU
s, memory, devices, network interfaces, etc.)
whose
characteristics (quantity, size, etc.)
are inde
pendent from th
ose available on the

physical

server
.

The OS
installed in a VM is called a

guest OS
: the guest OS can only access the virtual resources available in its
VM.

Several

VM
s

can be created and run on one physical node.

These
VM
s

appear

like physi
cal
machines
for

the applications, the users and the other nodes

(physical or virtual)
.

Virtualization is interesting in the context of our study for two reasons:

1.

It makes possible the installation of several management nodes (MN) on a single physical
serv
er. This is an important point for installing several OS on a cluster without increasing its cost
with the installation of an additional physical MN server.

2.

It provides a fast and rather easy way to
switch from
one

OS to another
:

by
starting a VM that
runs

an OS while suspending another VM that runs another OS.

A hypervisor is a software layer that runs at a higher privilege level on the hardware. The virtualization
software runs in a partition (domain 0

or dom0
), from where it controls how the hypervisor a
llocates
resources to the virtual machines.

The other domains where the VMs run are
called

unprivileged

domains

and noted domU.

A hypervisor normally enforces scheduling policies and memory boundaries. In some
Linux implementations it also provides access
to hardware devices via its own drivers. On Windows, it
does not.


The virtualization software can be:



H
ost
-
based

(like VMware)
:
this means that
the virtualization software

is
installed on a physical
server with a classical OS called the host OS
.



H
yperv
iso
r
-
based (like Windows Server
®

2008 Hyper
-
V


and Xen):
in this case,
the hypervisor
runs at a lower level than the OS. The “host OS” becomes just another VM

that is automatically
started at boot time
. Such
virtualization

architecture is shown i
n Figure 1.






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



11



Figure
1

-

Overview of
hypervisor
-
based virtualization architecture

“F
ull virtualization” is an

approach which

requires no modification

to the hosted operating system,
providing the illusion of a complete system of real hardware devices
. Such
Hardware Virtual Machine
s
(HVM) require hardware support provided for example by Intel
®

Virtual

Technology (VT) and AMD
-
V

technology
.
Recent Intel®
Xeon
®

processors support full virtualization thanks to the Intel®

VT.
Windows
is only supported on fully
-
virtualized VMs and not on para
-
virtualized VMs.


para
-
virtualization
” is a
n
appro
ach which requires modification

to the operating system in order to r
un in a
VM
.

The market provides
many
v
irtualization software

packages

among which:



Xen

[6]
: a f
reeware for Linux

i
ncluded in

the

RHEL5 distribution

which allows a m
aximum
of
8

virtual
CPU
s per virtual machine (VM)
.
Oracle VM

and Sun
xVM
VirtualBox

are
commercial
implementations.



VMware

[7]
:
commercial

software for Linux and Windows

which allows a m
aximum
of
4

virtual
CPU
s per VM
.



Hyper
-
V

[8]
: a solution provided by Microsoft which o
nly works on Windows Server

2008

and
allows
only 1

virtual
CPU

per VM fo
r non
-
Windows VM
.




PowerVM

[9]

(
formerly

Advanced
POWER

Virtualization)
:

an
IBM solution for UNIX and Linux on
most processor architectures that does not support Windows as a guest OS.






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



12




Virtuozzo

[10]
: a

Parallels, Inc


solution designed to deliver near native physical performance. It
only supports VMs that run the same OS as the host OS (i.e., Linux VMs on Linux hosts and
Windows VMs on Windows hosts).




OpenVZ

[11]
: an

operating system
-
level virtualization technology
li
censed under GPL

version

2.

It
is a basis of Virtuozzo

[10]
. I
t
requires both the host and guest OS to be

Linux, possibly of
different di
stributions. It has a low
performance penalty

compared to a standalone server
.

2.4

PXE


The
Pre
-
boot eX
ecution Environment

(
PXE
)

is an environment to boot computers using a network
interface independently of available data s
torage
devices or installed OS.
The end goal is to allow a client
to network boot and receive a network boot program (NBP) from a network boot server.

In a

network

boot operation, the client computer will:

1.

Obtain an IP address to gain network
connectivity:

when a PXE
-
enabled boot is initiated, the
PXE
-
based ROM request
s

an IP address from a Dynamic Host Configuration Protocol (DHCP)
server using the normal DHCP discovery process

(see
the detailed process in
Figure

2
)
.
It will
receive from the
DHCP
server

an IP address lease, information about the correct boot server and
information about the correct boot file.

2.

Discover a network boot server
: with the information from the DHCP server the client
establishes a connection to the PXE servers (TFTP, WDS, NFS, CIFS, etc.).

3.

Download the NBP file
from
the network boot server and e
xecute it
: the client uses Trivial
File Transfer Protocol (TFTP) to download the NBP. Examples of NBP are:
pxelinux.0

for
Linux and
WdsNbp.com

for Windows Server.

When booting a compute node with PXE, the goal can be to install or run it with an image depl
oyed
through the network, or just to run it with an OS installed on its local disk. In the latter case, the PXE just
answers the compute node requests by indicating that it must boot on the next boot device listed in its
BIOS.


NODE

bootpc(68): boot
protocol
client

on port 68


Broadcast IP source
= 0.0.0.0

DHCP SERVER

bootps(67): boot
protocol
server

on port 67


Broadcast IP

source =
DHCP server IP addr.

1

-

DHCP
DISCOVER

(client MAC address)

2
-

DHCP
OFFER

(NEW IP address)

3

-

DHCP
REQUEST

(NEW IP address)

4

-

DHCP
ACK
(NEW IP address and boot information)






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



13


Figure
2
-

DHCP discovery process

2.5

Job schedulers and resource managers

i
n a HPC cluster

I
n an HPC

cluster, a

resource manager
(aka Distributed Resource Management System (DRMS) or
Distributed Resource Manager (DRM))
gathers information about all cluster resources that can be used
by application jobs. Its main goal is to give accurate
resource
information about the cluster usage to a job
scheduler.

A job scheduler (aka batch scheduler

or batch system
)

is in charge of unattended background executions.

It provides a user interface for submitting, monitoring and terminating jobs. It is

usually

resp
onsible

for the
optimization of job placement on the cluster nodes
. For that purpose it deals with

resource information,
administrator

rules

and

user rules
:
job priority,
job dependencies, resource and time limits,
reservation,
specific resource requiremen
ts, parallel job management, process binding
,
etc.

With time, j
ob schedulers
and resource managers evolved
in
such
a way that
they are
now
usually integrated under a unique
product name.
Here are
such
noteworthy

products:



PBS Professional

[12]
:

supported b
y

Altair

for Linux/Unix and Windows



Torque

[13]
: a
n open source

job scheduler
based on the original PBS project. It

can be

used as
a
resource manager

by other schedulers

(e.g., Moab workload manager)
.



SLURM

(
Simple Linux Utility for Resource Management
)
[14]
:

free
ware

and
open source



LSF

(Load Sharing Facility)
[1
5
]
:
supported by

Platform

for Linux/Unix and Windows




SGE

(Sun Grid Engine)
[16
]
:
supported by Sun Microsystems




OAR

[17
]
:

free
ware

and
open source
for
Linux,

AIX

and
SunOS/Solaris



Microsoft

Win
dows HPC Server 2008

job scheduler
:

included in the Microsoft HPC pack

[5]

2.6

Meta
-
Scheduler

According to
Wikipedia

[
18
]
,


Meta
-
scheduling

or
Super scheduling

is a
computer software

technique of
optimizing computational workloads by combining an organization's multiple
Distributed Resource
Managers

into a single aggregated view, allowing
batch jobs

to be directed to the best location for
execution

. In this paper, we consider
that
th
e meta
-
scheduler is able to submit jobs on cluster nodes
with
heterogeneous OS type
s
and
that it can

switch automatically the OS
type
of these nodes
when necessary

(for optimizing
computational workloads
)
.
Here

is a partial list of meta
-
schedulers currentl
y available:



Moab

Grid Suite and
Maui

Cluster scheduler

[
19
]:
supported by

Cluster Resources, Inc.



GridWay

[
20
]
: a Grid m
eta
-
sched
uler

by the
Globus Alliance



C
SF
(C
ommunity Scheduler Framework
)

[
21
]
:

an open source framework
(
an add
-
on to the
Globus
Toolkit

v.3)
for implement
ing a grid meta
-
scheduler,
developed by Platform Computing






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



14


Recent j
ob schedulers can
sometime

be
adapted and configured

to behave as
“simple”
m
eta
-
sched
uler
s
.

2.7

Bull Advanced Server for Xeon

2.7.1

Description

Bull
Advanced Server for Xeon

(
XBAS
)
is a robust and efficient

Linux

solution that delivers total cluster
management. It addresses each step of the cluster lifecycle with a centralized administration interface:
installation, fast and reliable software deployments, topology
-
aware monit
oring and fault handling
(
to
dramatically lower time
-
to
-
repair), cluster optimization and expansion.
Integrated, tested and supported
by Bull

[4]
,
XBAS

federates the very best of Open Source components, complemented by leading
software packages from well k
nown Independent Software Vendors, and gives them a consistent view of
the whole HPC cluster through a comm
on
cluster database
: the clusterdb
.
XBAS

is
fully

compatible with
standard RedHat Enterprise Linux

(RHEL)
.

Latest

Bull Advanced Server for Xeon

5

rel
ease

(v3.1)

is
based on
RHEL5.
3
1
.

2.7.2

Cluster i
nstallation mechanisms

The Installation of a
n

XBAS

cluster starts
with

the setup of the management node (see the installation
&
configuration
guide

[
22
]). The compute nodes are then deployed by automated tools.

BIOS settings must be set so that
XBAS

compute
nodes boot on network with PXE by default.
The PXE
files stored on the management node indicate if a given compute node should be installed (i.e., its
DEFAULT

label is
ks
) or if it is ready to be run (i.e., it
s
DEFAULT

label is
local_primary
).

In the first case, a new OS image should be deployed
2
.
During the PXE boot process, operations to be
executed on the compute node
are

written in the kickstart file.
Tools

based on PXE
are provided by
XBAS

to simplify t
he

install
ation

of compute nodes.
The

preparenfs


tool

writes th
e configuration files with the
information
given by the administrator
and

with
those

found in
the

clusterdb
. The
generated
configuration
files are:
the PXE files

(e.g.,
/tftpboot/C0A80002
)
,
the

DHCP configuration file (
/etc/dhcpd.conf
),
the kickstart file (e.g.,
/release/ks/kickstart
) and the NFS export file (
/etc/exportfs
)
.
No user
interface access (remote or local) to the compute node is required during its installation phase

with the
preparen
fs tool
.

Figure

3

shows
the sequence of interactions between
a new
XBAS

compute node
being
installed and the servers
run on the management node
(
DHCP, TFTP and

NFS
).

On small clusters, the

preparenfs

tool
can be used to install every CN. On large cluster
s

the ksis tool can be used to optimize
the total deployment time of the cluster by cloning the first CN installed
with the “
preparenfs
” tool
.

In the second case,
the CN is already installed and
the compute node just needs to boot locally on its



1

The Bull Advanced Server for Xeon

5 release that was used to illustrate examples in this paper is v1.1
based on RHEL5.1 because this was
the latest release when we built the first prototypes in May 2008.

2

In this document, we define the “deployment of an OS” as the installation of a given OS on several
nodes from a management node. A more restrictive definition that only applies to the duplication of OS
images on the nodes is often used in technical litera
ture.






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



15


local
disk. Figure

4

shows
the
XBAS

compute node
normal boot scheme.

Management node
(192.168.0.1)
Compute node
/release/RHEL5.1 ?
/release/RHEL5.1 + /release/XHPC
Boot RHEL5.1/vmlinuz
kernel through network
Boot micro kernel pxelinux.0
and translates IP address in
hexadecimal format:
192.168.0.2 = C0A80002
vmlinuz + initrd.img
pxelinux.0 ?
Bios settings
Boot order:
1 Network
2 local HD
Boot on network and
looks for a DHCP server
mac
address=00:30:19:D6:77:8A
xbas1 192.168.0.2
192.168.0.1 pxelinux.0
Power
on
Installation of RHEL5.1
through NFS with the
kickstart file info
Execute instructions from
the kickstart file
/release/
ks
/kickstart
/release/
ks
/kickstart
/tftpboot/C0A80002
DEFAULT=
ks
LABEL
ks
KERNEL RHEL5.1/vmlinuz
APPEND
ksdevice
=eth0
ip
=dhcp
ks
=nfs:192.168.0.1:/release/ks/kickstart
initrd
=RHEL5.1/initrd.img
/release/
ks
/kickstart
#Kickstart file
#with disk partitions info and
#OS DVD image location.
DNS=192.168.0.1
/release/RHEL5.1
/etc/exportfs
/release/RHEL5.1 <world>
/release/XHPC <world>
/etc/dhcpd.conf
filename
‘’
pxelinux.0
’’
fixed
-
address 192.168.0.2
next
-
server
192.168.0.1
hardware
ethernet
00:30:19:D6:77:8A
option host
-
name
‘’
xbas1
’’
pxelinux.0
vmlinuz + initrd.img ?
/release/
ks/kickstart.cfg
?
pxelinux.0
RHEL5.1/vmlinuz
RHEL5.1/initrd.img
COA80002
C0A80002 ?
NFS
NFS
TFTP
TFTP
TFTP
DHCP
Connect to

next
-
server

Execute instructions from the
PXE file C0A80002

Figure
3



XBAS

compute node PXE installation scheme


Management node
(192.168.0.1)
Compute node
(192.168.0.2)
Boot Linux kernel in
XBAS5 environment on
local disk
Boot micro kernel pxelinux.0
and translates IP address in
hexadecimal format:
192.168.0.2 = C0A80002
chain.c32
pxelinux.0 ?
Bios settings
Boot order:
1 Network
2 local HD
Boot on network and
looks for a DHCP server
mac address=00:30:19:D6:77:8A
xbas1 192.168.0.2
192.168.0.1 pxelinux.0
Power
on
/tftpboot/C0A80002
DEFAULT=local_primary
LABEL local_primary
KERNEL chain.c32
APPEND hd0
/etc/dhcpd.conf
filename
‘’
pxelinux.0
’’
fixed
-
address 192.168.0.2
next
-
server
192.168.0.1
hardware ethernet 00:30:19:D6:77:8A
option host
-
name
‘’
xbas1
’’
pxelinux.0
chain.c32 ?
pxelinux.0
chain.c32
COA80002
C0A80002 ?
TFTP
TFTP
TFTP
DHCP
Connect to

next
-
server

Execute instructions from the
PXE file C0A80002

Figure
4



XBAS

compute node PXE boot scheme






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



16


2.8

Windows HPC Server 2008

2.8.1

Descript
ion

Microsoft
Windows HPC Server

2008

(
HPCS
)
, the successor to Windows Computer Cluster Server
(
W
CCS)
2003,

is based on the Windows Server

2008 operating system and is designed to increase
productivity, scalability and manageability. This new name reflects

Microsoft HPC’s readiness to tackle
the most challenging HPC workloads

[5]
.
HPCS

includes key features, such as new high
-
speed
networking, highly efficient and scalable cluster management tools, advanced failover capabilities, a
service oriented architecture (SOA) job scheduler, and support for partners’ clustered file systems.

HPCS

gi
ves access to
an HPC platform that is easy to deploy, operate, and integrate with existing enterprise
infrastructures

2.8.2

Cluster i
nstallation

mechanisms

The Install
ation
of a W
indow
s

HPC cluster

starts
with

the
setup of the head node

(HN)
.
For the
deployment
of a compute node

(CN)
,

HPCS

use
s

Windows Deployment Service
(WDS)
,

which fully
install
s

and configure
s

HPCS

and
add
s

the new node to the
set of Windows HPC compute nodes
.

WDS
is
a deployment tool provided by Microsoft, it is
the successor
of

Remote Instal
lation services (RIS
),

and

it
handles all the compute node installation process

and acts as a TF
T
P server.

During the first installation step,
Windows Preinstallation Environment

(Wi
nPE)
is the boot operating
system
.

It

is a lightweight version of
Windows
Server

2008

that is used for the deployment of servers. It is
intended as a
32
-
bit

or
64
-
bit

replacement for
MS
-
DOS

during the installation phase of Windows, and can
be
booted

via
PXE
,
CD
-
ROM
,
USB flash drive

or
hard disk
.

BIOS settings should be set so th
at
HPCS

compute
nodes boot on network with PXE

(
w
e assume that
a
private network exists a
nd that CNs send PXE requests
there first
)
.
From the head node point of view, a
compute node
must be

deployed if it

doesn’t have any entry into

the Active Directory (A
D)
, or if the cluster
administrator has explicitly specified that it must be re
-
imaged.
When a compute node
with no OS
boots
,
it first send
s

a DHCP request
in order to get an IP address, a valid network boot server and the name

of

a

network boot program

(
NBP
)
.
When the

DHCP server has
answered
,
the CN

download
s

the

NBP

called
WdsNbp.com

from
the

WDS
server
. T
he purpose is
to detect the architecture and to wait for other
download
s

from the
WDS
server
.


Then, on

the
HPCS

administration
console

of

the
head no
de, the
new compute node appears as


pending
approval

.
The

installation starts once the administrator assigns
a deployment template

to it
.

A

WinPE
image

is sent

and boot
ed on the compute
node;

file
s are transferred

in order to

prepare the W
indows
Server

2008 installation,

and an unattended

installation of Windows Server

2008 is played. Finally, the
compute node is joined to the domain and the cluster
.

Figure

5

shows the
details of
PXE boot operations
executed
during the installation procedure
.

If the
CN

h
as already been installed, the
AD

already contains the corresponding computer object, so the
WDS server sends him a
NBP

called
abortpxe.com

which boots the server by using the next boot item
in the BIOS without waiting for a timeout. Figure

6

shows the PXE

boot operations executed in this case.






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



17


Head node
(192.168.0.1)
Compute node
WdsNbp.com
?
Bios settings
Boot order:
1 Network
2 local HD
Boot on network and
looks for a DHCP server
mac
address=00:30:19:D6:77:8A
192.168.0.2 next=192.168.0.1
Boot/x64/WdsNbp.com
Power
on
WdsNbp.com
WdsNbp.com
DHCP
Boot micro kernel and
configure IP address
WDS
TFTP
Boot micro kernel and
configure IP address
Wait for HN approval
Create an AD
account
Check for AD
account
AD
Approve CN
Assign template
WDS
TFTP
BOOT.WIM +
diskpart.txt
Play Microsoft
®
HPC Pack
2008 installation
Install Windows
Server
®
2008
Join the domain
HPC Pack setup
Boot kernel WinPE
and partition the disk
WIM images
CIFS
CN is in the
domain
Boot Windows Server
®
2008
Join the cluster
AD
Ask for the Windows Server
®
image
WDS
TFTP
WDS
TFTP
diskpart.txt
unattend.xml
WDS
TFTP
pxeboot.com
(or .n12)
pxeboot.com
unattend.xml
Boot micro kernel
pxeboot.com
(or .n12)

Figure
5



HPCS

compute node
PXE

installation
scheme

Head node
(192.168.0.1)
Compute node
(192.168.0.2)
WdsNbp.com
?
Bios settings
Boot order:
1 Network
2 local HD
Boot on network and
looks for a DHCP server
mac
address=00:30:19:D6:77:8A
192.168.0.2 next=192.168.0.1
Boot/x64/WdsNbp.com
Power
on
WdsNbp.com
DHCP
Boot micro kernel and
configure IP address
WDS
TFTP
Boot micro kernel and
configure IP address
Wait for HN approval
Check for AD
account
AD
Boot Windows Server
®
2008 on local disk
abortpxe.com
AD account
exists
WDS
TFTP
WDS
TFTP
WdsNbp.com
abortpxe.com

Figure
6



HPCS

compute node
PXE

boot
scheme






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



18


2.9

PBS Professional

This Section presents
PBS Pro
fessional
,

the

job scheduler that we used

as m
eta
-
sched
uler for

building
the

HOSC prototype described in Chapter

5.

PBS Pro
fessional

is
part of the PBS GridWorks software
suite. It is
the professional version of the Portable Batch System

(PBS), a flexible workload
management
system, originally developed to

manage aerospace computing resources at NASA. PBS
Professional
has
since become

the leader in supercomputer workload management and the de facto standard

on Linux
clusters.

A few of the more important features
of
PBS Professional

10
are listed below:



Enterprise
-
wide Resource Sharing
provides transparent job scheduling on any PBS system by
any authorized user. Jobs can be submitted from any clien
t system both local and remote.



Multiple User Interfaces
provides
a traditional command line and
a graphical user interface for
submitting batch and interactive jobs; querying job, queue, and system status; and
monitoring job.



Job Accounting
offers detailed logs of system activities for charge
-
back or usage analysis per
user, per group, per project, and per compute host.



Parallel Job Support
works with parallel programming libraries such as MPI. Applications can be
scheduled to run within a single multi
-
processor computer or across multiple systems.



Job
-
Interdependency
en
ables the user to define a wide range of interdependencies
between jobs.



Computational Grid Support
provides an enabling technology for metacomputing and
computational grids.



Comprehensive API
includes a complete Application Programming Interface (API)
.



Au
tomatic Load
-
Leveling
provides numerous ways to distribute the workload across a cluster of
machines, based on hardware configuration, resource availability, keyboard activity, and local
scheduling policy.



Common User Environment offers users a common view

of the job submission, job querying,
system status, and job tracking over all systems.



Cross
-
System Scheduling ensures that jobs do not have to be targeted to a specific computer
system. Users may submit their job, and have it run on the first available s
ystem that meets their
resource requirements.



Job Priority allows users the ability to spe
cify the priority of their jobs.



Username Mapping provides support for mapping user account names on one system to the
appropriate name on remote server systems. This

allows PBS

Professional

to fully function in
environments where users do not have a consistent username across all hosts.



Broad Platform Availability is achieved through support of Windows and every major version of
UNIX and Linux, from workstations and s
ervers to
supercomputers.






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



19


3

Approaches

and recommendations

In this chapter
,

we will explain the different approach
es

to offer several OS
’s

on a cluster.

The
approaches discussed in Sections

3.1 and

3.2 are summarized in Table

1

on
the
next page.

3.1

A single
operating system at a time

Let us
examine

the case
where
all nodes run the same OS.
The
cluster
OS o
f the clu
ster is selected at
boot tim
e.

S
witch
ing from an OS to another can be done by
:



Re
-
installing

the selected

OS on the cluster if necessary
. But

since

this process can be long it is
n
ot
realistic

for frequent changes
.

This is noted as approach

1

in Table

1
.



D
eploying

a new
OS

i
mage
on the whole cluster
depending on the OS choice. The deployment
can be done
on local disks or in memory with

diskless comput
e nodes
.

It is d
ifficult to deal with
the
OS
change on the management

node

in such an environment: either

the management node
is

dual
-
boot
ed

(this is approach

7

in Table

1
)
, or an additional server is required to distribute the
OS image of
the MN
.

This
can be interesting in some specific cases: on HPC clusters with
diskless CN when the OS
switches

a
re

rare, for example. Otherwise, this approach
is

not

very

convenient
.

Th
e deployment technique can be used in a more appropriate manner

for clus
ters
with 2 simultaneous OS
’s

(i.e., 2 MNs)
; this will be shown in
the
next Section with

approach
es

3

and

11
.



D
ual
-
boot
ing
the s
elected OS from dual
-
boot disks
.
Dual
-
booting the whole cluster
(management and computing nodes) is a good and very practical

solution

that was introduced
in

[1] and

[2]. This

approach, noted

6

in Table

1
, i
s the e
asiest way to install

and manage a
cluster with several

OS
’s

but
it

can only apply

for small clusters with few users

when no flexibility
is required
.

If only the MN
s a
re

on a dual
-
boot server

while the CNs are
installed with a single OS
(half of the CNs having an OS while the others have another)
, the solution has no sense because
only half of the cluster can be used at a time in this case (this is approach

5
).
If the MNs are on a
dual
-
boot server while the CNs are installed
in VM
s

(2 VMs being installed on each compute
server), the solution has no real sense
either

because the added value of using VMs (quick OS
switch
ing

for instance) is cancelled by the need of

booting the MN server
(this is approach

8
).

Whatever the OS s
witch

method,

a complete clu
ster reboot is needed at each

change
.

This
impl
ies

cluster
unavailability during reboots,
a
need for
OS
usage
s
c
hedules
and potential
conflicts between

user

needs, hence a real lack of flexibility.

In Table

1
, approaches

1
,
5
,
6
,
7
,

and

8

define clusters that can run 2

OS
’s

but not simultaneously.
Even
if such clusters do not
stick to the Hybrid Operating System Cluster (HOSC) definition

given in Chapter

1
,
they can be considered as a simplified approach of its concept.









A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



20




2
Compute Nodes (CN)

with 2 different OS
’s



1 OS per server

(2 servers)

Dual
-
boot

(1 server)

OS image
deployment

(1 server)

Virtualization

(
2 CN
simultaneously

on 1 server
)

2
Management Nodes (MN)

with 2 different OS
’s

1 OS per
server

(2 servers)


1


Starting point
:
2
half size
independent
clusters

with 2
OS
’s
ⰠIr ㄠN畬l⁳iz攠
si湧l攠体⁣l畳瑥爠

J
i湳瑡tle搠wi瑨ta
摩ff敲敮琠体ewh敮
湥敤敤W

瑨t

數灥湳iv攠eol畴io渠
睩瑨潵琠tl數i扩l
ity


O


d潯d

䡏p䌠
s潬畴i潮⁦潲慲来a
cl畳瑥牳⁷ 瑨⁏p
fl數i扩lity
r敱畩r敭敮t


P


A
n

䡏p䌠
s潬畴i潮 瑨tt⁣慮
扥b
i湴nr敳ti湧⁦潲

l慲来⁣l畳瑥牳
W

睩瑨

摩skl敳s⁃

潲⁷桥o⁴ e⁏p
typ攠ef⁃乳⁩s
r慲敬y⁳睩瑣桥d


Q


A
n

䡏p䌠
s潬畴i潮 睩t栠
灯瑥湴n慬
p
敲e潲m慮c攠
iss略s
com灵瑥t湯摥s
慮搠數瑲a
J
c潳琠t潲o
瑨t⁡摤i瑩潮al
m慮慧敭敮琠
湯摥

Dual
-
boot

(1 server)


5


This
“single OS
at a time”
s潬畴uo渠
makes

慢s潬畴敬y
湯⁳敮s攠ei湣攠
潮ly⁨慬f ⁴ 攠e丠
c慮

扥 畳敤 慴a愠
瑩me


S


d潯d

cl慳sic慬⁤畡l
J
扯潴ocl畳瑥爠
s潬畴i潮


T


A


si湧l攠ep
慴a愠aime


s潬畴uo渠
瑨t琠ta渠
潮ly



i湴nr敳tin朠g潲o
摩skl敳s⁃
s


U


䡡vi湧 vir瑵tl
䍎C⁨
s

湯⁲敡l
s敮s攠ei湣攠e桥
䵎畳琠t攠
r敢潯t敤⁴漠
swi瑣栠hh攠体

Virtualization

(
2
MN
simultaneously
on
1 server
)


9


2
half size
independent
cluster
s

with a
single MN server
:

a bad HOSC
solution with
no
flexibility and

very
little cost savin
g


1
0


Good

HOSC
solution for
medium
-
size
d

clusters with OS
flexibility
requirement
(without
additional
hardware cost)


1
1


A
n

HOSC
solution that can
be inter
esting for
small clusters
with
diskless CN
s


1
2


Every node
is virtual:

t
he
most flexible
HOSC
solution
but with too many
performance
uncertainties

at
the moment

Table
1

-

Possible
approaches

to

HPC
clusters with 2 operating s
ystems






A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



21


3.2

Two simultaneous operating systems

The idea is to
provide
, with a single cluster, the capability to have several OS
’s

running simultaneously

on
a
n

HPC cluster
.
This is what we defined as a Hybrid Operating System Cluster
(HOSC) in Chapter

1.
Each compute node
(CN)
does not need to run every OS simultaneously. A single OS can run on a given
CN

while another OS runs on other
CNs

at the same time
.
The
CNs

can be dual
-
boot servers, diskles
s
servers, or virtual machines (VM). T
he cluster is managed from separate
manag
ement nodes (MN) with
different OS
’s
. MN
can be
installed on several

physical servers or
on

several VM
s
running
on a single
server
.

In Table

1, approaches

2
,
3
,
4
,
9
,
10
,
11

and

12

are HOSC.

HPC users may consider
H
PC
clusters with two simultaneous OS
’s

rather than a single OS at a time for
four

main

reasons:

1.

T
o
improve

resource utilization

and
adapt the workload dynamically

by easily changing

the ratio
of
OS’s (e.g.,
Windows vs. Linux c
ompute nodes
)

in a cluster for different kind
s

of usage
.


2.

T
o be abl
e to m
igrate
smoothly

from
one

OS to the other,
giving time to po
rt applications and train
users.

3.

S
imply to
be able to

try a new OS

without stopping the already installed one (i.e.,
in
stall a
HPCS

clust
er at low cost on an existing Bull Linux cluster

or

install a Bull Linux cluster at low cost on an
existing
HPCS

cluster
).

4.

To integrate

specific OS environment
s (e.g., with legacy OS’s and applications) in a global IT
infrastructure.

The simplest approach
for running 2

OS
’s

on a cluster is to install each
OS on half (or at least a part) of
the cluster

when it is built
.
This approach is equivalent to building 2

single OS clusters! Therefore it
cannot be classified as a cluster with 2

simultaneous OS
’s
.
Moreover, t
his solution is
expensive with its
2

physical MN servers and it is
absolutely not flexible since th
e OS
distribution (i.e.,
the OS
allocation to

nodes
)

is fixed

in advance
.
This

approach

is similar

to
approach

1

already discussed in
the
previous

s
ection
.

An alternative
to

this first approach is to use a

single
physical
server
with
2

virtual machines for installing
the 2

MNs
. In this case

there
is

no

additional
hardware

cost

but there
is

still no flexibility for the
choice of
the
OS distribution

o
n the CN
s

since
this distribution

is
done

when the cluster is built
.
T
his

approach

is
noted

9
.

On clusters with dual
-
boot CNs t
he
OS
distribution

can be
dynamically adapted to the user

and
application needs. The OS of a CN can be changed just by rebooting the CN

aided by

a few simple

dual
-
boot

operations
(this will be demonstrated in

Section
s

6.3

and

6.4
).
With

such dual
-
boot CNs,
the 2

MNs
can b
e on a single server with 2

VMs:

t
his app
roach, noted

10
, is very flexible
and
requires
no additional

hardware

cost. It is a good HOSC

solution
,

especially for medium
-
size
d

clusters.







A Hybrid OS Cluster Solution: Dual
-
Boot and Virtualization with Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon



22


With dual
-
boot CNs, t
he 2

MNs can
also
be
installed
on 2

physical servers

instead of 2

VMs
: this
approach, noted

2
, c
an only

be justified on large clusters
because of the extra cost
due to a

second

physical

MN
.

A new OS image can be
(re
-
)
deployed on a CN on request. This technique allows
changing

the OS
distribution on CNs on a cluster

quite easily
.
However, this is
mainly

interesting for clusters with diskless
CNs b
ecause re
-
deploying an OS image

for each OS switch is slower and consume