Distributed Operating Systems

mangledcobwebΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

79 εμφανίσεις

CECS 526

Distributed Operating Systems

Architecture of Distributed System

Motivations


A distributed system refers to a system that consists of several computers
that do not share a memory or a clock and communicate with each other
by exchanging messages over a communication network. A distributed
operating system runs on multiple independent computers but appears to
its users as a single machine
.



Motivations


Improved
price/performance
--

with the exception of certain special
computation intensive applications, equivalent computing power may be
obtained with a network of workstations at a much lower cost than a
traditional time
-
sharing system
.


Resource
sharing
--

requests of services may be satisfied using
hardware/software resources on other computers on the communication
network
.


Enhanced
performance
--

concurrent execution of tasks and load distributing
can lead to improved response time
.


Improved
reliability and availability
--

fault tolerance can be achieved through
the replication of data and services
.


Modular
expandability
--

new hardware and software resources can be
added without replacing the existing resources.



Issues in distributed operating systems


Global knowledge
--

global state of the system is hard to acquire
due to the unavailability of a global memory and a global clock and
the unpredictability of message delays.



Naming
--

the directory of all the named objects in the system
(services, files, users, printers, etc.) must be maintained to allow
proper access. Both schemes of replicated directories and
partitioned directories have their strengths and weaknesses.


Scalability
--

any mechanisms or approaches adopted in a system
must not result in badly degraded performance when the system
grows.


Compatibility
--

the interoperability among the resources in a
system must be an integral part of the design of a distributed
system. Three levels of compatibility exist: binary level, execution
level & protocol level.


Process
synchronization
--

especially difficult in distributed systems
due to the lack of shared memory and a global clock.


Issues in distributed operating systems


Resource management
--

refers to schemes and methods devised to
make local and remote resources available to users in an effective and
transparent manner. Three general schemes exist: data migration,
computation migration, & distributed scheduling.


Security
--

two issues are relevant: authentication (verifying claims) &
authorization (deciding the and authorizing the proper amount of
privileges).


Structuring
--

defines how various parts of the operating system are
organized. Three general organizations exist:


Monolithic
kernel
--

a traditional method with the kernel consisting of all the
services to be provided in the system which may be wasteful in a distributed
environment.


Collective
kernel structure
--

all services are implemented as independent
processes. Those that are required to run in all computers are grouped into
the microkernel (i.e. for task, processor & memory management) and the
others included in each computer on the network on a need basis.


Object
-
oriented
operating system (also called process model)
--

services are
implemented as a collection of objects, each object encapsulates a data
structure and defines a set of operations on that data structure.


Inherent Limitations of a Distributed System


The lack of common memory and a system
-
wide common
clock is an inherent problem in distributed systems.


In
the absence of global time, it becomes difficult to talk
about temporal order of
events.


Without
a shared memory, an up
-
to
-
date information
about the state of the system is not available to every
process via a simple memory lookup. The state information
must therefore be collected through communication.


The
combination of unpredictable communication delays
and the lack of global time in a distributed system make it
difficult to know how up
-
to
-
date collected state
information really is. The example below illustrates this
problem and underscores the need for a coherent global
state.


Difficulty in State Information Collection

Happen Before Relation


In light of the need for ordering events,
Lamport

proposed
a scheme using logical clocks. While it remains true in
general that the order of two events occur at two different
computers cannot be determined based on the local time
at which they occur, events that have certain causal
dependencies can be ordered.
Lamport’s

logical clocks are
designed to capture those dependencies where two events
are causally related by a “
happened before
” relation as
defined below.


Happened
Before Relation

(

). The relation


is defined
as follows
:



a



b
, if
a

and
b

are events in the same process and
a

occurred
before
b
.


a



b
, if
a

is the event of sending a message
m

in a process and
b

is the event of receiving the same message
m

by another
process.


If
a



b

and
b



c
, then
a



c
, i.e., “

” relation is transitive
.

Happen Before Relation Example


The
happened before
relation captures the causal
dependencies between events, i.e., whether two
events are causally related or not. Event
a

causally
affects

event
b

if
a



b
. Two events
a

and
b

are said
to be concurrent (denoted as
a
||
b
) if
not

(
a



b

or

b



a
). In other words, concurrent events do not
causally affect each other.



For any two events
a
and
b

in a system, either
a



b
,
b



a
, or
a
||
b
.



Lamport’s

Logical
Clocks


A clock
C
i

is associated with each process
P
i

in the
system, that can be thought of as a function for
assigning a number
C
i

(
a
) to any event
a
, called
the
timestamp

of event
a
, at
P
i
. The happened
before relation “

” can now be realized by using
the logical clocks if the following conditions are
met
:


[
C1
] For any two events
a

and
b

in a process
P
i
, if
a

occurs before
b
, then
C
i

(
a
) <
C
i

(
b
).


[
C2
] If
a

is the event of sending a message
m

in
process
P
i

and
b

is the event of receiving the same
message
m

at process
P
j
, then
C
i

(
a
) <
C
j

(
b
).


Lamport’s

Logical Clocks


These
two conditions are guaranteed with the following
implementation rules
:



[
IR1
] Clock
C
i

is incremented between any two successive events
in process
P
i

as
follows:
C
i

:=
C
i

+
d
, where
d

> 0.


[
IR2
] If event
a

is the sending of message
m

in process
P
i
, then
message
m

is assigned a timestamp
t
m

=
C
i

(
a
). On receiving the
same message
m

by process
P
j
,
C
j

is first set using [IR1], then set
to a value greater than or equal to the new
C
j

and greater than
t
m
, i.e.,
C
j

:= max (
C
j
,
t
m

+
d
), where
d

> 0
.


With these two implementation rules, two causally related events
a

and
b

such that
a



b

will have
C
(
a
) <
C
(
b
), and two successive
events
a
and
b

in process
P
i

will yield
C
i
(
b
) =
C
i
(
a
) +
d
.


Example:



Limitation of
Lamport’s

Logical Clocks


Lamport’s

system of logical clocks implements an approximation to
global/physical time that is referred to as
virtual time
. The virtual
time advances along with the progression of events and is therefore
discrete. The virtual time is defined based on an
irreflexive

partial
order “

”, and can be used to totally order events in a distributed
system (hence produces a total order relation “

”) as follows
:


If
a

is any event at process
P
i

and
b

is any event at process
P
j
, then
a



b
if and only if

either


C
i

(
a
) <
C
j

(
b
), or


C
i

(
a
) =
C
j

(
b
) and
P
i

P
j



where
is any arbitrary relation that totally orders the processes to break
ties (e.g., process id
i

< j implies
P
i

P
j
)



Note that
a



b

does not necessarily imply
a



b
. And this is known to
be a major limitation of
Lamport’s

logical clocks: If
a



b

then
C
(
a
) <
C
(
b
),
but the converse is not necessarily true.



Illustration of this limitation:


Vector Clocks


Each process
P
i

in a distributed system with
n

processes is equipped with a
vector clock
C
i
. The clock
C
i

consists of an integer vector of length
n
, and
can be viewed as a function that assigns a vector
C
i

(
a
) to any event
a

at Pi
as the event’s timestamp.
C
i

[
i
], the
i
th

entry of
C
i
, corresponds to
P
i
’s own
logical time.
C
i

[
j
],
j



i
, indicates the time of occurrence of the last event
at
P
j

that “happened before” the current point in time at
P
i
. It therefore
represents
P
i
’s best guess of the logical time at
P
j
, and must satisfy the
assertion of
C
i

[
j
]


C
j

[
j
].


The vector clocks can be implemented with the following implementation
rules
:


[
IR1
] Clock
C
i

is incremented between any two successive events in process
P
i

as follows:



C
i

[
i
] :=
C
i

[
i
] +
d
, where
d

> 0.


[
IR2
] If event
a

is the sending of message
m

in process
P
i
, then message
m

is
assigned a timestamp
t
m

=
C
i

(
a
). On receiving the same message
m

by
process
P
j
,
C
j

[j] is first incremented as in [IR1], then
C
j

is updated as follows:




k
,
C
j

[
k
] := max (
C
j

[
k
],
t
m

[
k
]).


Vector Clocks Example


Example of how vector clocks advance as events occur:


With vector clocks,
a



b

iff

t
a

<
t
b
, where
t
a

and
t
b

denote the
vector timestamps of events
a

and
b
, respectively. In other
words, vector clocks allow us to order events in a distributed
system and decide whether two events are causally related
based simply on the timestamps of the events. The next
section shows an application of vector clocks in causal
ordering of messages
.








Causal Ordering of Messages


Causal ordering of messages refers to the preservation of causal relationship
that holds among “message send” events in the corresponding “message
receive”
events, i.e.,
Send
(
M
1
)


Send
(
M
2
) implies
Receive
(
M
1
)


Receive
(
M
2
), where
Send
(
M
) and
Receive
(
M
) represent the event of sending
and receiving message
M
, respectively.


Causal
ordering of messages is important in some applications, e.g.,
replicated database systems, where every process in charge of updating a
replica receives the updates in the same order to maintain the consistency
of the database.


Causal
ordering of messages is not automatically guaranteed in distributed
systems, hence will require implementation where necessary
.



Example of violation of causal ordering of messages:









Birman
-
Schiper
-
Stephenson
Protocol


This protocol
for causal ordering of messages assumes
that all processes
communicate through broadcast messages.


The
vector clocks in this protocol are maintained in such a way so that
entry
i

of ’s vector clock at any given moment will indicate the number of
messages that has sent so far. In the following algorithm, we assume
that the distributed system consists of
n

processes
.


Before broadcasting a message
m
, a process
P
i

increments the vector time

and timestamp
m
.


A process
P
j



P
i
,
upon receiving message
m

timestamped

VT
m

from
P
i
,
delays
its delivery until both the following conditions are satisfied. Delayed messages
are queued at each process in increasing order of the messages’ vector time.
Concurrent messages are ordered by the time of their arrival.



(
This ensures that all messages sent from prior to
m

have been received.)



(
This ensures that all messages sent from
any process prior to sending out
m

have been received.)


When a message is delivered at a process
P
j
, is
updated according to the
vector clocks rule [IR2
].

Schiper
-
Eggli
-
Sandoz
Protocol


In this
protocol for causal ordering of messages,
messages are sent
point
-
to
-
point.


Each
process
P
maintains a vector
VP
of size
(n
-
1),
where
n
is the
number of processes in the system. An element of
VP
is an ordered
pair
(
P’,t
),
where
P’
is the id of the destination process of a message
and
t
is a vector timestamp. Processes in the system maintain
vector clocks, with

refer
to the current logical time at process
P
i
,
and
t
M

the logical time at the sending of message
M.
The
communication channels need not be FIFO channels. Each event
triggers vector clock updating as described in the vector clock
implementation rules [IR1] and [IR2].


1.
Process
P
i

sends a message
M
to
P
j

as follows:


Send
M
with timestamp
t
M
,
along with a vector
VM
=
VP
i

to
P
j
.


Insert ordered pair
(
P
j
,
t
M
)
into
VP
i
.
If
VP
i

already contains a pair
(
P
j
, t),
it simply gets overwritten by the new pair
(
P
j
,
t
M
).


Schiper
-
Eggli
-
Sandoz Protocol

2. A message M (carrying
t
M

and VM) arriving at process
P
j

will be
handled as follows:


If a pair (
P
j
, t) exists in VM and t is not <
t
Pj
,


then M must be buffered for later delivery


else M can be delivered, followed by the following actions:

a.
Merge VM with
VP
j

in the following manner:


If VM contains an entry (
P,t
) for some process P but
VP
j

contains no entry for P,
simply insert that entry into
VP
j
.


Every process
P

P
j

for which entries are found in both VM and
VP
j
, say (
P,t
)
VM and (
P,t
’)
VP
j
, replace entry (
P,t
’) in
VP
j

by (
P,t
sup
), where
t
sup

is the vector
time such that .

b.
Update ’s vector clock.

c.
Check for buffered messages that can now be delivered as a result of
changes to the local clock.

3.


A pair can be deleted from the vector maintained at a site after
ensuring that the pair has become obsolete.


Global State


Due to the lack of shared memory and a global clock, and
communication delays are unpredictable, one cannot count on
simultaneous observations of states in all the individual computers
in an asynchronous distributed system. Consequently, one cannot
determine a global state in such a system in a conventional manner.


The following 2
-
site system can illustrate the difficulty:






Suppose a transfer of $50 is taking place when a request for the
global state is issued. Many global views are possible, including the
following, with states 2 and 3 clearly incorrect.


{ S1: A=500, S2: B=200, C1: empty, C2: empty}


{ S1: A=500, S2: B=200, C1: 50, C2: empty}


{ S1: A=500, S2: B=250, C1: empty, C2: empty}


{ S1: A=450, S2: B=200, C1: 50, C2: empty}


Global State and Cut


A global state of a distributed system is a collection of local states in all the sites in
the system.


From the illustration on the previous slide, we can deduce that the correctness of
a recorded global state depends on the timing at which the local states are
recorded.


The timing of recordings of the local states that make up a global state can be
graphically represented by a cut.


Specifically, a
cut

of a distributed computation is a set C = {c
1
,c
2
,…,
c
n
}, where
c
i

is
the cut event at site S
i

in the history of the distributed computation. Graphically, a
cut is a
zig
-
zag

line that connects the corresponding cut events in the time
-
space
diagram.


If a cut event
c
i

at site S
i

is
S
i
’s

local state at that instant, then clearly a cut denotes
a global state of the system.


A cut C = {c
1
,c
2
,…,
c
n
} is a
consistent cut

iff


S
i

S
j

there are no events
e
i


S
i
and
e
j


S
j

such that (
e
i



e
j
)


(
e
j



c
j
)



(
e
i



c
j
), where both
c
i
,
c
i



C. That is, a cut
is a consistent cut if every message that was received before a cut event was sent
before the cut event at the sender site in the cut.


However, messages sent but not yet received may still exist in a consistent cut, and
are considered in transit.


With the use of FIFO channels and a marker, the
Chandy
-
Lamport

global state
recording algorithm given below stipulates the timing of recordings of the local
states and captures the messages that have been sent but not received in the cut
and attributes them as messages in transit.


Global State and Cut


Global states in a distributed computation:





A cut:

Chandy
-
Lamport’s

Global State Recording Algorithm

Marker Sending Rule for a process P


P records its state.


For each outgoing channel C on which a marker has not been already sent,
P sends a marker along C before P sends further messages along C.


Marker Receiving Rule for a process Q


On receipt of a marker along a channel C,


if Q has not recorded its state


then



Record the state of C as an empty sequence.



Follow the “Marker Sending Rule.”


else


Record the state of C as the sequence of messages received along C
after the state of Q was recorded and before Q received the marker along
C.


Huang’s Termination Detection Algorithm

{Let B(w) denote a computation message with weight w, and C(w) a control message
sent to controlling agent with weight w.}


Rule 1.

The controlling agent or an active process having weight W may send a
computation message to a process P by doing the following:


Split W into W
1

and W
2
such that W
1

> 0, W
2

> 0, and W = W
1

+ W
2


Send B(W
2
) to P


Set W to W
1



Rule 2.

On receipt of B(w), a process P having weight W does:


W := W + w


If P is idle, then P becomes active



Rule 3.

An active process having weight W may become idle at any time by doing:


Send C(W) to the controlling agent


W := 0


The process becomes idle


Rule 4.

On receiving C(w), the controlling agent having weight W takes the following
actions:


W := W + w


If W = 1, conclude that the computation has terminated