ZooKeeper

helmetpastoralSoftware and s/w Development

Dec 13, 2013 (3 years and 6 months ago)

54 views

ZooKeeper

A highly available, scalable, distributed,
configuration, consensus, group
membership, leader election, naming, and
coordination service


Flavio Junqueira, Mahadev Konar, Andrew
Kornev, Benjamin Reed

https://me.alipay.com/jlxue


https://me.alipay.com/jlxue


Asynchronous
Replication
Primary
Backup
Clients
Today’s
architecture for
source code
revision control...

A cloud
-
based

architecture...

EC2

EC2

S
3

S
3

S
3

Two simultaneous
commits…

EC2

EC2

S
3

S
3

S
3

Rev.
31337

Rev.
31337

Rev.
31337

Followed by an
update…

Leads to data
loss!

Coordination
Coordination
EC2

EC2

S
3

S
3

S
3

Commit Process

ZooKeepe
r

Hbase
架构



Storage Stamp Architecture

M

Extent Nodes (EN)

Paxos

Front End
Layer

FE

Incoming Write Request

M

M

Partition

Server

Partition

Server

Partition

Server

Partition

Server

Partition

Master

FE

FE

FE

FE

Lock
Service

Ack

Partition
Layer

Stream

Layer

Data Model

Path = /plants/legumes/pea

Data = “this is pea”

Data is not general data, but abstract
information.


Data Model

1)
Hierarchical

data model, much like a standard
distributed file system

2)
Nodes are known as
znodes
, and identified by a path

3)
znode can have data associated with it, and children.
Data is in
KBs

4)
znodes are
versioned

5)
Data is read/written in its
entirety

6)
znodes can be ephemeral nodes


exists as long as the
session that created it is active

7)
Watches

can be set on znodes

Data Flags

1)
Ephemeral


The znode will be
deleted when the
session that created
it
times out
or it is
explicitly
deleted

2)
Sequence


The path name will
have a monotonically
increasing counter
relative to the parent
appended

Watches

1)
Tell me when something changes. E.g. Configuration
data

2)
One time trigger
. Have to be reset by the client if
interested in future notifications

3)
Not a full fledged notification system. Its like clients
asking for recommendations. Client should verify the
state after receiving the
watch event

4)
Ordering

guarantee: a client will never see a change for
which it has set a watch until it first sees the watch
event

5)
Default watcher


notified of state changes in the client
(connection loss, session expiry, …)

Zookeeper Session

1)
ZK client establishes connection to ZK service, using a
language binding. (Java, C, Perl, Python, REST)

2)
List of servers provided


retry the connection until it is
(re)established

3)
When a client gets a handle to the ZK service, ZK
creates a ZK session, represented as a 64
-
bit number

4)

If reconnected to a different server within the session
timeout, session
remains the same

5)
Session is kept alive by periodic
PING

requests from the
client library

Zookeeper API

1)
String create (path, data, acl, flags)

2)
void delete (path, expectedVersion)

3)
Stat setData (path, data, expectedVersion)

4)
byte[] getData (path, watch)

5)
Stat exists (path, watch)

6)
String[] getChildren (path, watch)

7)
void sync (path)

Zookeeper Service

1)
Client
-
Server Model, Multiple Servers

2)
One leader, multi
-
followers


All servers store a
copy

of the data(in
memory)


All copies of the data are the
same


A leader is
elected

at startup


Followers service clients, all
write

requests
forward to leader


Guarantees

1)
Sequential Consistency
-

Updates from a client will be
applied in the order that they were sent

2)
Atomicity

-

Updates either succeed or fail. No partial
results

3)
Single System Image
-

A single client will see the same
view of the service regardless of the server that it
connects to

4)
Reliability

-

Once an update has been applied, it will
persist from that time forward until a client overwrites
the update

5)
Timeliness

-

The clients view of the system is
guaranteed to be up
-
to
-
date within a certain time
bound

Use Cases

1)
Use cases inside Yahoo!


Leader Election


Group Membership


Work Queues


Configuration Management


Cluster Management


Load Balancing


Sharding

2)
Use cases in HBase


Leader Election


Configuration Management


store bootstrap
information


Group membership


discover tablet servers and
finalize tablet server death


To be done: store schema information and ACLs



Example
-
Leader Election

/

services

leader

myservice

Contains leader name/details


1.
getData (“/services/myservice/leader”,
true)

2.
If successful,
follow the leader
described in the data and exit

3.
create (“/services/myservice/leader”,
hostname, EPHEMERAL)

4.
If successful,
lead and exit

5.
Go to step 1




Leader election algorithm


when exactly one of N
service providers have to be available:



Note: If you want to have M processes of a set of N
processes to be active, the algorithm can be
modified to do so

Example
-
Group Management

Example
-
Configuration
Management

Example
-
Configuration
Management

Example
-
Configuration
Management

Zookeeper Components

1)
Request Processor


Form write transaction(
zxid
,
epoch
)


Zxid is a 64bit number:32 bit epoch and 32bit
counter

2)
Atomic Broadcast(zab)

3)
Replicated Database


In memory, after write
-
ahead log