Enterprise Job Scheduling for Clustered Environments

errorhandleΛογισμικό & κατασκευή λογ/κού

18 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

243 εμφανίσεις

Enterprise Job Scheduling for
Clustered Environments

Stratos Paulakis,
Vassileios Tsetsos,

and

Stathes Hadjiefthymiades


P
ervasive
C
omputing
R
esearch
G
roup

C
ommunication
N
etworks
L
aboratory

Department of Informatics and Telecommunications

University of Athens


Greece




Santorini ‘07 @ Greece

Outline


Introduction


System Design


System Implementation


Performance Evaluation & Comparison
with Quartz Scheduler


Conclusions



Introduction


System
-
level scheduling vs. Application
-
level
scheduling


PoLoS platform for LBS


IST FP5 Project


Scheduler was a core architectural component (time
-
triggered SMS/WAP services)


Clustering: modern solution for scalability and
fault
-
tolerance in enterprise systems


Objective


Design and implementation of a cluster
-
aware
version of the original Scheduler





Functional Requirements


Time
-
accuracy and low delay


Jobs should commence execution as close as possible to their
registered time


Delay tolerance depends on the application


Efficiency and scalability


High throughput is mandatory in large
-
scale applications


Robustness through job persistence


System crashes should not result in data loss


It imposes a performance overhead


High availability and fault tolerance


Near
-
zero downtime


No missed job execution


Logging



Billing, administration, SLAs



Technical Requirements


Asynchronous decoupling


The scheduling process should run independently
from the job executions


Parallel job execution


Multi
-
threaded job execution


Load balancing


Maximizing utilization of available resources


Client
-

or server
-
side


Clustering


Deals with most requirements


Challenge: global timer is a singleton object


Related Systems

Kronova

Quartz

Flux

Custom

Java

Jobs







Simple Time
-
Scheduling
(start/stop time, period)







Event
-
Driven

Scheduling







API








JMX
-
compatible

X



X

Logging







JTA
-
enabled


X



X

Tracking

Data

X



X

Clustering







Load
-
Balancing







Fault
-
Tolerance (Fail
-
Over)

Poor

Moderate (the only
point of failure is the
DB)

Poor

Architecture

Scheduling

Management

Caching

Execution

Scheduling

Management

Caching

Execution

Management

Caching

Execution

Scheduling

Queue

Node A

Node B

Node C
-

Master

Caching Subsystem


Distributed in
-
memory cache


Synchronous and asynchronous
replication


Optimistic and pessimistic locking


A DB is asynchronously updated


Implementation: JBossCache


Aspect
-
oriented programming techniques for
cache updates

Scheduling Subsystem


JMX Timer


JBoss implementation


Java 5.0 JMX Timer class has limitations in multi
-
threading
(high timer delays)



Singleton design pattern


one instance may be instantiated across the cluster


Recovers job trigger times after master node
crash

Cache

job data

Put job

into queue

Job trigger

Performance Evaluation Setup


Setup


Cluster with 2 nodes (AMD 64bit
-

3.4 GHz, 1GB
RAM)


JBoss
4
.0.4, MySQL 5.0


JMeter for workload generation, JProfiler for
performance profiling


Metrics


Maximum throughput
: scalability measure


Delays


Average total delay


T1: timer delay, T2: persistence delay, T3: queuing delay

Performance Evaluation Setup II


PoLoS
-

Synchronous replication


PoLoS
-

Asynchronous replication


Clustered Quartz


JMeter parameters:


Discrete user requests: 1000, 2000, …, 4000


Ramp up period: 120 seconds


Repetitions: 10


Period: variable (60, 120, 180, …, 300 seconds)


Job logic: a logging command



Maximum Throughput


Original non
-
clustered scheduler:
~3000
jobs/min


PoLoS sync:
2998 jobs/min
-

6000 user requests


PoLoS async:
2810 jobs/min



6000 user requests


Quartz:
240 jobs/min



4000 user requests


Transaction isolation errors during persistence


high latency


Server crashed for more than 4000 user requests



Job period = 120 sec

Delays

T1 delay
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
1
2
3
4
thousands of jobs
delay (msec)
sync polos
async polos
quartz
Total delay
0
50000
100000
150000
200000
250000
300000
350000
400000
1
2
3
4
thousands of jobs
delay (msec)
sync polos
async polos
quartz
Job period = 60 sec

1000 jobs

T2

Polos sync

3.2 ms

Polos async

0.75 msec

Quartz

91905 msec

Delay Distribution

Delay distribution - quartz
T1
T2
Delay Distribution - polos async
T1
T2
T3
Delay distribution - polos sync
T1
T2
T3
T1

T2

Polos sync

41549 ms

7.1 ms

Polos async

490 ms

1 ms

Quartz

120 ms

198550 ms

2000 jobs with period 60 sec

Conclusions


JBoss JMX timer resulted in lower timer
delays (T1) but the MDBs could not
operate at that high rate


Asynchronous replication is much more
efficient than synchronous


When the maximum throughput is
reached, delays increase dramatically



Thank You!


Questions???



http://p
-
comp.di.uoa.gr