IPDPS-13x

mangledcobwebSoftware and s/w Development

Dec 14, 2013 (3 years and 8 months ago)

109 views

Behavior of Synchronization
Methods in Commonly Used
Languages and Systems

Yiannis Nikolakopoulos

ioaniko@chalmers.se


Joint work with:

D. Cederman, B. Chatterjee, N. Nguyen,

M. Papatriantafilou, P. Tsigas

Distributed Computing and Systems

Chalmers University of Technology

Gothenburg, Sweden

Developing a multithreaded
application…

Yiannis Nikolakopoulos

ioaniko@chalmers.se

2

The boss
wants .NET

The client
wants speed…
(C++?)

Java is nice

Multicores
everywhere


Yiannis Nikolakopoulos

ioaniko@chalmers.se

3

The worker
threads need
to access data

Concurrent
Data Structures

Then we need
Synchronization
.

Developing a multithreaded
application…

Implementation

Coarse Grain
Locking

Fine Grain
Locking

Test And Set

Array Locks

And more!

Yiannis Nikolakopoulos

ioaniko@chalmers.se

4

Implementing Concurrent Data
Structures

Performance
Bottleneck

Implementation

Coarse Grain
Locking

Fine Grain
Locking

Test And Set

Array Locks

And more!

Lock Free

Yiannis Nikolakopoulos

ioaniko@chalmers.se

5

Implementing
Concurrent Data
Structures

Hardware platform

Which is the
fastest/most
scalable?

Implementing concurrent data
structures

Yiannis Nikolakopoulos

ioaniko@chalmers.se

6

Problem Statement


How the interplay of the above parameters
and the different synchronization methods,
affect the performance and the behavior of
concurrent data structures.

Yiannis Nikolakopoulos

ioaniko@chalmers.se

7

Outline


Introduction

Experiment Setup

Highlights of Study
and Results

Conclusion

Yiannis Nikolakopoulos
ioaniko@chalmers.se

8

Which data structures to study?

Represent different levels of contention:


Queue
-

1 or 2 contention
points


Hash table
-

multiple contention points




Yiannis Nikolakopoulos

ioaniko@chalmers.se

9

How do we choose implementation?

Possible criteria:


Framework dependencies


Programmability


“Good” performance

Yiannis Nikolakopoulos

ioaniko@chalmers.se

10

Interpreting “good”


Throughput:

The more
operations completed

per time unit
the better.



Is this enough?

Yiannis Nikolakopoulos

ioaniko@chalmers.se

11

Non
-
fairness

Yiannis Nikolakopoulos

ioaniko@chalmers.se

12


Throughput:

Data structure
operations

completed per time
unit.


𝑎𝑖  
Δ
𝑡
=
𝑖
min

(

𝑖
)

𝑛
𝑖
𝑖
𝑁
,

𝑛
𝑖
𝑖
𝑁
𝑎𝑥

(

𝑖
)

What to measure?

Yiannis Nikolakopoulos

ioaniko@chalmers.se

13

Operations by
thread
i

Average
operations per
thread


Implementation Parameters

Yiannis Nikolakopoulos

ioaniko@chalmers.se

14

Programming

Environments

C++

Java

C# (.NET,

Mono)

Synchronization

Methods

TAS, TTAS, Lock

-

free, Array lock

PMutex,

Lock

-

free memory

management

Reentrant,

synchronized

lock

construct,

M

utex

NUMA

Architectures

Intel Nehalem, 2 x 6 core

(24 HW threads)

AMD Bulldozer, 4 x 12 core

(48 HW threads)

Do they influence
fairness?

Experiment Parameters


Different levels of
contention


Number of threads


Measured time intervals

Yiannis Nikolakopoulos

ioaniko@chalmers.se

15

Outline


Queue


Fairness


Intel
vs

AMD


Throughput
vs

Fairness


Hash Table


Intel
vs

AMD


Scalability

Introduction

Experiment Setup

Highlights
of Study
and Results

Conclusion

Yiannis Nikolakopoulos
ioaniko@chalmers.se

16

Fairness can change along
different time intervals

24
Threads, High contention

Yiannis Nikolakopoulos
ioaniko@chalmers.se

17

Observations:
Queue

Significantly different
fairness behavior in
different architectures

24
Threads, High contention

Yiannis Nikolakopoulos
ioaniko@chalmers.se

18

Observations:
Queue

Fairness

Significantly different
fairness behavior in
different
architectures

24
Threads, High
contention



Lock
-
free is less affected
in this case

Yiannis Nikolakopoulos
ioaniko@chalmers.se

19

Observations:
Queue


Fairness

Queue: Throughput
vs

Fairness

Fairness 0.6 s, Intel

Throughput

Yiannis Nikolakopoulos
ioaniko@chalmers.se

20

0

0,2

0,4

0,6

0,8

1

2

4

6

8

12

24

48

Fairness

Threads

C++

TTAS

Lock
-
free

PMutex

0

2

4

6

8

10

12

14

16

2

4

6

8

12

24

48

O
perations per ms (thousands)

Threads

C++

Observations:
Hash table


Operations are distributed in different buckets


Things get interesting when



#threads > #buckets


Tradeoff between throughput and fairness


Different winners and losers


Contention is lowered in the linked list
components


Yiannis Nikolakopoulos

ioaniko@chalmers.se

21

Fairness differences in
Hash table across
architectures

24
Threads, High contention

Yiannis Nikolakopoulos
ioaniko@chalmers.se

22

Observations:
Hash table

Fairness differences in
Hash table across
architectures

24
Threads, High
contention





Lock
-
free is
again not
affected

Yiannis Nikolakopoulos
ioaniko@chalmers.se

23

Observations:
Hash table

Observations:
Hash table

In C++, custom memory management and lock
-
free implementations excel
in scalability and performance.

Yiannis Nikolakopoulos
ioaniko@chalmers.se

24

Conclusion


Complex synchronization mechanisms (
Pmutex
,
Reentrant lock) pay off in heavily contended hot
spots


Scalability via more complex, inherently

parallel
designs and
implementations


Tradeoff between
throughput
and
fairness


LF Hash table


Reentrant lock
vs

Array Lock
vs

LF Queue


Fairness can be heavily influenced by HW


Interesting exceptions

Yiannis Nikolakopoulos
ioaniko@chalmers.se

25

Which is the
fastest/most
scalable?

Is
fairness
influenced by
NUMA
?