PhD Thesis Proposal - Ferad Zyulkyarov

mangledcobwebΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

72 εμφανίσεις

Programming, Debugging, Profiling and
Optimizing Transactional Memory Applications


Department of Computer Architecture

Universitat Politècnica de Catalunya


BarcelonaTech

Barcelona Supercomputing Center

01

July 2010

Ferad Zyulkyarov

PhD Thesis Proposal

Publications


Ferad Zyulkyarov
,
Srdjan

Stipic
, Tim Harris, Osman Unsal, Adrian Cristal, Ibrahim Hur, Mateo Valero,
Discovering and Understanding Performance Bottlenecks in Transactional Applications
,
PACT'10


Ferad Zyulkyarov
, Tim Harris, Osman Unsal, Adrian Cristal, Mateo Valero,
Debugging Programs that
use Atomic Blocks and Transactional Memory
,
PPoPP'10


Vladimir Gajinov,
Ferad Zyulkyarov
, Osman Unsal, Adrian Cristal, Eduard
Ayguade
, Tim Harris,
Mateo Valero,
QuakeTM
: Parallelizing a Complex Serial Application Using Transactional Memory
,
ICS'09


Ferad Zyulkyarov
, Vladimir Gajinov, Osman Unsal, Adrian Cristal, Eduard
Ayguade
, Tim Harris,
Mateo Valero,
Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game
Server
,
PPoPP’09


Ferad Zyulkyarov
,
Sanja

Cvijic,Osman

Unsal, Adrian Cristal, Eduard
Ayguade
, Tim Harris, Mateo
Valero,
WormBench

-

A Configurable Workload for Evaluating Transactional Memory Systems
,
MEDEA '09


Ferad Zyulkyarov
,
Milos

Milovanovic
, Osman Unsal, Adrian Cristal, Eduard
Ayguade
, Tim Harris,
Mateo Valero, Memory Management for Transaction Processing Core in Heterogeneous Chip
-
Multiprocessors,
OSHMA '09


Milos

Milovanovic
, Osman Unsal, Adrian Cristal,
Ferad Zyulkyarov
, Mateo Valero, Compiler Support
for Using Transactional Memory in C/C++ Applications,
INTERACT’07

2

Work Plan

3

07/08/2006
23/02/2007
11/09/2007
29/03/2008
15/10/2008
03/05/2009
19/11/2009
07/06/2010
Thesis Writing
TM-Optimization
TM-Profiling
TM-Debugging
WormBench
AtomicQuake
TxPC
StateOfTheArt
Start Date
Completed
Remaining
12m

11m

21m

10m

15m

9.5m

7m

2m

01/10/2010

Transactional Memory

4

atomic {


statement1;


statement2;


statement3;


statement4;


...

}

The Big Questions


Is programming with TM easy?


Is TM competitive with locks?


Are existing development tools sufficient?

5

Atomic Quake


Parallel Quake
g
ame server


All locks are replaces with atomic blocks


27,400 LOC of C code in 56 files


Rich transactional application


63 atomic blocks


Rich uses of atomic blocks


Library calls, I/O, error handling, memory allocation, failure
atomicity


Various transactional characteristics


A workload to drive research in TM

6

Is programming with TM easy?


Yes.


In large applications where we have many
shared objects and want to provide efficient
fine grain synchronization


Example: region based locking in tree data
structure and graphs.


7

Where Transactions Fit?

Guarding different types of objects with separate locks.


1 switch(object
-
>type) { /* Lock phase */


2 KEY:
lock(key_mutex);

break;


3 LIFE:
lock(life_mutex);

break;


4 WEAPON:
lock(weapon_mutex);

break;


5 ARMOR:
lock(armor_mutex);

break


6 };


7


8 pick_up_object(object);


9

10 switch(object
-
>type) { /* Unlock phase */

11 KEY:
unlock(key_mutex);
break;

12 LIFE:
unlock(life_mutex);

break;

13 WEAPON:
unlock(weapon_mutex);

break;

14 ARMOR:
unlock(armor_mutex);

break

15 };

Lock phase.

Unlock phase.

atomic {

}

pick_up_object(object);

8

Is TM Competitive to Locks?


No.


4
-
5x slowdown on single
threaded version.


But it is promising to be
competitive because of
the obtained good
scalability.

0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
1
2
4
8
Scalability*

STM
LOCK
9

Scales OK up to
4 threads.

Threads

Transaction
s

Aborts

Irrevocable

Num

%

1

36 667

0

0.00%

17

2

75 824

241

0.42%

31

4

166 000

2 612

1.58%

85

8

477 519

76 771

25.50%

237

Sudden increase
in aborts.

Are Existing Tools Sufficient?


No


We need:


Richer language level primitives and integration.


Mechanisms to handle I/O.


Dynamic error handling.


Debuggers.


Profilers.

10

Unstructured Use of Locks

Locks


1 for (i=0; i<sv_tot_num_players/sv_nproc; i++){


2 <statements1>


3 LOCK(cl_msg_lock[c
-

svs.clients]);


4 <statemnts2>


5 if (!c
-
>send_message) {


6 <statements3>


7 UNLOCK(cl_msg_lock[c
-

svs.clients]);


8 <statements4>


9 continue;

10 }

11 <stamemnts5>

12 if (!sv.paused && !Netchan_CanPacket (&c
-
>netchan)) {

13 <statmenets6>

14 UNLOCK(cl_msg_lock[c
-

svs.clients]);

15 <statements7>

16 continue;

17 }

18 <statements8>

19 if (c
-
>state == cs_spawned) {

20 if (frame_threads_num > 1) LOCK(par_runcmd_lock);

21 <statements9>

22 if (frame_thread_num > 1) UNLOCK(par_runcmd_lock);

23 }

24 UNLOCK(cl_msg_lock[c
-

svs.clients]);

25 <statements10>

26 }

Atomic Block


1
bool

first_if

= false;


2
bool

second_if

= false;


3 for (
i
=0;
i
<
sv_tot_num_players
/
sv_nproc
;
i
++){


4 <statements1>


5 atomic {


6 <statemnts2>


7 if (!c
-
>
send_message
) {


8 <statements3>


9
first_if

= true;

10 } else {

11 <stamemnts5>

12 if (!
sv.paused

&& !
Netchan_CanPacket
(&c
-
>
netchan
)){

13 <statmenets6>

14
second_if

= true;

15 } else {

16 <statements8>

17 if (c
-
>state ==
cs_spawned
) {

18 if (
frame_threads_num

> 1) {

19 atomic {

20 <statements9>

21 }

22 } else {

23 <statements9>;

24 }

25 }

26 }

27 }

28 }

29 if (
first_if
) {

30 <statements4>;

31
first_if

= false;

32 continue;

33 }

34 if (
second_if
) {

35 <statements7>;

36
second_if

= false;

37 continue;

38 }

39 <statements10>

40 }

Extra
variables
and code

Solution

explicit “commit


Complicated
Conditional
Logic

11

Various Transactional Characteristics

ID

TX#

Dynamic Length (CPU Cycles)

Read Set (Bytes)

Write Set (Bytes)

Total

Min

Max

Avg

Total

Min

Max

Avg

Total

Min

Max

Avg

56

26,962

172,872,572

288

112,832

6,412

1,328,536

20

104

49

0

0

0

0

60

5,931

5,810,152

224

41,552

980

76,212

12

640

13

928

0

116

0

61

1,095

20,573,540

4,560

49,984

19,208

723,474

88

776

661

90

84

84

84

59

1,042

3,117,844

1,520

39,344

2,999

29,176

5

28

28

16,672

16

16

16

57

1,038

401,502,152

288,704

522,528

387,552

10,963,719

7,614

15,490

10,562

2,592,367

1,680

3,656

2,497

58

1,002

134,949,344

87,056

1,341,504

134,949

5,054,282

3,028

53,566

5,044

931,445

548

11,161

930

15

3

67,660

720

48,176

1,735

96

32

32

32

18

6

6

6

5

2

99,988

592

36,384

1,923

64

32

32

32

10

5

5

5

22

2

43,632

12,176

35,504

21,816

72

36

36

36

128

64

64

64

36

2

40,476

6,800

44,880

20,238

249

108

141

125

55

22

33

28

38

2

71,368

2,144

31,504

4,461

90

44

46

45

26

12

14

13

12

Very small
transactions

Very large
transactions

Different execution
frequency
-
> Phased
behavior.

Control flow does not
reach all atomic blocks.

Most frequent atomic
block is read
-
only.

Per
-
atomic block runtime statistics from Atomic Quake.

Debugging Transactional Applications


Existing debuggers are not aware of atomic
blocks and transactional memory


New principles and approaches:


Debugging atomic blocks atomically


Debugging at the level of transactions


Managing transactions at debug
-
time


Extension for
WinDbg

to debug programs with
atomic blocks

13

Atomicity in Debugging


Step over
atomic

blocks as if single instruction.


Abstracts weather atomic blocks are implemented with TM
or lock inference


Good for debugging sync errors at granularity of
atomic

blocks vs. individual statements inside the
atomic

blocks.

14

<statement 1>

<statement 2>

atomic {


<statement 3>


<statement 4>


<statement 5>


<statement 6>

}

<statement 7>

<statement 8>

<statement 1>

<statement 2>

atomic {


<statement 3>


<statement 4>


<statement 5>


<statement 6>

}

<statement 7>

<statement 8>

Non
-
TM Aware Debugger

TM Aware Debugger

Debugging becomes
frustrating when
transaction aborts.

Isolation in Debugging


What if we want to debug wrong code within atomic
block?


Put

breakpoint inside atomic block.


Validate the transaction


Step within the transaction.


The user does not observe intermediate results of
concurrently running transactions


Switch transaction to irrevocable mode after validation.

15

atomic {


<statement 1>


<statement 2>


<statement 3>


<statement 4>

}

Debugging at the Level of Transactions


Assumes that
atomic

blocks are
implemented with transactional memory.


Examine the internal state of the TM


Read/write set, re
-
executions, status


TM specific watch points


Break when conflict happens


Filters


Concurrent work with
Herlihy

and Lev [PACT’ 09].

16

TM Specific
Watchpoints

17

atomic {


<statement 1>


<statement 2>


<statement 3>


<statement 4>

}

Conflict Information


Conflicting Threads: T1, T2

Address: 0x84D2F0

Symbol: reservation@04

Readers: T1

Writers: T2

Break when
conflict happens


Filter: Break if

Address = reservation@04

Thread = T2


AND

Managing Transactions at Debug
-
Time


At the level of atomic blocks


Debug time
atomic

blocks


Splitting
atomic

blocks


At the level of transactions


Changing the state of TM system (i.e. adding and
removing entries from read/write set, change the
status, abort)


Analogous to the functionality of existing
debuggers to change the CPU state

18

Example Debug Time Atomic Blocks

19

<statement 1>

<statement 2>

<statement 3>

<statement 4>

<statement 5>

<statement 6>

<statement 7>

<statement 8>

<statement 9>

<statement 10>

<statement 11>

<statement 12>

<statement 13>

<statement 14>


Example Debug Time Atomic Blocks

20

<statement 1>

<statement 2>

<statement 3>

StartDebugAtomic

<statement 4>

<statement 5>

<statement 6>

<statement 7>

<statement 8>

<statement 9>

EndDebugAtomic

<statement 10>

<statement 11>

<statement 12>

<statement 13>

<statement 14>

User marks the start

a
nd the end of the

transactions

Issues of Profiling TM Programs


TM applications have unanticipated overheads


Problem raised by
Pankratius

[talk at ICSE’09] and
Rossbach

et al. [PPoPP’10]


Difficult to profile TM applications without
profiling tools and without knowing the
implementation of the TM system


Experience of optimizing
QuakeTM
, Gajinov et al.
[ICS’2009]

21

Profiling TM Programs


Design principles


Report results at source language constructs


Abstract the underlying TM system


Low probe effect and overhead


Profiling techniques


Conflict point
d
iscovery


Identifying conflicting data structures


Visualizing
t
ransactions

22

Conflict Point Discovery


Identifies the statements involved in conflicts


Provides c
ontextual information


Finds the critical path

23

File:Line

#Conf.

Method

Line

Hashtable.cs:51

152

Add

If (_container[
hashCode
]…

Hashtable.cs:48

62

Add

uint

hashCode

=
HashSdbm
(…

Hashtable.cs:53

5

Add

_container[
hashCode
] = n …

Hashtable.cs:83

5

Add

while (entry != null)



ArrayList.cs:79

3

Contains

for (
int

i

= 0;
i

< count;
i
++ )

ArrayList.cs:52

1

Add

if (count == capacity


1) …

Call Context

24

increment() {


counter++;

}

probability80 {


probability = random() % 100;


if (probability < 80) {


atomic {


increment();


}


}

}

probability20 {


probability = random() % 100;


if (probability >= 80) {


atomic {


increment();


}


}

}

for (
int

i

= 0;
i

< 100;
i
++) {


probability80();


probability20();

}

for (
int

i

= 0;
i

< 100;
i
++) {


probability80();


probability20();

}

Thread 1

Thread 2

Bottom
-
up view

+ increment (100%)


|
----

probability80 (80%)


|
----

probability20 (20%)

Top
-
down view

+ main (100%)


|
----

probability80 (80%)


|
----

increment (80%)


|
-----
probability20 (20%)


|
----

increment (20%)

Aborts Graph (
Bayes
)

25

AB1

AB2

AB3

Conf: 73%

Wasted: 63%

Conf: 20%

Wasted: 29%

72% of wasted work

There are 15 atomic blocks and only one of them aborts most.

Which atomic blocks cause AB3 to abort?

Indentifying Conflicting Objects

26

Per
-
Object View


+ List.cs:1 “list” (42%)


|
---

ChangeNode

(20 %)


+
----

Replace (12%)


+
----

Add (8%)

1: List
list

= new List();

2:
list.Add
(1);

3:
list.Add
(2);

4:
list.Add
(3);

...

atomic {


list.Replace
(2, 33);

}

List

1

2

3

0x08

0x10

0x18

0x20

GC

Memory
Allocator

DbgEng

Object
Addr

0x20

GC Root

0x08

Instr

Addr

0x446290

List.cs:1

Transaction
Visualizer

(Genome)

27

Aborts occur at the first
and last atomic blocks in
program order.

Garbage
Collection

Wait on barrier

Overhead and Probe Effect

28


Thrd
#

Bayes
+

Bayes
-

Gen+

Gen
-

Intrd
+

Intrd
-

Labr
+

Labr
-

Vac
+

Vac
-

WB+

WB
-

1

1.59

1.00

1.27

1.00

1.29

1.00

1.07

1.00

1.26

1.00

0.71

1.00

2

1.00

0.56

0.97

0.67

0.97

0.58

0.64

0.61

0.83

0.59

0.60

0.55

4

0.23

0.23

0.73

0.52

0.91

0.36

0.45

0.46

0.58

0.40

0.41

0.33

8

0.21

0.20

0.73

0.55

1.57

0.38

0.72

0.56

0.53

0.34

0.33

0.22

Normalized Execution Time


Thrd
#

Bayes+

Bayes
-

Gen+

Gen
-

Intrd
+

Intrd
-

Labr
+

Labr
-

Vac
+

Vac
-

WB+

WB
-

1

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

2

4.39

4.69

0.07

0.07

3.69

3.51

0.19

0.15

0.80

0.80

0.00

0.00

4

16.29

27.31

0.26

0.36

14.90

13.65

0.35

0.36

2.30

2.45

0.00

0.00

8

53.74

66.08

0.53

0.80

39.64

37.41

0.40

0.47

4.91

5.30

0.02

0.03

Abort Rate in %

+ Profiling Enabled

-

Profiling Disabled

Standard deviation for the
difference 27%

Standard deviation for the
difference 3.88%

Process data offline or during GC.

Optimization Techniques


Moving statements


Atomic block scheduling


Checkpoints and nested atomic blocks


Pessimistic reads


Early release

29

Will this code execute the same?

Moving Statements

atomic {


counter++;


<statement1>


<statement2>


<statement3>

}

atomic {


<statement1>


<statement2>


<statement3>


counter++;

}

30

5
10
15
20
25
30
35
1
2
4
8
16
Execution time
-

seconds

Threads

Intruder
-

Hoist Optimization

Top
Bottom
No!

Checkpoints

atomic {


<statement1>


<statement2>


<statement3>



<statement4>



<statement5>


<statement6>



<statement7>

}

31

Conflicts


2%


15%

4%


79%

Insert Checkpoint

Checkpoints

atomic {


<statement1>


<statement2>


<statement3>



<statement4>



<statement5>


<statement6>



<checkpoint>



<statement7>

}

32

Conflicts


2%


15%

4%



79%

Insert Checkpoint

Reduced wasted
work for the atomic
block with 40%.

Conclusion


Study
the programmability
aspects of TM


New debugging principles and approaches for
TM applications


New profiling techniques for TM applications


Profile
-
guided optimization approaches for
TM applications

33

34

Край