Getting the most out of Parallel Extensions for .NET

basiliskcanoeΛογισμικό & κατασκευή λογ/κού

2 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

84 εμφανίσεις

1

Getting the most out of Parallel
Extensions for .NET

Dr. Mike Liddell

Senior Developer

Microsoft

(mikelid@microsoft.com)


2

Agenda

Why parallelism, why now?

Parallelism with today’s technologies

Parallel Extensions to the .NET Framework

PLINQ

Task Parallel Library

Coordination Data Structures

Demos

3

Hardware Paradigm Shift

“… we see a very significant shift in what architectures will look like in the future ...

fundamentally the way we've begun to look at doing that is to move from instruction level
concurrency to … multiple cores per die. But we're going to continue to go beyond there.
And that just won't be in our server lines in the future;
this will permeate every
architecture that we build.
All

will have
massively

multicore

implementations.



Pat
Gelsinger

Chief Technology Officer, Senior Vice
President, Intel Corporation

10,000


1,000


100


10


1

‘70

‘80

‘90

‘00

‘10

Power Density (W/cm
2
)

4004

8008

8080

8085

8086

286

386

486

Pentium
®

processors

Hot Plate

Nuclear Reactor

Rocket Nozzle

Sun’s Surf ace

Intel Developer Forum, Spring 2004
-

Pat
Gelsinger

Today’s Architecture: Heat becoming an
unmanageable problem!

To Grow, To Keep Up,

We Must Embrace Parallel Computing

GOPS

32,768


2,048


128


16

2004

2006

2008

2010

2012

2015



Parallelism Opportunity

80X

4

It's An Industry Thing

Open MP

Intel TBB

Java libraries

Open CL

CUDA

MPI

Erlang

Cilk

(many others)

5

6

What's the Problem?

Multithreaded programming is “hard” today

Robust solutions only by specialists

Parallel patterns are not prevalent, well known, nor easy to
implement

Many potential correctness & performance issues

Races, deadlocks, livelocks, lock convoys, cache coherency overheads,
missed notifications, non
-
serializable updates, priority inversion,
false
-
sharing, sub
-
linear scaling and so on…

Features that can are often skimped on

Last delta of perf, ensuring no missed exceptions, composable
cancellation, dynamic partitioning, efficient and custom scheduling


Businesses have little desire to “go deep”

Developers should focus on business value,

not concurrency hassles and common concerns

7

Example: Matrix Multiplication

void

MultiplyMatrices
(
int

size,


double
[,] m1,
double
[,] m2,
double
[,] result)

{


for

(
int

i = 0; i < size; i++) {


for

(
int

j = 0; j < size; j++) {


result[
i
, j] = 0;


for

(
int

k = 0; k < size; k++) {


result[
i
, j] += m1[
i
, k] * m2[k, j];


}


}


}

}


8

Manual Parallel Solution

int

N = size
;

int

P = 2 *
Environment
.ProcessorCount
;

int

Chunk = N / P
;

ManualResetEvent

signal
=
new

ManualResetEvent
(
false
);

int

counter = P
;

for

(
int

c = 0; c < P;
c++
) {


ThreadPool
.QueueUserWorkItem
(o => {


int

lc

= (
int
)o;


for

(
int

i =
lc

* Chunk;


i < (
lc

+ 1 == P ? N : (
lc

+ 1) * Chunk);


i++) {


// original loop body


for

(
int

j = 0; j < size; j++) {


result[
i
, j] = 0;


for

(
int

k = 0; k < size; k++) {


result[
i
, j] += m1[
i
, k] * m2[k, j];


}


}


}


if

(
Interlocked
.Decrement
(
ref

counter) == 0) {


signal.Set
();



}


}, c);

}

signal.WaitOne
();

Error
P
rone

Static Work Distribution

Potential scalability bottleneck

Error
P
rone

Manual locking

Manual Synchronization

9

Parallel Solution




void

MultiplyMatrices
(
int

size,


double
[,] m1,
double
[,] m2,
double
[,] result)

{


Parallel
.For
(0, size, i => {


for

(
int

j = 0; j < size; j++) {


result[
i
, j] = 0;


for

(
int

k = 0; k < size; k++) {


result[
i
, j] += m1[
i
, k] * m2[k, j];


}


}


});

}




Demo!

Parallel Extensions to the

.NET Framework


What is it?

Additional APIs shipping in .NET BCL (
mscorlib
, System,
System.Core
)

With corresponding enhancements to the CLR &
ThreadPool

Provides primitives, task parallelism and data parallelism

Coordination/synchronization constructs (Coordination Data Structures)

Imperative data and task parallelism (Task Parallel Library)

Declarative data parallelism (PLINQ)

Common exception handling model

Common and rich cancellation model


Why do we need it?

Supports parallelism in any .NET language

Delivers reduced concept count and complexity, better time to
solution

Begins to move parallelism capabilities from concurrency experts
to domain experts


11

Parallel Extensions Architecture

Task Parallel
Library

Coordination Data Structures

PLINQ Execution Engine

Data
Partitioning
(Chunk, Range, Stripe, Custom)

Operators (
Map, Filter, Sort, Search, Reduction

Merging
(Pipeline, Synchronous, Order preserving)

Thread
-
safe Collections

Coordination Types

Cancellation Types

Structured Task Parallelism


User Code

Applications


Pre
-
existing Primitives

ThreadPool

Monitor, Events, Threads

System.Threading.Tasks

Task

Parent
-
child relationships

Structured waiting and cancellation

Continuations on
succes
, failure, cancellation

Implements
IAsyncResult

to compose with
Async
-
Programming Model (APM).


Task<T>

A tasks that has a value on completion

Asynchronous execution with blocking on
task.Value

Combines ideas of futures, and promises


TaskScheduler

We ship a scheduler that makes full use of the (vastly) improved
ThreadPool

Custom Task Schedulers can be written for specific needs.


Parallel

Convenience APIs:
Parallel.For
(),
Parallel.ForEach
()

Automatic, scalable & dynamic partitioning.


Task Parallel
Library

1
st
-
class debugger support!

13

Task Parallel Library

Loops

Loops are a common source of work





Can be parallelized when iterations are independent

Body doesn’t depend on mutable state

e.g. static
vars
, writing to local
vars

to be used in subsequent iterations



for (int i = 0; i < n; i++) work(i);



foreach (T e in data) work(e);

Parallel.For
(0, n, i => work(i));



Parallel.ForEach
(data, e => work(e));

14

Task Parallel Library

Supports early exit via a Break API

Parallel.For
,
Parallel.ForEach

for loops.

Parallel.Invoke

for easy creation of simple tasks




Synchronous (blocking) APIs, but with
cancellation support


Parallel.Invoke
(


() =>
StatementA
() ,


() =>
StatementB

,


() =>
StatementC
() );

Parallel.For
(…,
cancellationToken
);

Enable LINQ developers to leverage

parallel hardware

Supports all of the .NET Standard Query Operators

Plus a few other extension methods specific to PLINQ

Abstracts away parallelism details

Partitions and merges data intelligently

(“classic” data parallelism)

Works for any
IEnumerable
<T>


eg
data.AsParallel
().Select(..).Where(..);

eg
array.AsParallel
().
WithCancellation
(ct)…


Parallel LINQ (PLINQ)

16

Different ways to write PLINQ queries

Comprehensions

Syntax extensions to C# and Visual Basic





Normal APIs (two flavours)

Used as extension methods on
IParallelEnumerable
<T>






Direct use of ParallelEnumerable




Writing a PLINQ Query

var

q =
ParallelEnumerable.Select
(


ParallelEnumerable.OrderBy
(


ParallelEnumerable.Where
(
Y.AsParallel
(), x => p(x)),


x => x.f1),


x => x.f2);

var

q =

Y.AsParallel()



.Where(x => p(x))



.OrderBy(x => x.f1)



.Select(x => x.f2);

var

q = from x in
Y.AsParallel
() where p(x)
orderby

x.f1 select x.f2;

Plinq Partitioning and Merging


Input to a single operator is partitioned into
p
disjoint subsets


Operators are replicated across the partitions


A merge marshals data back to consumer thread



foreach(int i in D.AsParallel()



.where(x=>p(x))



.Select(x=> x*x*x)



.OrderBy(x=>
-
x)









Each partition executes in (almost) complete isolation

PLINQ


Task
n




Task 1


where p(x)

D

where p(x)

select x
3

select x
3

LocalSort()

LocalSort()

Merge

foreach

partition

Coordination Data Structures

Thread
-
safe collections

ConcurrentStack<T>

ConcurrentQueue<T>




Work exchange

BlockingCollection<T>




Phased Operation

CountdownEvent



Locks and Signaling

ManualResetEventSlim

SemaphoreSlim

SpinLock …


Initialization

LazyInit<T> …


Cancellation

CancellationTokenSource

CancellationToken

OperationCanceledException

Used throughout PLINQ and TPL

Assist with key concurrency patterns


19

Common Cancellation

A
CancellationTokenSource

is a source of cancellation requests.

A
CancellationToken

is a notifier of a cancellation request
.







Linking tokens allows combining of

cancellation requesters.

Slow code should poll every 1ms

Blocking calls should observe a Token.

Work co
-
ordinator


1.
Creates a CTS

2.
Starts work

3.
Cancels CTS if reqd


Workers…


1.
Get, share, and copy tokens

2.
Routinely poll token which observes CTS

3.
May attach callbacks to token


CTS

CT

CT

CT1

CT2

CTS12

CT

CT

CT

20

Common Cancellation (cont.)

All blocking calls allow a
CancellationToken

to be supplied.




var

results = data




.
AsParallel
()




.
WithCancellation
(
token
)




.Select(x => f(x))




.ToArray();


User code can observe the cancellation token, and
cooperatively enact cancellation


var

results = data




.
AsParallel
()




.
WithCancellation
(
token
)




.Select(x => {






if (
token
.IsCancellationRequested
)







throw new
OperationCanceledEx
(
token
);






return f(x);






}





)




.ToArray();

21

Extension Points in TPL & PLINQ

Partitioning strategies for Parallel & Plinq

Extend via Partitioner<T>, OrderablePartitioner<T>

eg partitioners for heterogenous data.


TaskScheduling

Extend via TaskScheduler

eg GUI
-
thread scheduler, throttled scheduler


BlockingCollection


extend via IProducerConsumerCollection

eg blocking priority queue.

Debugging Parallel Apps in VS2010

Two new debugger tool windows

“Parallel Tasks”

“Parallel Stacks”

.

23

Identifier

Task Entry Point

Parent ID

Thread Assignment

Current Task

Flagging

Item context menu

Status

Task’s
thread

is frozen

Parallel Tasks

Location + Tooltip

Tooltip shows info on waiting/deadlocked status

Column context menu

.

24

Context menu

Parallel Stacks

active frame of

other thread(s)

active frame of

current thread

current frame

method tooltip

Zoom

control

header tooltip

Bird’s eye view

Blue highlights path of current thread

.

25

Summary

The ManyCore Shift is happening

Parallelism in your code is inevitable

Invest in a platform that enables parallelism

…like the Parallel Extensions for .NET



Further Info and News

MSDN Concurrency Developer Center

http://msdn.microsoft.com/concurrency


Parallel

Extensions Team Blog

http://blogs.msdn.com/pfxteam


Getting the bits!

June 2008 CTP
-

http://msdn.microsoft.com/concurrency


Microsoft Visual Studio 2010


Beta coming soon.

http://www.microsoft.com/visualstudio/en
-
us/products/2010/default.mspx


Blogs

Parallel Extensions Team

http://blogs.msdn.com/pfxteam

Joe Duffy


http://www.bluebytesoftware.com

Daniel Moth


http://www.danielmoth.com/Blog/


27

©
2008 Microsoft
Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademar
ks
and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the
dat
e of this presentation. Because Microsoft must respond to changing market conditions, it should
not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any informatio
n p
rovided after the date of this presentation.
MICROSOFT
MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

28

Extra Slides …

29

Parallel Technologies from
Microsoft

Local computing

CDS

TPL

Plinq

Concurrency Runtime in Robotics Studio

PPL (Native)

OpenMP

(Native)


Distributed computing

WCF

MPI, MPI.NET



30

Types

Key Common Types:

AggregateException
,
OperationCanceledException
,
TaskCanceledException

CancellationTokenSource, CancellationToken

Partitioner
<T>

Key TPL types:

Task, Task<T>

TaskFactory
,
TaskFactory
<T>

TaskScheduler

Key Plinq types:

Extension methods
IEnumerable.AsParallel
(),
Ienumerable
<T>.
AsParallel

()

ParallelQuery
,
ParallelQuery
<T>,
OrderableParallelQuery
<T>


Key CDS types:

Lazy<T>,
LazyVariable
<T>,
LazyInitializer
,

CountdownEvent, ManualResetEventSlim, SemaphoreSlim

BlockingCollection,
ConcurrentDictionary
, ConcurrentQueue

Performance Tips

Early community technology preview

Keep in mind that performance will improve significantly


Compute intensive and/or large data sets

Work done should be at least 1,000s of cycles

Measure, and combine/optimize as necessary


Do not be gratuitous in task creation

Lightweight, but still requires object allocation, etc.


Parallelize only outer loops where possible

Unless N is insufficiently large to offer enough parallelism

Consider parallelizing only inner, or both, at that point


Prefer isolation and immutability over synchronization

Synchronization == !Scalable

Try to avoid shared data


Have realistic expectations

Amdahl’s Law

Speedup will be fundamentally limited by the amount of sequential computation

Gustafson’s Law

But what if you add more data, thus increasing the parallelizable

percentage of the application?





Parallelism Blockers

Ordering not guaranteed




Exceptions



Thread affinity



Operations with sub
-
linear speedup, or even speedup < 1.0




Side effects and mutability are serious issues

Most queries do not use side effects,
but



Race condition if non
-
unique elements

int[] values = new int[] { 0, 1, 2 };

var

q = from x in
values.AsParallel
() select x * 2;

int[] scaled =
q.ToArray
(); // == { 0, 2, 4 }
??

object[] data = new object[] { "
foo
", null, null };

var

q = from x in
data.AsParallel
() select
o.ToString
();

var

q = from x in
data.AsParallel
() select
x.f
++;

AggregateException

IEnumerable
<
int
> input = …;

var

doubled = from x in
input.AsParallel
() select x*2;

controls.AsParallel
().
ForAll
(c =>
c.Size

= ...); //Problem

Plinq Partitioning, cont.

Types of partitioning

Chunk

Works with any IEnumerable<T>

Single enumerator shared; chunks handed out on
-
demand




Range

Works only with IList<T>

Input divided into contiguous regions, one per partition




Stride

Works only with IList<T>

Elements handed out round
-
robin to each partition




Hash

Works with any IEnumerable<T>

Elements assigned to partition based on hash code




Repartitioning sometimes necessary


Pipelined:
separate consumer thread

Default for GetEnumerator()

And hence foreach loops

Access to data as its available

But more synchronization overhead



Stop
-
and
-
go:
consumer helps

Sorts, ToArray, ToList,

GetEnumerator(false), etc.

Minimizes context switches

But higher latency and more memory



Inverted:
no merging needed

ForAll extension method

Most efficient by far

But not always applicable





Thread 2

Thread 4

Thread 1

Thread 3

Thread 1

Thread 1

Thread 3

Thread 1

Thread 2

Thread 1

Thread 1

Thread 3

Thread 2

Thread 1

Thread 1

Plinq Merging

35

Example: “Baby Names”

IEnumerable
<
BabyInfo
>
babyRecords

=
GetBabyRecords
();

var

results = new List<
BabyInfo
>();

foreach (
var

babyRecord

in
babyRecords
)

{


if (
babyRecord.Name

==
queryName

&&




babyRecord.State

==
queryState

&&




babyRecord.Year

>=
yearStart

&&




babyRecord.Year

<=
yearEnd
)


{


results.Add
(
babyRecord
);


}

}

results.Sort
((b1, b2) => b1.Year.CompareTo(b2.Year));

36

Manual Parallel Solution

IEnumerable
<
BabyInfo
> babies = …;

var

results = new List<
BabyInfo
>();

int
partitionsCount

=
Environment.ProcessorCount

* 2;

int
remainingCount

=
partitionsCount
;

var

enumerator =
babies.GetEnumerator
();

try {


using (
ManualResetEvent

done = new
ManualResetEvent
(false)) {


for (int i = 0; i <
partitionsCount
; i++) {


ThreadPool.QueueUserWorkItem
(delegate {


var

partialResults

= new List<
BabyInfo
>();


while(true) {


BabyInfo

baby;


lock (enumerator) {


if (!
enumerator.MoveNext
()) break;


baby =
enumerator.Current
;


}


if (
baby.Name

==
queryName

&&
baby.State

==
queryState

&&


baby.Year

>=
yearStart

&&
baby.Year

<=
yearEnd
) {


partialResults.Add
(baby);


}


}


lock (results)
results.AddRange
(
partialResults
);


if (
Interlocked.Decrement
(ref
remainingCount
) == 0)
done.Set
();


});


}


done.WaitOne
();


results.Sort
((b1, b2) => b1.Year.CompareTo(b2.Year));


}

}

finally { if (enumerator is
Idisposable
) ((
Idisposable
)enumerator).Dispose(); }

Synchronization Knowledge

Inefficient locking

Manual aggregation

Lack of foreach simplicity

Tricks

Heavy synchronization

Lack of thread reuse

Non
-
parallel sort

37

LINQ Solution

var

results = from baby in
babyRecords


where
baby.Name

==
queryName

&&


baby.State

==
queryState

&&


baby.Year

>=
yearStart

&&


baby.Year

<=
yearEnd


orderby

baby.Year

ascending


select baby;

.
AsParallel
()

(or in different

Syntax…)

var

results =

babyRecords




.Where(b =>
b.Name

==
queryName

&&


b.State

==
queryState

&&


b.Year

>=
yearStart

&&


b.Year

<=
yearEnd
)




.
OrderBy
(b=>
baby.Year
)



.Select(b=>b);

.
AsParallel
()

38

ThreadPool

Task Queues

ThreadPool

Task (Work) Stealing

Worker
Thread 1

Worker
Thread p

Program
Thread





Task 1

Task 2

Task 3

Task 5

Task 4

Task 6

.