Massive Parallelism in AI

hartebeestgrassΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

79 εμφανίσεις

©
2010
Autodesk

Massive Parallelism in AI

Throughput versus Realtime

Pierre Pontevia

10
th

March 2010

©
2010
Autodesk

Agenda


Where are

we
today


The
pathfinding

challenge : from
throughput

to
realtime


MASAI :

the premises of an AI massive parallel
solution

©
2010
Autodesk


WHERE ARE WE
TODAY
?

©
2010
Autodesk

Where are we
today
?


Parallel
programming has becoming a
reality

for game
developers since the arrival of

”next gen”
consoles (2005
-
2006)


Since then, a lot of new
languages
and
programming models
have been suggested to better tackle parallelism,


And new
hardware
is being announced, shaping the
future
of
consoles




So this is a good moment to see how
parallelism

could be
revisited

for the games of tomorrow… with a special focus on
pathfinding

©
2010
Autodesk

As a start, the
13 dwarves
should help us to find
the right
parallel pattern


The
13 dwarves
is an initiative from
Berkeley University
to help
achieve high parallelism



A dwarf is an
algorithmic method
that captures a
pattern
of
computation
and
communication



The 1st exercise is to
identify
which
dwarves
match the
problems involved in pathfinding

©
2010
Autodesk

As a start, the
13 dwarves
should help us to find
the right
parallel pattern
(cont’d)

Dwarf

Description

1. Dense Linear Algebra

Data are dense matrices or
vectors

2. Sparse Linear Algebra

Data sets include many zero values. Data is usually stored
in compressed matrices to reduce the storage and
bandwidth requirements to access all of the nonzero values

3. Spectral Methods

Data are in the frequency domain, as opposed to time or
spatial
domains

4. N
-
Body Methods

Depends on interactions between many discrete points.
Variations include particle
-
particle methods

5. Structured Grids

Represented by a regular grid; points on grid are
conceptually updated together. It has high spatial
locality

6. Unstructured Grids

An irregular grid where data locations are selected, usually
by underlying characteristics of the
application

7. Monte Carlo

Calculations depend on statistical results of repeated
random trials

©
2010
Autodesk

As a start, the
13 dwarves
should help us to find
the right
parallel pattern
(cont’d)

Dwarf

Description

8. Combinational Logic

Functions that are implemented with logical functions and
stored
state

9. Graph traversal

Visits many nodes in a graph by following successive edges.
These applications typically involve many levels of
indirection, and a relatively small amount of
computation

10. Dynamic Programming

Computes a solution by solving simpler overlapping
sub
problems.
Particularly useful in optimization problems with a
large set of feasible
solutions

11. Backtrack
and

Branch +
Bound

Finds an optimal solution by recursively dividing the feasible
region into
sub domains,
and then pruning
sub problems
that are
suboptimal

12. Construct Graphical
Models

Constructs graphs that represent random variables as nodes
and conditional dependencies as edges. Examples include
Bayesian networks and Hidden Markov
Models

13. Finite State Machine

A system whose behavior is defined by states, transitions
defined by inputs and the current state, and events
associated with transitions or
states

©
2010
Autodesk

Recent
languages
and
programming models
provide guidance for parallel implementation


Data Parallelism for homogenous
architectures


OpenMP


TBB


Ct



Data Parallelism for heterogeneous
architectures


CUDA,


OpenCL,


DirectCompute


SPURS


RapidMind


PC clusters


MPI


Map Reduce




Concurrent Programming


PPL, Asynchronous Agents


Grand Central Station

©
2010
Autodesk

However, there are
specific constraints
in the
video games
impacting on parallel design…


Memory Resources Constraints


How much scratch memory required by solver


Concurrent Memory access


Computations are done on data which can change significantly from frame to
frame


Data lifetime / persistence


Things are volatile by nature


Reactivity

/ Time delay / Frequency constraints


When do you really need the result of your computation


Interruptibility


The system can change its mind


80% of the path goals are never reached

©
2010
Autodesk

…and even
more constraints
when you develop
middleware


Multiple cohabitant

models


Several middleware with several threading models


Not

blocking
is not enough
-
> fine tuning issues


Spurs everywhere?



Multiple
HW
targets


PC is different from Xbox 360 console which is different from a
PlayStation
®

3 (PS3) console


Multiple exclusive programming languages

©
2010
Autodesk

A gap analysis on existing solutions shows that
no one solution fits the video game context
perfectly


No model really takes care of
memory as a limitating
resource
in the design of parallel solutions



No model takes into account
time as a dimension
of
the problem



All the approches are very
throughput
oriented

©
2010
Autodesk

THE
PATHFINDING

CHALLENGE :
FROM
THROUGHPUT

TO
REALTIME

©
2010
Autodesk

Pathfinding

in a nutshell

Path Planning

Path

Smoothing

DA
(*)

&

Steering

LOW FREQUENCY (0,1 Hz)


Input :

-
Topology

-
current position

-
destination


Output :

-
Valid Path

MEDIUM FREQUENCY (2 Hz)

Input :

-
current position

-
destination


Output :

-
Target point

HIGH FREQUENCY (10 Hz)


Input :

-
current position

-
Target point


Output :

-
New Target point

(*): DA
-

Dynamic Avoidance

A

B

©
2010
Autodesk

Pathfinding is made of different
solvers

with
different
characteristics


3

categories of solvers:


A*,

Graph Traversal
: low frequency
/
large input
-
work

memory


Trajectory Smoothing
: medium frequency
/optional


DA / Steering
:

high frequency/critical

Frequency

Work Memory
requirements


A*


Graph Traversals


Smoothing


DA


Steering

10

3

0.2

> 500 K

< 5 K

©
2010
Autodesk

There are
2 natures
of
data parallelism
in
pathfinding


Number of characters
: all solver jobs increase linearly with the
number of characters



Size of graph
: Graph

Traversal related solvers can use a
Dwarf 9 pattern solving approach


©
2010
Autodesk

A first approach could be a
single frame batch
paradigm

(throughput)
compatible with most programming models

Pathfinder


Entity 1

Path Request

Queue

Target Request

Queue

DA Request

Queue

Steering Request

Queue

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Search

Path

Task

Select

Target

Task

Compute

DA

Task

Compute

Steering

Task

PPM (Parallel Programming Model)

MiddleWare

Queue

PPM

Queue

Framework

©
2010
Autodesk

Each task request has a context composed of character
data, global data, and potentially customized objects

Searching
Path

Start & Destination

Movement Model

Constraint

LPF
(*)

Shortcut

Pathdata

Potentially all
PathObjects

Path

Selecting
Target

Current Pos

Current target

Path

Movement Model

Constraint

LPF
(*)

Shortcut

PathObjects of the path

Pathdata

Target Pos

Computing
DA Target

Current Pos

Current Target

Movement Model

Cluster of entities

Pathdata

DA Target Pos

Steering

Current Pos

Current DA Target

Movement Model

Current PathObject

LPF Shortcut

Wanted Speed & Yaw

Character Context

Global Data

Customizable

Output

(*): LPF


Obstacle Avoidance

©
2010
Autodesk

Compute

Path

Compute

Target

Point

Compute

DA Tgt

Point

Compute

Steering

However, as the number of
solvers

can be
limited

by
memory


Thread 1

Thread 2

©
2010
Autodesk

…throughput maximization approach in
parallelization

can be capped by
Amdahl’ law

Thread 1

Thread 1

Thread 2

Thread 1

Thread 2

Parallel
-

No memory limitation

Parallel
-

Memory constrained environment

Serial
-

No memory limitation

©
2010
Autodesk

To avoid that, the Pathfinding solution needs to
find more
task parallelism
on
time dimension

Moving from


“How to solve all the work within a frame”


To


“How to distribute work across several frames”


©
2010
Autodesk

A good illustration is describing Pathfinding as a

statechart
with 4
orthogonal

states

Stopped

Path Not Found

Has Arrived

Active

Target Selection

No Target

Selecting

Target

Target

Found

Path Updated

Target Found

Has arrived

DA Target

No DA Target

Computing

DA Target

DA Target

Computed

Target Updated

DA Target Found

Has arrived

Steering

No Steering

Computing
Steering

Steering

Computed

DA Target Updated

Steering Computed

Has arrived

Path Planning

No Path

Searching

Path

Path

Found

New Destination

Path Found

Has arrived

New Destination

New Pos

Path Updated

New Pos

Target Updated

New Pos

DA Target Updated

New Destination

Pos updated

©
2010
Autodesk

It is still compatible with the precedent approach, but

multiframe

(no more capped by Amdahl’s law)

Path Request

Queue

Target Request

Queue

DA Request

Queue

Steering Request

Queue

Search

Path

Task

Select

Target

Task

Compute

DA

Task

Compute

Steering

Task

MiddleWare

Queue

Framework

Active

Target Selection

No Target

Selecting

Target

Target

Found

Path Updated

Target Found

Has arrived

DA Target

No DA Target

Computing

DA Target

DA Target

Computed

Target Updated

DA Target Found

Has arrived

Steering

No Steering

Computing Steering

Steering

Computed

DA Target Updated

Steering Computed

Has arrived

Path Planning

No Path

Searching

Path

Path

Found

New Destination

Path Found

Has arrived

New Destination

New Pos

Path Updated

New Pos

Target Updated

New Pos

DA Target Updated

Active

Target Selection

No Target

Selecting

Target

Target

Found

Path Updated

Target Found

Has arrived

DA Target

No DA Target

Computing

DA Target

DA Target

Computed

Target Updated

DA Target Found

Has arrived

Steering

No Steering

Computing Steering

Steering

Computed

DA Target Updated

Steering Computed

Has arrived

Path Planning

No Path

Searching

Path

Path

Found

New Destination

Path Found

Has arrived

New Destination

New Pos

Path Updated

New Pos

Target Updated

New Pos

DA Target Updated

Active

Target Selection

No Target

Selecting

Target

Target

Found

Path Updated

Target Found

Has arrived

DA Target

No DA Target

Computing

DA Target

DA Target

Computed

Target Updated

DA Target Found

Has arrived

Steering

No Steering

Computing Steering

Steering

Computed

DA Target Updated

Steering Computed

Has arrived

Path Planning

No Path

Searching

Path

Path

Found

New Destination

Path Found

Has arrived

New Destination

New Pos

Path Updated

New Pos

Target Updated

New Pos

DA Target Updated

©
2010
Autodesk

But now we have
3

new
problems

Problem 1
: How to
guarantee
that high frequency
steering solvers return value
on time
?


Problem 2

: How to deal with
multiframe volatility
and
dynamicity
of data?


Problem 3
: What computation
triggering
logic do
we want?

©
2010
Autodesk

Problem 1 is a
scheduling

problem for
realtime

systems

Problem 1 can be reworded as follows:

“How to guarantee a
deadline

for each pathfinding
solver

request
compatible with the
frequency

of the solver”


This is very close the definition of a realtime software as found on
Wikipedia:

“In computer science,
real
-
time computing
(RTC), or "reactive
computing", is the study of hardware and software systems that are
subject to a "
real
-
time constraint
"

i.e., operational
deadlines

from
event

to system
response


The good news is that there is a good literature

on realtime scheduling!

©
2010
Autodesk

To answer
problem 1

we restate pathfinding
solvers in a
realtime

formalism…


Realtime formalism: a task x is defined by 4 parameters


X.s

: starting time


X.d

: deadline


X.e

: execution requirement


X.p

: execution period



Adapting to pathfinding solvers:


Need to assume
all tasks
are
periodic
:


Easy for smoothing, steering or DA solvers


More tricky for A* and other Graph traversals solvers


Need to have an
estimate

of each core solver
job duration
:


Again quite simple for smoothing, steering or DA solvers


Much less easy for A* and other Graph traversals solvers
-
> need to decompose graph
traversal tasks into subtasks of constant duration

©
2010
Autodesk

…and select a
scheduling

algorithm


P
-
fairness scheduling scheme

(S.K. Baruah, N.K. Cohen, C.G. Plaxton, D.A. Varvel)
:


Defines a notion of
proportionate progress
called P
-
fairness


Uses it to define an efficient algorithm solving the periodic scheduling problem



Cache
-
aware P
-
fair based scheduling scheme

(J.H. Anderson, J.M. Calendrino, U.M. Devi)


Extends P
-
fairness approach to avoid scheduling of co
-
existent threads that
would worsen performance of shared caches



Task
-
grouping P
-
fair based scheduling scheme

(J.H.
Anderson, J.M. Calendrino)


Extends P
-
fairness approach to encourage grouping of tasks that share
common working set


©
2010
Autodesk

Answering
problem 2

(volatile data) requires a
better description of
memory models


Programming models differ in the way they manage
memory space


Homogenous

models: unified memory


Heterogeneous

models: Host / Device space



Today only
homogenous

models offer a transparent
memory management



For
heterogeneous

models, the developer still has to do
a
lot of work

©
2010
Autodesk

Programming models differ in the way they
manage memory space

Framework

Request

Queue

Compute

Kernel

Compute

Kernel

Compute

Kernel

Compute

Kernel

Task

OpenCL

Queue

Host Memory Space

Device Memory Space

©
2010
Autodesk

There is a need for locking
mechanism between the framework
and the kernel


Framework

Request

Task

Request

Kernel

Request

Kernel

Execution

Task

Update

Framework

Update

Inserting

Data

OK

OK

LOCK

(if Kernel


uses data)

OK

OK

OK

Data Ready

OK

OK

OK

OK

OK

OK

Data Locked

OK

OK

OK

LOCK

(if Kernel
accesses host
memory)

OK

OK

Removing

Data

OK

OK

LOCK

(if Kernel

uses data)

OK

OK

OK

©
2010
Autodesk

It requires also a better description of
user data


There are 3 types of
user data
:


Read Only
Memory

(e.g. navmesh in a static world)


Needs to be aware of when user data is available and when it is garbage


Read / Write
Memory

(e.g.. navmesh in a dynamic world)


Same as Read Only approach, with extension to secure data modification
stages


Work

Memory

(e.g. open & closed sets for a A* solver)


Located where the solver is really called


©
2010
Autodesk

Data Lifecycle States

Data Life cycle States are introduced to handle R/O
and R/W data volatility and dynamicity

Data
Ready

Notifying Data
To be Inserted

Data in
Insertion

Data in
Removal

Notifying Data
Removed

Data
Locked

On Dependency Insertion / Removal

Dependency Inserted / Removed

CRITICAL when data are not owned by middleware

©
2010
Autodesk

Problem 3
(triggering logic)

requires choosing
between
Pull

or
Push
Triggering mechanism


To limit computations over time, it is important to decide
whether we want a pull or push triggering model


In a push model
, the system
polls
over all the characters to get new steering
policy


In a pull model
, the system gets
update requirements
from the game engine
and only performs computations on related characters


The pull model better controls the amount of computations


not really compatible with a Realtime approach


The push model offers the capabilities of optimizing from a
Cache and Task Grouping point of view

©
2010
Autodesk

MASAI : THE PREMISES OF AN AI
MASSIVE PARALLEL
SOLUTION

©
2010
Autodesk

Guidelines for a new parallel programming
model for realtime AI


Extends to the full AI the rational described in previous slides



Data
/ Message Flow
based

system


Realtime
P
-
fair Scheduling
algorithm


Compatible with
heterogeneous

programming models


Push

Triggering

Mechanism

©
2010
Autodesk

Introducing the concept of Working Unit


A WU receives requests to process


A WU communicates with another WU ONLY through strongly typed requests


Requests are explicitly exposed in the WU interface


A request can be synchronous or asynchronous (2 different implementations of the
request)


A WU is responsible for the serialization Host<
-
>Device of its context

Working Unit

Host Code

Device Code

Owner / Children

Event Handler

Incoming Requests
Queues

Context

Context

Serializer

Requests

Interface

Context

Accessors

©
2010
Autodesk

The system works on a mixture of events and requests

Entity 1

Entity 2

Entity …

Brain1

Brain 2

Brain …

PF 1

PF 2

PF …

Entity Update WU

Entity Update

Queue

Brain Update WU

Brain Update

Queue

Pathfinding WU

Pathfinding
Update
Queue

Pathdata
Mgr

CanGo WU

CanGo

Queue

World Update
WU

World
Update

Queue

Request

Event

Game Engine

World1

World…

Geometry
Mgr

IsVisible WU

IsVisible

Queue

©
2010
Autodesk

The underlying architecture would rely on
a event broadcaster and communicating
components

Global Events Broadcaster

Local Events Broadcaster

SearchPath

CC

SelectTarget

CC

ComputeDA

CC

Steering

CC

Local Events Broadcaster

SearchPath

CC

SelectTarget

CC

ComputeDA

CC

Steering

CC

Communicating Component = Working Unit for parallelism

©
2010
Autodesk

Open challenges


Customized Objects vs. Data / Services model


Interruptability


Multi
-
platform


Scheduling algorithm performance



And many more


©
2010
Autodesk

Multiplatform


Too many programming languages!


C++


C for OpenCL


C for CUDA


C99 for Spurs


HLSL 5 for DirectX





Which standards will emerge?


Which standards will be chosen in future consoles?

©
2010
Autodesk

GAME DEVELOPER ZONE

www.the
-
area.com/gamedev

©
2010
Autodesk