codes+isss 2011 - Compiler Microarchitecture Lab - Arizona State ...

mangledcobwebSoftware and s/w Development

Dec 14, 2013 (3 years and 9 months ago)

80 views

C

M

L

Vector Class on Limited Local Memory
(LLM) Multi
-
core Processors

Ke

Bai

Di Lu and
Aviral

Shrivastava

Compiler Microarchitecture Lab

Arizona State University, USA

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Summary

2


Cannot improve performance without improving power
-
efficiency


Cores are becoming simpler in multicore architectures


Caches not scalable (both power and performance)


Limited Local Memory multicore architectures


Each core has a scratch pad (e.g., Cell processor)


Need
explicit DMAs

to communicate with global memory



Objective:


How to enable vector data structure (dynamic arrays) on the LLM

cores?


Challenges:


1. Use local store as temporary buffer (e.g., software cache) for vector data


2. Dynamic global memory management, and core request arbitration


3. How to use pointers when the data pointed to may have moved ?



Experiments


Any size vector is supported


All SPUs may use vector library simultaneously


and is scalable

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

From multi
-

to many
-
core processors

IBM
XCell

8i

GeForce 9800 GT

Tilera

TILE64

3


Simpler design and verification


Reuse the cores


Can improve performance without



much increase in power


Each core can run at a lower frequency


Tackle thermal and reliability problems at core granularity

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Memory Scaling Challenge

D Cache

19%

I Cache

25%

D MMU

5%

I MMU

4%

arm9

25%

PATag
RAM

1%

CP 15

2%

BIU

8%

SysCtl

3%

Clocks

4%

Other

4%

Intel
48
core chip

Strong ARM 1100

4


In
Chip Multi Processors (CMPs) ,
caches
guarantee data coherency


Bring required data from wherever into the cache


Make sure that the application gets the latest copy
of the data


Caches consume too much power


44% power, and greater than 34% area


Cache coherency protocols do not scale
well


Intel 48
-
core Single Cloud
-
on
-
a
-
Chip has non
-
coherent caches

C

M

L

http://www.aviral.lab.asu.edu

C

M

L



PPE

Element Interconnect Bus (EIB)

Off
-
chip
Global
Memory

PPE: Power Processor Element

SPE: Synergistic Processor Element

LS: Local Store

SPE 0

SPE 2

SPE 5

SPE 4

SPE 3

SPE 1

SPE 6

Limited
Local Memory Architecture


Cores have small local memories (scratch pad)


Core can only access local memory


Accesses to global memory through explicit DMAs in the program


e
.g.

IBM Cell
architecture, which is in Sony PS3.

SPE 7

5


LS

SPU

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

LLM Programming


Task based programming, MPI like communication

#include<libspe2.h>


extern spe_program_handle_t
hello_spu
;


int main(void)

{

int speid, status;


speid


(&
hello_spu
);


}


Main Core

<
spu_mfcio.h
>


int main(speid,
argp)

{

printf("
Hello
world
!
\
n
");

}

Local Core

<
spu_mfcio.h
>


int main(speid,
argp)

{

printf("Hello
world!
\
n");

}

Local Core

<
spu_mfcio.h
>


int main(speid,
argp)

{

printf("Hello
world!
\
n");

}

Local Core

<
spu_mfcio.h
>


int main(speid,
argp)

{

printf("Hello
world!
\
n");

}

Local Core

<
spu_mfcio.h
>


int main(speid,
argp)

{

printf("Hello
world!
\
n
");

}

Local Core

<
spu_mfcio.h
>


int main(speid,
argp)

{

printf("Hello
world!
\
n");

}

Local Core

=
spe_create_thread

Otherwise, efficient data management is required
!

6


Extremely power
-
efficient computation


If all code and data fit into the local memory of the cores

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Managing data

Local
Memory Aware
Code

Original Code

int

global;


f1(){


int
a,b
;


global

= a + b
;



f2();

}






int

global;


f1(){


int
a,b
;


DMA.fetch
(global
)


global

= a + b;


DMA.writeback
(global
)





DMA.fetch
(f2
)


f2();

}


7

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Vector Class Introduction

Vector Class
is widely used library for
programming!

8


One of classes in
Standard Template Library(STL) for C++


Implemented as dynamic arrays, sequential container


Elements
stored in contiguous storage
locations


Can be
accessed
by using iterators or offsets on regular pointers to
elements


Compared to arrays:


Vector have the ability to be easily resized


Capacity increase and decrease is handled automatically


They usually consume more memory than arrays when their capacity is
handled automatically


This is in order to accommodate extra storage space for future
grownth

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Vector Class
Management

main() {


vector<
int
>
vec
;


for(int i
=
0;
i <
N;
i++)


vec.push

back(i);

}

SPE code

Max N is 8192

N
0

8192 INTs is only 32KB, far less than 256KB
of local memory. Why it crashes so early?

9


All code and data need to be managed


This paper focuses on vector data management


Vector management is difficult


Vector size is dynamic and can be unbounded


Cell programming manual suggests “Use dynamic data at your own risk”.


Restricting the usage of dynamic data is restrictive for programmers.

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Outline of the Talk

10


Motivation



Related Works on Vector Data Management



Our Approach of Vector Data Management



Experiments

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Related Works

SPE

Local
Memory

Global

Memory

……

LLM
Architecture

SPE

Local
Memory

DMA

They ensure data coherency across
different spaces
.
What about size
of
local memory is small?


11


Different threads can access vector concurrently, no matter it
is in one address space or different spaces.


They provide
efficient parallel implementations, abstract
platform details, provide an interface to programmers to
express the parallelism of
the
problem
s
,
automatically
translate from one space to
another


Shared memory
:
MPTL[Baertschiger2006], MCSTL[Singler2007] and
Intel TBB[Intel2006]


Distributed memory: POOMA[Reynders1996], AVTL[Sheffler1995],
STAPL[Buss2010] and PSTL[Johnson1998]


C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Space Allocation and Reallocation

Unlimited vector requires evicting
older
vector data
to
global
memory and reallocating
more global
memory!

Vector Data

a
llocated
s
pace

0x010100

0x010200

(a) When the vector use
up the allocated space

Vector

Data

New
allocated
space

0x010500

0x010600

(b) We allocate a large
space and move all data

0x010700

12


push_back

& insert


Adds elements


Needs to be re
-
allocated for a larger space when there is no unused
space

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Space
Allocation and Reallocation


Static buffer?


Small vector
-
> low utilization; large vector
-
> overflow


SPU thread can’t use
malloc
() and free() on global
memory


Hybrid: DMA + mailbox

SPE

struct

msgStruct

{



int

vector_id
;



int

request_size
;


int

data_size
;


int

new_gAddr
;

};


(2)operation


type

vector data

Global Memory

(4)
r
estart


signal

(1) transfer
parameters by DMA

SPE thread

PPE

PPE thread

m
ailbox based

13

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Element
Retrieving

133th
element
:

block index = 128 = 133 / 16 * 16

……

Block Size is 16

……

……

0
th

element

1
st

element

15
th

element

128
th

element

143
th

element

……

Block 0

Block 7



Based on the
global
address, we can know whether
this block is in the local memory or not. If not, fetch it.

14


Block index:
index of 1
st

element in the block


Each block contains a block index, besides the data;



blocks are in linked list.


Global address:

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Vector Function Implementation


But elements shifting now is a challenging task under
LLM architecture


Because we cannot use pointers
in the local memory

to access
global memory & DMA requires alignment

New
Element

Global Memory

Global

Memory

f
or (……)


(*b++) = (*a++);

Local
Memory

New Element

15


In order to
keep semantics, we implemented all functions
.
But only
insert

function is shown here.



Original insertion can take advantage of pointers.

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Pointer problem needs to be solved!

Pointer Problem


In order to support limitless vector data, global memory
must be leveraged.


Two address spaces co
-
exist, no matter what scheme is
implemented, pointer issue exist.

vec

Global Memory

(a)
Pointer points to a



vector element

struct* S {


……


int* ptr;

}

Local Memory

vec

(b) The vector element is



moved to global memory

?

struct* S {


……


int* ptr;

}

Local Memory

Global Memory

16

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Pointer Resolution

(a) Original Program

(b) Transformed Program

main
()

{


vector<
int
>
vec
;





int
*
a

=
vec.at(index);





int

sum = 1 + *
a
;





int
*
b

=
a
;

}


main
()

{


vector<
int
>
vec
;





int
*
a

=
ppu_addr
(
vec,index
);






a

=
ptrChecker
(
a
);



int

sum = 1 + *
a
;


a

=
s2p
(
a
);






int
*
b

=
a
;

}



ppu_addr
:
returns global
address
ptr

pointing
to
the
vector element.


ptrChecker
:


checks whether
ptr

is pointing
to
a

vector data;


guarantees the data pointed is in the
local memory;


returns the local address.


s2p:
transforms local address back to global address


Local address should not be used to identify the data.

17

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Experimental Setup


Hardware


PlayStation 3 with IBM Cell BE


Software


Operating System:
Linux Fedora 9 and IBM SDK 3.1


Benchmarks: some possible applications using vector data.

18

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Unlimited Vector Data

0.001
0.01
0.1
1
10
100
90
900
9000
90000
900000
9000000
Runtime(s)

Total number of integers

Our Improved Vector Class
Original Vector Class
𝑁
0

= 8192

4 B

……

B: Bytes

8 B

16 B

2
n+2

B





reallocation


𝒏𝒅


reallocation

𝒏
𝒉


reallocation

……






+
𝟒



+
12

?™

+
𝟖



+
2
n
+2

𝟒

……

Why?

19

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Impact of Block Size

1
10
100
4
8
16
32
64
128
256
Runtime(s)

heap sort
radix sort
FFT
invfft
dijkstra
SOR
sparse matrix
Block Size (# of
elements

in one block)

20

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Impact of

buffer Space

0
5
10
15
20
25
30
512
1024
2048
4096
Runtime(s)

heap sort
radix sort
FFT
invfft
dijkstra
SOR
sparse matrix
Buffer Size (# of elements in one buffer)

buffer_size

=
number_of_block

×

block_size
.

21

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Impact of Associativity

0
5
10
15
20
25
30
35
heap sort
radix sort
FFT
invfft
dijkstra
SOR
sparse
matrix
Runtime(s)

Benchmarks

Direct Map
2-way Associative
4-way Associative
8-way Associative

Higher associativity

-
> high computation spent on looking
up data structure & low miss ratio

22

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Scalability

0
5
10
15
20
25
30
1
2
3
4
5
6
Runtime(s)

heap sort
radix sort
FFT
invfft
dijkstra
SOR
sparse matrix
Number of Cores

23

C

M

L

http://www.aviral.lab.asu.edu

C

M

L

Summary

24


Cannot improve performance without improving power
-
efficiency


Cores are becoming simpler in multicore architectures


Caches not scalable (both power and performance)


Limited Local Memory multicore architectures


Each core has a scratch pad (e.g., Cell processor)


Need
explicit DMAs

to communicate with global memory



Objective:


How to enable vector data structure (dynamic arrays) on the LLM

cores?


Challenges:


1. Use local store as temporary buffer (e.g., software cache) for vector data


2. Dynamic global memory management, and core request arbitration


3. How to use pointers when the data pointed to may have moved ?



Experiments


Any size vector is supported


All SPUs may use vector library simultaneously


and is scalable