by Low-Overhead, Transparent Threads

paraderollΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

66 εμφανίσεις

Copyright 2013, Toshiba Corporation.

DAC2013 Designer/User Track

Scalability Achievement

by Low
-
Overhead, Transparent Threads

on an Embedded Many
-
Core Processor

Takeshi Kodaka
, Akira Takeda, Shunsuke Sasaki, Akira Yokosawa,
Toshiki Kizu, Takahiro Tokuyoshi, Hui Xu, Toru Sano, Hiroyuki Usui,
Jun Tanabe, Takashi Miyamori and Nobu Matsumoto


Center for Semiconductor Research and Development

Toshiba Corporation

2

DAC2013

Background


Requirements for embedded processors


Various types of processing


Video Codecs (HEVC, H.264

MPEG
-
2

WMV

...)


Face Detection/Recognition, Audio/Video playback, Mobile TV


Wide range of required processing performance


Should deal with various types of products from mobile phone to
Tablets or more


Example: video decoding from QVGA 15fps to 1080p 60fps or
more


Low cost and short time development that meets
market requirement


Reuse existing software to reduce development cost

3

DAC2013

Challenges


What kind of
hardware architecture

to employ?


The number of cores should be easily increased/decreased





How can we realize
the scalable performance
?


Parallelized application program that utilizes multiple cores
efficiently


How can we realize
the transparency
?


Hiding the number of cores from application program

Multiple
Core
Architecture
[xu2012low]

Our Proposed Scheduler

[xu2012low] A low power many
-
core SoC with two 32
-
core clusters connected by tree based NoC for multimedia applications,
H. Xu, et al. VLSI Symposium 2012

4

DAC2013

Our approach

A simple multiple core architecture


+ An application program independent of # of cores


+
An efficient parallel processing scheme




Achieving Scalable performance

5

DAC2013

Strategy to realize our approach


Strategy


Developing an application independent of # of cores




transparency


Running the developed application on

a multiple
-
core processor and achieving scalable
performance proportional to # of cores




scalable performance


Scheme


Designed an efficient thread scheduler


efficient management of threads may achieve

scalable performance


the number of cores may be hidden

if a thread scheduler abstracts the cores


Challenges


Minimizing overheads

for execution


Hiding the number of cores

from application program

6

DAC2013

How to minimize overheads


Defined
unique properties for
threads


A Thread never suspends
to wait for
data


eliminate
the overhead of thread
switching


A Thread becomes ready to run
when
necessary
data are
all available


Managed a thread status using
simple
counters


Simplify the dependency
into


“the number of dependency“


this can
be
realized by simple
operations


7

DAC2013

How to hide the number of cores


Designed a distributed scheduler with a shared queue


ONLY ready threads
are placed in
a shared queue


A Thread dispatcher
runs on each core


The
dispatcher
fetches a thread from
the shared queue
and
executes
it



To reduce access conflict for a shared queue


We use

CAS
(Compare And Swap)
instruction

Core

search

Thread

Thread

Thread

Thread

fetch & execute

Core

Thread

Thread

Thread

Thread

fetch & execute

Core

Thread

search

fetch & execute

Thread

Dispatcher

Thread

Dispatcher

Thread

Dispatcher

8

DAC2013

Implemented thread scheduler


Our Thread Scheduler consists of three components


Dependency Controller, Thread Pool, and Thread Dispatcher


Our Thread Scheduler ...


is low overhead for

Scalable Performance


hides the number of cores from application
for

Transparency

Dependency

Controller

Thread Pool

Thread

Dispatcher

Core

Core

Thread Scheduler

Thread

Dispatcher

core

Appl.

register

Core

Thread

Dispatcher

1

0

Thread

Thread

3

1

・・

・・

Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

available

necessary

fetch

& execute

ready

9


Design goals for a many
-
core processor


Achieve scalable performance


Reuse existing software

for a multi
-
core processor


a

many
-
core processor has to execute existing software efficiently


knowledge of the software is
absolutely necessary

Software engineers and Hardware engineers collaborated closely


to design a many
-
core processor


Design cycles


use
“Plan


Evaluate


Analyze


Improve”

cycle


existing software is used

through out evaluation


At 1
st

cycle,: detect issues of existing architecture


At 2
nd

cycle, improve and optimize


Main design features

from our development cycle


CAS instruction,
multi
-
bank L2
cache, tree
-
based network on chip,


Designing a many
-
core processor

DAC2013

Plan

Evaluate using

Simulation

Analyze

Improve

10


Used SAME application binary

even if the number of cores
is changed


These results confirms
proposed thread scheduler


achieves scalable performance with transparency!

Evaluation results

DAC2013

H.264 Decoding 1080p

Super resolution (full HD to 4K2K)

Scalable

Performance

Scalable

Performance

Lack of READY threads

# of ready
threads




< # of MPEs

11

Conclusions


Proposed a low
-
overhead thread scheduler


It achieves
scalable performance and transparency


Reduces
thread execution overheads


defined unique properties for a thread


A thread never suspends


A thread becomes ready when all necessary data are available


managed thread status by the number of dependencies


Hides the number of core


designed a distributed
scheduler with a shared queue


Confirmed performance scalability and transparency


Evaluated on a real 32
-
core many
-
core processor


A scalable performance is achieved without modification of the
application program



DAC2013

Our scheduler
contributes

to the reduction of the software development cost