SOC memory 관리

feastcanadianSoftware and s/w Development

Dec 14, 2013 (3 years and 7 months ago)

78 views

1

Memory Design for

Multi
-
Core

System on Chip


2

Introduction


The DSP processor is optimized for extremely
high performance for a specific kind of
arithmetic
-
intensive algorithms.


Data Path Optimization Operations like
Multiply
-
Accumulate should only take one
clock cycle


Memory Architecture Optimization


Large Amounts of Data must be moved


from and to Memory


3

A FIR
-
Filter is a typical


DSP application


4

Example FIR
-
Filter


If Multiply
-
Accumulate can be done


in a single clock cycle, a new sample in a k
-
tap FIR
filter could be calculated in k cycle, if there would be
no delay due to memory access.



However, several memory acceses are necessary:


1. Fetch the multiply
-
accumulate instruction


2. Read the delayed data value (xi)


3. Read the coefficient value (ci)


4. Write the data value into the next delay


location in memory (xi
-
> xi
-
1)

5

Memory Structure

􀂉

The common memory structure used
by general purpose processors is the
Von Neumann
architecture

􀂉

The processor can make one memory
access per instruction cycle


6

Original Harvard Architecture


Processor is connected with two
memories (one for instructions, one


for data) via independent buses.


7

Modified Harvard

Architecture


Processor is connected with two


memories (both for instruction and


data) via independent buses.


8

Comparison

􀂉

An implementation of the FIR filter

needs per sample

o 4 instruction cycles (Von Neumann)

o 3 instruction cycles (Original Harvard)

o 2 instruction cycles (Modified Harvard)

􀂉

Thus there is of course the possibility

to have more than two independent

memory banks, which is also used in

some DSPs


9

Multiple Memory Buses

􀂉

Multiple Memory buses outside the


chip are costly.

􀂉

DSP processors generally provide


only two off
-
chip buses (address and


data bus)

􀂉

Processors with multiple memory


banks provide usually a small


amount of memory on
-
chip


10

Multiple Memory Access

Fast Memories

􀂉

Multiple Memory Access can be

achieved by using faster memories

that support multiple memory

accesses per instruction cycle

􀂉

Fast memories can be combined with

a Harvard architecture to achieve

higher memory bandwidth


11

Multiple Memory Access

Multi
-
Port Memories


Multi
-
port memories have multiple


independent set of address and data


connections.


12

􀂉

DSP processors often include a

program cache to eliminate to access

the main memory for certain

instructions

􀂉

In general DSP processor caches are

much smaller and simpler than caches in
general purpose processors


DSP
프로세서

캐쉬

13

DSP Processor Caches

􀂉

Single
-
Instruction
Repeat Buffer


An instruction is loaded into the repeat


buffer (initiated by the programmer).


If the repeat instruction is used, the


processor can make an extra memory


access within a single cycle.

􀂉

Extended
Repeat
-
Buffer


A whole block of instructions is loaded


into the repeat buffer.


14

DSP Processor Caches

􀂉

Single Sector Instruction Cache

o The cache stores a number of the most recent instructions,
which are in a single contiguous region of program memory.

o The cache is loaded automatically with these instructions during
program execution


􀂉

Multiple Sector Instruction Cache

o Two or more independent sectors in memory can be stored in a
cache

o If an instruction is of another sector than stored in the caches
(sector miss),
one of the sectors in the caches is replaced by


the sector of the instruction


15

DSP Processor Caches

􀂉

Some DSP processors provide special


instructions that allow to lock the


contents of the caches or to disable


the cache


o This may lead to better performance,
if the programmer knows the behavior
of the program


16

DSP Processor Caches

􀂉

DSP processors caches are in general only
used for program instructions and not for
data

o Caches that accomodate data and


instructions must include a mechanism to


write back data to the external memory

o If this is not the case, the deletion of the


cache contents means that updates of


data in the cache are lost

􀂉

In order to use caches in an efficient way,
algorithms should exploit data locality


17

External Memory Interfaces

􀂉

Most DSPs provide a single external
memory port consisting of an address
bus, a data bus and a set of control
signals


18

External Memory Interfaces

􀂉

The lack of external memory ports

means that multiple external

memory accesses cannot be done

within a single clock cycle

􀂉

There are DSP Processors that

provide multiple off
-
chip memory

ports


19

Multiprocessor Support in

DSP External Interfaces

􀂉

DSPs intended for multiprocessor


systems often provide special


features to simplify the design of


multi
-
processor DSP systems

􀂉

Examples:

o Two external memory ports

o Sophisticated shared bus interface that


allows to connect several DSPs together


with no special hardware or software

􀂉

P. Lapsley, Jeff Bier, Amit Shoham,

E. A. Lee,
DSP Processor

Fundamentals
, IEEE Press, 1997



20

FAST COMPUTATIONS ON A LOW
-
COST DSP
-
BASED
SHARED
MEMORY MULTIPROCESSOR SYSTEM

ICECS


2002



Charalambos
S.
Christou



21

Introduction


Processor performance has increased dramatically over the past
few years while memory latency and bandwidth have progressed
at a much slower pace




Large latencies have considerably reduced the number of
processors, which can be effectively supported in shared
-
memory
parallel computers



=>
New cost
-
effective parallel system



1. Reduces memory latency


2. Effectively supports a greater number of processing elements


for faster DSP computations




22

TWIN
-
PREFETCHING DSP
-
BASED
SHARED
-
MEMORY SYSTEM(1)




The proposed multiprocessor is a high
-
speed low
cost DSP
-
based
twin
-
prefetching
shared
-
memory MIMD parallel system


Figure1. Twin
-
prefetching multiprocessor system diagram

23

Advantages of Shared
-
memory


Ease of programming when communication patterns


are complex or vary dynamically during execution


Lower communication overhead, good utilization of


communication bandwidth for small data items, no


expensive I/O operations


Hardware
-
controlled caching to reduce remote


commutations when remote data is cached


Another favor: Message passing (e.g. MPI)


24

SMP Interconnect


Processors to Memory AND to I/O


Bus based: all memory locations equal
access time so SMP =

Symmetric MP



Sharing limited BW w/ processors, I/O


25

TWIN
-
PREFETCHING DSP
-
BASED
SHARED
-
MEMORY SYSTEM(2)



Data Memory



-

Comprised of two controllers
(twin TTCs)
and two fast memories


(twin
-

prefetching
caches)



-

The two
TTC
/cache pairs are
Twin1
and
Twin2.


=> one
Twin
is accessible to the processor providing data operands


=> the other
Twin
is transferring data from/to the shared memory


i.e., as soon as a block of data is moved into the cache



-

Loading (input image segments) and unloading (results) from/to the
Twins


occur simultaneously with data processing.



-

The back and forth switching of
Twin1
and
Twin2
allows maximum


utilization of resources; thus optimum system performance


26

TWIN
-
PREFETCHING DSP
-
BASED
SHARED
-
MEMORY SYSTEM(3)



Host Processor



-

The host can directly read or write asynchronously the internal memory of


any ADSP
-
21060 via the Host bus



-

Host processor is responsible for booting all nodes and downloading all


necessary code and some data to the internal memory of every processor.



-

The data downloaded to the internal memories include the addresses of


image segments in the global memory which every node is assigned to


process.


27

RESULT(1)





28

RESULT(2)





29

An efficient dynamic Memory Manager
for Embedded systems



Most embedded systems rely on statically allocated
memory to avoid problems with
garbage collection
.





Solution of the garbage collection


Manual Memory Management System

: The systems that
would benefit form using dynamic memory management are
applications where all clients involved do not need instant
access to the maximum memory that they could claim.

30

An efficient dynamic Memory Manager
for Embedded systems



The DMMS works as an
address translator

for the clients.









The DMMS contains an
Arbiter

for granting access to the
different clients.


Three types of requests : Allocation, Deallocation, R/W.

31

An efficient dynamic Memory Manager
for Embedded systems


The
interface

to the clients consists of four parts :
Allocation, Deallocation, R/W, maintenance.









In order to achieve an as high grade as possible of memory
utilisation it is important to perform thorough analysis of the
optimal block size
.

32

An efficient dynamic Memory Manager
for Embedded systems


The key issue in the DMMS is to make the clients to see a
contiguous memory
.










A naive implementation of the DMM would be to have a
separate
Address Translation Table(ATT)

for each client.

33

An efficient dynamic Memory Manager
for Embedded systems


The table below shows the
worst and average cases

for
a different number of clients.








The
main advantage
with DMMS is that it has predictable,
and nice, worst, and average case behaviors.



The system is intended to
ease the job

of the hardware
designer/programmer as well as
create better and
smaller hardware
.

34

Hardware Support for Real
-
Time Embedded
Multiprocessor SoC Memory Management


The aggressive evolution of the semiconductor industry has
provided design engineers the means to create
complex
,
high performance SoC

designs.








A typical SoC consists of multiple
processing elements
,
configurable logic
,
large memory
,
analog
components

and
digital interfaces
.


35

Hardware Support for Real
-
Time Embedded
Multiprocessor SoC Memory Management


The
SoC Dynamic Memory Management Unit

is a
hardware unit which deals with the global on
-
chip memory
allocation/de
-
allocation between the PEs.








There are three types of commands that the SoCDMMU can
execute :
G_Allocate commmands
,
G_Deallocate
command

and
Move command
.


36

Hardware Support for Real
-
Time Embedded
Multiprocessor SoC Memory Management


An
RTOS
usually divides the memory into
fixed
-
sized
allocation

units and any task can allocate only one unit at
a time.








As an RTOS,
Atalanta manage memory

in a deterministic
way ; tasks can dynamically allocate fixed
-
size blocks by
using memory partitions.


37

Hardware Support for Real
-
Time
Embedded Multiprocessor SoC Memory
Management


Atalanta RTOS memory management



to support the SoCDMMU and to allow the Atalanta RTOS to
work in a multiprocessor SoC environment.







38

Hardware Support for Real
-
Time Embedded
Multiprocessor SoC Memory Management


Four
-
PE SoC with SoCDMMU








hardware SoCDMMU that provides a
dynamic,
fast way to allocate/deallocate

the global on
-
chip memory.



39

Unifying Memory and Processor Wrapper
Architecture in Multiprocessor SoC Design



ISSS


2002



F
é
rid Gharsalli, Damien Lyonnard, Samy Meftali, Fr
é
d
é
ric Rousseau,
Ahmed A. Jerraya

TIMA Laboratory, Grenoble, France


40

Introduction

Multiprocessor SoCs (MP SoC)




Increasing performance requirements of application domains


Complex communication protocols


IP or application
-
specific memory components


Require heterogeneous processors



=> This architecture generation demands significant design
efforts


41

Introduction




To reduce the productivity gap, designers reuse components (IP cores)





An IP core need to adapt specific physical accesses and protocols of
those components to the communication network that may have other
physical connections and other protocols.





To facilitate the design space exploration and to allow the designer to try


different components or communication protocols,



=> Need that generate automatically these wrappers, based on
parameters given by the architecture (processor types,protocols, etc.)


42

MP SoC Architecture



Figure 1: A typical Multiprocessor SoC

43

MP SoC Architecture



Figure 2: Architectural models


44

Unified Wrapper Model(1)


generic wrapper architecture



Figure 3: Wrapper Architectural


Key Idea :
To allow the automatic wrapper generation based on a common library

Module Adapter (MA)

Implements services requested by the module
.



Channel Adapter (CA)


-

Implements the communication protocol


(FIFO, DMA controller, etc)


-

Controls the communication between the


module and the network
.



45

Unified Wrapper Model(2)
-
processor wrapper architecture



Figure 3: Wrapper Architectural


Processor Adapter

(PA)



-

Performs channel access selection by address


decoding and interrupt management



-

The PA is a master, whereas the CAs are slaves


-

Enable signals are set/reset by the PA,


select one CA and enable it to read/write data


to/from data signal.


46

Unified Wrapper Model(3)
-

memory wrapper architecture



Figure 3: Wrapper Architectural


Memory Port Adapter (MPA)

-
Includes a memory controller and several


memory specific functions


-
Performs data type conversion and data


transfer between internal communication bus


and memory bus.


47

Wrapper generation



Figure 4: Wrapper generation flow




In order to facilitate the wrapper generation a library of basic


components should be built.



This library includes several macro
-
models of channel adapters


and module adapters



The wrapper generation flow are composed of processor and memory


library, and MA and CA library.

48

Memory Wrapper Generation in Image

Processing Application



Validation.



-

The correctness of the memory wrapper, we performed a low level


image processing for a digital camera application



-

This algorithm using two processors (ARM7) and a global shared memory.




Experiment


-

two CAs, each one is composed of two FIFOs (32 words x 32 bits) with


one controller and one buffer (1 word of 32bits),



-

two specific SRAM port adapters. Each one is composed of one address


decoder and one SRAM controller that provides the following services:


SRAM control, burst access and test operation used during co
-
simulation,



-

two parallel internal buses of 32 bits.


49

50

Results



The automatic generation of these wrappers allows a fast design space


exploration of various types of memories.




The generated wrappers have been validated with a cycle accurate co
-


simulation approach based on SystemC.


Two ISSs of ARM7 core (40 MHz) are used.




We note that there is a small difference in the code size of the memory


wrapper in the two RTL architecture models
.


CAs are not changed. Only the MPA is changed (10% of the wrapper code).





Write latency : 3 CPU (without memory latency) cycles


Read latency : 7 CPU cycles (send/receive).




Simulation cycle which corresponds to the processing of an image of


387x322 pixels => 2.05
×
106 CPU cycles