1
Memory Design for
Multi
-
Core
System on Chip
2
Introduction
•
The DSP processor is optimized for extremely
high performance for a specific kind of
arithmetic
-
intensive algorithms.
•
Data Path Optimization Operations like
Multiply
-
Accumulate should only take one
clock cycle
•
Memory Architecture Optimization
Large Amounts of Data must be moved
from and to Memory
3
A FIR
-
Filter is a typical
DSP application
4
Example FIR
-
Filter
•
If Multiply
-
Accumulate can be done
in a single clock cycle, a new sample in a k
-
tap FIR
filter could be calculated in k cycle, if there would be
no delay due to memory access.
•
However, several memory acceses are necessary:
1. Fetch the multiply
-
accumulate instruction
2. Read the delayed data value (xi)
3. Read the coefficient value (ci)
4. Write the data value into the next delay
location in memory (xi
-
> xi
-
1)
5
Memory Structure
The common memory structure used
by general purpose processors is the
Von Neumann
architecture
The processor can make one memory
access per instruction cycle
6
Original Harvard Architecture
•
Processor is connected with two
memories (one for instructions, one
for data) via independent buses.
7
Modified Harvard
Architecture
•
Processor is connected with two
memories (both for instruction and
data) via independent buses.
8
Comparison
An implementation of the FIR filter
needs per sample
o 4 instruction cycles (Von Neumann)
o 3 instruction cycles (Original Harvard)
o 2 instruction cycles (Modified Harvard)
Thus there is of course the possibility
to have more than two independent
memory banks, which is also used in
some DSPs
9
Multiple Memory Buses
Multiple Memory buses outside the
chip are costly.
DSP processors generally provide
only two off
-
chip buses (address and
data bus)
Processors with multiple memory
banks provide usually a small
amount of memory on
-
chip
10
Multiple Memory Access
Fast Memories
Multiple Memory Access can be
achieved by using faster memories
that support multiple memory
accesses per instruction cycle
Fast memories can be combined with
a Harvard architecture to achieve
higher memory bandwidth
11
Multiple Memory Access
Multi
-
Port Memories
•
Multi
-
port memories have multiple
independent set of address and data
connections.
12
DSP processors often include a
program cache to eliminate to access
the main memory for certain
instructions
In general DSP processor caches are
much smaller and simpler than caches in
general purpose processors
DSP
프로세서
캐쉬
13
DSP Processor Caches
Single
-
Instruction
Repeat Buffer
An instruction is loaded into the repeat
buffer (initiated by the programmer).
If the repeat instruction is used, the
processor can make an extra memory
access within a single cycle.
Extended
Repeat
-
Buffer
A whole block of instructions is loaded
into the repeat buffer.
14
DSP Processor Caches
Single Sector Instruction Cache
o The cache stores a number of the most recent instructions,
which are in a single contiguous region of program memory.
o The cache is loaded automatically with these instructions during
program execution
Multiple Sector Instruction Cache
o Two or more independent sectors in memory can be stored in a
cache
o If an instruction is of another sector than stored in the caches
(sector miss),
one of the sectors in the caches is replaced by
the sector of the instruction
15
DSP Processor Caches
Some DSP processors provide special
instructions that allow to lock the
contents of the caches or to disable
the cache
o This may lead to better performance,
if the programmer knows the behavior
of the program
16
DSP Processor Caches
DSP processors caches are in general only
used for program instructions and not for
data
o Caches that accomodate data and
instructions must include a mechanism to
write back data to the external memory
o If this is not the case, the deletion of the
cache contents means that updates of
data in the cache are lost
In order to use caches in an efficient way,
algorithms should exploit data locality
17
External Memory Interfaces
Most DSPs provide a single external
memory port consisting of an address
bus, a data bus and a set of control
signals
18
External Memory Interfaces
The lack of external memory ports
means that multiple external
memory accesses cannot be done
within a single clock cycle
There are DSP Processors that
provide multiple off
-
chip memory
ports
19
Multiprocessor Support in
DSP External Interfaces
DSPs intended for multiprocessor
systems often provide special
features to simplify the design of
multi
-
processor DSP systems
Examples:
o Two external memory ports
o Sophisticated shared bus interface that
allows to connect several DSPs together
with no special hardware or software
P. Lapsley, Jeff Bier, Amit Shoham,
E. A. Lee,
DSP Processor
Fundamentals
, IEEE Press, 1997
20
FAST COMPUTATIONS ON A LOW
-
COST DSP
-
BASED
SHARED
MEMORY MULTIPROCESSOR SYSTEM
ICECS
’
2002
Charalambos
S.
Christou
21
Introduction
Processor performance has increased dramatically over the past
few years while memory latency and bandwidth have progressed
at a much slower pace
Large latencies have considerably reduced the number of
processors, which can be effectively supported in shared
-
memory
parallel computers
=>
New cost
-
effective parallel system
1. Reduces memory latency
2. Effectively supports a greater number of processing elements
for faster DSP computations
22
TWIN
-
PREFETCHING DSP
-
BASED
SHARED
-
MEMORY SYSTEM(1)
The proposed multiprocessor is a high
-
speed low
cost DSP
-
based
twin
-
prefetching
shared
-
memory MIMD parallel system
Figure1. Twin
-
prefetching multiprocessor system diagram
23
Advantages of Shared
-
memory
•
Ease of programming when communication patterns
are complex or vary dynamically during execution
•
Lower communication overhead, good utilization of
communication bandwidth for small data items, no
expensive I/O operations
•
Hardware
-
controlled caching to reduce remote
commutations when remote data is cached
•
Another favor: Message passing (e.g. MPI)
24
SMP Interconnect
•
Processors to Memory AND to I/O
•
Bus based: all memory locations equal
access time so SMP =
“
Symmetric MP
”
•
Sharing limited BW w/ processors, I/O
25
TWIN
-
PREFETCHING DSP
-
BASED
SHARED
-
MEMORY SYSTEM(2)
Data Memory
-
Comprised of two controllers
(twin TTCs)
and two fast memories
(twin
-
prefetching
caches)
-
The two
TTC
/cache pairs are
Twin1
and
Twin2.
=> one
Twin
is accessible to the processor providing data operands
=> the other
Twin
is transferring data from/to the shared memory
i.e., as soon as a block of data is moved into the cache
-
Loading (input image segments) and unloading (results) from/to the
Twins
occur simultaneously with data processing.
-
The back and forth switching of
Twin1
and
Twin2
allows maximum
utilization of resources; thus optimum system performance
26
TWIN
-
PREFETCHING DSP
-
BASED
SHARED
-
MEMORY SYSTEM(3)
Host Processor
-
The host can directly read or write asynchronously the internal memory of
any ADSP
-
21060 via the Host bus
-
Host processor is responsible for booting all nodes and downloading all
necessary code and some data to the internal memory of every processor.
-
The data downloaded to the internal memories include the addresses of
image segments in the global memory which every node is assigned to
process.
27
RESULT(1)
28
RESULT(2)
29
An efficient dynamic Memory Manager
for Embedded systems
•
Most embedded systems rely on statically allocated
memory to avoid problems with
garbage collection
.
•
Solution of the garbage collection
–
Manual Memory Management System
: The systems that
would benefit form using dynamic memory management are
applications where all clients involved do not need instant
access to the maximum memory that they could claim.
30
An efficient dynamic Memory Manager
for Embedded systems
•
The DMMS works as an
address translator
for the clients.
•
The DMMS contains an
Arbiter
for granting access to the
different clients.
–
Three types of requests : Allocation, Deallocation, R/W.
31
An efficient dynamic Memory Manager
for Embedded systems
•
The
interface
to the clients consists of four parts :
Allocation, Deallocation, R/W, maintenance.
•
In order to achieve an as high grade as possible of memory
utilisation it is important to perform thorough analysis of the
optimal block size
.
32
An efficient dynamic Memory Manager
for Embedded systems
•
The key issue in the DMMS is to make the clients to see a
contiguous memory
.
•
A naive implementation of the DMM would be to have a
separate
Address Translation Table(ATT)
for each client.
33
An efficient dynamic Memory Manager
for Embedded systems
•
The table below shows the
worst and average cases
for
a different number of clients.
•
The
main advantage
with DMMS is that it has predictable,
and nice, worst, and average case behaviors.
•
The system is intended to
ease the job
of the hardware
designer/programmer as well as
create better and
smaller hardware
.
34
Hardware Support for Real
-
Time Embedded
Multiprocessor SoC Memory Management
•
The aggressive evolution of the semiconductor industry has
provided design engineers the means to create
complex
,
high performance SoC
designs.
•
A typical SoC consists of multiple
processing elements
,
configurable logic
,
large memory
,
analog
components
and
digital interfaces
.
35
Hardware Support for Real
-
Time Embedded
Multiprocessor SoC Memory Management
•
The
SoC Dynamic Memory Management Unit
is a
hardware unit which deals with the global on
-
chip memory
allocation/de
-
allocation between the PEs.
•
There are three types of commands that the SoCDMMU can
execute :
G_Allocate commmands
,
G_Deallocate
command
and
Move command
.
36
Hardware Support for Real
-
Time Embedded
Multiprocessor SoC Memory Management
•
An
RTOS
usually divides the memory into
fixed
-
sized
allocation
units and any task can allocate only one unit at
a time.
•
As an RTOS,
Atalanta manage memory
in a deterministic
way ; tasks can dynamically allocate fixed
-
size blocks by
using memory partitions.
37
Hardware Support for Real
-
Time
Embedded Multiprocessor SoC Memory
Management
•
Atalanta RTOS memory management
–
to support the SoCDMMU and to allow the Atalanta RTOS to
work in a multiprocessor SoC environment.
38
Hardware Support for Real
-
Time Embedded
Multiprocessor SoC Memory Management
•
Four
-
PE SoC with SoCDMMU
•
hardware SoCDMMU that provides a
dynamic,
fast way to allocate/deallocate
the global on
-
chip memory.
39
Unifying Memory and Processor Wrapper
Architecture in Multiprocessor SoC Design
ISSS
’
2002
F
é
rid Gharsalli, Damien Lyonnard, Samy Meftali, Fr
é
d
é
ric Rousseau,
Ahmed A. Jerraya
TIMA Laboratory, Grenoble, France
40
Introduction
Multiprocessor SoCs (MP SoC)
•
Increasing performance requirements of application domains
•
Complex communication protocols
•
IP or application
-
specific memory components
•
Require heterogeneous processors
=> This architecture generation demands significant design
efforts
41
Introduction
To reduce the productivity gap, designers reuse components (IP cores)
An IP core need to adapt specific physical accesses and protocols of
those components to the communication network that may have other
physical connections and other protocols.
To facilitate the design space exploration and to allow the designer to try
different components or communication protocols,
=> Need that generate automatically these wrappers, based on
parameters given by the architecture (processor types,protocols, etc.)
42
MP SoC Architecture
Figure 1: A typical Multiprocessor SoC
43
MP SoC Architecture
Figure 2: Architectural models
44
Unified Wrapper Model(1)
–
generic wrapper architecture
Figure 3: Wrapper Architectural
Key Idea :
To allow the automatic wrapper generation based on a common library
Module Adapter (MA)
Implements services requested by the module
.
Channel Adapter (CA)
-
Implements the communication protocol
(FIFO, DMA controller, etc)
-
Controls the communication between the
module and the network
.
45
Unified Wrapper Model(2)
-
processor wrapper architecture
Figure 3: Wrapper Architectural
Processor Adapter
(PA)
-
Performs channel access selection by address
decoding and interrupt management
-
The PA is a master, whereas the CAs are slaves
-
Enable signals are set/reset by the PA,
select one CA and enable it to read/write data
to/from data signal.
46
Unified Wrapper Model(3)
-
memory wrapper architecture
Figure 3: Wrapper Architectural
Memory Port Adapter (MPA)
-
Includes a memory controller and several
memory specific functions
-
Performs data type conversion and data
transfer between internal communication bus
and memory bus.
47
Wrapper generation
Figure 4: Wrapper generation flow
In order to facilitate the wrapper generation a library of basic
components should be built.
This library includes several macro
-
models of channel adapters
and module adapters
The wrapper generation flow are composed of processor and memory
library, and MA and CA library.
48
Memory Wrapper Generation in Image
Processing Application
Validation.
-
The correctness of the memory wrapper, we performed a low level
image processing for a digital camera application
-
This algorithm using two processors (ARM7) and a global shared memory.
Experiment
-
two CAs, each one is composed of two FIFOs (32 words x 32 bits) with
one controller and one buffer (1 word of 32bits),
-
two specific SRAM port adapters. Each one is composed of one address
decoder and one SRAM controller that provides the following services:
SRAM control, burst access and test operation used during co
-
simulation,
-
two parallel internal buses of 32 bits.
49
50
Results
The automatic generation of these wrappers allows a fast design space
exploration of various types of memories.
The generated wrappers have been validated with a cycle accurate co
-
simulation approach based on SystemC.
Two ISSs of ARM7 core (40 MHz) are used.
We note that there is a small difference in the code size of the memory
wrapper in the two RTL architecture models
.
CAs are not changed. Only the MPA is changed (10% of the wrapper code).
Write latency : 3 CPU (without memory latency) cycles
Read latency : 7 CPU cycles (send/receive).
Simulation cycle which corresponds to the processing of an image of
387x322 pixels => 2.05
×
106 CPU cycles
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο