18 Multicore Computers

egyptiannorweiganInternet και Εφαρμογές Web

31 Οκτ 2013 (πριν από 4 χρόνια και 12 μέρες)

148 εμφανίσεις

William Stallings

Computer Organization

and Architecture

8
th

Edition

Chapter 18

Multicore Computers

Hardware Performance Issues


Microprocessors have seen an exponential
increase in performance


Improved organization


Increased clock frequency


Increase in Parallelism


Pipelining


Superscalar (multi
-
issue)


Simultaneous multithreading (SMT)


Diminishing returns


More complexity requires more logic


Increasing chip area for coordinating and
signal transfer logic


Harder to design, make and debug

Alternative Chip

Organizations

http://www.cadalyst.com/files/cadalyst/nodes/2008/6351/i4.jpg

Intel Hardware

Trends

Exponential speedup trend

ILP has come and gone

http://smoothspan.files.wordpress.com/2007/09/clockspeeds.jpg

http://www.ixbt.com/cpu/semiconductor/intel
-
65nm/power_density.jpg

Increased Complexity


Power requirements grow exponentially with chip
density and clock frequency


Can use more chip area for cache


Smaller


Order of magnitude lower power requirements


By 2015


100 billion transistors on 300mm
2

die


Cache of
100MB


1 billion transistors for logic

http://www.tomshardware.com/reviews/core
-
duo
-
notebooks
-
trade
-
battery
-
life
-
quicker
-
response,1206
-
4.html

http://techreport.com/r.x/core
-
i7/die
-
callout.jpg

Power and Memory Considerations

Less action

More action

We passed 50%!!!

Is this a RAM or a processor?

Increased Complexity


Pollack’s rule:


Performance is roughly proportional to square root of
increase in complexity


Double complexity gives 40% more performance


Multicore has the potential for near
-
linear
improvement (needs some programming effort
and won’t work for all problems)


Unlikely that one core can use all of a huge
cache effectively, so add PEs to make an MPSoC

Chip Utilization of Transistors

Cache

CPU

Software Performance Issues


Performance benefits dependent on
effective exploitation of parallel resources
(obviously)


Even small amounts of serial code impact
performance (not so obvious)


10%

inherently serial on 8 processor system
gives only
4.7

times performance


Many overheads of MPSoC:


Communication


Distribution of work


Cache coherence


Some applications effectively exploit
multicore processors

Effective Applications for Multicore Processors


Database (e.g. Select *)


Servers handling independent transactions


Multi
-
threaded native applications


Lotus Domino, Siebel CRM


Multi
-
process applications


Oracle, SAP, PeopleSoft


Java applications


Java VM is multi
-
threaded with scheduling and memory
management (not so good at SSE

)


Sun’s Java Application Server, BEA’s Weblogic,
IBM
Websphere
, Tomcat


Multi
-
instance applications


One application running multiple times

Multicore Organization


Main design variables:


Number of core processors on chip (dual, quad ... )


Number of levels of cache on chip (L1, L2, L3, ...)


Amount of shared cache
v.s
. not shared (1MB, 4MB, ...)


The following slide has examples of each organization:

a)
ARM11
MPCore

b)
AMD
Opteron

c)
Intel Core Duo

d)
Intel Core i7


Multicore Organization Alternatives

ARM11 MPCore

AMD Opteron

Intel Core Duo

Intel Core i7

No shared

Shared

Advantages of shared L2 Cache


Constructive interference reduces overall miss
rate (A wants X then B wants X


good!
)


Data shared by multiple cores not replicated at
cache level (
one copy of X for both A and B)


With proper frame replacement algorithms mean
amount of shared cache dedicated to each core is
dynamic


Threads with less locality can have more cache


Easy inter
-
process communication through
shared memory


Cache coherency confined to small L1


Dedicated L2 cache gives each core more rapid
access


Good for threads with strong locality


Shared L3 cache may also improve performance

Core i7 and Duo


Let us review these two Intel
architectures…

Individual Core Architecture


Intel Core Duo uses superscalar cores


Intel Core i7 uses simultaneous multi
-
threading (SMT)


Scales up number of threads supported


4 SMT cores, each supporting 4 threads appears as
16 core (my corei7 has 2 threads per CPU)

Core i7

Core 2 duo

Intel x86 Multicore Organization
-

Core Duo (1)


2006


Two x86 superscalar, shared L2 cache


Dedicated L1 cache per core


32KB instruction and 32KB data


Thermal control unit per core


Manages chip heat dissipation with sensors, clock speed
is throttled


Maximize performance within thermal constraints


Improved ergonomics (quiet fan)


Advanced Programmable Interrupt
Controlled (APIC)


Inter
-
process interrupts between cores


Routes interrupts to appropriate core


Includes timer so OS can self
-
interrupt a core

Intel x86 Multicore Organization
-

Core Duo (2)


Power Management Logic


Monitors thermal conditions and CPU activity


Adjusts voltage (and thus power consumption)


Can switch on/off individual logic subsystems
to save power


Split
-
bus transactions can sleep on one end


2MB shared L2 cache


Dynamic allocation


MESI support for L1 caches


Extended to support multiple Core Duo in SMP
(not SMT)


L2 data shared between local cores (fast) or external


Bus interface is FSB

Intel Core Duo Block Diagram

Intel x86 Multicore Organization
-

Core i7


November 2008


Four x86 SMT processors


Dedicated L2, shared L3 cache


Speculative pre
-
fetch for caches


On chip DDR3 memory controller


Three 8 byte channels (192 bits) giving 32GB/s


No front side bus (just like labs 1 & 2 with the SDRAM
controller)


QuickPath Interconnect (
QPI video if time allows
)


Cache coherent point
-
to
-
point link


High speed communications between processor chips


6.4G transfers per second, 16 bits per transfer


Dedicated bi
-
directional pairs


Total bandwidth 25.6GB/s

Intel Core i7 Block Diagram

ARM11 MPCore


“ARM vs. x86 and Microsoft

Intel started this fight by challenging ARM
with its Atom
processor
, which is moving
downmarket and towards
smartphones.

Apparently, the major ARM
vendors are feeling the threat, are now
moving upmarket and are beginning to
make their run at low
-
end PCs and
storage appliances to put the pressure
back on Intel.”


http://www.tgdaily.com/trendwatch
-
features/41561
-
the
-
coming
-
arm
-
vs
-
intel
-
pc
-
battle

ARM11 MPCore


Up to 4 processors each with own L1 instruction and data
cache


Distributed Interrupt Controller (DIC)


Recall the APIC from Intel’s core architecture


Timer per CPU


Watchdog (feed or it barks!)


Warning alerts for software failures


Counts down from predetermined values


Issues warning at zero


CPU interface


Interrupt acknowledgement, masking and completion
acknowledgement


CPU


Single ARM11 called MP11


Vector floating
-
point unit (VFP)


FP co
-
processor


L1 cache


Snoop control unit


L1 cache coherency

http://barfblog.foodsafety.ksu.edu/DogObedienceTraining.jpg

ARM11

MPCore

Block

Diagram

ARM11 MPCore Interrupt Handling


Distributed Interrupt Controller (DIC) collates
from many sources (ironically it is a centralized
controller)


It provides


Masking (who can ignore an interrupt)


Prioritization (CPU A is more important than CPU B)


Distribution to target MP11 CPUs


Status tracking (of interrupts)


Software interrupt generation


Number of interrupts independent of MP11 CPU
design


Memory mapped DIC control registers


Accessed by CPUs via private interface through
SCU


DIC can:


Route interrupts to single or multiple CPUs


Provide inter
-
process communication


Thread on one CPU can cause activity by thread on another CPU

DIC Routing


Direct to specific CPU


To defined group of CPUs


To all CPUs


OS can generate interrupt to:


All but self


Self


Other specific CPU


Typically combined with shared memory
for inter
-
process communication


16 interrupt ids available for inter
-
process
communication (per cpu)

Interrupt States


Inactive


Non
-
asserted


Completed by that CPU but pending or active
in others


E.g. allgather


Pending


Asserted


Processing not started on that CPU


Active


Started on that CPU but not complete


Can be pre
-
empted by higher priority interrupt


Interrupt Sources


Inter
-
process Interrupts (IPI)


Private to CPU


ID0
-
ID15 (16 IPIs per CPU as mentioned earlier)


Software triggered


Priority depends on receiving CPU not source


Private timer and/or watchdog interrupt


ID29 and ID30


Legacy FIQ line


Legacy FIQ pin, per CPU, bypasses interrupt distributor


Directly drives interrupts to CPU


Hardware


Triggered by programmable events on associated
interrupt lines


Up to 224 lines


Start at ID32

ARM11 MPCore Interrupt Distributor

Cache Coherency


Snoop Control Unit (SCU) resolves most shared
data bottleneck issues


Note: L1 cache coherency based on MESI similar to
Intel’s core architecture


3 types of SCU shared data resolution:

1.
Direct data Intervention


Copying clean entries between L1 caches without accessing
external memory or L2


Can resolve local L1 miss from remote L1 rather than L2


Reduces read after write from L1 to L2

2.
Duplicated tag RAMs


Cache tags implemented as separate block of RAM, a copy is held
in the SCU. So the SCU knows when 2 CPUs have the same cache
lines.


Tag RAM has same length as number of lines in cache


TAG duplicates used by SCU to check data availability before
sending coherency commands


Only send to CPUs that must update coherent data cache


Less bus locking due to less communication during coherency step

3.
Migratory lines


Allows moving dirty data between CPUs without writing to L2 and
reading back from external memory(See Stallings CH 18.5 pg703)

Performance Effect of Multiple Cores

Recommended Reading


Multicore Association web site


Stallings chapter 18


ARM web site


(if we have time)
http://www.intel.com/technology/quickpat
h/index.htm


http://www.arm.com/products/CPUs/ARM
11MPCoreMultiprocessor.html


http://www.eetimes.com/news/design/fea
tures/showArticle.jhtml?articleID=239011
43