Energy and Power

jazzydoeΛογισμικό & κατασκευή λογ/κού

30 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

50 εμφανίσεις

Energy and Power

Lecture notes S. Yalamanchili and S. Mukhopadhyay

(
2
)

Some Useful Reading


http://en.wikipedia.org/wiki/
CPU_power_dissip
ation


http://en.wikipedia.org/wiki/CMOS#Power:
_sw
itching_and_leakage



http://www.xbitlabs.com/articles/cpu/display/c
ore
-
i5
-
2500t
-
2390t
-
i3
-
2100t
-
pentium
-
g620t.html



http://
www.cpu
-
world.com
/info/
charts.html

(
3
)

Historical Scaling

(
4
)

Technology Scaling


30% scaling down in dimensions


doubles
transistor density



Power per transistor


V
dd

scaling


lower power



Transistor delay =
C
gate

V
dd
/I
SAT



C
gate
,
V
dd

scaling


lower delay

GATE

SOURCE

BODY

DRAIN

t
ox

GATE

SOURCE

DRAIN


L

(
5
)

Fundamental Trends

High Volume
Manufacturing

2004

2006

2008

2010

2012

2014

2016

2018

Technology Node
(nm)

90

65

45

32

22

16

11

8

Integration Capacity
(BT)

2

4

8

16

32

64

128

256

Delay = CV/I scaling

0.7

~0.7

>0.7

Delay scaling will slow down

Energy/Logic Op
scaling

>0.35

>0.5

>0.5

Energy scaling will slow down

Bulk Planar CMOS

High Probability Low Probability

Alternate, 3G etc

Low Probability High Probability

Variability

Medium High Very High

ILD (K)

~3

<3


Reduce slowly towards 2
-
2.5

RC Delay

1

1

1

1

1

1

1

1

Metal Layers

6
-
7

7
-
8

8
-
9

0.5 to 1 layer per generation

Source: Shekhar Borkar, Intel Corp.

(
6
)

ITRS Roadmap for Logic Devices

From: “
ExaScale

Computing
Study: Technology
Challenges
in Achieving
Exascale
Systems,” P. Kogge,
et.al
, 2008

(
7
)

Where Does the Power Go in CMOS?


Dynamic Power Consumption


Charging and discharging capacitance


Short Circuit Power


Short circuit path between supply rails during
switching


Nominally 10%
-
20% of dynamic power and can be
ignored for a first order analysis


Leakage


Leaky transistors

(
8
)

Dynamic Power

P
DYNAMIC
= C
L
x VDD x VDD x Frequency


Time

VDD

Voltage

0

T

VDD

VDD

Output
Capacitor
Charging

Output
Capacitor
Discharging

Input to
CMOS
inverter

i
DD

i
DD

C
L

C
L


Dynamic power is used in charging and
discharging the capacitances in the CMOS circuit.


(
9
)


Technology scaling has caused transistors to
become smaller and smaller. As a result, static
power has become a substantial portion of the
total power.

Static Power

Gate Leakage

Junction Leakage

Sub
-
threshold
Leakage

Input = 0

Output = VDD

P
STATIC
= VDD x I
STATIC

(
10
)

Energy

Delay

Energy or delay

V
DD

V
DD

EDP

Energy
-
Delay Interaction


Delay decreases with supply voltage but
energy/power increases

(
11
)

leakage or delay

V
th

leakage

delay

Static Energy
-
Delay Interaction


Static energy increases exponentially with
decrease in threshold voltage


Delay increases with threshold voltage

t
ox

SOURCE

DRAIN


L

GATE

(
12
)

Power Vs. Energy


Energy is a rate of expenditure of energy


One joule/sec = one watt


Both profiles use the same amount of energy
at different rates or power

Power(watts)

P0

P1

P2

Same Energy = area under the curve

Power(watts)

Time

P0

Time

(
13
)

Optimizing Power vs. Energy

Thermal envelopes


minimize peak power

Maximize battery life


minimize energy

(
14
)

The Problem


Historically performance scaling was
accompanied by power scaling


This is no longer true


power densities are
increasing

(
15
)

The End of Dennard Scaling

t
ox

SOURCE

DRAIN


L

GATE


Voltage is no longer
scaling at the same rate


Slower scaling in power
per transistor



increasing power densities

From R
. Dennard, et al., “Design of ion
-
implanted MOSFETs
with very small physical dimensions,

IEEE
Journal of Solid
State Circuits
, vol. SC
-
9, no
. 5
, pp. 256
-
268, Oct. 1974.

(
16
)

Chip Power Densities

From: “
ExaScale

Computing
Study: Technology
Challenges
in Achieving
Exascale
Systems,” P. Kogge,
et.al
, 2008

(
17
)

Mukhopadhyay

and Yalamanchili (2009)


Based on scaling
using

Pentium
-
class cores


While
Moore’s Law continues, scaling phenomena have
changed


Power
densities are increasing with each
generation

17

What is the Problem?

(
18
)

The Power Wall


Power per transistor scales with frequency
but also scales with
V
dd


Lower
V
dd

can be compensated for with increased
pipelining to keep throughput constant


Power per transistor is not same as power per
area


power density is the problem!


Multiple units can be run at lower frequencies to
keep throughput constant, while saving power


(
19
)

The Advent of Dark Silicon?

64
-
core asymmetric chip multiprocessor layout

and failure probability distribution

In
-
order core

Out of
-
order core


Cannot afford to turn
on all devices at once


How do we manage the
power and thermals?

(
20
)

What are my Options?

1.
Better
technology


Manufacturing


New Devices


non
-
CMOS?

2.
Be more efficient


activity management


Clock gating


Power gating


Power management

3.
Improved architecture


Simpler pipelines

4.
Parallelism

(
21
)

Activity Management


Turn off clock to a block of
logic


Eliminate unnecessary
transitions/activity


Clock distribution power


Turn off power to a
block of logic, e.g.,
core


No leakage

Combinational

Logic

clk

clk

cond

input

clk





Core 0

Core 1

V
dd

Power gate
transistor

Clock Gating

Power Gating

(
22
)

Power Management


Software controlled power management


Optimize power and/or energy


Orchestrated by the operating system or application
libraries


Industry standard interfaces for power management

o
Advanced Configuration and Power Interface (ACPI)


https
://www.acpica.org
/



http://www.acpi.info
/



Hardware power management


Optimized power/energy


Failsafe operation, e.g., protect against thermal
emergencies

(
23
)

Processor Power States


Performance States


P
-
states


Operate at different voltage/frequencies

o
Recall delay
-
voltage relationship


Lower voltage


lower leakage


Lower frequency


lower power (
not the same as energy!
)


Lower frequency


longer execution time


Idle States
-

C
-
states


Sleep states


Differ is how much state is saved


SW or HW managed transitions between states!

(
24
)

Multiple Voltage Frequency Domains

From E.
Rotem

et. Al.
HotChips

2011


Cores

and ring in one DVFS domain


Graphics unit in another DVFS domain


Cores and portion of cache can be gated
off

Intel Sandy Bridge
Processor

(
25
)

Power States

From: http://
www.intel.com
/content/www/us/en/processors/core/2nd
-
gen
-
core
-
family
-
mobile
-
vol
-
1
-
datasheet.html

(
26
)

Power Gating

Intel Sandy Bridge
Processor


Turn off components
that are not being used


Lose all state information


Costs of powering down


Costs of powering up


Smart shutdown


Models to guide decisions

(
27
)

Simplify Core Design

AMD Bulldozer Core

ARM A7 Core (
arm.com
)


Support for out of
order execution,
schedulers, branch
prediction, etc.
consumes more
energy per instruction


Can fit many more
simpler cores on a
dies

(
28
)

Parallelism and Power

IBM Power5

Source: IBM

AMD Trinity

Source:
forwardthinking.pcmag.com


How much of the chip area is devoted to compute?


Run many cores slower. Why does this reduce power?

(
29
)

Parallelism


Concurrency + lower frequency


greater
energy efficiency


Core

Cache

Core

Cache

Core

Cache

Core

Cache

Core

Cache


4X #cores


0.75x voltage


0.5x Frequency


1X power


2X
in performance


Example

(
30
)

Microarchitectural Level Models


How can we study power consumption without
building circuits?


Models


Models can are available at multiple levels of
abstraction.


We are interested in microarchitectural models

(
31
)

Processor Microarchitecture

Instruction
Cache

Instruction
Queue

Fetch

Queue

Instruction
Decoder

Branch

Prediction

Register
Files

Instruction
TLB

ALU

MUL

FPU

LD

ST

L1 Data
Cache

Data

TLB

L2 Data Cache

NoC

Router

On
-
Chip

Network

Fetch

Decode

Execute/
Writeback

Memory

Network

(
32
)

Energy/Power Calculation


How do we calculate
energy

or
power

dissipation
for a given microarchitecture?


Energy/Power varies between:


Different
ISA
;
ARM

vs

Intel x86


Different
microarchitecture
;
in
-
order

vs

out
-
of
-
order


Different
applications
;
memory

vs

compute
-
bound


Different
technologies
;
90nm

vs

22nm

technology


Different
operation conditions
;
frequency
,
temperature

(
33
)

Architecture Activity (1)

Instruction
Cache

Instruction
Queue

Fetch

Queue

Instruction
Decoder

Branch

Prediction

Register
Files

Instruction
TLB

ALU

MUL

FPU

LD

ST

L1 Data
Cache

Data

TLB

L2 Data Cache

NoC

Router

On
-
Chip

Network

Activity 1: Instruction Fetch

icache.read
++;

f
buffer.write
++;


Collect
activity counts

of
each architecture
component (through
simulation or
measurement).


List of components differs
between microarchitectures.


Activity counts at each
component differs between
applications.

(
34
)

Architecture Activity (2)

Instruction
Cache

Instruction
Queue

Fetch

Queue

Instruction
Decoder

Branch

Prediction

Register
Files

Instruction
TLB

ALU

MUL

FPU

LD

ST

L1 Data
Cache

Data

TLB

L2 Data Cache

NoC

Router

On
-
Chip

Network

Activity 2: Instruction Decode

f
buffer.read
++;

idecoder.logic
++;


Read/write accesses to
caches, buffers, etc.


Logical accesses to logic
blocks such as decoder, ALUs,
etc.


Tradeoff of differentiating
more access types (accuracy)
vs

simulation speed
(complexity).

(
35
)

Power and Architecture Activity


For example, At
n
th

clock cycle, collected
counters are:


Data cache:

o
read

= 20,
write

= 12;

o
p
er
-
read energy

= 0.5nJ;
p
er
-
write energy

= 0.6nJ;

o
Read energy

= read*per
-
read energy = 10nJ

o
Write energy

= write*per
-
write energy = 7.2nJ

o
Total
a
ctivity energy =
read+write

energies = 17.2nJ

o
If
n

= 50
th

clock cycle and clock frequency = 2GHz,

Total activity power

= energy*
clock_freq
/
n
= 688mW

*Note:
n
/
clock_freq

=
n

clock periods in sec


power = time average of energy

(
36
)

Things to consider (1)

1.
How do we calculate per
-
read/write energies?


Per
-
access energies can be estimated from
circuit
-
level
designs and analyses
.


There are various
open
-
source tools
for this.

Architecture
Specification

Technology
Parameters

Circuit
-
level

Estimation
Tool

Estimation
Results
:

Area, Energy,
Timing, etc.

(
37
)

Things to consider (2)

2.
Is per
-
access energy always the same?


Per
-
access energy in fact depends on:


how many bits are switching


h
ow they are switching

(0

1 or 1

0)


It is reasonable to assume
constant per
-
access
energy
in long
-
term observation (e.g.,
n

= 1M
clock cycles); the number of switching bits are
averaged (e.g., 50% of bits are switching).


Most architecture simulators do not capture bit
-
level details due to simulation complexity
.

(
38
)

Things to consider (3)

3.
If a register file didn’t have read/write accesses
but held data, what is the energy dissipation?


Energy (or power) is largely comprised of
dynamic

and
static
dissipations.


Dynamic (or switching) energy

refers to energy dissipation
due to
switching activities
.


Static (or leakage) energy

is dissipation to
keep the
electronic system turned on
.


In this case, the register file has
no dynamic energy

dissipation but consumes
static energy
.

(
39
)

Thermal Issues


Heat can cause damage to the chip


Need failsafe operation


Thermal fields change the physical
characteristics


Leakage current and therefore power increases


Delay increases


Device degradation becomes worse


Cooling solution determines the permitted
power dissipation

(
40
)

Thermal Design Power (
TDP)


This is the
maximum

power at which the part is
designed to operate


Dictates the design of the
cooling system


o
Max temperature


T
jmax


Typically fixed by worst case
workload


Parts are typically
operating below the TDP


Opportunities for
turbo
mode?

AMD Trinity APU

http://
ecs.vancouver.wsu.edu
/
thermofluids
-
research

(
41
)

Trinity TDP

Source: http
://
www.anandtech.com
/show/6347/amd
-
a10
-
5800k
-
a8
-
5600k
-
review
-
trinity
-
on
-
the
-
desktop
-
part
-
2

(
42
)

Exploiting the Physics


Most of time the part is operating well below its
thermal limit


Leaving performance on the table


Can temporarily boost frequency (and
therefore power dissipation) for short periods
of time, e.g., seconds


Temperature changes slowly


(
43
)

Boosting


Exploit package physics


Temperature changes on the
order of milliseconds


Use the thermal headroom

Max Power

TDP Power

Low power


build up
thermal credits

Turbo boost region

10s of seconds

Intel Sandy Bridge

(
44
)

Conclusions


Power/energy is the leading driver of modern
architecture design


Power and energy management is key to
scalability


Need integrated power/energy, performance,
thermal management in fielded systems


What about energy/power efficient algorithms?

(
45
)

Study Guide


Explain the difference between energy
dissipation and power dissipation


Distinguish between static power dissipation
and dynamic power dissipation


Be able to apply the simplified McPAT power
model to a simple datapath and instruction
sequence


Explain dynamic voltage frequency scaling


What are power states?


Why is this an advantage?


What is the impact of DVFS on
i
) energy, ii)
execution time, and iii) power

(
46
)

Study Guide (cont.)


How is thermal design power (TDP) calculated?


When using boost algorithms, what determines
the duration of the high frequency operation?


How does a power virus work?


Describe how throttling works


Know the power dissipation in some modern
processor
-
memory systems drawn from the
embedded, server, and high performance
computing segments