New processors - KTU - Kompiuterių katedra

birdsowlSoftware and s/w Development

Dec 2, 2013 (3 years and 10 months ago)

57 views

1

C
OMPUTER
ARCHITE
C
T
U
R
E

(P175B125)

Asso
c.
Prof.

Stasys Maciulevičius

C
omputer
Dept.

sta
sys.
ma
ciulevicius
@ktu.lt

©S.Maciulevi
čius

2

20
12

Intel’s

strategy




Intel introduces
n
ew

microprocessor

archite
c
t
u
r
e
s
every 2 years

as part of

Tick
-
Tock


strategy
:

©S.Maciulevi
čius

3

20
12

Intel
’s

Sandy Bridge


Sandy Bridge

is the codename for a
microarchitecture developed by Intel
beginning in 2005 for CPUs in computers to
replace the Nehalem microarchitecture


It was designed for the full range of
applications from mobile devices, laptop and
desktop computers, to large enterprise
servers


Intel demonstrated a Sandy Bridge processor
in 2009, and released first products in
January 2011 based on the architecture

.

©S.Maciulevi
čius

4

20
12

Intel
’s

Sandy Bridge


Sandy Bridge

main

features:


32 nm
fabrication process


CPU clock rate 1.4

3.4

GHz, grafics clock rate 350
-
850

MHz (for different models)


Turbo Boost 2.0 technology enables rise of
clock
rate
till

3.8

GHz and 1350

MHz respectively


32

kB data + 32

kB instruction L1 cache
(3

clocks) and 256

kB L2 cache (8

clocks) per
core


Shared L3

cache


3
-
8 MB (25

clocks)

.

©S.Maciulevi
čius

5

20
12

Intel
’s

Sandy Bridge


Sandy Bridge has integrated graphic controller
and specialized accelerator; it accelerates
multimedia content processing significantly



Sandy Bridge supports DirectX 10.1 and
OpenCL 1.1; its productivity far exceeds the
performance of the first generation Core


Advanced Vector Extensions (AVX) 256
-
bit
instruction set with wider vectors, new
extensible syntax and rich functionality

.

©S.Maciulevi
čius

6

20
12

Intel
’s

Sandy Bridge


Decoded micro
-
operation cache and enlarged,
optimized branch predictor


256
-
bit/cycle ring bus interconnect between
cores, graphics, cache and System Agent
Domain


Intel Quick Sync Video, hardware support for
video encoding and decoding


Up to 8 physical cores or 16 logical cores
through Hyper
-
threading


TDP of desktop CPUs is 35

95

W, for mobile
CPUs

17
-
55

W

.

©S.Maciulevi
čius

7

20
12

Intel
’s

Sandy Bridge

.

©S.Maciulevi
čius

8

20
12

.

Sandy Bridge microarchitecture

©S.Maciulevi
čius

9

20
12

Sandy Bridge: L0 cache

.

©S.Maciulevi
čius

10

20
12

Sandy Bridge: ring bus

.

Each core, each slice
of L3 (LLC) cache, the
on
-
die GPU, media
engine and the system
agent all have a stop
on the ring bus

The bus is made up of
four independent rings:
a data ring, request
ring, acknowledge ring
and snoop ring. Each
stop for each ring can
accept 32
-
bytes of data
per clock


©S.Maciulevi
čius

11

20
12


Ivy Bridge

is the first chip to use Intel's
22nm tri
-
gate transistors, which will help
scale frequency and reduce power
consumption


At a high level Ivy Bridge looks a lot like
Sandy Bridge


Ivy Bridge is considered a tick from the
CPU perspective but a tock from the GPU
perspective

Intel
’s

Ivy Bridge

©S.Maciulevi
čius

12

20
12

Intel
’s

Ivy Bridge

©S.Maciulevi
čius

13

20
12

Intel
’s

Ivy Bridge

©S.Maciulevi
čius

14

20
12

Intel
’s

Ivy Bridge


Ivy Bridge introduces
configurable TDP that allows
the platform to increase the
CPU's TDP if given additional
cooling, or decrease the TDP
to fit into a smaller form factor

65W

55W

45W

Ivy Bridge XE

33W

17W

13W

Ivy Bridge ULV

cTDP Up

Nominal

cTDP
Down



Ivy Bridge Configurable TDP

©S.Maciulevi
čius

15

20
12

Intel
’s

Ivy Bridge


Sandy Bridge brought a completely
redesigned GPU core onto the processor
die itself


With Ivy Bridge the GPU remains on die
but it grows more than the CPU does this
generation


Ivy Bridge GPU adds support for OpenCL
1.1, DirectX 11 and OpenGL 3.1

©S.Maciulevi
čius

16

20
1
3

From Nehalem to Hasswell

Intel
’s

Hasswell


H
aswell

is the

codename

for
a

processor

microarchitecture

as the successor
to the

Ivy Bridge

architecture


Using the

22 nm

process,

Intel is expected to
release CPUs based on this microarchitecture
around June 2, 2013


With Haswell, Intel will introduce a new low
-
power processor designed for convertible or
'hybrid'

Ultrabooks

©S.Maciulevi
čius

17

20
1
3

©S.Maciulevi
čius

18

20
1
3

Intel
’s

Hasswell


The Haswell architecture is specifically designed

to
optimize the power savings and performance benefits


Haswell is expected to launch in three major forms
:


Desktop version (LGA1150 socket):

Haswell
-
DT


Mobile/Laptop version (PGA socket):

Haswell
-
MB


BGA version:


47W and 57W TDP classes:

Haswell
-
H

(For "All
-
in
-
one" systems,
Mini
-
ITX form factor motherboards, and other small footprint formats.)


13.5W and 15W TDP classes (SoC):

Haswell
-
ULT

(For Intel's
UltraBook platform.)


10W TDP class (SoC):

Haswell
-
ULX

(For tablets and certain
UltraBook
-
class implementations.)

©S.Maciulevi
čius

19

20
1
3

Intel
’s

Hasswell
Performance



Compared to

Ivy Bridge

(expected):


Twice the

vector processing

performance


At least 10% sequential CPU performance increase (8
execution ports per core versus 6
)


Up to double the performance of the integrated GPU

©S.Maciulevi
čius

20

2013

Intel
’s

Hasswell

©S.Maciulevi
čius

21

20
1
3

CPU Idle Power

©S.Maciulevi
čius

22

20
1
3

©S.Maciulevi
čius

23

20
1
3

©S.Maciulevi
čius

24

20
1
3

Intel
’s

Hasswell

©S.Maciulevi
čius

25

20
1
3

©S.Maciulevi
čius

26

20
1
3

Intel Hasswell

©S.Maciulevi
čius

27

20
1
3

©S.Maciulevi
čius

28

20
1
3

AVX2


FMA

Some models

CPU


Freq
.

Turbo
Boost

Cache
-
Memory

Cores /
Threads

TDP



Core i7
-
4770K

3.5
GHz

3.9 GHz

8 MB

4 / 8

84 W

Core i7
-
4770

3.4
GHz

3.9 GHz

8 MB

4 / 8

84 W

Core i7
-
4770S

3.1
GHz

3.9 GHz

8 MB

4 / 8

65 W

Core i7
-
4770T

2.5
GHz

3.7 GHz

8 MB

4 / 8

45 W

Core i7
-
4765T

2.0
GHz

3.0 GHz

8 MB

4 / 8

35 W

©S.Maciulevi
čius

29

20
1
3

©S.Maciulevi
čius

30

20
12

AMD’s APU


An
accelerated processing unit

(APU) is a
processing system that includes additional
processing capability designed to accelerate
one or more types of computations outside
of a CPU


This may include a graphics processing unit
(GPU) used for general
-
purpose computing
(GPGPU), a field
-
programmable gate array
(FPGA), or similar specialized processing
system

©S.Maciulevi
čius

31

20
12

AMD’s APU


At the most basic level, AMD’s new
Accelerated Processing Units

combine
general
-
purpose x86 CPU cores with
programmable vector processing engines
on a single silicon die


AMD’s APUs also include a variety of
critical system elements, including memory
controllers, I/O controllers, specialized
video decoders, display outputs, and bus
interfaces

©S.Maciulevi
čius

32

20
12

AMD view on APUs

©S.Maciulevi
čius

33

20
12

AMD Fusion


AMD Fusion

is the
marketing name for a
series of APUs by
AMD, aimed at
providing good
performance with low
power consumption,
and integrating a
CPU and a GPU
based on a mobile
stand
-
alone GPU

©S.Maciulevi
čius

34

20
12

AMD Fusion







First
demonstration
of AFU Fusion
was on
Computex 2010

(
Taipei, Taiwan
,

June
2
.

2010

)



©S.Maciulevi
čius

35

20
12

AMD Bulldozer


Bulldozer

is the codename AMD has given to one of
the CPU cores based on the AMD family 15h
microarchitecture


Bulldozer is designed from scratch, not a
development of earlier processors


AMD has introduced a new microarchitecture
building block called
module


In terms of hardware complexity and functionality, a
module is midway between a dual
-
core processor (in
which each core is fully independent) and a single
processor core that has two SMT threads (in which
each thread shares most of the hardware resources
with the other thread)

©S.Maciulevi
čius

36

20
12

AMD Bulldozer


A module consists of two
tightly coupled,
"conventional" x86 out
-
of
-
order processing
engines


The processing engine
shares the early pipeline
stages (eg. instruction
fetch, decode), the
FPUs, and the L2 cache

©S.Maciulevi
čius

37

20
12

AMD Bulldozer


Two dedicated integer cores


each consists of two ALU and two AGU which
are capable for total of 4 independent arithmetic
and memory operations per clock per core


duplicating integer schedulers and execution
pipelines offers dedicated hardware to each of
two threads which significantly increase
performance in multithreaded integer
applications


second integer core increases Bulldozer module
die by around 12%, which at chip level adds
about 5% of total die space

©S.Maciulevi
čius

38

20
12

AMD Bulldozer


Two symmetrical 128
-
bit FMAC (fused multiply

add
capability) floating
-
point pipelines per module that
can be unified into one large 256
-
bit
-
wide unit if one
of integer cores dispatch AVX instruction and two
symmetrical x87/MMX/SSE capable FPPs for
backward compatibility with SSE2 non
-
optimized
software


Multiple modules share an L3 cache as well as an
Advanced Dual
-
Channel Memory Sub
-
System (IMC
-

Integrated Memory Controller)


A dual
-
core Bulldozer processor has a single
module, a quad
-
core processor has two modules
and an octo
-
core processor has four modules




©S.Maciulevi
čius

39

20
12

AMD Bulldozer


The first shipments of Bulldozer
-
based Opteron
processors begun on September 2011


On 12 October 2011, AMD released the first four
FX
-
series processors of the Bulldozer line (FX
-
8150, FX
-
8120, FX
-
6100, FX
-
4100)


AMD stated on its blog that “
there are some in our
community who feel the product performance did
not meet their expectations



AMD said that the remaining FX series AMD
processors would be released at the end of the first
quarter of 2012

©S.Maciulevi
čius

40

20
1
3

AMD Piledriver

©S.Maciulevi
čius

41

20
1
3

AMD Piledriver

©S.Maciulevi
čius

42

20
1
3

AMD Piledriver

©S.Maciulevi
čius

43

20
1
3

Improvements in the Piledriver


Improved branch prediction precision due to the use of
Hybrid Predictor augmented with 2nd level predictor;


128 and 256
-
bit FMA3 instructions extensions (fused
multiply
-
add) and F16C SSE5 instructions extensions (half
-
precision floating
-
point conversion);


Optimized schedulers;


Accelerated division by modifying a corresponding
execution unit;


Increased L1 TLB;


Improved L1 and L2 pre
-
fetchers that can work with variable
length patterns, including those on page boundaries;


Improved L2 cache efficiency by more aggressive removal
of the unused data, which the pre
-
fetcher algorithms loaded
into the cache by mistake.

©S.Maciulevi
čius

44

20
1
3

New micro
-
architecture x86
Steamroller


Steamroller will be the third modular x86
architecture from AMD


promises a yield per cycle / watt from 15% to 20%
higher than the micro
-
architecture Piledriver released
in Trinity,


come with a new memory controller integrated DDR3
-
2133, plus have a PCI Express (PCIe) 3.0.


Kaveri Steamroller possess up to 2 modules (4
cores of processing whole “ALUs”) and 2 floating
point units Flex
-
FP third generation.

©S.Maciulevi
čius

45

20
1
3

New architecture
-
based igp
Graphics Core Next


Kaveri released a new architecture
-
based
IGP Graphics Core Next released late last
year with the GPU Radeon HD 7970



IGP Kaveri will consist of two Compute
Units (512 ALUs “shader processors”) with
a graphic power even higher than the
current Radeon HD 7750 GPU.

©S.Maciulevi
čius

46

20
1
3

New architecture
-
based igp
Graphics Core Next


The IGP, as well as the IGP of the desktop
version of Trinity known as Devastator
code , have dedicated graphics memory
GDDR5, which will be in the same
package with its technology Silicon
Interposer via TSV , and help increase the
IGP graphics performance, very similar to
that seen in SidePort technology released
in the AMD 790GX chipset , also supports
4 output will display “Eyefinity4″


©S.Maciulevi
čius

47

20
1
3