Intel Pentium 4 Processor

reelingripehalfΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

117 εμφανίσεις









Intel Pentium 4 Processor




Presented by


Michele Co



(much slide content courtesy of Zhijian Lu and Steve Kelley)









Outline


Introduction (Zhijian)


Willamette (11/2000)


Instruction Set Architecture (Zhijian)


Instruction Stream (Steve)


Data Stream (Zhijian)


What went wrong (Steve)


Pentium 4 revisions


Northwood (1/2002)


Xeon (Prestonia, ~2002)


Prescott (2/2004)


Dual Core


Smithfield











Introduction


Intel Pentium 4 processor


Latest IA
-
32 processor equipped with a full set
of IA
-
32 SIMD operations


First implementation of a new micro
-
architecture called “NetBurst” by Intel
(11/2000)









IA
-
32


Intel architecture 32
-
bit (IA
-
32)


80386 instruction set (1985)


CISC, 32
-
bit addresses


“Flat” memory model


Registers


Eight 32
-
bit registers


Eight FP stack registers


6 segment registers









IA
-
32 (cont’d)


Addressing modes


Register indirect (mem[reg])


Base + displacement (mem[reg + const])


Base + scaled index (mem[reg + (2
scale

x index)])


Base + scaled index + displacement (mem[reg + (2
scale

x index) + displacement])


SIMD instruction sets


MMX (Pentium II)

»
Eight 64
-
bit MMX registers, integer ops only


SSE (Streaming SIMD Extension, Pentium III)

»
Eight 128
-
bit registers









Pentium III vs. Pentium 4 Pipeline

Comparison Between Pentium3 and
Pentium4









Execution on MPEG4 Benchmarks @ 1 GHz









Instruction Set Architecture


Pentium4 ISA =


Pentium3 ISA +


SSE2 (Streaming SIMD Extensions 2)



SSE2 is an architectural enhancement to
the IA
-
32 architecture










SSE2


Extends MMX and the SSE extensions with
144 new instructions:


128
-
bit SIMD integer arithmetic operations


128
-
bit SIMD double precision floating
point operations


Enhanced cache and memory management
operations









Comparison Between SSE and SSE2


Both support operations on 128
-
bit XMM register


SSE only supports 4 packed single
-
precision
floating
-
point values


SSE2 supports more:


2 packed double
-
precision floating
-
point values


16 packed byte integers


8 packed word integers


4 packed doubleword integers


2 packed quadword integers


Double quadword









Packing


128 bits (word = 2 bytes)

Quad word

Quad word

Double word

Double word

Double word

Double word

64 bit

64 bit

32 bit

32 bit

32 bit

32 bit









Hardware Support for SSE2


Adder and Multiplier units in the SSE2
engine are 128 bits wide, twice the width of
that in Pentium3


Increased bandwidth in load/store for
floating
-
point values


load and store are 128
-
bit wide


One load plus one store can be completed
between XMM register and L1 cache in one
clock cycle









SSE2 Instructions (1)


Data movements


Move data between XMM registers and between
XMM registers and memory


Double precision floating
-
point operations


Arithmetic instructions on both scalar and
packed values


Logical Instructions


Perform logical operations on packed double
precision floating
-
point values











SSE2 Instructions (2)


Compare instructions


Compare packed and scalar double precision
floating
-
point values


Shuffle and unpack instructions


Shuffle or interleave double
-
precision floating
-
point values in packed double
-
precision floating
-
point operands


Conversion Instructions


Conversion between double word and double
-
precision floating
-
point or between single
-
precision and double
-
precision floating
-
point
values









SSE2 Instructions (3)


Packed single
-
precision floating
-
point instructions


Convert between single
-
precision floating
-
point
and double word integer operands


128
-
bit SIMD integer instructions


Operations on integers contained in XMM
registers


Cacheability Control and Instruction Ordering


More operations for caching of data when storing
from XMM registers to memory and additional
control of instruction ordering on store operations









Conclusion


Pentium4 is equipped with the full set of
IA
-
32 SIMD technology. All existing
software can run correctly on it.


AMD has decided to embrace and
implement SSE and SSE2 in future CPUs

Instruction Stream









Instruction Stream


What’s new?


Added Trace Cache


Improved branch predictor


Terminology



op


Micro
-
op, already decoded RISC
-
like
instructions


Front end


instruction fetch and issue










Front End


Prefetches instructions that are likely to be
executed


Fetches instructions that haven’t been
prefetched


Decodes instruction into

ops


Generates

ops for complex instructions or
special purpose code


Predicts branches









Prefetch



Three methods of prefetching:



Instructions only


Hardware


Data only


Software


Code or data


Hardware









Decoder


Single decoder that can operate at a
maximum of 1 instruction per cycle


Receives instructions from L2 cache 64 bits
at a time


Some complex instructions must enlist the
help of the microcode ROM









Trace Cache


Primary instruction cache in NetBurst
architecture


Stores decoded

ops


~12K capacity


On a Trace Cache miss, instructions are
fetched and decoded from the L2 cache









What is a Trace Cache?

I1 …

I2 br r2, L1

I3 …

I4 …

I5 …

L1: I6

I7 …


Traditional instruction cache




Trace cache

I1

I2

I3

I4

I1

I2

I6

I7









Pentium 4 Trace Cache


Has its own branch predictor that directs
where instruction fetching needs to go next
in the Trace Cache


Removes


Decoding costs on frequently decoded
instructions


Extra latency to decode instructions upon
branch mispredictions









Microcode ROM


Used for complex IA
-
32 instructions (> 4

ops) , such as string move, and for fault
and interrupt handling


When a complex instruction is encountered,
the Trace Cache jumps into the microcode
ROM which then issues the

ops


After the microcode ROM finishes, the
front end of the machine resumes fetching

ops from the Trace Cache










Branch Prediction


Predicts ALL near branches


Includes conditional branches, unconditional
calls and returns, and indirect branches



Does not predict far transfers


Includes far calls, irets, and software interrupts









Branch Prediction


Dynamically predict the direction and target
of branches based on PC using BTB


If no dynamic prediction is available,
statically predict


Taken for backwards looping branches


Not taken for forward branches


Traces are built across predicted branches to
avoid branch penalties










Branch Target Buffer


Uses a branch history table and a branch
target buffer to predict



Updating occurs when branch is retired









Return Address Stack


16 entries


Predicts return addresses for procedure calls


Allows branches and their targets to coexist
in a single cache line


Increases parallelism since decode bandwidth is
not wasted









Branch Hints


P4 permits software to provide hints to the
branch prediction and trace formation
hardware to enhance performance


Take the forms of prefixes to conditional
branch instructions


Used only at trace build time and have no
effect on already built traces









Out
-
of
-
Order Execution


Designed to optimize performance by
handling the most common operations in
the most common context as fast as possible


126

ops can in flight at once


Up to 48 loads / 24 stores









Issue


Instructions are fetched and decoded by
translation engine


Translation engine builds instructions into
sequences of

ops


Stores

ops to trace cache


Trace cache can issue 3

ops

per cycle









Execution


Can dispatch up to 6

ops per cycle


Exceeds trace cache and retirement

op
bandwidth


Allows for greater flexibility in issuing

ops to
different execution units










Execution Units









Double
-
pumped ALUs


ALU executes an operation on both rising
and falling edges of clock cycle









Retirement


Can retire 3

ops per cycle


Precise exceptions


Reorder buffer to organize completed

ops


Also keeps track of branches and sends
updated branch information to the BTB

Execution Pipeline









Execution Pipeline

Data Stream of Pentium 4 Processor

Register Renaming









Register Renaming (2)


8
-
entry architectural register file


128
-
entry physical register file


2 RAT


Frontend RAT and Retirement RAT


Data does not need to be copied between
register files when the instruction retires









On
-
chip Caches


L1 instruction cache (Trace Cache)


L1 data cache


L2 unified cache


Parameters:







All caches are not inclusive and a pseudo
-
LRU
replacement algorithm is used












L1 Instruction Cache


Execution Trace Cache stores decoded
instructions


Remove decoder latency from main
execution loops


Integrate path of program execution flow
into a single line










L1 Data Cache


Nonblocking


Support up to 4 outstanding load misses


Load latency


2
-
clock for integer


6
-
clock for floating
-
point


1 Load and 1 Store per clock


Speculation Load


Assume the access will hit the cache


“Replay” the dependent instructions when miss
happen









L2 Cache


Load latency


Net load access latency of 7 cycles


Nonblocking


Bandwidth


One load and one store in one cycle


New cache operation begin every 2 cycles


256
-
bit wide bus between L1 and L2


48Gbytes per second @ 1.5GHz










Data Prefetcher in L2 Cache


Hardware prefetcher monitors the reference
patterns


Bring cache lines automatically


Attempt to stay 256 bytes ahead of current
data access location


Prefetch for up to 8 simultaneous
independent streams











Store and Load


Out of order store and load operations


Stores are always in program order


48 loads and 24 stores can be in flight


Store buffers and load buffers are allocated
at the allocation stage


Total 24 store buffers and 48 load buffers














Store


Store operations are divided into two parts:


Store data


Store address


Store data is dispatched to the fast ALU,
which operates twice per cycle


Store address is dispatched to the store
AGU per cycle









Store
-
to
-
Load Forwarding


Forward data from pending store buffer to
dependent load


Load stalls still happen when the bytes of
the load operation are not exactly the same
as the bytes in the pending store buffer









System Bus

Deliver data with 3.2Gbytes/S


64
-
bit wide bus


Four data phase per clock cycle (quad
pumped)


100MHz clocked system bus


Conclusion

Reduced Cache Size

VS

Increased Bandwidth and Lower Latency

What Went Wrong









No L3 cache


Original plans called for a 1M cache


Intel’s idea was to strap a separate memory
chip, perhaps an SDRAM, on the back of
the processor to act as the L3


But that added another 100 pads to the
processor, and would have also forced Intel
to devise an expensive cartridge package to
contain the processor and cache memory










Small L1 Cache


Only 8k!


Doubled size of L2 cache to compensate


Compare with


AMD Athlon


128k


Alpha 21264


64k


PIII





32k


Itanium




16k










Loses consistently to AMD


In terms of performance, the Pentium 4 is as
slow or slower than existing Pentium III and
AMD Athlon processors


In terms of price, an entry level Pentium 4
sells for about double the cost of a similar
Pentium III or AMD Athlon based system


1.5GHz clock rate is more hype than
substance









Northwood










Northwood


1/2002


Differences from Willamette


Socket 478


21 stage pipeline


512 KB L2 cache


2.0 GHz, 2.2 GHz clock frequency


0.13


fabrication process (130 nm)

»
55 million transistors









Prescott










Prescott


2/2004


Differences


31 stage pipeline!


1MB L2 cache


3.8 GHz clock frequency


0.9


fabrication process


SSE3