Overview of MacSim - CompArch

rodscarletSoftware and s/w Development

Dec 14, 2013 (3 years and 5 months ago)

131 views

HPArch

Research Group

|
Part 2. Overview of MacSim


Introduction


For
b
lack box approach users


|
Part 3: Details of MacSim


For computer architecture researchers


|
Part 4.


MacSim
-
SST case studies


Ocelot
-
MacSim case studies


Research using Ocelot


Research using MacSim

MacSim Tutorial (In ISCA
-
39, 2012)

|
Heterogeneous architecture simulator (x86+PTX)


|
Developed from Georgia Tech


|
Trace
driven simulator


Internal RISC style
micro
-
op
generation module


X86
traces


using Pin, PTX traces


using GPUOcelot


|
Cycle
-
level simulator


Cores, caches, memory systems are modeled


|
Support various simulations
-

single/multi
-
threaded
application, multi
-
program, heterogeneous (CPU+GPU)


MacSim Tutorial (In ISCA
-
39, 2012)

|
Flexible design to support various platforms


|
Integration with a parallel simulator (SST) to support high
-
performance computing systems


|
From mobile to
Exascale

computing systems

MacSim Tutorial (In ISCA
-
39, 2012)


X86 binaries

CUDA code

(.cu)

Open GL code

PIN

(API Generator)

PIN

Trace Generator

NVCC

(Compiler)

GPUOcelot

Trace Generator

Attila

(OpenGL Emulator)

Heterogeneous

Architecture


Timing & Power

Simulator

PTX code

Prof.
Yalamanchili


(Georgia Tech)

Instruction

Thread information

Ongoing Work

MacSim Tutorial (In ISCA
-
39, 2012)

|
Getting MacSim


Stable version


google

code project

http
://
macsim.googlecode.com/files/macsim
-
1.0.tar.gz


Latest code from SVN repository


|
Directions are explained in

http
://code.google.com/p/macsim/wiki/GettingMacsim


|
How to build


http://
code.google.com/p/macsim/wiki/BuildingMacsim


Chapter 2 of manual provides an instruction to build


README file in the simulator directory

MacSim Tutorial (In ISCA
-
39, 2012)

|
Macsim

package


IRIS
(
NoC

simulator from
Prof.
Yalamanchili’s

group) is
included


CPU trace generator


Download PIN separately. Trace generator tool is in the MacSim Package


GPU trace generator


Download Ocelot Separately. Trace generator is in the Ocelot’s package


|
MacSim
-
SST


SST needs to be downloaded separately


|
Energy
Introspector

(From Prof.
Yalamanchili’s

group)


EI is a power model based on
McPAT
,
HotSpot
.

Because of
McPAT

license issue, currently EI cannot be distributed, but
we will resolve this issue soon

MacSim Tutorial (In ISCA
-
39, 2012)

MacSim Tutorial (In ISCA
-
39, 2012)

|
Once build process is successful, binary will be created in


macsim
-
top/trunk/bin/
macsim



|
Screenshot of a simulation








|
Now, How to configure simulation models ?

MacSim Tutorial (In ISCA
-
39, 2012)

|
Knob variables need to set up (3 ways)


Default value in the source code


Params.in


Command line

Core type 1

Core type 2

Core type 3

Core type 1

Core type 2

Core type 3

Core type 1

Core type 2

Core type 3

Core type 1

Core type 2

Core type 3

Core type 1

Core type 2

Core type 3

Memory

MacSim Tutorial (In ISCA
-
39, 2012)

num_sim_cores 4 // 4 cores

num_sim_small_cores 0

num_sim_medium_cores 0

num_sim_large_cores 4

max_threads_per_large_core 2

large_core_type x86

repeat_trace 1

|
Configuration


4 cores


2
-
way SMT

param<NUM_SIM_CORES, num_sim_cores, int, 4>

./macsim

num_sim_cores=4

.
def

params.in

commandline

MacSim Tutorial (In ISCA
-
39, 2012)

|
To configure CPU+GPU
arch.


Set up
number
of cores and

type
accordingly



num_sim_cores

8
// 4 CPUs +
4 GPUs

num_sim_small_cores

4
//
4
GPU

num_sim_medium_cores

0

num_sim_large_cores

4 // 4 CPUs

core_type

ptx

//
specify
small cores

large_core_type

x86

cpu_frequency

3

gpu_frequency

1.5

repeat_trace

1


|
Usually, we use small
core for GPU and large
for CPU


|
GPU has internally
multiple processing
elements (N
-
wide SIMD)


MacSim Tutorial (In ISCA
-
39, 2012)

|
Multiple Applications


Set up from
trace_file_list



MCF

GCC

MM

thread

1

MM

thread

2

Blackscholes

4

<
--

number of applications

/sample/
mcf
/trace.txt
<
-

appl

1

/sample/
gcc
/trace.txt
<
-

appl

2

/sample/mm/trace.txt
<
-

appl

3

/sample/
blackscholes
/trace.txt
<
-

appl

4

MacSim Tutorial (In ISCA
-
39, 2012)

|
Execution time for each application is different.

|
Provide an option to enable repeat short traces until the
longest trace ends








|
Whether it’s the right way to simulate?

mcf

gcc

gcc

gcc

gcc

bfs

bfs

bfs

bfs

bfs

Program 1

Program 2

Program 3

MacSim Tutorial (In ISCA
-
39, 2012)

|
Sample configuration files in


macsim
-
top/trunk/
params

File name

Contents

params_8800gt

GeForce 8800 GT (G80)


params_gtx280

GeForce GTX 280 (GT200)

params_gtx465 NVIDIA GeForce GTX
465 (Fermi)


params_gtx465

GeForce GTX 465 (Fermi)

params_x86

Intel’s Sandy Bridge (CPU part only)


params_hetero_4c_4g

Intel’s Sandy Bridge (CPU + GPU)


MacSim Tutorial (In ISCA
-
39, 2012)

|
Thread spawn is modeled.

|
Lock is not modeled.

GPU Kernel invocation

core

Main thread

Threads spawn

Barrier

Host thread

core

core

core

MacSim Tutorial (In ISCA
-
39, 2012)

|
It will be covered in Part
-
III


|
Trace generator will generate thread execution information is
automatically.


|
Users do not need to worry about this.

MacSim Tutorial (In ISCA
-
39, 2012)

MacSim Tutorial (In ISCA
-
39, 2012)

|
MacSim

has 5 different clock domains


CPU


GPU


Last
-
level cache


Interconnection network


DRAM

# Clock

clock_cpu 3

clock_gpu 1.5

clock_l3 1

clock_noc 1

clock_mc 1.6

|
X86 instructions are mapped to
uops

|
PTX instructions are mapped to
uops

(almost 1
-
1 mapping)






|
Pipeline stages


Pin


XED

Macro instructions
with decoded
information from
Pin’s XED

MacSim

Trace
decoder

uops

Timing/

power

simulator

MacSim Tutorial (In ISCA
-
39, 2012)

Memory

Front
-
end

Decode

Rename

Schedule

Execution

Retire

|
Front
-
end, DEC/Rename: Just a simple FIFO queue.


fetch_latency

5 // front
-
end depth


alloc_latency

5 // decode/allocation
depth


width // pipeline width (same width for all the pipeline)


bp_dir_mech

gshare


bp_hist_length

14 // branch history length


|
Rename: create RAW dependency (map structure)


rob_size

96 // ROB size


|
Scheduler // in
-
order scheduler,
ooo

scheduler



schedule
io
,
ooo

// instruction scheduling policy



MacSim Tutorial (In ISCA
-
39, 2012)

|
Execution latency


Fixed
uop

latency (
macsim
-
top/
def
/
uop_latency
_[x86,ptx].
def
)


Variable latency: Cache/Memory latency


|
Instruction scheduling rates


isched_rate

4 // # of integer inst. that can be executed per cycle


msched_rate

2 // # of memory inst. that can be executed per cycle


fsched_rate

2 // # of FP inst. That can be executed per cycle


MacSim Tutorial (In ISCA
-
39, 2012)

|
Cache configuration


#
of sets, # of associativity, line size, # of banks, etc
. (See manual)


|
Cache size = # of sets x
assoc

x
line_size

x # of tiles


|
DRAM
configuration


Frequency, bus width, column/activate/
precharge

latency


# of Memory controllers, # banks, #
channels,
row buffer size,
DRAM
scheduling policy


Simple,
but fast DRAM model that
models key features


|
MacSim

is connected with
DRAM
-
SIM2


Users can
use
DRAM
-
SIM2 for a detailed DRAM timing simulation


L3 only

MacSim Tutorial (In ISCA
-
39, 2012)

|
Statistics


Simulation outputs: *.
stat.out


m
acsim
/trunk/
def

file has stat definition

(more details in Part
-
III)


|
Important Stats


IPC = INST_COUNT_TOT/CYC_COUNT_TOT


CPI = CYC_COUNT_TOT/INST_COUNT_TOT


|
Per Core stats


IPC for core 0


INST_COUNT_CORE_0/CYC_COUNT_CORE_0


|
Multiple
applications stats


*.
stat.out
.<
application_id
> e.g.)
memory.stat.out.0, bp.stat.out.1


Each stat file contains stats only for the first running (repeated
simulations are ignored)




MacSim Tutorial (In ISCA
-
39, 2012)

|
Memory Systems


L[1
-
3]_HIT_CPU/L[1
-
3]_HIT_GPU


L[1
-
3]
_MISS_CPU
/L[1
-
3]
_MISS_GPU


|
Front
-
end


BP_ON_PATH_[CORRECT/MISPREDICT/MISFETCH ]


|
Instruction profiling


Based on instruction category.
inst.stat.out


|
More details regarding statistics are in the
documentation


|
We will provide simple script file to fetch stat data

MacSim Tutorial (In ISCA
-
39, 2012)

MacSim Tutorial (In ISCA
-
39, 2012)

|
Multi
-
threading support is already there.

|
Different ISAs: using micro
-
ops

|
Warp ?


One warp is treated as one thread. Each thread generates its own
trace file. Active bit information is included


Trace format will be explained in Part
-
III

|
Thread and block scheduling


Block
-
level barrier, block
-
level scheduling/retirement


More details will be explained in Part
-
III

|
Different memory structures


Memory systems


MacSim Tutorial (In ISCA
-
39, 2012)











|
Include

the

memory

access

by

each

thread

of

a

warp

as

a

separate

instruction

in

the

trace

|
In

trace,

mark

these

accesses

as

coming

from

the

same

warp


SIMD load instruction

Addr

0

Addr

1

Addr

2

Addr

3

Addr

4

Addr

5

Addr

6

Addr

7

Coalesced

Uncoalesced

Mem

inst

with 128B size

64B Request

32B Req.

32B Req.

TraceInst

TraceInst_begin

TraceMem1

TraceMem2

TraceMem3

TraceInst_end

Trace file

Trace file

start of memory
instruction marker

end of memory
instruction marker

MacSim Tutorial (In ISCA
-
39, 2012)

|
During

simulation,

form

a

“parent”

uop

that

holds

all

the

individual

memory

accesses

as

its

child

uops

|
Parent

uop

flows

through

the

pipeline,

only

in

the

memory

stage,

the

individual

children

uops

are

issued

to

the

memory


Parent

uop

is

ready

for

retirement

when

all

children

have

completed


TraceInst_begin

TraceMem1

TraceMem2

TraceMem3



TraceMemN

TraceInst_end

Trace file

start of memory
instruction marker

end of memory
instruction marker

MacSim

uop

addr0

addr1

addr2

addr3

addr4

addr5



addrN

Mem_type
:
ld

#children: 8

Parent
uop

Children
uops

MacSim Tutorial (In ISCA
-
39, 2012)

MacSim Tutorial (In ISCA
-
39, 2012)

|
IRIS (From Prof.
Yalamanchili’s

group)


Flit
-
level interconnection network simulator


Virtual channel, credit
-
based flow control

deadlock
-
avoidance, …


Part
-
IV will cover more.



|
MacSim
-
SST


Parallel simulation

Node

Node

Node

Node

Topology

(Ring, Mesh, Torus, ..)

router

router

MacSim Tutorial (In ISCA
-
39, 2012)