Powerpoint - Irisa

mewstennisSoftware and s/w Development

Nov 4, 2013 (3 years and 9 months ago)

84 views

CAPS team

Compilation et Architecture pour les
Processeurs Superscalaires et
Spécialisés


Compiler and Architecture

for superscalar and embedded
processors


CAPS project

2

CAPS members


2 INRIA researchers
:

A. Seznec, P. Michaud



2 professors
:
F. Bodin, J. Lenfant



11 Ph D students
:
R. Amicel, R. Dolbeau, A. Monsifrot , L. Bertaux,
K. Heydemann, L. Morin, G. Pokam, A. Djabelkhir,
A. Fraboulet, O. Rochecouste, E.Toullec



3 engineers
:

S. Bihan, P. Villalon, J. Simonnet


CAPS project

3

CAPS themes



Two interacting activities



High performance microprocessor
architecture



Performance oriented compilation


CAPS project

4

CAPS Grail


Performance at the best cost



Progress in computer science
and applications are driven by
performance


CAPS project

5

CAPS path to the Grail


Defining the tradeoffs between:


what should be done through hardware


what can be done by the compiler


for maximum performance


or for minimum cost


or for minimum size, power ..


CAPS project

6

Need for high
-
performance
processors


Current applications


general purpose: scientific, multimedia, data bases …


embedded systems: cell phones, automotive, set
-
top boxes ..


Future applications


don’t worry: users have a lot of imagination !



New software engineering techniques are CPU hungry:


reusability, generality


portability, extensibility (indirections, virtual machines)


safety (run
-
time verifications)


encryption/decryption


CAPS project

7

CAPS (ancient) background


«

ancient

» background in hardware and software
management of ILP


decoupled pipeline architectures


OPAC, an hardware matrix floating
-
point coprocessor


software pipeline for LIW



«

Supercomputing

» background


interleaved memories


Fortran
-
S


CAPS project

CAPS background in architecture


Solid knowledge in microprocessor architecture


technological watch on microprocessors


A. Seznec worked with Alpha Development Group in
1999
-
2000



Researches in cache architecture



Researches in branch prediction mechanisms


CAPS project

9


CAPS background in compilers


Software optimizations for cache memories


Numerical algorithms on dense structures


Optimizing data layout



Many prototype environments for parallel compilers:


CT++
(with CEA): image processing C++ library for a SIMD
architecture,



Menhir:
a parallel compiler for MatLab


IPF
(with Thomson
-
LER): Fortran Compiler for image processing
on Maspar


Sage
(with Indiana): Infrastusture for source level transformation


CAPS project

10

We build on



SALTO:
System for Assembly
-
Language Transformations and
Optimizations


retargetable assembly source to source preprocessor


Erven Rohou’s Ph. D





TSF:


Scripting language for program transformation on top
of ForeSys (Simulog)


Yann Mevel’s Ph. D


CAPS project

11

Salto overview


Assembly source to source preprocessor


Fine grain machine description


Independent from compilers

Transformation

tool

SALTO

C++

Machine


Description

assembly

language

assembly

language


CAPS project

12

Compiler activities


Code optimizations for embedded applications


infrastructures
rather than compilers


optimizing compiler strategies
rather than new
code optimizations


Global constraints


performance /code sizes/ low power (starting)


Focus on interactive tools
rather than automatic



code tuning


case based reasoning


assembly code optimizations




CAPS project

13

Computer aided hand tuning


Automatic optimization has many shortcomings


rather provide the user with a testbed to hand
-
tune
applications


Target applications


Fortran codes and embedded C applications


Our approach


case based reasoning


static code analysis and pattern matching


profiling


learning techniques


the user is the ultimate responsible


CAPS project

14

CAHT

Prototype built on

Foresys: Fortran interactive front
-
end (from Simulog)

TSF: Scripting language for program transformation

Sage++:

Infrastusture for source level transformation


CAPS project

15

Analysis and Tuning tool for Low Level Assembly and
Source code (with Thomson Multimedia)


ATLLAS objectives :


Has the compiler done a good job ?



Try to match source and optimized assembly at fine
grain


Development/analysis environment:


Models for both source and assembly


Global and local analysis (WCET, …) at both levels


Interactive environment for codes visualization and
manual/ automatic analysis and optimization


Built using Salto and Sage++:


Retargetable with compilers and architectures



CAPS project

16

ATLLAS
-

Analysis and Tuning tool for Low Level Assembly
and Source code : Tuning method

Good
?

Half
-
Automatic
or Manual Source
Optimisations

Atllas

compilation

profiling

End

Yes

Half
-
Automatic or
Manual Assembly
Optimisations

Source Code

Assembly Code

Post
-
Processing

Processing

Support


C
ode matching analysis and evaluations


Graphic Display of Ass. And Src. Code


CAPS project

17

Assembly Level Infrastrure for Software
Enhancement

(with STmicroelectonics)


ALISE


enhanced SALTO for code optimization:


better integration with code generation


interface with front
-
end


interface for profiling data


targets global optimization


based on component software optimization
engines


Answer to a real need from industry:


A retargetable infrastructure


CAPS project

18

ALISE


Environment for:


global assembly code optimization


providing optimization alternatives



Support for new embedded processors


ISAs with ILP support (VLIW, EPIC)


Predicated instructions


Functional unit clusters, ..



CAPS project

19

ALISE

Architecture

Description

D to M

Architecture Model

Intermediate representation

Opt 1

Opt 2

Opt n

P to IR

Text

Input

IR to Ass

(Emit)

Optimized

Program

High Level API

Interfaces

External

Infrastructure

User interface

G.U.I.

Intermediate

Code

External

Infrastructure


CAPS project

20

Preprocessor for media processors
(
MEDEA+ Mesa project)


Multimedia instructions on embedded and general
-
purpose processors but :


no consensus on MMD instructions among constructors:


saturated arithmetic or not, different instructions, …





Multimedia instructions are not well handled by
compilers:


but performance is very dependent


CAPS project

21

Preprocessor for media processors:

our approach


C source to source preprocessor


user oriented idioms recognition:


easy to retarget


target dedicated recognition



exploiting loop parallelism


vectorization techniques


multiprocessor systems


available soon



Collaboration with Stmicroelectonics


CAPS project

22

Iterative compilation


Embedded systems:


Compile time is not critical


Performance/code size/power are critical


One can often relate on profiling



Classical compiler: local optimizations


but constraints are GLOBAL



Proof of concept for code sizes (Rohou

’s Ph. D)


new Ph. D. beginning in september 2000


CAPS project

23

High performance instruction set
simulation


Embedded processors:


// development of silicon, ISA, compiler and
applications


Need for flexible instruction set simulation:


high performance


simulation of large codes


debugging


retargetable to experiment:



new ISA



various microarchitecture options


First results: up to 50x faster than ad
-
hoc simulator




CAPS project

24

ABSCISS:
Assembly Based System
for Compiled Instruction Set Simulation


C Source

TriMedia Assembly

tmcc

TriMedia Binary

ABSCISS

tmsim

tmas

gcc

C/C++ Source

Compiled simulator

Architecture
Description


CAPS project

25

Enabling superscalar processor
simulation


Complete O
-
O
-
O microprocessor simulation:


10000
-
100000 slower than real hardware


can not simulate realistic applications, but slices


even fast mode emulation is slow (50
-
100x):


simulation generally limited to slices at the
beginning of the application


representativeness ?


Calvin2 + DICE:


combines direct execution with simulation


really fast mode: 1
-
2x slowdown


enables simulating slices distributed over the whole
application


CAPS project

26

DICE

Host ISA

Emulator

User analysis

routines

Calvin2 + DICE

Original
code

SPARC V9
assembly
code

calvin2

Static Code Annotation Tool

checkpoint

checkpoint

checkpoint

checkpoint

checkpoint

Switching event

Emulation mode

Switching event


CAPS project

27

Moving tools to IA64


New 64bit ISA from Intel/HP:


Explicitly Parallel Instruction Computing


Predicated Execution


Advanced loads (i.e. speculative)


A very interesting platform for research !!




Porting SALTO and Calvin2+DICE approach to IA64



Exploring new trade
-
offs enabled by instruction sets:


predicting the predicates ?


advanced loads against predicting dependencies


ultimate out
-
of
-
order execution against compiler



CAPS project

28

Low power, compilation, architecture, …

(just beginning :=)



Power consumption becomes a major issue:


Embedded and general purpose



Compilation
(setting a collaboration with STmicroelectronics/Stanford/Milan):


Is it different from performance optimization ?


Global constraint optimization


Instruction Set Architecture support ?



Architecture:


High order bits are generally null, …


registers and memory


ALUs


CAPS project

29

Caches and branch predictors



International CAPS visibility in architecture =


skewed associative cache


+ decoupled sectored cache


+ multiple block ahead branch prediction


+ skewed branch predictor



Continue recurrent work on these topics:


multiple block ahead + tradeoffs complexity/accuracy



CAPS project

30


Simultaneous Multithreading


Sharing functional units among several processes


Among the first groups working on this topic


S. Hily’s Ph. D.


SMT behavior well understood for independent threads


now, focus on // threads from a single application



Current research directions:


speculative multithreading


ultimate performance with a single thread through
predicting threads


performance/complexity tradeoffs: SMT/CMP/hybrid



CAPS project

31

«

Enlarging

» the instruction window
(supported by Intel)


In an O
-
O
-
O processor, fireable instructions are chosen in a
window of a few tens of RISC
-
like instructions.


Limitations are:


size of the window


number of physical registers


Prescheduling:


separate data flow scheduling from resource arbitration.


coarser units of work ?


Reducing the number of physical registers:


how to detect when a physical register is dead ?


Per group validation ? revisiting CISC/RISC war ?


CAPS project

32

Unwritten rule on superscalar
processor designs


For general purpose registers:

Any physical register can be the source or

the result of any instruction executed

on any functional unit


CAPS project

33

4
-
cluster WSRS architecture

(supported by Intel)

S0

S0

C0

S1

S1

C1

S2

C2

S3

S3

C3

S2




Half the read ports, one

fourth the write ports


Register file:



Silicon area x 1/8



Power x 1/2



Access time x 0.6


Gains on:


bypass network


selection logic




CAPS project

34

Multiprocessor on a chip




Not just replicating board level solutions !



A way to manage a large on
-
chip cache capacity:


how can a sequential application use efficiently a distributed
cache ?


architectural supports for distributing a
sequential

application
on several processors ?


how should instructions and data be distributed ?


CAPS project

35


HIPSOR

HIgh Performance SOftware Random number generation


Need for unpredicable random number generation:


sequences that cannot be reproduced



State of the art:


< 100 bit/s using the operating system


75Kbit/s using hardware generator on Pentium III



Internal state of a superscalar can not be reproduced


use this state to generate unpredictable random
numbers



CAPS project

36

HIPSOR (2)


1000’s of unmonitorable states modified by OS interrupts



Hardware clock counter to indirectly probe these states



Combined with in
-
line pseudo
-
random number generation



100 Mbit/s unpredictable random numbers


ARC INRIA with CODES