The Fourth International Symposium on

errorhandleΛογισμικό & κατασκευή λογ/κού

18 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

79 εμφανίσεις

1

CGO 2006:

The Fourth International Symposium on

Code Generation and Optimization


New York, March 26
-
29, 2006



Conference Review


Presented by: Ivan Matosevic



2

Outline




Conference overview


Brief summaries of sessions


Keynote speeches


Best paper

3

Conference Overview


Primary focus: back
-
end compilation
techniques


Static analysis and optimization


Profiling


Run
-
time techniques


8 sessions, 29 papers


Dominating topics: multicores, dynamic
compilation

4

Overview of Session

1.
Dynamic Optimization

2.
Object
-
Oriented Code Generation and Optimization

3.
Phase Detection and Profiling

4.
Tiled and Multicore Compilation

5.
Static Code Generation and Optimization Issues

6.
SIMD Compilation

7.
Optimization Space Exploration

8.
Security and Reliability



5

Session 1: Dynamic Optimization


Kim Hazelwood (University of Virginia), Robert Cohn (Intel),
A
Cross
-
Architectural Interface for Code Cache Manipulation



Pin

dynamic instrumentation system with code cache


The paper describes an API for various operations with the code
cache (callbacks, lookups, statistics, etc.)



Derek Bruening, Vladimir Kiriansky, Tim Garnett, Sanjeev
Banerji (Determina Corporation),
Thread
-
Shared Software Code
Caches


Problem: sharing a code cache across multiple threads


Authors propose a fine
-
grained locking scheme


Evaluation using DynamoRIO


6

Session 1: Dynamic Optimization


Keith Cooper, Anshuman Dasgupta (Rice Univ.),
Tailoring
Graph
-
coloring Register Allocation For Runtime Compilation


Problem: register allocation in JIT compilers


Authors propose a novel lightweight graph
-
colouring technique



Weifeng Zhang, Brad Calder, Dean Tullsen (UC San Diego),
A
Self Repairing Prefetcher in an Event
-
Driven Dynamic
Optimization Framework


Extension of the
Trident
event
-
driven dynamic optimization
framework (previously proposed by the same authors)


Dynamic insertion of prefetching instructions based on run
-
time
analysis


7

Session 2: Object
-
Oriented Code

Generation and Optimization


Suresh Srinivas, Yun Wang, Miaobo Chen, Qi Zhang, Eric Lin,
Valery Ushakov, Yoav Zach, Shalom Goldenberg (Intel
Corporation),
Java JNI Bridge: An MRTE Framework for Mixed
Native ISA Execution


Use a dynamic translator for the execution of native calls to one
ISA on a different ISA’s Java platform



Kris Venstermans, Lieven Eeckhout, Koen De Bosschere
(Ghent University),
Space
-
Efficient 64
-
bit Java Objects through
Selective Typed Virtual Addressing


Use address bits on a 64
-
bit architecture to encode object type in
order to save memory


Objects of the same type allocated in a contiguous (virtual) region


8

Session 2: Object
-
Oriented Code

Generation and Optimization


Daryl Maier, Pramod Ramarao, Mark Stoodley, Vijay
Sundaresan (IBM Canada),
Experiences with Multi
-
threading
and Dynamic Class Loading in a Java Just
-
In
-
Time Compiler


The IBM TestaRossa JIT compiler


This paper focuses on code patching and profiling in a multi
-
threaded environment with a lot of class loading/unloading



Lixin Su, Mikko H Lipasti (University of Wisconsin Madison),
Dynamic Class Hierarchy Mutation


Run
-
time reassignment of objects from one derived class to
another, changing its virtual tables


Offers opportunity for optimizations based on specialization

9

Session 3: Phase Detection and Profiling


Priya Nagpurkar, (UCSB), Michael Hind (IBM), Chandra Krintz,
(UCSB), Peter Sweeney, V.T. Rajan (IBM),
Online Phase
Detection Algorithms



Detecting phase behaviour in virtual machines


Track dynamic program parameters (methods invoked, branch
directions…) over time and apply a similarity model



Jeremy Lau, Erez Perelman, Brad Calder (UC San Diego),
Selecting Software Phase Markers with Code Structure Analysis


Portions of code whose execution correlates with phase changes


Procedure calls and returns, loop boundaries


Profile
-
based hierarchical loop
-
call graph



10

Session 3: Phase Detection and Profiling


Shashidhar Mysore, Banit Agrawal, Timothy Sherwood,
Nisheeth Shrivastava, Subhash Suri (UC Santa Barbara),
Profiling over Adaptive Ranges


Voted best paper


details later



Hyesoon Kim, Muhammad Aater Suleman, Onur Mutlu, Yale N.
Patt (UT
-
Austin),
2D
-
Profiling: Detecting Input
-
Dependent
Branches with a Single Input Data Set


Predicts whether the prediction accuracy of each branch will vary
across input sets


Heuristic approach used to derive representative profiling results
from a single input set


11

Session 4: Tiled and Multicore Compilation


David Wentzlaff, Anant Agarwal (MIT),
Constructing Virtual
Architectures on a Tiled Processor


Map components of a superscalar architecture (Pentium III) onto a
parallel tiled architecture (Raw) using dynamic translation


In a way, uses Raw as a coarse
-
grain FPGA



Aaron Smith, (UT
-
Austin), J. Burrill, (UMass at Amherst), J.
Gibson, B. Maher, N. Nethercote, B. Yoder, D. Burger, K. S.
McKinley (UT
-
Austin),
Compiling for EDGE Architectures


TRIPS EDGE (Explicit Data Graph Execution) architecture


This paper focuses on compilation of standard C and FORTRAN
benchmarks

12

Session 4: Tiled and Multicore Compilation


Shih
-
wei Liao, Zhaohui Du, Gansha Wu, Guei
-
Yuan Lueh (Intel),
Data and Computation Transformations for Brook Streaming
Applications on Multiprocessors


Parallel compiler for the Brook streaming language


An extension of C that enables specifying data parallelism



Michael L. Chu, Scott A. Mahlke (University of Michigan),
Compiler
-
directed Object Partitioning for Multicluster Processors


Partitioning of data in clustered architectures such as Raw


I didn’t really understand what programming model these authors
have in mind?


13

Session 5: Static Code Generation and

Optimization Issues


Two papers about the HPUX Itanium compiler:



Dhruva R. Chakrabarti, Shin
-
Ming Liu (Hewlett
-
Packard),
Inline Analysis: Beyond Selection Heuristics


Cross
-
module techniques for selection of inlined call sites and
the choice of specialized function versions


Robert Hundt, Dhruva R. Chakrabarti, Sandya S.
Mannarswamy (Hewlett
-
Packard),
Practical Structure Layout
Optimization and Advice


Data layout and placement on the heap to improve locality


Structure splitting, structure peeling, dead field removal, and
field reordering



14

Session 5: Static Code Generation and

Optimization Issues


Chris Lupo, Kent Wilken (University of California, Davis),
Post
Register Allocation Spill Code Optimization


Authors propose a profile
-
based algorithm for placement of
save/restore instructions handling spilled variables in function calls


Implemented as a part of GCC



Seung Woo Son, Guangyu Chen, Mahmut Kandemir
(Pennsylvania State University),
A Compiler
-
Guided Approach
for Reducing Disk Power Consumption by Exploiting Disk
Access Locality


Goal: restructure code so that disk idle periods are lengthened


The approach targets array
-
based programs: disk layout of array
data exposed to the compiler

15

Session 6: SIMD Compilation


Jianhui Li, Qi Zhang, Shu Xu, Bo Huang (Intel China Software
Center),
Optimizing Dynamic Binary Translation for SIMD
Instructions



Algorithms for dynamic binary translation of SIMD instructions in
general
-
purpose architectures (such as MMX in x86)


Evaluation using IA
-
32 binaries on Itanium 2



Dorit Nuzman (IBM), Richard Henderson (Red Hat),
Multi
-
Platform Auto
-
Vectorization


Implementation of automatic vectorizer for GCC 4.0



16

Session 7: Optimization
-
space Exploration


Felix Agakov, Edwin Bonilla, John Cavazos, Bjoern Franke,
Grigori Fursin, Michael O'Boyle, Marc Toussaint, John
Thomson, Chris Williams (U. of Edinburgh),
Using Machine
Learning to Focus Iterative Optimization


Predictive modelling used to search the optimization space


Targets embedded platforms


AMD Au1500 and Texas
Instruments TI C6713


Prasad Kulkarni, David Whalley, Gary Tyson (Florida State
University), Jack Davidson (University of Virginia),
Exhaustive
Optimization Phase Order Space Exploration


Exhaustive search of the phase order space (15 phases) using
aggressive pruning; takes time on the order of minutes to hours


Targets StrongARM SA
-
100



17

Session 7: Optimization
-
space Exploration


Zhelong Pan, Rudolf Eigenmann (Purdue University),
Fast and
Effective Orchestration of Compiler Optimizations for Automatic
Performance Tuning


Problem: find the optimal combination of 38 GCC O3 options,
targeting Pentium IV and Sparc II


Proposed heuristic algorithm that provides s quality solution in time
on the order of several hours

18

Session 8: Security and Reliability


Edson Borin, (UNICAMP), Cheng Wang, Youfeng Wu (Intel),
Guido Araujo (UNICAMP),
Software
-
Based Transparent and
Comprehensive Control
-
Flow Error Detection


Addresses the problem of soft (transient) errors that cause
branches to incorrect instructions


Implemented in SW as a part of a dynamic binary translator



Tao Zhang, Xiaotong Zhuang, Santosh Pande (Georgia Tech),
Compiler Optimizations to Reduce Security Overheads


Optimizations that specifically target techniques that implement
software protection with minimal HW support



19

Session 8: Security and Reliability


Susanta Nanda, Wei Li, Tzi
-
cker Chiueh (State University of NY
at Stony Brook),
BIRD: Binary Interpretation using Runtime
Disassembly


Goal: framework for automatic detection of vulnerabilities such as
buffer overflows when the source code is not available


Static and dynamic disassembly and instrumentation


targets
Windows x86 application


20

Keynote Speeches


Wei Li, Principal Engineer, Intel:
"Parallel
Programming 2.0"




Kevin Stoodley, Fellow and CTO of Compilation
Technology, IBM:
"Productivity and Performance:
Future Directions in Compilers"



21

Wei Li:

Parallel Programming 2.0


Major technological change:


Moore’s Law continues to increase transistor counts


However: power, memory latency, limits to ILP are setting an
effective performance ceiling


General trend towards thread
-
level on
-
chip
parallelism


SMT


Chip multiprocessors



22

Wei Li:

Parallel Programming 2.0


“Parallel Programming 2.0” refers to the advent of multicores


A very optimistic future vision:

23

Wei Li:

Parallel Programming 2.0


Key issue


where will the parallelism come from?


Parallel programming needs to become more
mainstream


Consumer vs. HPC/server/database


Inclusion into education at more elementary level


New tools for greater ease of programming


Intel’s parallel programming tools


http://www.intel.com/software


24

K. Stoodley:
"Productivity and Performance:


Future Directions in Compilers"






Limits to traditional static compilation


Overview of IBM compiler technology


Testarossa JIT compiler, Toronto Portable Optimizer, Tobey
backend


Challenges at present and near future


Software abstraction complexity


forces the
scope of compilation to higher levels


Maintaining high performance backwards
compatibility increasingly difficult







25

K. Stoodley:
"Productivity and Performance:


Future Directions in Compilers"




Future: convergence/combination of dynamic and
static compilation technologies

xlc

Toronto Portable

Optimizer (TPO)

W
-
Code

Profile
-
Directed

Feedback (PDF)

xlC

xlf

TOBEY

Backend

Static

Machine

Code

class

class

jar

J9 Execution Engine

(Java + Others)

Testarossa

JIT

Dynamic

Machine

Code

CPO

Front

Ends

Binary

Translation

26

Best Paper


Shashidhar Mysore, Banit Agrawal, Timothy
Sherwood, Nisheeth Shrivastava, Subhash Suri (UC
Santa Barbara):
Profiling over Adaptive Ranges



27

Profiling over Adaptive Ranges


Problem: how to count specific events
efficiently and accurately?


Code segments executed


Memory regions accessed


IP addresses of routed packets


In all cases, impossible to maintain separate
counters for the entire range of values


Each basic block, memory address, IP address…





28

Trade
-
off: Precision vs. Efficiency

Unlimited counters

Uniform ranges


Profiling with uniform ranges fails to distinguish hot code

29

Higher Precision for Hot Regions


Good trade
-
off with limited
resources:


High precision for hot regions


Low precision for colder ones, but
this affects the accuracy less


Challenge: how to determine what exactly to count with
what precision?

30

Solution: Adaptive Profiling


Start with one counter; split counters as they become
hot:

31

Solution: Adaptive Profiling


Start with one counter; split counters as they become
hot:

32

Solution: Adaptive Profiling


Start with one counter; split counters as they become
hot:

33

Counter Merging


Problem: what if program behaviour changes after
the initialization phase?

34

Counter Merging


Problem: what if program behaviour changes after
the initialization phase?

35

Counter Merging


Solution: perform counter merging along with splitting


36

Counter Merging


Counters of merged child nodes added to the parent

37

Counter Merging


Counters of merged child nodes added to the parent

38

Counter Merging


Problem: how to identify nodes for merging?


They are by definition those ones that are not
updated frequently


Solution: periodic batched merge operations


Tree depth grows at logarithmic rate


can be
done at exponentially increasing intervals

39

Additional Contributions


Heuristics for splitting and merging


Theoretical analysis of accuracy guarantees


Proposal for hardware implementation


Experimental evaluation


Memory requirements


Average and worst
-
case errors on benchmarks


Performance of HW implementation


Accuracies on the order of 98.0
-
99.8% with only 8
-
64K of
memory

40

Conclusions


Highly interesting program


My short presentation certainly doesn’t do justice to most of
the mentioned works!


Readings to perhaps consider for future CARG:


D. Wentzlaff, A. Agarwal,
Constructing Virtual Architectures
on a Tiled Processor


A. Smith et al.,
Compiling for EDGE Architectures


F. Agakov et al
., Using Machine Learning to Focus Iterative
Optimization



(Highly subjective!)