Case Studies of Compilers and Future Trends

wheatauditorSoftware and s/w Development

Oct 30, 2013 (3 years and 5 months ago)

70 views

Advanced Compiler Design and Implementation

Mooly Sagiv


Course notes

Case Studies of Compilers and Future
Trends


(Chapter 21)


By Ronny Morad & Shachar Rubinstein


24/6/2001
Case Studies

This chapter deals with real
-
life examples of compilers. For each
compiler this scribe
will discuss three subjects:



A brief history of the compiler.



The structure of the compiler with emphasis on the back end.



Optimization performed on two programs. It must be noted that this
test can’t be used to measure and compare the

performance of the
compilers.


The compilers examined are the following:



SUN compilers for SPARC 8, 9



IBM XL compilers for Power and PowerPC architectures. The Power
and PowerPC are classes of architectures. The processors are sold in
different configurat
ions.



Digital compiler for Alpha. The Alpha processor was bought by Intel.



Intel reference compiler for 386


Historically, compilers were built for specific processors. Today, it is not so obvious.
Companies use other developers’ compilers. For example, In
tel uses IBM’s compiler
for the Pentium processor.


The compilers will compile two programs: a C program and a Fortran 77 program.


The C program

int length, width, radius;

enum figure {RECTANGLE, CIRCLE} ;

main()

{

int area = 0, volume = 0, height;


enum
figure kind = RECTANGLE;


for (height=0; height < 10; height++)


{

if (kind == RECTANGLE) {

area += length * width;

volume += length * width * height;

}

else if (kind == CIRCLE){

area += 3.14 * radius * radius ;

volume += 3.14 * radius * height;

}


}


pr
ocess(area, volume);

}



Possible optimizations:

1.

The value of ‘kind’ is constant and equals to ‘RECTANGLE’.
Therefore the second branch, the ‘else’ part is dead code and the first
‘if’ is also redundant.

2.

‘length * width’ is loop invariant and can be calcul
ated before the loop.

3.

Because ‘length * width’ is loop invariant, area can be calculated
simply by using a single multiplication. Specifically, 10 * length *
width.

4.

The calculation of ‘volume’ in the loop can be done using addition
instead of multiplicatio
n.

5.

The call to ‘process()’ is a tail
-
call. The fact can be used to prevent to
need to create a stack frame.

6.

Compilers will probably use loop unrolling to increase pipeline
utilization.


Note: Without the call to ‘process()’, all the code is dead, because ‘
area’ and ‘volume’
aren’t used.



The Fortran 77 program:

integer a(500, 500), k, l;

do 20 k=1,500


do 20 l=1,500


a(k, l)= k+l

20 continue

call s1(a, 500)

end


subroutine s1(a,n)

integer a(500, 500), n

do 100 i = 1,n

do 100 j = i + 1, n

do 100 k = 1, n




l = a(k, i)




m = a(k, j)




a(k, j) = l + m

100 continue

end


Possible optimizations:

1.

a(k,j) is calculated twice. This can be prevented by using common
-
subexpression elimination.

2.

The call to ‘s1’ is a tail
-
call. Because the compiler has the source of
‘s1’, it can be inlined at the main procedure. This can be used to further
optimize the resulting code. Most compilers will leave the original
copy of ‘s1’ intact.

3.

If the procedure is not inlined, interprocedural constant propagation
can be used to find ou
t that ‘n’ is a constant equals 500.

4.


The access to ‘a’ is calculated using multiplication. This can be
averted using addition. The compiler “knows” how the array will be
realized in memory. For example, in Fortran, arrays are ordered by
columns. So it can

add the correct number of bytes every time to the
address, instead of recalculating.

5.

After 4, the counters aren’t needed and the conditions in the loop can
be replaced by testing the address. That’s done using linear test
replacement.

6.

Again, loop unrollin
g will be used according to the architecture.

Sun SPARC

The SPARC architecture

SPARC has two major versions of the architecture, Version 8 and Version 9.

The SPARC 8 has the following features:



32 bit RISC superscalar system with pipeline.



Integer and flo
ating point units.



The integer unit has a set of 32
-
bit general registers and executes load,
store, arithmetic, logical, shift, branch, call and system
-
control
instructions. It also computers addresses (register + register or register
+ displacement).



The
floating
-
point unit has 32 32
-
bit floating
-
point data register and
implements the ANSI/IEEE floating
-
point standard.



There are 8 general purpose integer registers (from the integer unit).
The first has a constant value of zero (r0=0).



Three address instruc
tions at the following form: Instruction
Src1,Src2,Result.



Several 24 register windows (spilling by OS). This is used to save on
procedure calls. When there aren’t enough registers, the processor
sends an interrupt and the OS handles saving the registers t
o memory
and refilling them with the necessary values.


SPARC 9 is a 64
-
bit version, fully upward
-
compatible with Version 8.


The assembly language guide is on pages 748
-
749 of the course book, tables A.1, A.2,
A.3.

The SPARC compilers

General

Sun SPARC co
mpilers originated from the Berkeley 4.2 BSD UNIX software
distribution and have been developed at Sun since 1982. The original back end was
for the Motorola 68010 and was migrated successively to later members of the
M68000 family and then to SPARC. Work
on global optimization began in 1984 and
on interprocedural optimization and parallelization in 1989. The optimizer is
organized as a mixed model. Today Sun provides front
-
ends, and thus compilers, for
C, C++, Fortran 77 and Pascal.


The structure





















The four compilers: C, C++, Fortran 77 and Pascal, share the same back
-
end. The
front
-
end is Sun IR, an intermediate representation discussed later. The back end
consists of two parts:



Yabe


“Yet Another Back
-
End”. Creates a relocatable code

without
optimization.



An optimizer.


The optimizer is divided to the following:



The automatic inliner. This part works only on optimization level 04
(discussed later). It replaces some calls to routines within the same
compilation unit with inline copies
of the routines’ body. Next, tail
-
recursion elimination is preformed and other tail calls are marked for
the code generator to optimize.



The aliaser. The aliaser used information that is provided by the
language specific front
-
end to determine which sets o
f variables may,
at some point in the procedure, map to the same memory location. The
aliaser aggressiveness is determined on the optimization level. Aliasing
information is attached to each triple that requires it, for use by the
global optimizer.



IRopt,
the global optimizer



The code generator.

Fro
nt End

Code
generator

Automatic inliner

Aliaser

Iropt

(global optimization)

yabe

Sun IR

Relocatable

Sun IR

Relocatable

The Sun IR

The Sun IR represents a program as a linked list of triples representing executable
operations and several tables representing declarative information. For example:


ENTRY “s1_” {IS_EXT_ENTRY, ENTRY_IS_GL
OBAL}

GOTO LAB_32

LAB_32:

LTEMP.1 = (.n { ACCESS V41} );

i = 1

CBRANCH (i <= LTEMP.1, 1: LAB_36, 0: LAB_35);

LAB_36:

LTEMP.2 = (.n { ACCESS V41} );

j=i+1

CBRANCH (j <= LTEMP.2, 1: LAB_41, 0: LAB_40);

LAB_41:

LTEMP.3 = (.n { ACCESS V41} );

k=1

CBRANCH (k
<= LTEMP.3, 1: LAB_46, 0: LAB_45);

LAB_46:

l = (.a[k, i] ACCESS V20} );

m = (.a[k, j] ACCESS V20});

*(a[k,j] = l+m {ACCESS V20, INT});

LAB_34:

k = k+1;

CBRANCH(k>LTEMP.3, 1: LAB_45, 0: LAB_46);

LAB_45:

j = j+1;

CBRANCH(j>LTEMP.2, 1: LAB_40, 0: LAB_41);

LAB
_40:

i = i+1;

CBRANCH(i>LTEMP.1, 1: LAB_35, 0: LAB_36);

LAB_35:


The CBRANCH is a general branch, not attached to the architecture. It provides two
branches, the first when the expression is correct and the second when not.

This IR is somewhere between LIR

and MIR. It isn’t LIR because there are no
registers. It isn’t MIR because there is access to memory using the compiler memory
organization, the use of LTEMP.


Optimization levels

There are four optimization levels:

01

Limited optimizations. This level i
nvokes only certain optimization
components of the code generator.

02

This and higher levels invoke both the global optimizer and the optimizer
components of the code generator. At this level, expressions that involve global or
equivalent variables, aliase
d local variables’ or volatile variables are not candidates
for optimization. Automatic inlining, software pipelining, loop unrolling, and the
early phase of instruction scheduling are not done.

03

This level optimizes expressions that involve global varia
bles but make worst
-
case assumptions about potential aliases caused by pointers and omits early
instruction scheduling and automatic inlining. This level gives the best results.

04

This level aggressively tracks what pointers may point to’ making worst
-
cas
e
assumptions only where necessary. It depends on the language
-
specific front ends to
identify potentially aliased variables, pointer variables, and a worst
-
case set of
potential aliases. It also does automatic inlining and early instruction scheduling. Th
is
level turned out to be very problematic because of bugs in the front
-
ends.

The global optimizer

The optimizer input is Sun IR and the output is Sun IR.

The global optimizer performs the subsequent on that input:



Control
-
flow analysis is done by identify
ing dominators and back
edges, except that the parallelizer does structural analysis for its own
purposes.



The parallelizer searches for commands the processor can execute in
parallel. Practically, it doesn’t improve execution time (The alpha
processor is
where it has an effect, if any). Most of the time it is just
for not interrupting the processor’s parallelism.



The global optimizer processes each procedure separately, using basic
blocks. It first computes additional control
-
flow information. In
particula
r, loops are identified at this point, including both explicit
loops (for example, ‘do’ loops in Fortran 77) and implicit ones
constructed from ‘if’s and ‘goto’s.



Then a series of data
-
flow analysis and transformations is applied to
the procedure. All data
-
flow analysis is done iteratively. Each
transformation phase first computes (or recomputes) data
-
flow
information if needed. The transformations are preformed in this order:

1.

Scalar replacement of aggregates and expansion of Fortran
arithmetic on complex n
umbers to sequences of real
-
arithmetic
operations.

2.

Dependence
-
based analysis and transformations (levels 03 and
04 only, as described below).

3.

Linearization of array addresses.

4.

Algebraic simplification and reassociation of address
expressions.

5.

Loop invarian
t code motion.

6.

Strength reduction and induction variable removal.

7.

Global common
-
subexpression elimination.

8.

Global copy and constant propagation.

9.

Dead
-
code elimination



The dependence based analysis and transformation phase is designed
to support paralleliza
tion and data
-
cache optimization and may be done
(under control of a separate option) when the optimization level
selected is 03 or 04. The steps comprising it (in order) are as follows:

1.

Constant propagation

2.

Dead
-
code elimination

3.

Structural control flow an
alysis

4.

Loop discovery (including determining the index variables,
lower and upper bounds and increment).

5.

Segregation of loops that have calls and early exits in their
bodies.

6.

Dependence analysis using GCD and Banerjee
-
Wolfe tests,
producing direction vecto
rs and loop
-
carried scalar du
-

and
ud
-
chains.

7.

Loop distribution.

8.

Loop interchange.

9.

Loop fusion.

10.

Scalar replacement of array elements.

11.

Recognition of reductions.

12.

Data
-
cache tiling.

13.

Profitability analysis for parallel code generation.

The code generator

Afte
r global optimization has been completed, the code generator first translates the
Sun IR code input to it to a representation called ‘asm+’ that consists of assembly
-
language instructions and structures that represent control
-
flow and data dependence
infor
mation. An example is available on page 712. The code generator then performs
a series of phases, in the following order:

1.

Instruction selection.

2.

Inline of assembly language templates whose computational impact is
understood (02 and above).

3.

Local optimizati
ons, including dead
-
code elimination, straightening,
branch chaining, moving ‘sethis’ out of loops, replacement of
branching code sequences by branchless machine idioms, and
communing of condition
-
code setting (02 and above).

4.

Macro expansion, phase 1 (expa
nding of switches and a few other
constructs).

5.

Data
-
flow analysis of live variables (02 and above).

6.

Software pipelining and loop unrolling (03 and above).

7.

Early instruction scheduling (04 only).

8.

Register allocation by graph coloring (02 and above).

9.

Stack f
rame layout

10.

Macro expansion, phase 2 (Expanding of memory
-
to
-
memory moves,
max, min, comparison of value, entry, exit, etc.). Entry expansion
includes accommodating leaf routines and generation of position
-
independent code.

11.

Delay
-
slot filing.

12.

Late instruct
ion scheduling

13.

Inlining of assembly language templates whose computational impact
is not understood (02 and above).

14.

Macro expansion, phase 3 (to simplify code emission).

15.

Emission of relocatable object code.


The Sun compiling system provides for both stati
c and dynamic linking. The selection
is done by a link
-
time option.

Compilation results

The assembly code for the C program appears in the book on page 714. The code was
compiled using 04 optimization.

The assembly code for the Fortran 77 program appears i
n the book on pages 715
-
716.
The code was compiled using 04 optimization.

The numbers in parentheses are according to the numbering of possible optimizations
for each program.

Optimizations performed on the C program



(1) The unreachable code in ‘else’ was

removed, except for
π
, which is
still loaded from .
L
_const_seg_900000101 and stored at %fp
-
8.



(2) The loop invariant ‘length * width’ has been removed from the
loop (‘smul %o0,%o1,%o0’).



(4) Strength reduction of “height”. Instead of multiplying by ‘heigh
t’,
addition of previous value is used.



(6) Loop unrolling by factor of four. (‘cmp %lo,3’)



Local variables in registers.



All computations in registers.



(5) Identifying tail call and optimizing it by eliminating the stack
frame.

Missed optimizations on th
e C program



Removal of


computation.



(3) Compute area in one instruction.



Completely unroll the loop. Only the first 8 iterations were unrolled.

Optimizations performed on the Fortran 77 program



(2) Procedure integration of s1. The compiler can make use o
f the fact
that n=500 to unroll the loop, which it did.



Common subexpression elimination of ‘a[k,j]’



Loop unrolling, from label .L900000112 to .L900000113.



Local variables in registers



Software pipelining. Note, for example, the load just above the starti
ng
label of the loop.


An example for software pipelining:

When running the following commands, assuming all depend on each other:

Load

Add

Store


The add can’t be started until load is finished and ‘store’ can’t be started until ‘add’ is
finished. The com
piler can improve this code by writing the following:

Load

*Load

Add

*Store

Store


The compiler inserts here the commands with * needed later. This way, when ‘add’
will start execution, the result of the first load will be available. Same for ‘store’ and

add’ respectively.


Missed optimizations on the Fortran 77 program



Eliminating s1. The compiler produced code for ‘s1()’ although the
main routine is the only one calling ‘s1()’.



Eliminating addition in the loop via linear function test replacement.
This w
ould have eliminated one of the additions in the resulting code.
POWER/PowerPC

The POWER/PowerPC architecture


The POWER architecture is an enhanced 32
-
bit RISC machine with the following
features:



It consists of branch, fixed
-
point, floating
-
point and sto
rage
-
control
processors.



Individual implementations may have multiple processors of each sort,
except that the registers are shared among them and there may be only
one branch processor in a system. That is, a processor is configurable
and may be purchased

with different number of processors.



The branch processor includes the condition, link and count registers
and executes conditional and unconditional braches and calls, system
calls and condition register move and logical operations.



The fixed
-
point proce
ssor contains 32 32
-
bit integer general purpose
registers, with register ‘gr0’ delivering the value zero when used as an
operand in an address computation. (gr0=0). It implements loads and
stores, arithmetic, logical, compare, shift, rotate and trap instru
ctions.
It also implements system control instructions. There are two modes of
addressing: register + register or register + displacement, plus the
capability to update the base register with the computed address.



The floating
-
point processor contains 32 6
4
-
bit data register and
implements the ANSI/IEEE floating
-
point standard for double
-
precision values only.



The storage
-
control processor provides for segmented main
-
storage,
interfaces with caches and translation look
-
aside buffer and does
virtual address
translation.



The instructions typically have three operands, two sources and one
result. The order is opposite to SPARC, first the result and then the
operands: Instructions result, src1, src2.


The PowerPC architecture is a nearly upward compatible extens
ion of POWER that
allows for 32
-

and 64
-
bit implementations. It isn’t 100% compatible because, for
example, some instructions, which were troublesome corner cases, have been made
invalid.


The assembly language guide is on page 750 of the course book, tab
le A.4.


The IBM XL compilers

General

The compilers for these architectures are known as the XL family. The XL family
originated in 1983, as a project to provide compiler to an IBM RISC architecture that
was an intermediate stage between the IBM 801 and PO
WER, but that was never
released as a product. It was an academic project. The first compilers created were an
optimizing Fortran compiler for the PC RT that was release to a selected few
customers and a C compiler for the PC RT used only for internal IBM
development.
The compilers were created with interchangeable back ends, so today they generate
code for POWER, Intel 386, SPARC and PowerPC. The compilers were written in
PL.8.

The compilers don’t perform interprocedural optimizations. Almost all optimizat
ion
are preformed on a proprietary low level IR, called “XIL”. Some optimizations, which
require higher level IR, for example, optimizations on arrays, are performed on YIL, a
higher level representation. It’s created from XIL.

The structure





























Each compiler consists of a front end called a translator, a global optimizer, an
instruction scheduler, a register allocator. An instruction selector and a phase called
final assembly that produces the relocatable image and assembly langua
ge listings.
The root services module interacts with all phases and serves to make compilers
compatible with multiple operating systems by, for example, holding information
about how to produce listings and error messages.


The translator and XIL

A transla
tor converts the source language to XIL using calls to XIL library routines.
The XIL generation routines do not merely generate instructions. They may perform a
few optimizations, for example, generate a constant in place of an instruction that
Translator

Root services

Optimizer

Instruction
schedule
r

Register allocator

Instruction
scheduler

Instruction
selection

Final assembly

XIL

XIL

XIL

XIL

Translator

Root services

Optimizer

Instruction
scheduler

Register allocator

Instruction
scheduler

Instruction
selection

Final assembly

XIL

XIL

XIL

relocatable

would compu
te the constant. A translator may consist of a front end that translates a
source language to a different IR language, followed by a translator from the other
intermediate form to XIL.


The XIL compilation unit uses data structures illustrated on page 720.

The illustration
shows the relationships among the structures. It may save memory space while
compiling but it makes debugging the compiler more difficult. The data structures are:



A procedure descriptor table that holds information about each
procedure,
such as the size of its stack frame and information about
global variables it affects, and a pointer to the representation of its
code.



A procedure list. The code representation of each procedure consists of
a procedure list that comprises pointers to the

XIL structures that
represent instructions. The instructions are quite low level and source
-
language independent.



Computation table. Each instruction is represented as an entry in this
table. The computation table is an array of variable length records th
at
represent preorder traversals of the intermediate code for the
instructions.



Symbolic register table. Variables and intermediate results are
represented by symbolic registers, each comprises an entry in this
table. Each entry points to the computation t
able entry that defines it.


An example of XIL is on page 721.


TOBEY

The compiler back end (all the phases except the source to XIL translator) is named
TOBEY, an acronym for TOronto Back End with Yorktown, indicating the heritage
of the two group which c
reated the back end.

The TOBEY optimizer

The optimizer does the subsequent:



YIL is used for storage
-
related optimization.

o

YIL is created by TOBEY from XIL and includes; in
addition to the structures in XIL, representations for
looping constructs, assignme
nt statements, subscripting
operation, and conditional control flow at the level of
‘if’ statements.

o

It also represent the code is SSA form.

o

The goal is to produce code that is appropriate for
dependence analysis and loop transformations.

o

After the analysi
s and transformations, the YIL is
translated back to XIL.



Alias information is provided by the translator to the optimizer by calls
from the optimizer to front end routines.



Control flow uses basic blocks. It builds the flow graph within a
procedure, uses
DFS to construct a search tree and divides it into
intervals.



Data flow analysis is done by interval analysis. It’s an older method
that the dominator method for finding loops. The iterative form is used
for irreducible intervals.



Optimization is preform
ed on each procedure separately.


The register allocator

TOBEY includes two register allocators:



A “quick and dirty” local , used when optimization is not requested.



A graph
-
coloring global based on Chatin’s, but with spilling done in
the style of Brigg’s
work.

The instruction scheduler



Performs basic
-
block and branch scheduling.



Performs global scheduling.



Run after register allocations if any spill code has been generated.

The final assembly

The final assembly phase does 2 passes over the XIL:



peephole op
timizations


removing compares.



generate relocatable image and listings.

Compilation results

The assembly code for the C program appears in the book on page 724.

The assembly code for the Fortran 77 program appears in the book on pages 724
-
725.

The numbe
rs in parentheses are according to the numbering of possible optimizations
for each program.

Optimizations performed on the C program



(1) The constant value of kind has been propagated into the conditional
and the dead code eliminated.



(2) The loop invari
ant length*width has been removed from the loop.



(6) the loop has been unrolled by factor of two.



the local variables have been allocated to registers.



instruction scheduling has been performed.

Missed optimizations on the C program



(5) tail call to
proces
s()
.



(4) accumulation of
area

has not been turned into a single
multiplication

Optimizations performed on the Fortran 77 program



(3) find out that
n
=500.



(1) common sub
-
expression elimination of
a[k,j]
.



(6) The inner loop has been unrolled by a factor of t
wo.



The local variables have been allocated to registers.



Instruction scheduling has been performed.

Missed optimizations on the Fortran 77 program



(2) The routine
s1()

has not been inlined.



(5) Eliminating addition in loop via linear function test replace
ment.
Intel 386

The Intel 386 architecture

The Intel 386 architecture includes the Intel 386 and its successors, the 486, Pentium,
Pentium Pro and so on. The architecture is a thoroughly CISC design, however some
implementations utilize RISC principles suc
h as pipelining and superscalarity.

It has the following characteristics:



There are eight 32
-
bit integer registers.



It supports 16 and 8 bit registers.



There are six 32
-
bit segment registers for computing addresses.



Some registers have dedicated purposes
(e.g. point to the top of the
current stack frame).



There are many addressing modes.



There are eight 80
-
bit floating point regisers.


The assembly language guide is on page 752
-
753 of the course book, tables A.7 and
A.8.

The Intel compilers

Intel provides
compilers from C, C++, Fortran 77 and Fortran 90 for the 386
architecture family.

The structure of the compilers, which use the mixed model of optimizer organization,
is as follows:

Front end

Interprocedural
optimizer

Memory
optimizer

Global
optimizer

Code selector

Register
allocator

Instruction

scheduler

IL
-
1

IL
-
1 + IL
-
2

IL
-
1 + IL
-
2

IL
-
1 + IL
-
2

Relocatable

Code genrator

The fron
-
end is derived from work done at Multiflow
and the Edison Design Group.

The fron
-
ends produce a medium
-
level intermediate code called IL
-
1.

The interprocedural optimizer operates accross modules. It performs a series of
optimizations that include inlining, procedure cloning, parameter substitution,

and
interprocedural constant propagation.

The output of the interprocedural optimizer is a lowered version of IL
-
1, called IL
-
2,
along with IL
-
1’s program
-
structure information; this intermediate form is used for
the remainder of the major components of t
he compiler, down through input to the
code generator.

The memory optimizer improves use of memory and caches mainly by performing
loop transformations.It first does SSA
-
based sparse conditional constant propagation
and then data dependence analysis.

The g
lobal optimizer does the following optimizations:



constant propagation



dead
-
code elimination



local common subexpression elimination



copy propagation



partial
-
redundancy elimination



a second pass of copy propagation



a second pass of dead
-
code elimination

Com
pilation results

The assembly code for the C program appears in the book on page 741.

The assembly code for the Fortran 77 program appears in the book on pages 742
-
743.

The numbers in parentheses are according to the numbering of possible optimizations
for

each program.

Optimizations performed on the C program



(1) The constant value of
kind

has been propagated into the conditional
and the dead code eliminated.



(2) the loop invariant
length*width

has been removed from the loop.



strength
-
reduction of
height
.



the local variables have been allocated to registers.



instruction scheduling has been performed.

Missed optimizations on the C program



(6) loop unroll.



(5) tail
-
call optimization.



(3) accumulation of
area

into a single multiplication.

Optimizations perfor
med on the Fortran 77 program



(2)
s1()

has been inlined, and therefore it is found out that
n
=500.



(1) common subexpression elimination of
a[k,j]



(5) linear
-
function test replacement



local variables allocated to regisers

Missed optimizations on the Fortran

77 program



(6) loop unroll

Compilers comparison

The performance of each of the compilers on the C example is summarized in the
following table:

optimization

Sun SPARC

IBM XL

Intel 386 family

constant propagation of
king

yes

yes

yes

dead
-
code eliminatio
n

almost all

yes

yes

loop
-
invariant code
motion

yes

yes

yes

strength
-
reduction of
height

yes

yes

yes

reduction of
area

computation

no

no

no

loop unrolling factor

4

2

none

rolled loop

yes

yes

yes

regiser allocation

yes

yes

yes

instruction scheduling

yes

yes

yes

stack frame eliminated

yes

no

no

tail call optimized

yes

no

no


The performance of each of the compilers on the Fortran example is summarized in
the following table:

optimization

Sun SPARC

IBM XL

Intel 386 family

address of
a(i)

a
common su
bexpression

yes

yes

yes

precedure integration of
s1()

yes

no

yes

loop unrolling factor

4

2

none

rolled loop

yes

yes

yes

instructions in innermost
loop

21

9

4

linear
-
function test
replacement

no

no

yes

software pipelining

yes

no

no

register allocatio
n

yes

yes

yes

instruction scheduling

yes

yes

yes

elimination of
s1()

subroutine

no

no

no


Future trends

There are several clear main trends developing for the near future of advanced
compiler design and implementation:



SSA is being uses more and more:

o

a
llows methods designed to basic blocks & extended
basic blocks to be applied to whole procedures

o

improves performance



more use of partial
-
redundancy elimination



partial
-
redundancy elimination & SSA are being combined



scalar
-
oriented optimizations integrate
d with parallelization and
vectorization.



advance in data
-
dependence testing, data
-
cache optimization and
software pipelining



The most active research in scalar compilation will continue to be
optimization.

Other trends



More and more work will be shifted f
rom hardware to compilers.



More advanced hardware will be available.



Higher order programming languages will be used:

o

Memory management will be simpler

o

Modularity facilities will be available.

o

Assembly programming will hardly be used.



Dynamic (runtime) com
pilation will become more significant.

Theoretical Techniques in Compilers

technique

compiler field

data structures

all

automata algorithms

front
-
end, instruction selection

graph algorithms

control
-
flow analysis, data
-
flow analysis, register
allocation

linear programming

instruction selection (complex machines)

Diophantic equations

parallelization

random algorithms

not use yet