Managed Compiler Infrastructure - Leanpub

secrettownpanamanianΚινητά – Ασύρματες Τεχνολογίες

10 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

71 εμφανίσεις

Managed Compiler Infrastructure
Design and Internals of a Modern Virtual Machine
Alex Rønne Petersen
This book is for sale at
http://leanpub.com/mci
This version was published on 2013-01-22
This is a Leanpub book.Leanpub empowers authors and publishers with the Lean Publishing
process.
Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many
iterations to get reader feedback,pivot until you have the right book and build traction once
you do.
To learn more about Lean Publishing,go to http://leanpub.com/manifesto.
To learn more about Leanpub,go to http://leanpub.com.
©2012 - 2013 The Lycus Foundation
Tweet This Book!
Please help Alex Rønne Petersen by spreading the word about this book on
Twitter
!
The suggested hashtag for this book is
#lycus
.
Find out what other people are saying about the book by clicking on this link to search for this
hashtag on Twitter:
https://twitter.com/search/#lycus
Contents
Acknowledgements
i
Preface
ii
1 Introduction
1
2 Motivation
3
3 Assembly Language
5
4 Type System
6
4.1 Primitive Types
....................................
6
4.2 Structure Types
....................................
6
4.3 Type Specifications
..................................
6
5 Instruction Set
7
6 Intrinsics
8
7 Verifiable Code
9
8 Concurrency
10
9 Garbage Collection
11
10 Optimization Passes
12
11 JIT Compilation
13
12 AOT Compilation
14
13 Tool Chain
15
14 Glossary
16
Acknowledgements
The author of this book and the developers of MCI would like to thank the following projects
and organizations for helping in the creation and shaping of MCI:

DMD
¹

GCC
²

GDC
³

Joeq


LDC


LLVM


Mono


Parrot


SDC


SL#
¹⁰
Many people in these projects and organizations have provided advice and thoughts about
design decisions during the development of MCI.Some projects we have learned from through
studying their design choices and implementations.We would like to thank everyone in these
communities for the support,whether directly or indirectly.
This book’s over image
¹¹
is copyright (c) 2011 Nomadic Lass,made available under a
CC BY-SA
2.0 License
¹²
.
¹
http://github.com/D-Programming-Language/dmd
²
http://gcc.gnu.org
³
http://gdcproject.org

http://joeq.sourceforge.net

http://github.com/ldc-developers/ldc

http://www.llvm.org

http://www.mono-project.com

http://www.parrot.org

http://github.com/bhelyer/SDC
¹⁰
http://github.com/IgniteInteractiveStudio/SLSharp
¹¹
http://www.flickr.com/photos/nomadic_lass/5784379119
¹²
http://creativecommons.org/licenses/by-sa/2.0/legalcode
i
Preface
This book introduces the Managed Compiler Infrastructure,a compiler back end and virtual
machine for so-called ‘managed’ languages.It explains how this virtual machine works,why it
was created,and what considerations drove the design of it.This book is not intended to be a
guide on how to use MCI to build compilers,but rather a guide to its internals mostly targeted
at people interested in hacking on MCI,researching with it,or extending it in some way.
I wrote this book because,if you’re going to be hacking on MCI,you’re not likely to find the
documentation entirely adequate - that’s because the documentation is very user-oriented;it’s
not an appropriate place for noting historical considerations and thoughts about the virtual
machine’s design.This book,on the other hand,caters to the curious compiler hacker and
contains all the nitty-gritty details.
Hopefully,the contents of this book will come together into one big picture illuminating the
rationale behind the design of MCI as a whole.
I hope you enjoy reading this book - even if you’re not a compiler engineer or language
designer,you’re likely to find many interesting thoughts in this book that’ll change your view
on programming languages and their implementations.
ii
1
Introduction
The MCI is a modern and intuitive compiler infrastructure (AKA back end) written in the D
2.0 programming language.The API is designed to be easy to use for modern programming
language compilers,and to be as future-proof as possible.
Compiler front ends will primarily use the MCI throughthe IALISA.IAL,Intermediate Assembly
Language,is a typed,four-address instruction set.It’s a simple IR consisting of types (containing
fields) and functions which contain typed registers and basic blocks.A basic block is a simple
linear sequence of instructions ending with a terminator instruction (which can branch to
another basic block,return fromthe function,throw an exception,etc).
The MCI is a full-featured development tool chain,and as such,provides a wide variety of tools:

Framework libraries
These are the libraries most commonly used by compilers (front ends) and other tools.
They contain the core functionality of the IR and type system,analysis and verification,
optimization,memory management,and native compilation.

IAL assembler
This is a simple tool which assembles text-formIAL code to CIAL modules.It is primarily
useful when debugging the MCI itself or when experimenting with the infrastructure.

Code verifier
This tool verifies that IAL code is well-formed.There is a set of rules for what code can and
cannot do,and they must be enforced before code reaches the optimization and execution
stages (the linking stage allows invalid code).

Module linker
This is a simple tool that will link a set of CIAL modules into one module.Its primary
purpose is simply to bundle applications as one unit.

IR optimizer
This is the primary optimization pipeline.It takes a CIAL module and runs a variety of
optimizations on the code depending on the user-specified aggressiveness.

Interpreter
The interpreter serves primarily as a means of executing CIAL modules in an environ-
ment which doesn’t yet support full JIT compilation.It generally follows the IAL ISA
specification.

JIT compiler
This is the tool most people will be running code with.It loads a CIAL module (and any
associated modules),compiles the entry point,and starts executing the code.Normally,
functions are compiled on demand (i.e.as they get called),but the JIT compiler can also
compile all functions immediately on startup (the Clang-based back end does this).

AOT compiler
The AOT compiler takes a CIAL module and compiles all IAL code to native code
immediately,writing it to a file on disk in the process.This code will then be loaded by
1
Introduction 2
the JIT compiler instead of doing full compilation.This is useful if a program is invoked
multiple times but only runs for a short time on each run.

Debugger
The debugger works much like any other:It supports pausing/resuming,breakpoints,
catchpoints,variable and thread inspection,disassembly,and so on.It works through
a well-defined socket protocol that is utilized by both the JIT and interpreter;in other
words,it is a so-called cooperative debugger,as it works with the runtime.
2
Motivation
It is not unnatural to question why we would create a system like the MCI.After all,a number
of very good VMs exist already,with excellent support for high-quality code generation and GC.
Below are some comparisons between the MCI and some of these VMs:

HotSpot:
http://openjdk.java.net/groups/hotspot
Made originally as a VM for the Java programming language.The HotSpot VM has
excellent code generation and a variety of GC algorithms (concurrent mark-sweep,
parallel scavenge,compacting mark-sweep).However,being designed for Java,it has
no support for user-defined value types,low-level pointer operations,stack allocation,etc.
Additionally,it has no support for statically unrolled vector operations.These factors all
limit the HotSpot VM’s usefulness in expensive computational work,as well as the amount
of languages that can realistically target it.That being said,the VMwas originally never
meant to run anything but Java,so it is forgivable that a more general design wasn’t
thought up.

LLVM:
http://www.llvm.org
Originally intended as a universal IR,LLVM evolved into a system more suited for C-
like languages.LLVM’s IR is extremely flexible,and there is virtually no operation that
cannot be expressed by both LLVM and the MCI.LLVM has very good code generation
and optimization,including full support for vectorization.However,LLVMitself is more
of a compilation system than an actual VM,and as such,has no built-in GC.It also was
not designed for JIT compilation,but rather static compilation (e.g.for C/C++),and can
therefore be relatively slow compared to other JIT compilers.
It is important to note that LLVM’s lack of a runtime system is actually a feature.It was
designed to be independent of any specific GCimplementation,memory layout,execution
engine,etc.This is a different philosophy than the one of MCI.

Mono:
http://www.mono-project.com
Mono is a standards-compliant implementation of the Ecma 335 CLI ISA and VM.It aims
to provide full compatibility with the Microsoft implementation (the.NET Framework),as
far as is possible,on POSIX platforms.Mono provides a built-in JIT compiler called Mini,
but also allows building with an LLVMback end.It provides support for the conservative
Boehm GC,as well as the homebrewn SGen,which is a precise,generational GC using
copying and mark-sweep algorithms.Mono provides support for vectorization through
the intrinsic Mono.Simd assembly.Being an implementation of Ecma 335,it also has full
support for user-defined value types,pointers,stack allocation,etc.
The fundamental difference between Mono and the MCI is that Mono’s assembly format
is based on Ecma 335’s CIL format.This format is very specialized for object-oriented
systems and imposes certain limitations on how objects must be laid out in memory (and
in particular,prevents multiple inheritance).
What we want to do with MCI is create a universal managed virtual machine.We want
as many compilers as possible to be able to our ISA and runtime system,without imposing
3
Motivation 4
specific paradigms (object-oriented,functional,or similar) on the format of the code the MCI
reads,compiles,and executes.On top of the MCI,high-level abstraction layers can then be
implemented for specific paradigms (object-oriented,functional,logical,and so on).
Another problemwith all of the aforementioned VMs is that they are implemented in C or C++.
These languages are very low-level and error-prone,and exposing a simple and clean API is hard.
The idea is that compilers should be able to call the MCI API directly in order to emit code,rather
than forcing compiler engineers to implement MCI-compatible reading and writing code for the
binary format (CIAL).Not only that,but implementing the MCI in a language without automatic
garbage collection would be a daunting task,and would require countless hours of debugging
memory errors.
3
Assembly Language
5
4
Type System
As the MCI is intended for high-level languages,it has a full-blown type system which is
statically enforced.A program that is well-typed is both easier to optimize and generate code
for,as knowledge implied by types can be exploited.For instance,when a structure is declared,
we can exploit knowledge of the target systemto lay out the fields of the structure in a manner
that results in efficient addressing.On a 32-bit system,it is favorable for all fields to be on an
address that is a multiple of 4,while on a 64-bit system,we’d prefer a multiple of 8,and so on.
With vector types,we can exploit the static length to statically unroll vector operations to SIMD
instructions.
Having a type systemalso helps ensure correctness.It is easier to spot errors in a compiler front
end’s output if all values have an associated type.
4.1
Primitive Types
Primitive types are the building blocks of all other types.They represent integers and floating-
point values which can (usually) fit into a machine’s native registers (the exception is 64-bit
integers which are unrolled to two 32-bit registers in 32-bit targets).
4.1.1
Integral Types
4.1.2
Floating Point Types
4.2
Structure Types
4.3
Type Specifications
4.3.1
Reference Types
4.3.2
Pointer Types
4.3.3
Array Types
4.3.4
Vector Types
4.3.5
Function Pointer Types
6
5
Instruction Set
7
6
Intrinsics
8
7
Verifiable Code
9
8
Concurrency
10
9
Garbage Collection
11
10
Optimization Passes
12
11
JIT Compilation
13
12
AOT Compilation
14
13
Tool Chain
15
14
Glossary
ALU
Arithmetic/Logic Unit
AOT
Ahead-Of-Time
API
Application Programming Interface
AST
Abstract Syntax Tree
BB
Basic Block
CIAL
Compiled Intermediate Assembly Language
CIL
Common Intermediate Language
CLI
Common Language Infrastructure
CSE
Common Sub-expression Elimination
DCE
Dead Code Elimination
EH
Exception Handling
FFI
Foreign Function Interface
GC
Garbage Collection,Garbage Collector
IAL
Intermediate Assembly Language
IPA
Inter-procedural Analysis
IPO
Inter-procedural Optimization
IR
Intermediate Representation
ISA
Instruction Set Architecture
JIT
Just-In-Time
JVM
Java Virtual Machine
LTO
Link-Time Optimization
MCI
Managed Compiler Infrastructure
PRE
Partial Redundancy Elimination
RTO
Runtime Object
RTV
Runtime Value
SCCP
Sparse Conditional Constant Propagation
SSA
Static Single Assignment
VM
Virtual Machine
16