Introduction & Overview

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 4 χρόνια και 17 μέρες)

71 εμφανίσεις

Introduction

Compiler

References


Textbook
:


Compilers

Principles,

Techniques,

and

Tools,

Alfred

V
.
Aho,

Ravi

Sethi,

and

Jeffrey

D
.

Ullman,

Second

Edition

Addison
-
Wesley

,
2007


References


Programming

Language

Processors

in

Java
.

Compilers

and

Interpreters,

D
.
A
.

Watt

and

D
.
F
.

Brown,

Pearson

Education

Ltd
.



Assessment:


25% Coursework


75% Final Exam


Objectives



To introduce principles, techniques, and tools
for compiler construction


To obtaining the knowledge what a compiler
does and how to build one.

Course Outline


1. Introduction, Structure of a Compiler

2. Lexical Analysis: Tokens, Regular Expressions

3. Parsing: Context
-
free grammars, predictive

4. Abstract Syntax: Semantic actions, abstract parse trees

5. Semantic Analysis: Symbol tables, bindings, type
-
checking

6. Stack Frames: Representation and Abstraction

7. Intermediate Code: Representation trees, translation

Why we need to know compilers?


All

software

is

written

in

a

programming

language
.




Learning

about

compilers

will

teach

you

a

lot

about

the

programming

languages

you

already

know
.



Seeing

the

development

of

a

compiler

gives

you

a

feeling

for

how

programs

work
.




A

great

example

of

interplay

between

theory

and

practice
.




Many

algorithms

and

models

you

will

use

in

compilers

are

fundamental,

and

will

be

useful

to

you

elsewhere
:



automata,

regular

expressions

(lexing)



context
-
free

grammars,

trees

(parsing)



hash

tables

(symbol

table)



dynamic

programming,

graph

coloring

(code

gen
.
)

Why study compilers?



Compilers Improve Programming Productivity


To enhance understanding of programming languages


To have an in
-
depths knowledge of low
-
level machine executables


To write compilers and interpreters for various programming languages
and domain
-
specific languages Examples: Java, JavaScript, C, C++,
C#, Modula
-
3, Scheme, ML, Tcl/Tk, Database Query Lang.,
Mathematica, Matlab, Shell
-
Command
-
Languages, Awk, Perl, your
.mailrc file, HTML, TeX, PostScript, Kermit scripts, .....



To learn interesting compiler theory and algorithms.



To learn the beauty of programming in modern programming language


To learn how to use them well.


To learn how to write them.



To illuminate programming language design. .As an example of a large
software system.



To motivate interest in formal language theory.


Computer Organization

Hardware

Operating System

Compiler

Application

History, Programming Languages

Machine coding

(binary programming


punch holes) (first generation)

The computer’s ‘native language’, binary digits (0s, 1s)


0100 0001 0110 1110 0100 0001


0001 0010 1100 0100 0000 1101


Programming in machine code is

• very slow,

• error prone,

• requires a detailed knowledge of the relevant computer architecture,

• difficult to understand other people’s code,

• code becomes obsolete if the machine is changed.


Assembly Language

(second generation)

One
-
to
-
one correspondence to machine language

MOV AX, 5h

MOV DX, 3h

ADD A

Assembler


translates assembly language programs into machine
language

History, Programming Languages
(High
-
Level Languages)

Procedural Languages

(third generation)


Instructions translate into machine language instructions


Uses common words rather than abbreviated mnemonics

C, C++, Java, Fortran, QuickBasic



A= 3



B= A * 2
-

1



D= A / B + A^5



Compiler
-

translates the entire program at once


Interpreter
-

translates and executes one source program
statement at a time


History, Programming Languages
(High
-
Level Languages)


Nonprocedural Languages

(fourth generation)

Allows the user to specify the desired result without
having to specify the detailed procedures needed for
achieving the result.



Standard Query Language (SQL)



Natural Language Programming Languages



(fifth generation (intelligent) languages).


Translates natural languages into a structured,
machine
-
readable form

High
-
Level Languages


Expressions: such as +,
-
, *, /


Data Types: simple types (e.g. Boolean, int, float) as well as
composite structures (records) and arrays


-

can be defined by the programmer


Control Structures: allow programming of selective computation
as well as iterative computation


Declaration: introduce identifiers to indicate const. Values,
variables, procedures etc.


Abstraction: separation of concerns i.e. break a problem up and
deal with sub
-
sets


Encapsulation: (data abstraction) grouping relevant relations
and selectively hiding specific information (e.g. classes)

Why high
-
level languages?


• Understandability (readability)

• Naturalness (languages for different
applications)

• Portability (machine
-
independent)

• Efficient to use (development time)

Language Processors


Editors
:

(

to

enter

text)

they

can

process

text

based

on

the

logical

structure

of

the

text
.



Translator
:

translates

text

from

one

language

to

another


Compiler
:

translates

from

a

high
-
level

language

to

low
-
level

language



Interpreter
:

takes

a

text

(in

a

particular

language)

and

runs

it

immediately



Assembler
:

translates

from

an

assembly

language

into

the

corresponding

machine

code
.

assembly

languages

easier

to

produce

as

output

and

is

easier

to

debug


Language Processors


Simulator,

Emulator

Machine

code

is

interpreted



machine

code



e
.
g
.

Simulate

a

processor

on

an

existing

processor
.


Preprocessor

Extended

high
-
level

language



high
-
level

Language
.

Preprocessors

Sometimes

called

before

the

actual

compilation

process

e
.
g
.

Remove

comments,

include

the

text

of

other

files,

and

perform

macro

substitutions

(replace

shorthand

notation

with

longer

piece

of

text)



Natural

language



translators



e
.
g
.

Chinese



English


Assembler



The Assembler is responsible for translating the target code

usually
assembly code

into an executable machine code.



The assembly code is a mnemonic version of machine code in which:

1. Names are used instead of binary codes for operations (Code Table).

2. Names are used for operands instead of memory locations (Symbol
Tables).


Assembly level programming:

-

improves the productivity,

-

is less error prone,

-

is somewhat easier to understand,

-

code is as efficient as the machine code.

but

-

it requires detailed knowledge of a computer architecture,

-

code is machine dependent,

-

code is obsolete when a machine is changed.


It became soon apparent that we need to do the programming in a
machine independent language (HLL)

Compilers & Interpreters



Interpreters are another class of translators


Compiler: translates a program once and for all into target
language.

“C++”


Interpreter: effectively translates a source program every time it
is run.

“Basic “


Compilers and interpreters (highbred) are used together “Java”



Java compiled into Java byte code,


byte code interpreted by a Java Virtual Machine (JVM).

What is a Compiler?


A compiler is program that reads a program
written in one language (source language) and
translates it into an equivalent program in
another language (target language) .

Compiler

Source Program

Target
Program

Error

Compiler


Source programs: Many possible source languages, from traditional, to
application specific languages.



Programming languages (High
-
level)



Modeling languages



Document description languages



Database query languages


Target programs: Another programming language, often the machine
language of a particular computer system.



High
-
level programming language



Low
-
level programming language (assembler or machine code)



Application
-
specific target language



Error messages: Essential for program development

Do we need Compilers?


Machines understand only 1’s and 0’s. High
-
level
languages, make it easier for the user to program in,
but not for the machine to understand.


Once the programmer has written and edited the
program (in an Editor), it needs to be translated into
machine language (1’s and 0’s) before it can be
executed.



compilers are used to do this conversion

Where are compilers used?


Implementation of programming languages


C, C++, Java, Lisp, Prolog, SML, Haskell, Ada, Fortran.


Document processing


DVI


PostScript,


Word documents


PDF


Natural language processing


NL


database query language


database commands


Hardware design


silicon compilers, CAD data


machine operations, equipment lists


Report generation


CAD data


list of parts,


All kinds of input/output translations


various UNIX text filters, . . .

Interpreter


Given the program source code and the
run
-
time input, Interpret the source code
directly, i.e. parse and simulate it,
statement by statement (syntax
-
directed
interpretation)



UNIX shells (command line interpreter)



Early interpreters for BASIC, LISP, APL


Good for debugging


Very slow But ok for small scripts


Compiler / Translator and
Interpreter


A

translator

is

used

to

produce

an

“equivalent”

program

in

another

language

(e
.
g
.

from

C

to

Pascal)



Compiler

is

a

translator

that

generally

takes

in

a

higher

level

language

(e
.
g
.

C)

and

transforms

it

into

a

low

level

language




(usually

object

or

machine

code)
.



Compiler/Translator

produce

the

entire

output

code

before

executing



Interpreter

compiles

and

executes

a

statement

at

a

time

before

moving

on

to

the

next

statement

Compiler / Translator and
Interpreter


The machine
-
language target program produced by a compiler is
usually much faster than an interpreter at mapping inputs to outputs .



An interpreter, however, can usually give better error diagnostics
than a compiler



because it executes the source program statement by statement.



compiler

Interpreters versus Compilers

The tradeoffs between compilation and interpretation?

Compilers typically offer more advantages when


programs are deployed in a production setting


programs are “repetitive”


the instructions of the programming language are complex

Interpreters typically are a better choice when


we are in a development/testing/debugging stage


programs are run once and then discarded


the instructions of the language are simple


the execution speed is overshadowed by other factors


e.g. on a web server where communications
costs are much higher than execution speed

Hybrid
compiler

/ interpreter

How does Java work?

Java Source Code

Java Byte Code

Java Virtual Machine

Compiler Javac

Interpreter Java

A benefit of this arrangement in Java is that bytecodes compiled on

one machine can be interpreted on another machine, perhaps across a
network.

Program execution

Three phases of execution:

Compile time"

1. Source program → object program (compiling)

2. Linking, loading → absolute program

"Run
-
time“

Large programs are often compiled in pieces, so the relocatable machine
code may have to be linked together with other relocatable object files
and library files into the code that actually runs on the machine.

The
linker
resolves external memory addresses, where the code in one file
may refer to a location in another file.

The
loader
then puts together all of the executable object files

into memory for execution


3. Input → output

Loader and Linker



The machine code generated by the Assembler can be executed only if
allocated in Main Memory starting from the address “0”.


Since this is not possible the
Loader
will alter the relocatable addresses
of the code to place both instructions and data in the right place in Main
Memory.


The starting free address, L, in Main Memory to allocate the program is
called the
Relocation Factor
.


The Loader must:

1. Add to each relocatable address the relocation factor L;

2. Leave unaltered each absolute address

e.g., address of I/O devices.


The
Linker
links together the different files/modules of a single program
and, possibly, adds library files.

The phases of a compiler

Syntax analyser

Lexical analyser

Semantic analyser

Intermediate code generator

Code optimizer

Code generator

Symbol table
manager

Error
Handler

Analysis
-
Synthesis Model of
Compilation


There are two parts of compilation



Part1, Analysis: breaks up the source program
into constituent pieces and creates an intermediate
representation of the source program.



Part2, Synthesis: constructs the desired target
program from the intermediate representation. It
requires the most specialized techniques

Part1
, Analysis of the Source
Program

Analysis consists of three phases:


Lexical (Linear or Scanning): read from left
-
to
-
right
and grouped into tokens that are sequences of
characters having a collective meaning.


Syntax Analysis (Hierarchical or Parsing):
characters or tokens are grouped hierarchically into
nested collections with collective meaning.


Semantic Analysis: certain checks are performed to
ensure that the components of a program fit
together meaningfully

Lexical Analysis (Linear Analysis/
Scanning)



Input: Sequence of characters


Output: Tokens (basic symbols, groups of successive characters which belong together
logically).


Translate the input program, entered as a sequence of characters, into a sequence of
words or symbols (tokens). For example, the keyword for should be treated as a single
entity, not as a 3 character string.



position := initial + rate * 60


The assignment statement would be grouped into the following tokens

1. The identifier position

2. The assignment symbol :=

3. The identifier initial

4. The plus sign +

5. The identifier rate

6. The multiplication sign *

7. The number 60


Note
: the blank separating the characters of these tokens would normally be eliminated
during lexical analysis

Lexical Analysis

Lexical Analysis

final := initial + rate * 60

id
1

:= id
2

+ id
3

* 60

Someone breaks the ice

S o m e o n e b r e a k s

t h e i c e

Lexical Analysis

Syntax Analysis (Hierarchical
Analysis or Parsing)


Input:
Sequence of tokens


Output:
Parse tree, error messages


It involves grouping the tokens of the source
program into grammatical phrases that are used by
the compiler to synthesize output. Usually, the
grammatical phrases of the source program are
represented by a parse tree such as the following:


Determine the structure of the program, for example,
identify the components of each statement and
expression and check for syntax errors.

Syntax Analysis

Syntax Analysis

id
1

:= id
2

+ id
3

* 60

:=

id
1

+

id
2

*

id
3

60

Someone breaks the ice

Someone breaks the ice

subject

verb

object

sentence

Syntax Analysis

Semantic Analysis



Input:
Parse tree + symbol table


Output: annotated tree (abstract tree with attributes) symbol table
variables information on their type

...


Checks the source program for semantic errors and gathers type
information for subsequent code generation phase


It uses the hierarchy structure determined by the syntax
-
analysis
phase


Check that the program is reasonable, for example, that it does
not include references to undefined variables.


An important component of semantic analysis is type checking

Semantic Analysis

:=

id
1

+

id
2

*

id
3

60

:=

id
1

+

id
2

*

id
3

60

i2r

Someone plays the piano

The piano plays someone

(meaningful)

(meaningless)

Semantic Analysis

Part2,

Synthesis


Internal form


Intermediate Code Generation: as a program for an abstract
machine. It should be easy to produce and easy to translate into
the target program.


Internal form, hopefully improved


Code Optimization: attempts to improve the intermediate code.
The program can be fixed during the code optimization phase.


Machine code/assembly
code Generation: memory
locations are selected for each of the variables used by the
program. Intermediate instructions are each translated into a
sequence of machine instructions that perform the same task. A
crucial aspect is the assignment of variables to registers.

Intermediate Code Generation

:=

id
1

+

id
2

*

id
3

60

i2r

temp1 := i2r ( 60 )

temp2 := id
3

* temp1

temp3 := id
2

+ temp2

id
1

:= temp3

Intermediate Code Generation

Code Optimization

temp1 :=
i2r

( 60 )

temp2 := id
3

* temp1

temp3 := id
2

+ temp2

id
1

:= temp3

Code Optimization

temp1 := id
3

* 60.0

id
1

:= id
2

+ temp1

Code Optimization

MOVF


rate, R2

MULF

#60, R2

MOVF initial, R1

ADDF


R2, R1

MOVF


R1, position

Code Generator

temp1 := id
3

* 60.0

id
1

:= id
2

+ temp1

Symbol Table



Help for other phases during compilation


A symbol table is a data structure containing
a record for each identifier, with fields for the
attributes of the identifier. The data structure
allows us to find the record for each identifier
quickly and to store or retrieve data from that
record quickly.

Error Handler





Discover an error.



Write an error message.


Correct the error (or guess, very difficult!)


Restart from the error (try to continue)


Each phase can encounter errors. However, after
detecting an error, a phase must somehow deal with
that error, so that compilation can proceed, allowing
further errors in the source program to be deducted.


Examples of error messages



Lexical analysis:


Faulty sequence of characters which does not result in a token,


e.g.Ö, 5EL, %K, ’string


Syntax analysis:

Syntax error (e.g. missing semicolon), (4 * (y + 5)
-

12))




Semantic analysis:

Type conflict, e.g. ’HEJ’+5


Code optimization:

Uninitialized variables, anomaly detection.


Code generation:

Too large integers, run out of memory.



Table management:

Double declaration, table overflow.


A good compiler finds an error at the earliest occasion.


Usually, some errors are left to run time: array index out of bounds

Inside the Compiler

scanner

checker

parser

Optimizer code generator

Sequence of
character

sequence of tokens

Abstract Syntax
Tree (AST)

verified/ annotated
AST

Lexical
Analysis

Contextual Analysis/

checking Static Semantics

Syntactic Analysis/

Parsing

Optimization and Code
Generation

Language Processing System

skeletal source program


Preprocessor


source program


Compiler


target assembly program


Assembler


relocatable machine code


Loader/Linker


Performs: Macro
-
processing, File
inclusion, “Rational”
reprocessor, Language
extension

Split into 6 phases.
Produces assembly
code. Some compilers
include the assembler
too.

Converts mnemonics
(assembly code) into
object code.

Two
-

pass assembler :

1. denote storage
locations for
identifiers in symbol
table

2. translate code into
machine code,
translate locations into
addresses


Reads file, placing
relocatable
addresses into
proper locations in
memory

Links other
object &
library files
with object
code

The Phases of a Compiler

Phase

Output

Sample

Programmer (source code
producer)

Source string

A=B+C;

Scanner

(performs
lexical analysis
)

Token string

‘A’
,
‘=’
,
‘B’
,
‘+’
,
‘C’
,
‘;’

And
symbol table

with names

Parser

(performs

syntax analysis

based on the grammar of the
programming language)

Parse tree or abstract syntax
tree


;


|


=


/
\

A +


/
\


B C

Semantic analyzer

(type checking,
etc)

Annotated parse tree or
abstract syntax tree

Intermediate code generator

Three
-
address code, quads, or
RTL

int2fp B t1

+ t1 C t2

:= t2 A

Optimizer

Three
-
address code, quads, or
RTL

int2fp B t1

+ t1 #2.3 A

Code generator

Assembly code

MOVF #2.3,r1

ADDF2 r1,r2

MOVF r2,A

Peephole optimizer

Assembly code

ADDF2 #2.3,r2

MOVF r2,A

Compiler Pass


compiler often finds it convenient to process the entire
source program several times before generating code


Each of these repetitions is called a
pass



A collection of phases is done only once (
single pass
) or multiple
times (
multi pass
)


Single pass: usually requires everything to be defined before being
used in source program


Multi pass: compiler may have to keep entire program representation
in memory


A multi pass compiler makes several passes over the program. The output of a
preceding phase is stored in a data structure and used by subsequent phases.

Single Pass Compiler

Compiler Driver

Syntactic Analyzer

calls

calls

Contextual Analyzer

Code Generator

calls

Dependency diagram of a typical Single Pass Compiler:

A single pass compiler makes a single pass over the source text,
parsing, analyzing and generating code all at once.

Compiler
passes

Compiler Driver

Syntactic Analyzer

calls

calls

Contextual Analyzer

Code Generator

calls

Dependency diagram of a typical Multi Pass Compiler:

input

Source Text

output

AST

input

output

Decorated AST

input

output

Object Code

Syntax analyser

Lexical analyser

Semantic analyser

Intermediate code generator

Code optimizer

Code generator

Symbol table
manager

Error
Handler

Decompose statement into
tokens

Parsing, check order of tokens with
grammar, create Abstract Syntax
Tree


Type checking,
identify operators
& operands

Stores record
for each
identifier and
its attributes

First translation &
create temp. sub
-
result variables

Generates
final
assembly
code

Improve
speed,
efficiency

Detects errors,

Reports errors