the Compilation Process

scacchicgardenΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

89 εμφανίσεις

1

Incorporating Domain
-
Specific Information into
the Compilation Process

Samuel Z. Guyer



Supervisor: Calvin Lin

April 14, 2003

2

Motivation

Two different views of software:



Compiler’s view


Abstractions:

numbers, pointers, loops


Operators:

+,
-
, *,
-
>, []



Programmer’s view


Abstractions:

files, matrices, locks, graphics


Operators:

read, factor, lock, draw



This discrepancy is a problem...

3

Find the error


part 1


Example:







Error:
case

outside of
switch

statement


Part of the language definition


Error reported at compile time


Compiler indicates the location and nature of error

switch (var_83) {

case 0: func_24();


break;

case 1: func_29();


break;

}

case 2: func_78();

!

4

Find the error


part 2


Example:






Improper call to
libfunc_38


Syntax is correct


no compiler message


Fails at run
-
time



Problem
: what does
libfunc_38

do?


This is how compilers view reusables

struct __sue_23 * var_72;

char var_81[100];

var_72 =
libfunc_84
(__str_14, __str_65);

libfunc_44
(var_72);

libfunc_38
(var_81, 100, 1, var_72);

!

5

Find the error


part 3


Example:





Improper call to
fread()

after
fclose()


The
names

reveal the mistake



No traditional compiler reports this error


Run
-
time system: how does the code fail?


Code review: rarely this easy to spot

FILE * my_file;

char buffer[100];

my_file =
fopen
(“my_data”, “r”);

fclose
(my_file);

fread
(buffer, 100, 1, my_file);

!

6

Problem


Compilers are unaware of library semantics


Library calls have no special meaning


The compiler cannot provide any assistance



Burden is on the programmer:


Use library routines
correctly


Use library routines
efficiently

and
effectively



These are difficult manual tasks


Tedious and error
-
prone


Can require considerable expertise

7

Solution


A
library
-
level

compiler


Compiler support for software libraries


Treat library routines more like built
-
in operators



Compile at the library interface level


Check programs for
library
-
level

errors


Improve performance with
library
-
level

optimizations



Key
: Libraries represent
domains


Capture domain
-
specific semantics and expertise


Encode in a form that the compiler can use

8

The Broadway Compiler






Broadway



source
-
to
-
source C compiler


Domain
-
independent compiler mechanisms



Annotations



lightweight specification language


Domain
-
specific analyses and transformations



Many libraries, one compiler

Application

Source code

Library

Annotations

Header files

Source code

Broadway

Analyzer

Optimizer

Error reports

Library
-
specific messages

Application+Library

Integrated source code

9

Benefits


Improves capabilities of the compiler


Adds many new error checks and optimizations


Qualitatively different



Works with existing systems


Domain
-
specific compilation without recoding


For us: more thorough and convincing validation



Improve productivity


Less time spent on manual tasks


All users benefit from one set of annotations

10

Outline


Motivation


The Broadway Compiler



Recent work on scalable program analysis


Problem
: Error checking demands powerful analysis


Solution
:
Client
-
driven

analysis algorithm


Example
: Detecting security vulnerabilities



Contributions


Related work


Conclusions and future work

11

Security vulnerabilities


How does remote hacking work?


Most are not direct attacks (e.g., cracking passwords)


Idea
: trick a program into unintended behavior



Automated vulnerability detection:


How do we define “intended”?


Difficult to formalize and check application logic



Libraries control all critical system services


Communication, file access, process control


Analyze routines to approximate vulnerability

12

Remote access vulnerability


Example:





Vulnerability: executes any remote command


What if this program runs as root?


Clearly domain
-
specific: sockets, processes, etc.


Requirement:




Why is detecting this vulnerability hard?

int sock;

char buffer[100];

sock =
socket
(AF_INET, SOCK_STREAM, 0);

read
(sock, buffer, 100);

execl
(buffer);

Data from an Internet socket should
not specify a program to execute

!

13

Challenge 1: Pointers


Example:






Still contains a vulnerability


Only one buffer


Variables
buffer

and
ref

are
aliases



We need an accurate model of memory

int sock;

char
buffer
[100];

char *
ref

=
buffer
;

sock =
socket
(AF_INET, SOCK_STREAM, 0);

read
(sock,
buffer
, 100);

execl
(
ref
);

!

14

Challenge 2: Scope


Call graph:






Objects flow throughout program


No scoping constraints


Objects referenced through pointers



We need whole
-
program analysis

main

read

socket

sock = (AF_INET, SOCK_STREAM, 0);


(sock, buffer, 100);


(ref);

execl

!

15

Challenge 3: Precision


Static analysis is always an approximation



Precision
: level of detail or sensitivity


Multiple calls to a procedure


Context
-
sensitive: analyze each call separately


Context
-
insensitive: merge information from all calls


Multiple assignments to a variable


Flow
-
sensitive: record each value separately


Flow
-
insensitive: merge values from all assignments



Lower precision reduces the cost of analysis


Exponential polynomial ~linear

16

Insufficient precision


Example:

Context
-
insensitivity






Information merged at call


Analyzer reports 2
possible

errors


Only 1 real error



Imprecision leads to
false positives

!

main

socket

execl

execl

read

stdin

?

?

^

^

17

Cost versus precision


Problem
: A tradeoff


Precise analysis prohibitively expensive


Cheap analysis too many false positives



Idea
: Mixed precision analysis


Focus effort on the parts of the program that matter


Don’t waste time over
-
analyzing the rest



Key
: Let error detection problem drive precision


Client
-
driven program analysis

18

Client
-
Driven Algorithm


Client
:
Error detection analysis problem


Algorithm:


Start with fast cheap analysis


monitor imprecision


Determine extra precision


reanalyze

Pointer
Analyzer

Client
Analysis

Memory
Model

Error
Reports

Dependence
Graph

Monitor

Information
Loss

Adaptor

Precision
Policy

19

Algorithm components


Monitor


Runs alongside main analysis


Records imprecision




Adaptor


Start at the locations of reported errors


Trace back to the cause and diagnose

?

20

Sources of imprecision

Polluting assignments

Multiple
assignments

x =

x =

x

foo( )

Multiple
procedure calls

foo( )

foo( )


=
f
⠠†Ⱐ,


Conditions

if(cond)

x =

x =

ptr

Polluted target

ptr

Polluted pointer

(*ptr)

or

Pointer
dereference

21

In action...


Monitor analysis






Polluting assignments



Diagnose and apply “fix”


In this case: one procedure context
-
sensitive



Reanalyze

main

socket

execl

execl

read

stdin

?

?

read

read

!

22

Methodology


Compare with commonly
-
used fixed precision









Metrics


Accuracy


number of errors
reported


Includes false positives


fewer is better


Performance


only when accuracy is the same

CS
-
FS

Full
: context
-
sensitive
, flow
-
sensitive

CS
-
FI

Slow
: context
-
sensitive
, flow
-
insensitive

CI
-
FS

Medium
: context
-
insensitive, flow
-
sensitive

CI
-
FI

Fast
: context
-
insensitive, flow
-
insensitive

23

Programs


18 real C programs


Unmodified source


all the issues of production code


Many are system tools


run in privileged mode



Representative examples:

Name

Description

Priv

Lines of code

Procedures

CFG nodes

muh

IRC proxy



㕋
㈵䬩



㔬5㤱

扬慣歨潬e

E
-
浡m氠晩汴敲



ㄲ䬠⠲K㑋)



㈱ⰳ,0


-
晴灤

䙔P 摡敭en



㈲䬠⠶K䬩

㈰2

㈳ⰱ,7

湡浥m

䑎匠獥牶Sr



㈶䬠⠸K䬩

㈱2

㈵ⰴ,2

nn

News reader



36䬠⠱16䬩

㐹4

46ⰳ,6

24

Error detection problems

1.
File access:

2.
Remote access vulnerabillity:



3.
Format string vulnerability (FSV):



4.
Remote FSV:

5.
FTP behavior:

Data from an Internet socket should
not specify a program to execute

Files must be open when accessed

Format string may not contain
untrusted data

Check if FSV is remotely exploitable

Can this program be tricked into
reading and transmitting arbitrary files

25

Increasing number of CFG nodes

Results

10
X

1000X

1

100
X

0

0

0

0

0

0

0

7

29

6

85

28

2

31

4

5

93

41

0

0

0

0

0

0

0

7

18

6

85

15

1

26

4

5

89

41

0

7

18

6

15

1

26

4

5

88

41

Slow (CS
-
FI)

Medium (CI
-
FS)

Fast (CI
-
FI)

Full (CS
-
FS)

Client
-
Driven

Remote access vulnerability

Normalized performance

0

0

0

0

7

29

28

31

0

0

0

7

15

26

?

?

?

?

?

?

?

?

?

?

?

?

26

Overall results


90 test cases: 18 programs, 5 problems


Test case: 1 program and 1 error detection problem


Compare algorithms: client
-
driven vs. fixed precision

As accurate as any
other algorithm:

Runs faster than best
fixed algorithm:

Both most accurate
and fastest:

Performance
not an issue:

87 out of 90

64 out of 87

19 of 23

29 out of 64

27

Why does it work?










Validates our hypothesis


Different errors have different precision requirements


Amount of extra precision is small


Name

Total
procedures

# procedures context
-
sensitive

RA

File

FSV

RFSV

FTP

muh

84

6

apache

313

8

2

2

10

blackhole

71

2

5

wu
-
ftpd

205

4

4

17

named

210

1

2

1

4

cfengine

421

4

1

3

31

nn

494

2

1

1

30

28

Outline


Motivation


The Broadway Compiler



Recent work on scalable program analysis


Problem
: Error checking demands powerful analysis


Solution
: Client
-
driven analysis algorithm


Example: Detecting security vulnerabilities



Contributions


Related work


Conclusions and future work

29

Central contribution


Library
-
level compilation


Opportunity
: library interfaces make domains explicit
in existing programming practice


Key
: a separate language for codifying domain
-
specific knowledge


Result
: our compiler can automate previously manual
error checks and performance improvements

Knowledge
representation

Applying
knowledge

Results

Old way:

Broadway:

Informal

Codified

Manual

Compiler

Difficult, unpredictable

Easy, automatic, reliable

30

Specific contributions


Broadway compiler implementation


Working system
(43K C
-
Breeze, 23K pointers, 30K Broadway)


Client
-
driven pointer analysis algorithm
[SAS’03]


Precise and scalable whole
-
program analysis


Library
-
level error checking experiments
[CSTR’01]


No false positives for format string vulnerability


Library
-
level optimization experiments
[LCPC’00]


Solid improvements for PLAPACK programs


Annotation language
[DSL’99]


Balance expressive power and ease of use

31

Related work


Configurable compilers


Power versus usability


who is the user?



Active libraries


Previous work focusing on specific domains


Few complete, working systems



Error detection


Partial program verification


paucity of results



Scalable pointer analysis


Many studies of cost/precision tradeoff


Few mixed
-
precision approaches

32

Future work


Language


More analysis capabilities



Optimization


We have only scratched the surface



Error checking


Resource leaks


Path sensitivity


conditional transfer functions



Scalable analysis


Start with cheaper analysis


unification
-
based


Refine to more expensive analysis


shape analysis

33

Thank You

34

Annotations (I)


Dependence and pointer information


Describe pointer structures


Indicate which objects are accessed and modified

procedure

fopen(pathname, mode)

{


on_entry

{ pathname
--
> path_string


mode
--
> mode_string }



access

{ path_string, mode_string }



on_exit

{
return

--
> new file_stream }

}

35

Annotations (II)


Library
-
specific properties


Dataflow lattices

property
State : { Open, Closed}


initially

Open


property

Kind : { File,


Socket { Local, Remote } }

Socket

File

Local

Remote

Open

Closed

^

^

^

^

36

Annotations (III)


Library routine effects


Dataflow transfer functions

procedure

socket(domain, type, protocol)

{


analyze

Kind {


if

(domain == AF_UNIX) IOHandle <
-

Local


if

(domain == AF_INET) IOHandle <
-

Remote


}



analyze

State { IOHandle <
-

Open }



on_exit

{
return

--
> new IOHandle }

}

37

Annotations (IV)


Reports and transformations

procedure

execl(path, args)

{


on_entry

{ path
--
> path_string }



report

if

(Kind : path_string
could
-
be

Remote)


“Error at “ ++
$callsite

++ “: remote access”;

}


procedure

slow_routine(first, second)

{


when

(
condition
)


replace
-
with

%{

quick_check(
$first
);


fast_routine(
$first
,
$second
);
}%

}

38

Why does it work?










Validates our hypothesis


Different clients have different precision requirements


Amount of extra precision is small


Name

# procedures context
-
sensitive

% variables flow
-
sensitive

RA

File

FSV

RFSV

FTP

RA

File

FSV

RFSV

FTP

muh

6

0.1

0.07

0.31

apache

8

2

2

10

0.89

0.18

0.91

1.07

0.83

blackhole

2

5

0.24

0.04

0.32

wu
-
ftpd

4

4

17

0.63

0.09

0.51

0.53

0.23

named

1

2

1

4

0.14

0.01

0.23

0.20

0.42

cfengine

4

1

3

31

0.43

0.04

0.46

0.48

0.03

nn

2

1

1

30

1.82

0.17

1.99

2.03

0.97

39

Time

40

Validation


Optimization experiments


Cons: One library


three applications


Pros: Complex library


consistent results



Error checking experiments


Cons: Quibble about different errors


Pros: We set the standard for experiments



Overall


Same system designed for optimizations is among the
best for detecting errors and security vulnerabilities

41

Type Theory


Equivalent to dataflow analysis (heresy?)





Different in practice


Dataflow: flow
-
sensitive problems, iterative analysis


Types: flow
-
insensitive problems, constraint solver



Commonality


No magic bullet: same cost for the same precision


Extracting the store model is a primary concern

Flow values

Types

Transfer functions

Inference rules

Remember Phil

Wadler’s talk?

42

Generators


Direct support for domain
-
specific programming


Language extensions or new language


Generate implementation from specification



Our ideas are complementary


Provides a way to analyze component compositions


Unifies common algorithms:


Redundancy elimination


Dependence
-
based optimizations


43

Is
it

correct?


Three separate questions:



Are Sam Guyer’s experiments correct?


Yes, to the best of our knowledge


Checked PLAPACK results


Checked detected errors against known errors



Is our compiler implemented correctly?


Flip answer: who’s is?


Better answer: testing suites



How do we validate a set of annotations?

44

Annotation correctness

Not addressed in my dissertation, but...


Theoretical approach


Does the library implement the domain?


Formally verify annotations against implementation



Practical approach


Annotation debugger: interactive


Automated assistance in early stages of development



Middle approach


Basic consistency checks

45

Error Checking vs Optimization


Optimistic


False positives allowed


It can even be unsound


Tend to be “may” analyses



Correctness is absolute


“Black and white”


Certify programs bug
-
free



Cost tolerant


Explore costly analysis


Pessimistic


Must preserve semantics


Soundness mandatory


Tend to be “must” analyses



Performance is relative


Spectrum of results


No guarantees



Cost sensitive


Compile
-
time is a factor


46

Complexity


Pointer analysis


Address taken: linear


Steensgaard: almost linear (log log n factor)


Anderson: polynomial (cubic)


Shape analysis: double exponential



Dataflow analysis


Intraprocedural: polynomial (height of lattice)


Context
-
sensitivity: exponential (call graph)



Rarely see worst
-
case

47

Optimization


Overall strategy


Exploit layers and modularity


Customize lower
-
level layers in context



Compiler strategy:
Top
-
down layer processing


Preserve high
-
level semantics as long as possible


Systematically dissolve layer boundaries



Annotation strategy


General
-
purpose specialization


Idiomatic code substitutions

48

PLAPACK Optimizations


PLAPACK matrices are distributed





Optimizations exploit special cases


Example: Matrix multiply

Processor grid

PLA_Gemm( , , );


PLA_Local_gemm

PLA_Gemm( , , );


PLA_Rankk

49

Results

50

Find the error


part 3


State
-
of
-
the
-
art compiler


struct __sue_23 * var_72;

struct __sue_25 * new_f = (struct __sue_25 *) malloc(sizeof (struct __sue_25));

_IO_no_init(& new_f
-
>fp.file, 1, 0, ((void *) 0), ((void *) 0));

(& new_f
-
>fp)
-
>vtable = & _IO_file_jumps;

_IO_file_init(& new_f
-
>fp);

if (_IO_file_fopen((struct __sue_23 *) new_f, filename, mode, is32) != ((void *) 0)) {



var_72 = & new_f
-
>fp.file;



if ((var_72
-
>_flags2 & 1) && (var_72
-
>_flags & 8)) {



if (var_72
-
>_mode <= 0) ((struct __sue_23 *) var_72)
-
>vtable = & _IO_file_jumps_maybe_mmap;



else

((struct __sue_23 *) var_72)
-
>vtable = & _IO_wfile_jumps_maybe_mmap;



var_72
-
>_wide_data
-
>_wide_vtable = & _IO_wfile_jumps_maybe_mmap;



}

}

if (var_72
-
>_flags & 8192U) _IO_un_link((struct __sue_23 *) var_72);

if (var_72
-
>_flags & 8192U) status = _IO_file_close_it(var_72);



else status = var_72
-
>_flags & 32U ?
-

1 : 0;

((* (struct _IO_jump_t * *) ((void *) (& ((struct __sue_23 *) (var_72))
-
>vtable) +



(var_72)
-
>_vtable_offset))
-
>__finish)(var_72, 0);

if (var_72
-
>_mode <= 0)



if (((var_72)
-
>_IO_save_base != ((void *) 0))) _IO_free_backup_area(var_72);

if (var_72 != ((struct __sue_23 *) (& _IO_2_1_stdin_)) &&


var_72 != ((struct __sue_23 *) (& _IO_2_1_stdout_)) &&



var_72 != ((struct __sue_23 *) (& _IO_2_1_stderr_))) { var_72
-
>_flags = 0;



free(var_72); }

bytes_read = _IO_sgetn(var_72, (char *) var_81, bytes_requested);

51

End backup slides