Slides - The UK Memory Management Network

reelingripehalfSoftware and s/w Development

Dec 14, 2013 (3 years and 8 months ago)

68 views

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

Memory Management for

High
-
Performance Applications

Emery Berger

University of Massachusetts, Amherst

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

2

High
-
Performance Applications


Web servers, search
engines, scientific codes


C or C++ (still…)


Run on one or cluster
of server boxes


software

compiler

runtime system

operating system

hardware


Needs support at every level


U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

3

New Applications,

Old Memory Managers


Applications and hardware have changed


Multiprocessors now commonplace


Object
-
oriented, multithreaded


Increased pressure on memory manager

(
malloc
,
free
)



But memory managers have
not

kept up


Inadequate support for modern applications




U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

4

Current Memory Managers

Limit Scalability


As we add
processors,
program
slows
down


Caused by
heap

contention

Larson server benchmark on 14
-
processor Sun

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

5

The Problem


Current memory managers

inadequate
for high
-
performance
applications on modern architectures


Limit scalability, application
performance, and robustness

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

6

This Talk


Building memory managers


Heap Layers
framework
[PLDI 2001]


Problems with current memory managers


Contention, false sharing, space


Solution: provably scalable memory manager


Hoard
[ASPLOS
-
IX]


Extended memory manager for servers


Reap
[OOPSLA 2002]

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

7

Implementing Memory Managers


Memory managers must be


Space efficient


Very fast



Heavily
-
optimized code


Hand
-
unrolled loops


Macros


Monolithic functions



Hard to write, reuse, or extend

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

8

Building Modular Memory Managers


Classes


Overhead


Rigid hierarchy



Mixins


No overhead


Flexible hierarchy



U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

9

A Heap Layer

template <
class SuperHeap
>

class GreenHeapLayer :


public
SuperHeap

{…};


Mixin with
malloc

&
free

methods

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

10

Example:

Thread
-
Safe Heap Layer

LockedHeap


protect the superheap


with a lock

LockedMallocHeap

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

11

Empirical Results


Heap Layers vs.
originals:


KingsleyHeap

vs. BSD allocator


LeaHeap

vs. DLmalloc 2.7


Competitive
runtime and
memory efficiency

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

12

Overview


Building memory managers


Heap Layers
framework


Problems with memory managers


Contention, space, false sharing


Solution: provably scalable allocator


Hoard


Extended memory manager for servers


Reap

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

13

Problems with General
-
Purpose
Memory Managers


Previous work for multiprocessors


Concurrent single heap
[Bigler
et al.
85, Johnson 91, Iyengar 92]



Impractical


Multiple heaps
[Larson 98, Gloger 99]



Reduce contention
but

cause other problems:


P
-
fold or even
unbounded increase

in space


Allocator
-
induced false sharing




we show

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

14

Multiple Heap Allocator:

Pure Private Heaps


One heap per processor:


malloc

gets memory

from its local heap


free

puts memory

on its local heap



STL, Cilk,
ad hoc

x1= malloc(1)

free(x1)

free(x2)

x3= malloc(1)

x2= malloc(1)

x4= malloc(1)

processor 0

processor 1

= in use, processor 0

= free, on heap 1

free(x3)

free(x4)

Key:

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

15

Problem:

Unbounded Memory Consumption


Producer
-
consumer:


Processor 0 allocates


Processor 1 frees



Unbounded memory
consumption


Crash!

free(x1)

x2= malloc(1)

free(x2)

x1= malloc(1)

processor 0

processor 1

x3= malloc(1)

free(x3)

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

16

Multiple Heap Allocator:

Private Heaps with Ownership


free

returns memory
to
original
heap


Bounded memory
consumption


No crash!


“Ptmalloc” (Linux),

LKmalloc

x1= malloc(1)

free(x1)

free(x2)

x2= malloc(1)

processor 0

processor 1

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

17

Problem:

P
-
fold Memory Blowup


Occurs in practice


Round
-
robin producer
-
consumer


processor
i mod P

allocates


processor (
i+1) mod P
frees



Footprint = 1 (2GB),

but space = 3 (6GB)


Exceeds 32
-
bit address
space: Crash!

free(x2)

free(x1)

free(x3)

x1= malloc(1)

x2= malloc(1)

x3=malloc(1)

processor 0

processor 1

processor 2

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

18

Problem:

Allocator
-
Induced

False Sharing


False sharing


Non
-
shared objects

on same cache line


Bane of parallel applications


Extensively studied




All these allocators

cause false sharing!

processor 0

processor 1

x2= malloc(1)

x1= malloc(1)

cache line

thrash…

thrash…

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

19

So What Do We Do Now?


Where do we put free memory?


on central heap:


on our own heap:

(pure private heaps)


on the original heap:

(private heaps with ownership)



How do we avoid false sharing?


Heap contention


Unbounded memory
consumption


P
-
fold blowup

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

20

Overview


Building memory managers


Heap Layers
framework


Problems with memory managers


Contention, space, false sharing


Solution: provably scalable allocator


Hoard


Extended memory manager for servers


Reap

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

21

Hoard: Key Insights


Bound local memory consumption


Explicitly track utilization


Move free memory to a
global heap


Provably bounds memory consumption



Manage memory in large chunks


Avoids false sharing


Reduces heap contention


U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

22

Overview of Hoard


Manage memory in
heap blocks


Page
-
sized


Avoids false sharing



Allocate from local heap block


Avoids heap contention



Low utilization



Move
heap block to global heap


Avoids space blowup

global heap



processor 0

processor P
-
1

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

23

Summary of Analytical Results


Space consumption: near optimal worst
-
case



Hoard:

O(n log M/m
+ P
) {P « n}


Optimal:

O(n log M/m)


[Robson 70]: ≈ bin
-
packing


Private heaps with ownership:


O(
P

n log M/m)



Provably low synchronization


n = memory
required

M = biggest object size

m = smallest object size

P = processors

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

24

Empirical Results


Measure runtime on 14
-
processor Sun


Allocators


Solaris (system allocator)


Ptmalloc

(GNU libc)


mtmalloc

(Sun’s “MT
-
hot” allocator)


Micro
-
benchmarks


Threadtest
:

no sharing


Larson:


sharing (server
-
style)


Cache
-
scratch:

mostly reads & writes


(tests for false sharing)


Real application experience similar

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

25

Runtime Performance:
threadtest

speedup
(x,P) =
runtime
(Solaris allocator, one processor)



/
runtime
(x on P processors)


Many
threads,

no sharing


Hoard
achieves
linear
speedup

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

26

Runtime Performance:
Larson


Many
threads,

sharing

(server
-
style)


Hoard
achieves
linear
speedup

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

27

Runtime Performance:

false sharing


Many
threads,

mostly reads
& writes of
heap data


Hoard
achieves
linear
speedup

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

28

Hoard in the “Real World”


Open source code


www.hoard.org


13,000 downloads


Solaris, Linux, Windows, IRIX, …


Widely used in industry


AOL, British Telecom, Novell, Philips


Reports: 2x
-
10x, “impressive” improvement in performance


Search server, telecom billing systems, scene rendering,

real
-
time messaging middleware, text
-
to
-
speech engine,
telephony, JVM



Scalable general
-
purpose memory manager



U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

29

Overview


Building memory managers


Heap Layers
framework


Problems with memory managers


Contention, space, false sharing


Solution: provably scalable allocator


Hoard


Extended memory manager for servers


Reap

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

30

Custom Memory Allocation


Programmers often replace
malloc
/
free


Attempt to increase performance


Provide extra functionality (e.g., for servers)


Reduce space (rarely)



Empirical study of custom allocators


Lea allocator often as fast or
faster


Custom allocation ineffective,


except for
regions
.
[OOPSLA 2002]

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

31

Overview of Regions

+
Fast

+
Pointer
-
bumping allocation

+
Deletion of chunks

+
Convenient

+
One call frees all memory


region
malloc
(
r
, sz)

regiondelete
(r)


Separate areas, deletion only
en masse

regioncreate
(r)

r

-
Risky

-
Accidental deletion

-
Too much space

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

32

Why Regions?


Apparently faster, more space
-
efficient


Servers need memory management support:


Avoid resource leaks


Tear down memory associated with terminated
connections or transactions


Current approach (e.g., Apache):
regions


U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

33

Drawbacks of Regions


Can’t reclaim memory within regions


Problem for long
-
running computations,

producer
-
consumer patterns,

off
-
the
-
shelf “malloc/free” programs


unbounded memory consumption



Current situation for Apache:


vulnerable to denial
-
of
-
service


limits runtime of connections


limits module programming

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

34


Reap =
region

+
heap


Adds individual object deletion & heap






Reap Hybrid Allocator

reap
malloc
(
r
, sz)

reap
delete
(r)

reap
create
(r)

r

reap
free
(r,p)


Can reduce memory consumption


Fast


Adapts to use (region or heap style)


Cheap deletion





U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

35

Using Reap as Regions

Reap performance nearly matches regions

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

36

Reap: Best of Both Worlds


Combining
new
/
delete

with regions

usually impossible:


Incompatible API’s


Hard to rewrite code



Use Reap: Incorporate
new
/
delete

code into Apache


“mod_bc”
(arbitrary
-
precision calculator)


Changed 20 lines (out of 8000)


Benchmark: compute 1000
th

prime


With Reap: 240K


Without Reap: 7.4MB

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

37

Open Questions


Grand Unified Memory Manager?


Hoard + Reap


Integration with garbage collection



Effective Custom Allocators?


Exploit sizes, lifetimes, locality and sharing



Challenges of newer architectures


NUMA, SMT/CMP, 64
-
bit, predication

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

38

Current Work: Robust
Performance


Currently: no VM
-
GC communicaton


BAD interactions under memory pressure


Our approach
(with Eliot Moss, Scott Kaplan):

C
ooperative
R
obust
A
utomatic
M
emory
M
anagement


Garbage
collector

/ allocator

Virtual
memory
manager

LRU queue

memory pressure

empty pages

reduced impact

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

39

Current Work: Predictable VMM


Recent work on scheduling for QoS


E.g., proportional
-
share


Under memory pressure, VMM
is

scheduler


Paged
-
out processes may never recover


Intermittent processes may wait long time


Scheduler
-
faithful

virtual memory

(with Scott Kaplan, Prashant Shenoy)


Based on page
value

rather than
order

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

40

Conclusion

Memory management for high
-
performance applications


Heap Layers

framework
[PLDI 2001]


Reusable components, no runtime cost


Hoard
scalable memory manager
[ASPLOS
-
IX]


High
-
performance, provably scalable & space
-
efficient


Reap
hybrid memory manager
[OOPSLA 2002]


Provides speed & robustness for server applications



Current work:

robust memory management for
multiprogramming

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

41

The Obligatory URL Slide


http://www.cs.umass.edu/~emery



U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

42

If You Can Read This,

I Went Too Far

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

43

Hoard: Under the Hood

select heap based on size

malloc from local heap,


free to heap block

get or return memory to global heap

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

44

Custom Memory Allocation


Very

common practice


Apache, gcc, lcc, STL,
database servers…


Language
-
level
support in C++


Replace
new
/
delete
,

bypassing general
-
purpose
allocator


Reduce runtime


often


Expand functionality


sometimes


Reduce space


rarely

“Use custom
allocators”

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

45

Drawbacks of Custom Allocators


Avoiding memory manager means:


More code to maintain & debug


Can’t use memory debuggers


Not modular or robust:


Mix memory from custom

and general
-
purpose allocators →
crash!



Increased burden on programmers


U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

46

Overview


Introduction


Perceived benefits and drawbacks


Three main kinds of custom allocators


Comparison with general
-
purpose allocators


Advantages and drawbacks of regions


Reaps


generalization of regions & heaps

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

47

(1) Per
-
Class Allocators

a

b

c

a =
new

Class1;

b =
new

Class1;

c =
new

Class1;

delete
a;

delete

b;

delete
c;

a =
new

Class1;

b =
new

Class1;

c =
new

Class1;


Recycle freed objects from a free list

+
Fast

+
Linked list operations

+
Simple

+
Identical semantics

+
C++ language support

-
Possibly space
-
inefficient

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

48

(II) Custom Patterns


Tailor
-
made to fit allocation patterns


Example:
197.parser
(natural language parser)

char[MEMORY_LIMIT]

a =
xalloc
(8);

b =
xalloc
(16);

c =
xalloc
(8);

xfree
(b);

xfree
(c);

d =
xalloc
(8);

a

b

c

d

end_of_array

end_of_array

end_of_array

end_of_array

end_of_array

end_of_array

+
Fast

+
Pointer
-
bumping allocation

-
Brittle

-
Fixed memory size

-
Requires stack
-
like lifetimes


U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

49

(III) Regions

+
Fast

+
Pointer
-
bumping allocation

+
Deletion of chunks

+
Convenient

+
One call frees all memory


region
malloc
(
r
, sz)

regiondelete
(r)


Separate areas, deletion only
en masse

regioncreate
(r)

r

-
Risky

-
Accidental deletion

-
Too much space

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

50

Overview


Introduction


Perceived benefits and drawbacks


Three main kinds of custom allocators


Comparison with general
-
purpose allocators


Advantages and drawbacks of regions


Reaps


generalization of regions & heaps

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

51

Custom Allocators Are Faster…

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

52

Not So Fast…

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

53

The Lea Allocator (DLmalloc 2.7.0)


Optimized for common allocation patterns


Per
-
size quicklists ≈ per
-
class allocation


Deferred coalescing

(combining adjacent free objects)


Highly
-
optimized fastpath


Space
-
efficient


U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

54

Space Consumption Results

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

55

Overview


Introduction


Perceived benefits and drawbacks


Three main kinds of custom allocators


Comparison with general
-
purpose allocators


Advantages and drawbacks of regions


Reaps


generalization of regions & heaps

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

56

Why Regions?


Apparently faster, more space
-
efficient


Servers need memory management support:


Avoid resource leaks


Tear down memory associated with terminated
connections or transactions


Current approach (e.g., Apache):
regions


U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

57

Drawbacks of Regions


Can’t reclaim memory within regions


Problem for long
-
running computations,

producer
-
consumer patterns,

off
-
the
-
shelf “malloc/free” programs


unbounded memory consumption



Current situation for Apache:


vulnerable to denial
-
of
-
service


limits runtime of connections


limits module programming

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

58


Reap =
region

+
heap


Adds individual object deletion & heap






Reap Hybrid Allocator

reap
malloc
(
r
, sz)

reap
delete
(r)

reap
create
(r)

r

reap
free
(r,p)


Can reduce memory consumption

+
Fast

+
Adapts to use (region or heap style)

+
Cheap deletion




U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

59

Using Reap as Regions

Reap performance nearly matches regions

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

60

Reap: Best of Both Worlds


Combining
new
/
delete

with regions

usually impossible:


Incompatible API’s


Hard to rewrite code



Use Reap: Incorporate
new
/
delete

code into Apache


“mod_bc”
(arbitrary
-
precision calculator)


Changed 20 lines (out of 8000)


Benchmark: compute 1000
th

prime


With Reap: 240K


Without Reap: 7.4MB

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

61

Conclusion


Empirical study of custom allocators


Lea allocator often as fast or
faster


Custom allocation ineffective,


except for
regions


Reaps:


Nearly matches region performance

without other drawbacks


Take
-
home message:


Stop using custom memory allocators!

U
NIVERSITY OF
M
ASSACHUSETTS
, A
MHERST •

Department of Computer Science

62

Software



http://www.cs.umass.edu/~emery


(part of Heap Layers distribution)