Dynamic memory management: challenges and ... - SCIEnce

clippersdogheartedΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

93 εμφανίσεις

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

1

Richard Jones

Computing Laboratory

University of Kent, Canterbury

http://www.cs.kent.ac.uk/~rej

January 2009, Paris


Dynamic Memory
Management

Challenges for today and tomorrow

SCIEnce 2009

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

2

Overview

Part 1:


Introduction


Basic algorithms

Part 2:

GC for high performance

Part 3:

Reference counting revisited

Part 4:

Integration with the environment


Discussion:
GC and CA

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

3

PART 1: Introduction


Why garbage collect?



Basic algorithms

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

4

Why garbage collect?

Obvious requirements


Finite, limited storage.


Language requirement


objects may survive their creating method.


The problem


hard/impossible to determine when something is
garbage.


Programmers find it hard to get right.


Too little collected? memory leaks.


Too much collected? broken programs.


Good software engineering


Explicit memory management conflicts with the software
engineering principles of abstraction and modularity.

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

5

Basic algorithms


The garage metaphor


Reference counting:
Maintain a note on each object in your garage,
indicating the current number of references to the object. When an
object’s reference count goes to zero, throw the object out (it’s dead).


Mark
-
Sweep:
Put a note on objects you need (roots). Then recursively
put a note on anything needed by a live object.

Afterwards, check all objects and throw out objects without notes.


Mark
-
Compact:

Put notes on objects you need (as above). Move
anything with a note on it to the back of the garage.

Burn everything at the front of the garage (it’s all dead).


Copying:

Move objects you need to a new garage. Then recursively
move anything needed by an object in the new garage.

Afterwards, burn down the old garage (any objects in it are dead)!

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

6

Generational GC

Weak generational hypothesis


“Most objects die young” [Ungar, 1984]


Common for 80
-
95% objects to die before a
further MB of allocation.

Strategy
:



Segregate objects
by age

into
generations
(regions of the heap).


Collect different generations at different
frequencies.


so need to “
remember
” pointers that cross
generations.


Concentrate on the nursery generation to reduce
pause times.


full heap collection pauses 5
-
50x longer than
nursery collections.

roots

Old gen

Young gen

remset

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

7

Generational GC:

a summary

Highly successful for a range of applications


acceptable pause time for interactive applications.


reduces the overall cost of garbage collection.


improves paging and cache behaviour.

Requires a low survival rate, infrequent major collections, low overall cost
of write barrier


But generational GC is not a universal panacea.

It improves expected pause time but not the worst
-
case.


objects may not die sufficiently fast


applications may thrash the write barrier


too many old
-
young pointers may increase pause times


copying is expensive if survival rate is high.

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

8

Can GC perform as well as
malloc/free?

Test environment


Real Java benchmarks +
JikesRVM

+ real collectors + real malloc/frees.


Object reachability traces: provide an ‘oracle’.


Dynamic SimpleScalar

simulator: count cycles, cache misses, etc.

Quantifying the Performance of Garbage Collection v. Explicit Memory Management
, Hertz and Berger. OOPSLA’05.

Best GC v. Best malloc

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

9

PART 2: GC for high
performance systems

Assume
:

Multiprocessor, many user threads, very large heaps.

Require:
Good throughput, good response time, but not hard real
-
time.


Typical configuration:


A nursery region providing fast, bump
-
pointer allocation.


A mature region managed by a mark
-
sweep collector.


requires occasional compaction


Exploit parallel hardware


Concurrent allocation


Concurrent marking


Parallel marker and collector threads


Concurrent sweeping



Parallel & concurrent compaction

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

10

Parallelism

Avoid bottlenecks


Heap contention


allocation.


Tracing, especially contention for mark stack.


Compaction


address fix
-
up must appear atomic.


Use lock
-
free data structures

Load balancing

is critical


Work starvation
vs.
excessive synchronisation

Over
-
partition work, work
-
share


Marking, scanning card tables / remsets, sweeping, compacting

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

11

Terminology

mutator

collector

Parallel collection

Incremental collection

Single threaded collection

Concurrent collection

User program

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

12

A. Concurrent allocation

Multiple user threads


Must avoid contention for the heap


Avoid locks, avoid atomic instructions (CAS)

freep

thread

thread

thread

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

13

Local Allocation Blocks



Thread contends for a contiguous block (LAB)


CAS.


Thread allocates within LAB (bump a pointer)


no contention.


Locality properties make this effective even for GCs that rarely move
objects


needs variable
-
sized LABs.

thread

thread

thread

nextLAB

LAB

LAB

LAB

LAB

LAB manager

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

14

marked

B: Concurrent Marking

Marking concurrently with user threads introduces a coherency problem:
the mutator might hide pointers from the collector.

Snapshot at the beginning

write barrier

Record &

revisit


IU

catch changes to connectivity.


SAB

prevent loss of the original path.

Incremental update

write barrier

Record &

revisit

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

15

Write barrier properties

Performance:
no barriers on local variables


Incremental update


Mutator can hide live objects from the collector.


Need a final, stop the world, marking phase.


Snapshot at the beginning


Any object reachable at the start of the phase remains reachable.


So no STW phase required.


But more floating garbage retained than for incremental update.

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

16


The goal is always to avoid contention (e.g. for the mark stack).


Thread
-
local mark stacks, work
-
stealing.


Grey packets.



C. Parallel markers

Stephen Thomas, Insignia Solutions

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

17

Marker threads acquire a full packet of marking work (grey references).

They mark & empty this packet and fill a fresh (empty) packet with new work.

The new, filled packet is returned to the pool.


Avoids most contention


Simple termination.


Allows prefetching (unlike traditional mark stack).


Reduces processor weak ordering problems (fences only around packet
acquisition/disposal).

Grey packets



























thread

thread

Grey packet
pool

Stephen Thomas, Insignia Solutions

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

18

E. Compaction

Without compaction


Heaps tend to fragment over time.


Issues for compaction:


Moving objects.


Updating all references to a moved object.


In parallel


And concurrently.

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

19

Compaction strategy

Old Lisp idea recently applied to concurrent systems.

Divide the heap into a few, large regions:






Marker constructs remsets


locations with references into each region.


Heuristic for condemned region (live volume/number/references) .


Use remsets to fix
-
up references.

heap

remsets

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

20

Parallel compaction

Split heap into


Many small
blocks

(e.g. 256b).


Fewer, larger

target areas
(e.g. 16 x processors, each 4MB +).

Each thread

1.
increments index of next target to compact (e.g. CAS).

2.
claims a target area with a lower address (e.g. CAS).

3.
moves objects/blocks in the next block en masse into target area.

An Efficient Parallel Heap Compaction Algorithm,
Abuaiadh et al, OOPSLA’04

target

target

target

block

block

block

block

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

21

Results

Reduced compaction time

(individual objects)


from 1680 ms to 470 ms

for a large 3
-
tier application suffering fragmentation problems.


Moving blocks

rather than individual objects


further reduces compaction time by 25%


at a small increase in space costs (+4%).


Compaction speed up

is linear in number of threads.


Throughput

increased slightly (2%).


© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

22

Read barrier methods

Alternatively
: don't let the mutator see objects that the
collector hasn’t seen.



How?


Trap mutator accesses and mark or copy & redirect.


Read barriers


software


memory protection.


© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

23

Memory Protection

By protecting a
grey

page, any access attempt by the
mutator is trapped.


Read

barrier

Mutator

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

25

Compressor: concurrent, incremental
and parallel compaction

Fully concurrent (no Stop
-
the
-
world phase), semispace sliding compaction


A single heap pass + a pass on a small mark
-
bit vector

(cf. 2 full heap passes for most compactors)

Phases

1.
Mark live objects, constructing a
mark
-
bit vector
.

2.
From mark
-
bit vector, construct
offset table

mapping fromspace
blocks to tospace blocks.

3.
mprotect tospace. Stop threads one at a time and update references
to tospace (using offset table and mark
-
bit vector).

4.
On access violations, move N pages and update references in moved
pages. Unprotect pages.


need to map physical pages to 2 virtual pages.

The Compressor: concurrent, incremental and parallel compaction,

Kermany & Petrank, PLDI’06

cf.
Mostly Concurrent Compaction for Mark
-
Sweep GC,

Ossia et al, ISMM’04

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

26

Compressor results

Throughput:
better than generational mark
-
sweep for SPECjbb2000 server
benchmark, worse for DaCapo client benchmarks

Pauses:
delay before achieving full allocation rate.













SPECjbb2000: 2
-
way 2.4GHz Pentium III Xeon

DaCapo: 2.8GHz Pentium 4 uniprocessor

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

28

Azul Pauseless GC

Mark


Parallel, concurrent; work
-
lists.


Threads mark own roots; parallel threads mark blocked threads’ roots.

Relocate


Protect pages and concurrently relocate.


Free physical
but not virtual pages

immediately (might be stale refs).


Side array for forwarding pointers, also used by RB.


Mutator may do the copy on behalf of the GC.

Remap


GC threads traverse object graph, tripping RB to update stale refs.

Results


Substantially reduced transaction times; almost all pauses < 3ms.


Reports other (incomparable) systems had 20+% pauses > 3ms.

mark

relocate

mark

remap

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

30

PART 3: Reference counting
revisited

Problems


Cycles


Overheads


Atomic RC operations


Benefits


Distributed overheads


Immediacy


Recycling memory



Observations


Practical solutions use Deferred RC


don’t count local variable ops


periodically scan stack


RC and tracing GC are duals


GC traces live objects


RC traces dead objects


Unified Theory of Garbage Collection,
Bacon et al, OOPSLA’04

Two solutions

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

31

A. Concurrent, cyclic RC

RC contention
: use a producer
-
consumer model


Mutator threads only add RC operations to local buffers.


Collector thread consumes them to modify RC.

Avoiding races

between increments and decrements:


Buffers periodically turned over to collector (epochs).


Perform Increments this epoch, Decrements the next.

Cycles
: use trial deletion


Trace suspect subgraphs


Remove RCs due to internal references


All RCs = 0 means the subgraph is garbage.


Results


Throughput close to a parallel MS collector for most applications.


Maximum pause time: 2.6ms, smallest gap between pauses: 36ms.


Concurrent cycle collection in reference counted systems
Bacon and Rajan,

ECOOP’01
.

threads

Epoch 1

Epoch 2

Epoch 3

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

32

B. Sliding views

Observation
:



Naively:
RC(
o
2
)
--
, RC(
o
3
)++, RC(
o
3
)
--
, RC(
o
4
)++


But only need:
RC(
o
2
)
--
, RC(
o
4
)++
.


Sliding views


Mutator checks o
1
’s dirty bit: if clear, set and add old values to a local
buffer.


Threads mark objects referenced from stacks, etc, & clear dirty bits.

Pass local buffers to collector thread.


Collector performs
RC(
o
2
)
--

using buffered values,

RC(
o
4
)++

using
current value.


Results

vs. STW Mark Sweep (Sun JVM 1.2.2)


Worst
-
case response times (SPEC jbb)


1 thread: 16ms v. 7433ms (464x);


16 threads: 250ms v 6593ms (26x)


An on
-
the
-
fly reference counting garbage collector for Java,
Levanoni and Petrank, OOPSLA’01
.

ç

1

1

1

1

0

2

1

1

ç

0

1

2

1

o
1

o
2

o
3

o
4

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

41

PART 4: Integration with the
environment

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

42

GC without paging

Page
-
level locality


Collecting disrupts program’s working set


GC touches all pages.


Paging = performance killed


unrealistic to overprovision memory.


Adaptive GC


Tune heap expansion v. GC decisions.


Monitor paging behaviour.


Bookmarking GC


Low space consumption + elimination of GC
-
induced paging


Cooperates with OS over page eviction/replacement.


Full
-
heap, compacting GC


even when portions of the heap are evicted.

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

43

GC & the Virtual Memory Manager

On eviction


GC and try to discard an empty but resident page.


Select a victim page.


Scan page and remember its outgoing pointers
(‘bookmarks’)


cf. remsets.


Protect page and inform VMM that it can be evicted.


GC


As normal but don’t follow references to evicted pages


Treat bookmarks as roots.


On return to main memory


Trap mutator access violation.


Remove the bookmarks for this page.

GC

Evacuates pages

Selects victim pages

VMM

Page

replacement

eviction notification

victim pages

Garbage Collection without Paging,

Hertz et al, PLDI’05

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

44

Average pause times

BC

BC w/ Resizing Only

GenCopy

GenMS

CopyMS

SemiSpace

PseudoJBB with 77MB heap while paging

Available memory

Available memory

Average pause time (ms)

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

45

Today…

GC


Throughput competitive with
explicit deallocation but a space
penalty.


Not intrusive.


High
-
performance
server GC


Heaps of several GB.


Multiprocessor machines.


Short pause times


Real
-
time GC


Guaranteed performance if you
specify parameters.


Reference counting


Renewed interest


Integration with environment


GC implementers are cache
sensitive.


Cooperation with OS over paging.


© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

46

The future

“Server” GC will move onto the desktop.


How to deal with massively multicore systems?


Better cooperation with the operating system etc.


The virtual machine communities should talk to each other.


Power
-
aware garbage collection.


Exploit program knowledge
.


Object lifetimes have distinct patterns


can we exploit them?


What if each Beltway belt represents a management policy


expected lifetime, promotion route, …


Static analyses, maybe not to reclaim objects, but to give hints to the
GC.

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

47

Questions?

© Richard Jones, University of Kent 2009

http://www.cs.kent.ac.uk/~rej

SCIEnce Paris Workshop 2009

48

Discussion

Your configurations


Work loads, object
demographics


Object sizes


Heap sizes


Lifetimes


Allocation rate, mutation rate


Requirements


Pause times v. throughput


Concurrency, parallelism


Fragmentation?


Bottlenecks?


Cache, memory bandwidth


Working set size


Tools


GCspy