High-Volume Writes with PostgreSQL - 2ndQuadrant

offbeatlossData Management

Nov 22, 2012 (4 years and 8 months ago)

315 views

Greg Smith - ©
2ndQuadrant US 2011
High-Volume Writes
with PostgreSQL
Greg Smith - ©
2ndQuadrant US 2011
Major parameters to set

shared_buffers: 512MB to 8GB

checkpoint_segments: 16 to 256

effective_cache_size: typically ¾ RAM

wal_buffers: typically 16MB

Auto-tuned in 9.1
Greg Smith - ©
2ndQuadrant US 2011
Checkpoints

Dirty data in buffer must be flushed

WAL segments are 16MB

Requested checkpoint

checkpoint_segments of writes

Timed checkpoint

checkpoint_timeout (5 minute default)
Greg Smith - ©
2ndQuadrant US 2011
Checkpoint spikes

8.3 added Spread Checkpoints

Aims to finish at 50% of progress

fsync flush to disk at end of checkpoint

Optimal behavior:

OS wrote data out before fsync call

Spreading sync out didn’t work usefully

Spikes still happen
Greg Smith - ©
2ndQuadrant US 2011
A bad checkpoint
LOG:  checkpoint complete: 
wrote 127961 buffers (12.2%); 
0 transaction log file(s)added, 
1818 removed,
0 recycled; 
write=80.190 s, sync=359.823 s, 
total=520.913 s
Greg Smith - ©
2ndQuadrant US 2011
A funding checkpoint
LOG:  checkpoint complete: 
wrote 141563 buffers (13.5%); 
0 transaction log file(s) added,
1109 removed, 257 recycled; 
write=944.601 s, sync=10635.130 s, 
total=11613.953 s
Greg Smith - ©
2ndQuadrant US 2011
Types of writes

Checkpoint write: most efficient

Background writer write: still good

Backend write, fsync

Fine if aborbed by background writer

Write will be cached by OS later

Backend write, BGW queue filled

backend does fsync itself

Very bad, multi-hour checkpoints possible

Improved in 9.1
Greg Smith - ©
2ndQuadrant US 2011
bgwriter monitoring
$ psql ­x ­c "select * from pg_stat_bgwriter"
checkpoints_timed | 0
checkpoints_req | 4
buffers_checkpoint | 6
buffers_clean | 0
maxwritten_clean | 0
buffers_backend | 654685
buffers_backend_fsync | 84
buffers_alloc | 1225
Greg Smith - ©
2ndQuadrant US 2011
Time analysis
$ psql ­c "select now(),* from pg_stat_bgwriter"

Sample two points

Buffers are 8K each (normally)

Compute time delta, value delta

Buffers allocated: read MB/s

Sum of buffers written: write MB/s

Compute or graph

Munin has an example
Greg Smith - ©
2ndQuadrant US 2011
bgwriter trends
Greg Smith - ©
2ndQuadrant US 2011
Cache refill
Greg Smith - ©
2ndQuadrant US 2011
Linux tuning

ext3 on old kernels does blocky fsync

dirty_ratio lowers write cache size in %

Kernel 2.6.29 is finer grained

dirty_bytes sets exact amount of RAM

Cannot go too far

OS write caching is expected

VACUUM slows a lot: 50% drop possible
Greg Smith - ©
2ndQuadrant US 2011
VACUUM

Cleans up after UPDATE and DELETE

The hidden cost of MVCC

Must happen eventually

Frozen ID cleanup
Greg Smith - ©
2ndQuadrant US 2011
Autovacuum

Cleans up after dead rows

Also updates database stats

Large tables: 20% change required

autovacuum_vacuum_scale_factor=20
Greg Smith - ©
2ndQuadrant US 2011
VACUUM Overhead

Intensive when it happens

Focus on smoothing and scheduling

Putting it off makes it worse

Dead rows add invisible overhead

Table “bloat” can be very large

Thresholds can be per-table
Greg Smith - ©
2ndQuadrant US 2011
Index Bloating

Indexes can become less efficient after deletes

VACUUM FULL before 9.0 makes this worse

REINDEX helps, but it locks the table

CREATE INDEX can run CONCURRENTLY

Rename: simulate REINDEX CONCURRENTLY

All transactions must end to finish

CLUSTER does a full table rebuild

Same “fresh” performance as after dump/reload

Full table lock to do it
Greg Smith - ©
2ndQuadrant US 2011
VACUUM Gone Wrong

Aim at a target peak performance

VACUUM isn't accounted for

Just survive peak load?

You won't survive VACUUM
Greg Smith - ©
2ndQuadrant US 2011
VACUUM monitoring

Watch pg_stat_user_tables timestamps

Beware long-running transactions

log_autovacuum_min_duration

Sizes of tables/indexes critical too
Greg Smith - ©
2ndQuadrant US 2011
Improving efficiency

maintenance_work_mem: up to 2GB

shared_buffers & checkpoint_segments
(again)

Hardware write caches

Tune read-ahead
Greg Smith - ©
2ndQuadrant US 2011
VACUUM Cost Limits
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20
vacuum_cost_limit = 200
autovacuum_vacuum_cost_delay = 20ms
Greg Smith - ©
2ndQuadrant US 2011
autovacuum Cost Basics

Every 20 ms = 50 runs/second

Each run accumulates 200 cost units

200 * 50 = 10000 cost / second
Greg Smith - ©
2ndQuadrant US 2011
Costs and Disk I/O

20ms = 10000 cost/second

All misses @ 10 cost?

10000 / 10 = 1000 reads/second

1000*8192/(1024*1024)=7.8MB/s read

All dirty @ 20 cost?

10000 / 20 = 500 writes/second

500*8192/(1024*1024)=3.9 MB/s write

Halve the delay to 10ms?

Doubles the rate: 17.2 MB/s / 7.8 MB/s

Double the delay to 40ms?

Halves the rate: 3.9 MB/s / 1.95 MB/s
Greg Smith - ©
2ndQuadrant US 2011
Submission for 9.2


Displaying accumulated autovacuum
cost”

In November CommitFest

Easily applies to older versions

Not very risky to production

Just adds some logging

Useful for learning how to tune costs
Greg Smith - ©
2ndQuadrant US 2011
Sample logging output
LOG:  automatic vacuum of table 
"pgbench.public.pgbench_accounts": index 
scans: 1
    
pages: 0 removed, 163935 remain
    
tuples: 2000000 removed, 2928356 remain
    
buffer usage: 117393 hits, 123351 misses, 
102684 dirtied, 2.168 MiB/s write rate
    
system usage: CPU 2.54s/6.27u sec elapsed 
369.99 sec 
Greg Smith - ©
2ndQuadrant US 2011
Common tricks

Manual VACUUM during slower periods

Make sure to set vacuum_cost_delay

Start with daily

Break down by table size

Alternate fast/slow configurations

Two postgresql.conf files, or edit script

Swap/change using cron or pgAgent

Aggressive freezing
Greg Smith - ©
2ndQuadrant US 2011
Write to disk, slow way

Data page change to pg_xlog WAL

Checkpoint pushes page to disk

Hint bits update page for faster visibility

Autovacuum marks free space

Freeze old transaction IDs
Greg Smith - ©
2ndQuadrant US 2011
Manually maintained path

Data page change to pg_xlog WAL

Checkpoint pushes page to disk

Manually freeze old transaction Ids

Tweak vacuum_freeze_min_age and/or
vacuum_freeze_table_age
Greg Smith - ©
2ndQuadrant US 2011
Hardware for fast writes

Log checkpoints to catch spikes

Battery-backed write cache

Fast commits

Beware volatile write caches

http://wiki.postgresql.org/wiki/Reliable_Writes
Greg Smith - ©
2ndQuadrant US 2011
Hard Drive Latency

Type
Latency-ms
Transactions/Sec
5400 RPM
11.1
90
7200 RPM
8.3
120
10K RPM
6.0
167
15K RPM
4.0
250
Battery-
Backed Write
Cache
0.2
5000
Greg Smith - ©
2ndQuadrant US 2011
Latency driving TPS

Greg Smith - ©
2ndQuadrant US 2011
Partitioning

Time-series data splits most easily

Monthly partitions typical

Setup is manual and requires some code

Queries can only exclude partitions

Old partitions don't need vacuum

Once frozen, they're ignored

Indexes are smaller

Less used indexes fade from cache

Oldest data can be truncated

No deletion VACUUM cleanup!
Greg Smith - ©
2ndQuadrant US 2011
Skytools

Proven to handle write scaling

Database access wrapped in functions

PL/Proxy routes to appropriate notes

Any, all, etc.

Replication used for shared data

Rebalancing is tricky

Pause feature in pgbouncer helps

Hard to retrofit to existing system
Greg Smith - ©
2ndQuadrant US 2011
Some monitoring samples provided by:
Track, measure, and improve your fitness
Clients for Android and iPhone
http://runkeeper.com/
Special thanks
Greg Smith - ©
2ndQuadrant US 2011
PostgreSQL Books
http://www.2ndQuadrant.com/books/
Greg Smith - ©
2ndQuadrant US 2011
Questions
Slides at 2ndQuadrant.com
Resources / Talks