PostgreSQL Database Performance Tuning

assistantashamedData Management

Nov 29, 2012 (4 years and 4 months ago)


PostgreSQL Database Performance Tuning


$Id: soft
tuning.sgml,v 1.12 2003/11/02 14:24:47 jpargudo Exp $

This document presents how to tune PostgreSQL Databases for performance

It is really outdated now, but some

principles exposed in the article are still valuable

Document licence

About the author

The purpose of this document


Use correct datatypes

The 20
80 rule

Indexes workaround : principles

Indices workaround : Little tables

Indices workaround : Big tables


VACUUM your database

pg_autovacuum to avoid crontab managed vacuums

Should you disable fsync?

Archiving your data

Use stored procedures



Document licence

Copyright © 2002 Jean
Paul Argudo

The document is freely distribuable under licence terms of
GNU Free Documentation License
. .

All comments, suggestions or help is very welcome and encouraged. Please send any
comment you want to the author. Just mail to

About the author

I'm a PostgreSQL fan of many years now. I used to be consultant DBA in a french Open
Source company (look at my cv zone here), where I almost worked only with PostgreSQL. I
started to write this document there.

I reall
y thanks Nat Makarevitch for everything I learnt in his company, IDEALX.

PostgreSQL is now in version 7.3.1 as per December 2002. I hope I'll have enough time to
work on this article, since it gets many hits (see stats page). Now I really know many people
learns from this article. All your messages goes directly to my hart and I thank all the people
who signed my guestbook in this site.

All encouragments give me strenght to continue writing this article. Thanks a lot.

The purpose of this document

This document tries to give some waypoints for the PostgreSQL DBA in tuning PostgreSQL
databases. This is *not* about PostgreSQL server tuning. For tuning your PostgreSQL server
ion, read
Bruce Momjian's article
on the subject. There are additional useful articles
the PostgreSQL Technical Documentation site

So what is it about then? Let's say it's about software tuning to simplify it roughly. I'll give
some tips & hints about database tuning, I mean, your database model, the way you store the
data, the way you can help PostgreSQL in finding the best ways to s
earch for data, etc.

My objective is for you, the reader, helping me to help us all. So I'd like this document to
become a kind of participative CookBook, so don't hesitate in sending patches to this
document. As a return of what you may learn reading this

document, please send what you
think others should know too.

Hope you'll find here all you wanted to know about PostgreSQL database tuning!


In your architecture, th
e database has been placed on a node, typically what you think being a
strong configuration: lots of memory, lots of disk space with raid 5. Hopefully you have put in
a standby raid 5 card, in case the main one burns. Since you had load problems with or
thout swapping, you increased your optimisation parameters in postgresql.conf. Now you
use much more of you 2G ram... But nothing seems to help you, you still have a big load on
your server. Queries seems to slow each day passing too...

Seems you need some

software tuning. I mean, you have already tried server configuration
tuning. Hope you read Bruce Momjian's excellent articles, where you learnt that sometimes
big is too big.

The PostgreSQL RDBMS, like all others, is based on simple ideas. Sometimes, simp
le ideas
need complex code to be implemented. But that is not the problem for now, so I'll try to
explain some concepts of databases to help you understand the few tips I give here.

So, let's dive into tuning!

Use correct datatypes

This is just common advice: try to use the smallest data type to store your data. This supposes
you know the possible values your fields will have.

For character data it is recommended to use the VARCHA
R type, since PostgreSQL manages
it very well. On the contrary, if you use the CHAR datatype, the size you specify for the string
is pre
allocated in the pages and you allocate space which may not be used. So your database
is bigger than what it should be
and performance is lowered, because this way you need more
I/O to read the data.

But, if you are sure you'll use all the characters, it is better to allocate it directly. For example,
you store fingerprints of public keys or your applications use a user pr
ofile coded in a 20
character string, with each character representing a user level or a special privilege for your

With the last example, I'll take a moment to advise you to code as much as you can of your
data logic into your database. But b
eware of thinking maybe you'll need more time decoding
the data than PostgreSQL needs time to find the data in a classic, pure entity
relationship data

Let's use an example. Imagine you have a table "user" in which you store the user's last name,
telephone, etc. Your application manages user privileges and other permissions. Since you are
a solid DBA, you surely designed the database with some entity
relationship method. Then
you separated table "user" from table "rights" accordingly to the proposi
tion "a user has 0 to n

There, the interesting thing is "n rights". Are you sure you'll have "n" rights to manage?
Surely not. You yet know all the privileges you want to manage in your application. Let's say
you have 10 privileges to manage. So,
why don't you make a special field in table "user" to
manage this 10 rights? For example, add column user.privileges CHAR(16), which is quite
nice because you know you have space for 6 more rights, just in case of a forgotten
specification, or whatever. Yo
u can now code all your privileges in one field: 1st character is
privilege P1, 2nd character is privilege P2, etc.

This technique is great for performance enhancement. You don't need table "privileges"
anymore and when you have to search privileges grante
d to a user (in your client
meaning, not SQL's GRANT), you make a simple select on one table which returns a single
tuple. It's then really simple to manage.

Note you can also create 16 fields of CHAR(1) in table "user", and then your have similar
antages. Perhaps this is less usable, its for you to decide.

The 20
80 rule

During my studies, a well advised economics teacher told me about the 20
80 rule. It's kind of
a magic

rule which Vilfredo Pareto (1848
1923), an Italian Economist, found. It simply
means 20% percent of something often represents 80% of another.

You can apply this rule to databases. I am always surprised to find I'm correct when I tell a
customer 20% perce
nt of their queries create 80% of the load. I mean 20 out of 100 queries
are the ones which client
side applications call the most often.

Naturally, you'll have to list this 20% of your queries in a short list, and concentrate your
efforts in tuning the da
tabase to make them run faster. That's just a good way to reach faster
our main goal: make the client
side application run faster.

You may ask yourself how to identify this 20% of queries? If this is the case, you are mostly
not working with hackers of you
r business application :
)... Well just joking. This is the first
way to know which queries are the most often called: ask your hackers, they surely know.

You are not sure they are correct? Just put spies everywhere. It's really simple. Just define a

mode in your sources, if you have not yet done this (gasp!). Numerotate each query. I
mean, give each query in the code a single number, or identifier. In your database, create a
table with one row:CREATE query_spy (queryid INT);. In the debug mode, each
time you
execute a query, make an INSERT too in the table query_spy, with the number of the query.
Put your database in production level and your application in debug mode. Wait for some
significant cycles of application's life. When you a think sufficient

sample is done, put your
application back into normal mode.

Now you have the table query_spy full of records, with very good information for us: each
time a query has been called, the call has been clearly logged in there. Make a query on it to
know which

queries are most often called.
For example:

select queryid, (count(*)*100/(select count(*) from query_spy))||'%'

from query_spy

group by queryid

order by count(*);

This will be very explicit and help you classify which queries are most often called. N
ow we
can go onto the next section.

Indexes workaround : principles

Indexes surely are the best way to improove the performance of your queries. I already saw
queries gaining a 5
0:1 ratio speedup with good indexes.

Have in mind that indexes are *not* costless. The time needed to maintain an index for
PostgreSQL's backend has to be considered. So? Yes, sometimes, you better have interest in
dropping all indexes on a table! Sometime
s, you'd better not create more indexes because they
will require costly treatments to be up

Indices workaround : Little tables

Many customers use "reference" tables. So
me call them "label" tables, others "description"
tables, etc.
Here is a classic example :

CREATE TABLE "label" (

"label_id" int4 UNIQUE DEFAULT nextval ( 'label_seq' ) NOT NULL,

"label_labeltext" character varying(40)


In this table consider we have

130 labels. We know these labels aren't updated, except in
version changes of our client software... If you copy
paste this example into a sample
database, you'll see PostgreSQL is creating a default index on label_id. What you have to do
is drop the inde

Why? You need to know a little about databases for that. When a query is sent to the server, it
is first parsed. Parsing is syntax checking only. If the syntax it's not okay, the process stops
and corresponding error messages are sent. If the parsing is

okay, to simplify, the backend
will figure out how to return the data corresponding to the query. During this process, the
backend analyses in terms of cost which way is the best. The backend will choose the
cheapest way to return data. So, with little ta
bles, the tablescan is most always chosen.

It is more effective making a full table scan on a little table than using an index. So now, if
you drop all indices on a little table, you gain the time of choosing the best way: the optimiser
doesn't have to com
pute costs or whatever, it will see there is no other way than making full
table scans, which we know is the best way to go in this case.

Indices workaround : Big tables

With big

tables it is not so simple. You'll have to analyse each one of the 20% of queries you
listed in previous chapters. You will have to watch the WHERE clause of your queries. You
you have to consider if each field in the where clause is indexed or not. There

are many
possibilities :

The WHERE clause is simple, you make JOINS (inner or not) between tables using their
primary keys. For example, you join table A and table B using A.PKEY = B.FKEY_A. On
table A it's okay, you should have an index on the primary ke
y (unless it is a small table, see
previous chapter). On table B, you have to verify you have an index (not unique!) on
B.FKEY_A. Index this field if it is not yet indexed;

The WHERE clause is the same as above, but you also have a field in table A that is

used too.
Index this field too. You'll have to figure out if you have better results creating an index on
this field alone or mixing with the primary key field too. Depending on the circumstances,
you'll have better results with one or the other.


Once you think all your indices are okay, you'll probably notice performance improvements.
In case you have a doubt, you can trace which execution plan is chosen to retur
n the data with
the EXPLAIN command.

Use extensively EXPLAIN to create indices for example. Here is one example of a tuning
session I made in a production database:

First, I noticed a query wich I considered suspicious:

mytestdb=# EXPLAIN SELECT t12_bsk
id, t12_pnb, t12_lne, t12_tck

# FROM t12

# WHERE t12_bskid >= 35

# ORDER BY t12_bskid, t12_pnb, t12_tck, t12_lne;

Sort (cost=1348.70..1348.70 rows=8565 width=16)

> Seq Scan on t12 (cost=0.00..789.20 rows=8565 width=16)

equential sorts are used to satisfy the ORDER BY clause. I decided to create an index on the
rows used to sort the results.

mytestdb=# create index t12_bskid_pnb_tck_lne on t12 (t12_bskid, t12_pnb,
t12_tck, t12_lne);


Now, you think that's okay?
Look at the next EXPLAIN plan on this query:

mytestdb=# EXPLAIN SELECT t12_bskid, t12_pnb, t12_lne, t12_tck

# FROM t12

# WHERE t12_bskid >= 35

# ORDER BY t12_bskid, t12_pnb, t12_tck, t12_lne;


Sort (cost
=1348.70..1348.70 rows=8565 width=16)

> Seq Scan on t12 (cost=0.00..789.20 rows=8565 width=16)

Ouch! Exactly the same result! This is allright... Just because we didn't notify the optimizer
the existence of a new index. We have to compute statistics

(Oracle folks may notice this ;

mytestdb=# vacuum analyze t12;


And finaly test the performance enhancement:

mytestdb=# EXPLAIN SELECT t12_bskid, t12_pnb, t12_lne, t12_tck

# FROM t12

# WHERE t12_bskid >= 35

ER BY t12_bskid, t12_pnb, t12_tck, t12_lne;


Index Scan using t12_bskid_pnb_tck_lne on t12 (cost=0.00..2232.11
rows=25693 width=16)

We notice now the optimizer uses the index, wich in this case was a pure benefit. But don't
think i
ndices are a solution to everything. Sometimes, you'll notice PostgreSQL doesn't use
your index, even if you are sure there maybe a performance in using it (you still remark PG
optimizers prefers sequential scans rather than your brand new index, ouch). Ho
wto be sure of
this kind of trap? Just modify ENABLE_SEQSCAN variable.

mytestdb=# set enable_seqscan='false';


mytestdb=# set enable_seqscan='true';


With ENABLE_SEQSCAN to FALSE, PG will use your index. In this mode, you are ab
le to
compute query's cost with EXPLAIN PLAN. Then, setting it back to TRUE, do an EXPLAIN
plan again, and see the difference...

VACUUM your database

Here you won't learn much
more than what you can find in the official documentation... Just a
tip: remember that the more frequent your vacuums are the less time they time to finish.

So? Personally, I've enhanced performance by making a simple VACUUM happen every 5
minutes, and a V
ACUUM ANALYZE every 15 minutes. It is quite easy to program this with
a cron command. But note that this is an extreme example.

Let me show you the importance of vacuumdb with the pgbench program

you can find it in
contrib/pgbench in PostgreSQL sources;
you may also have pgbench installed if you installed
contrib for example.

Beware this example works only on PostgreSQL 7.2 versions because it uses a new feature of
PostgreSQL : ability to run concurrent vacuums. This means PostgreSQL can now va
the database *without* locking resources.

First, we create a database for testing purpose:

$ createdb benchpg72


We initialize this test database with
i option. Let's make a representative database with 1
Million tuples using
s 1

$ pgbench
s 10 benchpg72


930000 tuples done.

940000 tuples done.

950000 tuples done.

960000 tuples done.

970000 tuples done.

980000 tuples done.

990000 tuples done.

1000000 tuples done.


Be patient, this may take a while dep
ending your machine. This will create a

160 Mb
database in you $PG_DATA/base/ directory tree.

Let's create a tiny script:







for c in 10


t=`expr $totxacts / $c`

c 'vacuum analyze' $DB

c 'checkpoint' $DB

echo "===== sync ======" 1>&2

sync;sync;sync;sleep 10

echo $c concurrent users...

t $t
h $H
c $c $DB


Adapt the 3 variables depending where is your database, the name of the owner and the name
of your database.

Run the script:


Since this database is important, you may have to wait a while. Watch the load of your server

You may now have an output with a look like this:

$ ./



===== sync ======

10 concurrent users...

transaction type: TPC
B (sort of)

scaling factor: 10

number of clients: 10

number of transactions per client: 1000

ber of transactions actually processed: 10000/10000

tps = 66.064132(including connections establishing)

tps = 66.104508(excluding connections establishing)


Okay, as you can see the server can make here 66 transactions per second. It is not that bad, I
made the test on my server, a Bi
Pentium 3 1GHz, with 512 Mb RAM. If you run this same
script again and again, you'll see pgbench is not accurate, I personnaly see difference up to
15% more or less... But pgbench is far enough precize to allow one to have
an idea of what
performances are on a server.

Running again the script: I got a tps = 43.337163(excluding connections establishing)! This is
okay... Just because the first time you use a PostgreSQL database, all PG ressources are free
(shared memory, buffe
rs, etc).

Try re
run again, you'll see TPS are quite the same, but never as much as the 1st test...

Now, create a script wich may look like this:


# cron





while [ 1 ]


n "
`date +'%Y
%d %H:%M:%S'`


echo `vacuumdb

sleep $WAITFOR


Now, on another term, run this script:

$ ./cron

And on the first term, run again the bench script *while* the cron
vacuum runs.

Here are the resul
ts I got with a WAITFOR=30 (seconds): $ ./ VACUUM
CHECKPOINT ===== sync ====== 10 concurrent users... transaction type: TPC
B (sort of)
scaling factor: 10 number of clients: 10 number of transactions per client: 1000 number of
transactions actual
ly processed: 10000/10000 tps = 59.111491(including connections
establishing) tps = 59.144385(excluding connections establishing)

I have re
runed this method with decreasing values for WAITFOR variable.
Here are the

WAITFOR value | tps (exclu
ding connections establishing)


30 | 59.111491

20 | 56.651573

10 | 55.113913

5 | 57.637565

3 | 58.177821

2 | 54.16857

1 | 57.595908

0 | 61.206833


tps: Transactions Per Second

Those results are a bit disturbing, isn't it? There's no simple explanation. Performance gained
by concurre
nt vaccum is certain. Just remember that is it not the more vaccum you do the
more efficient your database is.

Conclusion: vacuumdb is really usefull for PostgreSQL databases. Since it cleans the database
and permits then faster management of it for Postgr
eSQL "kernel". Concurrent vaccums,
appeared with version 7.2 is a great enhancement, since it allows you to gain much time when
running big processes, typically, nighlty batches.

There are some discussion about making the vacuum process a background proces
s of
PostgreSQL as DBAs won't matter of it anymore. However, you may won't need to make
vacuums on the entire database you have every 3 seconds.

Beware that vacuumdb is *not* costless. You'll have to think when to do your vacuums and
remember this is usefu
ll on databases whose activity is really high. pgbench makes the
activity of the database hudge. Depending your application, you may need a vacuum only
once a month (datawarehouses for example), or hourly (web applications for example)...

Since PostgreSQL
v.7.4, Matthew T. O'Connor added a new contrib: pg_autovacuum. This
program allows automatic VACUUM. Let's talk about it in the next chapter.

pg_autovacuum to avoid crontab mana
ged vacuums

A new contrib is now shipped in PostgreSQL 7.4. It allows everyone to forget about
vacuuming databases, since this usefull new feature does vacuum your database when it has to
be vaccumed. My thanks goes to Matthew T. O'Connor, since he gave e
veryone a simple and
efficient way to forget about vacuum management of PostgreSQL databases.

I sincerely encourage you to read more about this feature: have a look in the contrib directory,
under pg_autovacuum. If you read the entire README.pg_autovacuum,

you don't have to
read what follows here since I just copy/paste/comment this README.

About what are the benefits using this contrib, note that:

The primary benefit of pg_autovacuum is that the FSM and table

statistic information are updated more nearl
y as frequently as needed.

When a table is actively changing, pg_autovacuum will perform the

VACUUMs and ANALYZEs that such a table needs, whereas if a table

remains static, no cycles will be wasted performing this


A secondary benefit of pg
_autovacuum is that it ensures that a

database wide vacuum is performed prior to XID wraparound. This is an

important, if rare, problem, as failing to do so can result in major

data loss. (See the section in the _Administrator's Guide_ entitled

ng transaction ID wraparound failures" for more details.)

About how it works, remember that:

pg_autovacuum performs either a VACUUM ANALYZE or just ANALYZE depending on
ixture of table activity (insert, update, or delete):


If the number of (i
nserts + updates + deletes) > AnalyzeThreshold, then

only an analyze is performed.


If the number of (deletes + updates) > VacuumThreshold, then a

vacuum analyze is performed.

The number of such events are determined by the "Query/Index Statisti
cs Collector" in
PostgreSQL backend. So you'll have to set following parameters in postgresql.conf:

stats_start_collector = true

stats_row_level = true

It's a libpq client program coded in C. This means that you'll have to compile it and make it
run i
n daemon mode just after you start the PostgreSQL server.
As the README suggest it:

If you have a script that automatically starts up the PostgreSQL

instance, you might add in, after that, something similar to the


sleep 10 # To give t
he database some time to start up

[other arguments]

The complete list of parameters is given in the README. I did not yet test all the parameters.
Since Matthew put some default parameters, I assume t
hose defaults are efficient enough :

Now, you see how this new contrib can be usefull... If you don't, read previous chapter, it
relates how vacuuming PostgreSQL had to be installed in previous PostgreSQL versions:
crontab... |

Should you disable fsync?

Fsync option in postgresql.conf is very interesting since it allows you to multiple per x2 your
overall PostgreSQL server. But it is risky. He're is the explanation given fr
om "Pratical
PostgreSQL" O'Reilly book:

The fsync
disabling option. Using this increases performance at the risk

of data corruption in the event that the operating system or physical

crashes unexpectedly. Be sure you know what you are doing bef
ore you use


Wow! What you surely think is: "this man is mad. I'll never use an option which may result in
crashing all my database". You'll be probably right, depending on what you use your server

What is fsync? If fsync is enabled (
default option), it's PostgreSQL backend wich decides
when to flush data on disk, for example when the transaction is commited and when changes
have to be effectively wrotten to disks. In the case fsync is disabled, then you trust your
system. It's the sys
tem which decides when to flush data on disks, not PostgreSQL alone.
Then, there is always data from committed transactions in your systems'memory.
If a power
outage occurs, this data is lost!

I participated in a project based on Jabber. I was the DBA of t
he team. I faced a big problem,
we had performaces issues. I tunned PG as far as I could, but still had problems. Let me
explain the project. Jabber is an instant messaging system, just like ICQ, AIM or MSN. It's a
free software. We built the backend with
PostgreSQL and developped a new Load
with Proxy functions for Jabber.

The database is used to store all the data about users: friend list (called here "buddies"),
profile, behaviours, etc. And offline messages. Since the whole system is based on R
devices, with arrays of redundant servers, with no possible power outage, I finally had to
disable fsync to allow performances to reach the customers'requirements. But in fact, I just cut
the database in two parts, with two concurrent PostgreSQL. One

is running with fsync true
(the database with persistant and important data), the other one with fsync false (the one with
offline messages only).

Then, in the extreme case where a crash or power outage occurs (none since January 2001),
our client only lo
ses offline messages sent by users.

As a DBA, I can't stand to lose any data, but, with business requirements, I had to deal with
fsync. Sure, our customer knows everything about it and he accepted this risk to meet his

That's exactly what I want yo
u to understand: you have to balance all arguments and make
you own choice, depending on you particular situation.

Here's a comparison with fsynch=true and fsynch=false (with previous script "mybench", in
same conditions (scale factor=10)):

TEST Number

| (tps) | (tps)


1 | 88.694718 | 66.326287

2 | 104.763422 | 59.251771

3 | 107.409698 | 56.862442


tps: Transactions Per Second

No need to say more. fsync parameter really improves performances when it is disabled (set
fsync=false in postgresql.conf). As I told you before, here we are at yet x2 speed
improvement. Beware too about load... The
server load increases dramatically with
fsync=false. On my test server, the load is just the double.

Now it's up to you to decide wether disable fsync or not.

Archiving your dat

This chapter is probably the one which really requires all your knowledge of your database.
You have to bear in mind a database is faster when the tables are little.

Then, you will have to search for ways to make your tables smaller. One of the ways I k
which works very well is to separate data in two groups. In the first one, you'll place all your
"alive" data and in the other one, you'll put your "dead" data.

For example, you may have a database with invoices or other business documents stored in it
Think if there is a way to separate business documents which are old, let's say you think these
are documents finalised more than a month ago. So you'll create a table "arch_invoices", with
the same structure as the table "invoices". You'll create a stor
ed procedure whose job will be,
each night for example, to search for such invoices in table "invoices" and move them into the
table "arch_invoices". Beware of integrity constraints, if you use them.

Then, you'll have to manage this new system in your appl
ication, permitting users to switch
from the active database to the archive database. This is an old tip that surely you have used
before. In my country, like surely in yours, there's a proverb that says « Divide and conquer ».

You can apply this principle

all over. I know of a supermarket's application where all the cash
registers send the articles and tickets data. I know every day the application creates a new set
of tables! This way, the transactions are real fast. There are stored procedures to aggrega
te the
data during the night when the store is closed, etc.

You are able to place these archive tables on another disk too, physically, using symbolic
links. Read Bruce Momjian's article about Hardware tuning of PostgreSQL to know more
about it. In this ma
nner, you're also lightening the I/O on your database.

Use stored procedures

Your application probably always uses the same SQL queries, because there are identified

For example, the creation of a new invoice, requiring multiple INSERT and/or
UPDATE statements to be done. If you are working in a Client/Server environment, this
invoice creation process will generate a lot of network traffic.

What I would do is use a st
ored procedure to create this invoice. The client side application
will just then call the function to create the invoice. You'll just have to give the arguments to
the function, and it will return some reference/id you need to continue the client side pro

Other data treatments can benefit from stored procedures. The first one are batch treatments.
For example, each day at 6pm, when the day's activity report needs to be done, you (or cron?)
starts the report job. Typically it's batch job, there are man
y calculations, many queries, etc.
Here again you'll generate a lot of network traffic, and probably, the client side application
will compute slower than the server! Here again, try to use stored procedures to save time in
your client application.

r also that using stored procedures is a good method to centralise your code, making
widespread application installation easier, and changes applied automagically for client side
applications. This is generally not a problem for internet applications that
centralise code with
other methods, but this is interesting for client/server applications.

Finally, stored procedures will radically improve the security of you applications, as you
avoid SQL injections.
I'll write more on this topic later.


I hope you'll find here at least some ideas to help you tune your database. As you may now
know, tuning a database is an affair of understanding how a RDBMS works.

You'll probably

notice too that database tuning is not so different from cooking, it's a matter
of time to search and test different methods, like a chef will test different tastes.

Finally, if you have any ideas, tips or hints to tell us, please email me your contributi
ons, as
I'll maintain this document with your comments.

Thanks a lot for using PostgreSQL, the most Advanced Open Source Database.


I'd like to thank all hackers at Post
greSQL to provide such a good RDBMS. Thanks to all
contributors to this project. There are so many...

Thanks to all people on PostgreSQL mailing list for the good support they give to us all.
Thanks to their open mind, their right choices.

Special thanks t
o Justin Clift that maintains that give us so
much good technical docs related to PostgreSQL.

Thanks to all of you that read this article once and helped me to enhance it.