PostgreSQL and Oracle
Recently, driven by the ever more exorbitant prices of Oracle RDBMS components, many companies
started looking for an alternative platform that would help them decrease the license cost of the Oracle
RDBMS. Of course, as it became customary these days, the first place to look for an alternative is the
open source community. Open source community maintains several open source databases, the most
popular among them are MySQL and PostgreSQL. With Oracle Corp. acquisition of SUN
Microsystems, Oracle also acquired MySQL. There are other versions, like MariaDB and Drizzle, but
the general feeling is that MySQL is no longer a serious alternative. That leaves us with PostgreSQL.
This article strives to clarify PostgreSQL strengths and weaknesses for such purpose. I will refrain from
giving a verdict of the database suitability as a general Oracle replacement. It depends on the particular
project and the nature of the database that is being replaced. Overall impression is that Oracle is an 18-
wheeler, while PostgreSQL is a half ton truck. Each is useful its own area, but one cannot realistically
expect to replace one by another.
Characteristics of PostgreSQL
PostgreSQL is, in a way, very similar to Oracle. That similarity is intentional, as there are several
commercial products which use that similarity for marketing purposes. The most frequent complaints
of the Oracle developers, when trying out open source databases, are the problems with the CONNECT
BY clause, subquery factoring and analytic functions, also known as “window functions” to
PostgreSQL developers. PostreSQL supports “subquery factoring” or “Common Table Expressions”
(CTE), known to us oraclites as “WITH clause”. It also supports very decent set of window functions
like row_number, ntile, rank, dense_rank, lead, lag and some others.
Furthermore, recursive “WITH” is a very good replacement for Oracle's “CONNECT BY” query.
Incidentally, Oracle supports recursive “WITH” queries as of version 11.2. Below are both Oracle and
PostgreSQL versions of the same recursive query:
1 with e(empno,mgr,lev) as (
2 select empno,mgr,0 from emp where empno=7839
3 union all
4 select emp.empno,emp.mgr,e.lev+1
5 from emp,e
6 where emp.mgr=e.empno)
7* select * from e
EMPNO MGR LEV
7566 7839 1
7698 7839 1
7782 7839 1
7499 7698 2
7521 7698 2
7654 7698 2
7788 7566 2
7844 7698 2
7900 7698 2
7902 7566 2
7934 7782 2
7369 7902 3
7876 7788 3
14 rows selected.
Oracle requires “UNION ALL” and prohibits the use of “LEVEL” keyword, so I replaced it by “LEV”.
The result is essentially the same as if we did a self join with “CONNECT BY prior empno= mgr”.
PostgreSQL can do the same thing:
mgogala=# with recursive e(empno,mgr,level) as (
mgogala(# select empno,mgr,0 from emp where empno=7839
mgogala(# select emp.empno,emp.mgr,e.level+1
mgogala(# from emp,e
mgogala(# where emp.mgr=e.empno)
mgogala-# select * from e;
empno | mgr | level
7839 | | 0
7566 | 7839 | 1
7698 | 7839 | 1
7782 | 7839 | 1
7499 | 7698 | 2
7521 | 7698 | 2
7654 | 7698 | 2
7788 | 7566 | 2
7844 | 7698 | 2
7900 | 7698 | 2
7902 | 7566 | 2
7934 | 7782 | 2
7369 | 7902 | 3
7876 | 7788 | 3
PostgreSQL doesn't object to my using the word “level” in the query, it's not a keyword in PostgreSQL.
It also doesn't require “UNION ALL”, like Oracle. The query is essentially the same, which simplifies
porting all those proprietary “CONNECT BY” constructs to PostgreSQL. Similarity between the SQL
syntax of Oracle and PostgreSQL is remarkable and should be of great help to Oracle developers,
tasked with developing an application using PostgreSQL database.
PL/PgSQL – Procedural extension
PostgreSQL procedural extension, PL/PgSQL is very similar to Oracle's own PL/SQL, but much
smaller. Again, this similarity in both name and syntax is intentional. However, differences are very
significant and there is no hope of using Oracle PL/SQL objects in the Postgres database. A complete
rewrite is usually required. This rewrite is , however, helped to a significant degree by the similarity of
the two languages.
Here are some of the most notable differences. PL/PgSQL functions cannot contain “COMMIT”
statement and procedures or packages are not supported, only functions. Until recently, when “DO”
statement was introduced in PostgreSQL 9.0, it wasn't possible to execute an anonymous PL/PgSQL.
PostgreSQL, however, is multi-lingual. It is possible to use Perl or Python as internal languages and
write functions or triggers in those languages. There is no collection of administrative and utility
packages like DBMS_LOB, DBMS_SPACE, UTL_FILE or UTL_MAIL.
PL PgSQL is not as powerful as PL/SQL, but it can definitely be used for enforcing a complex
business logic. On the other hand, coding rules are slightly different from Oracle. Below is an actual
trigger function that was used for partitioning in a live, production database:
CREATE FUNCTION moreover_insert_trgfn() RETURNS trigger
V_INS VARCHAR(512):='insert into moreover.moreover_documents_y';
V_VALS VARCHAR(128):=' VALUES($1.*)';
SELECT EXTRACT(MONTH FROM NEW.created_at) INTO V_MONTH;
SELECT EXTRACT(YEAR FROM NEW.created_at) INTO V_YEAR;
IF (V_MONTH>=10) THEN
execute V_INS USING NEW;
This function computes the partition into which the row needs to be inserted, based on the month and
year in the CREATED_AT field in the incoming row and executes insert using the variable “NEW” as
a bind value. Oracle is much more rigorous when it comes to coding and doesn't easily allow tricks
like this. PostgreSQL doesn't compile PL/PgSQL code, it stores it as a text literal. The code is parsed
for each session that executes it and cached in the private memory of the parsing process. That means
that there is no code sharing. If one process parses a PL/PgSQL function, other processes will not
benefit from it because PostgreSQL doesn't have anything even remotely like Oracle shared pool. That
applies to all procedural extension to PostgreSQL, not just PL/PgSQL. In other words, database
functions written in Perl or Python are also stored as text literals, parsed as needed and cached on per
Partitioning in PostgreSQL leaves much to be desired. It is essentially the same technology used in
Oracle7, with partition views. In PostgreSQL, partitioning is based on the inheritance mechanism.
There is one “parent” table and partitions are separate tables that “inherit” the table structure from the
“parent”. Rows are routed into different partitions using triggers. PostgreSQL can do partition
elimination, but only during the parse phase. That means that it will not do partition pruning with an
already parsed SQL and a bind variable. Also, there is no global indexes, global constraints or any
other mechanism except of the very limited partition pruning. There is no global statistics, so the
optimizer cannot do anything on the table level, it has to descend to partition level and optimize the
SQL on per partition basis. Frequently, optimizer is not up to the task, so using partitions results in
horribly inefficient queries that take a very long time to complete. Partitioning can be useful, but it
requires a very careful design, which involves careful data model consideration and extremely careful
query design. There is no parallel query or parallel DML, not even for the partitioned tables.
PostgreSQL has a very sophisticated cost based optimizer, somewhat different from Oracle's. In
addition to the value histograms, that both databases collect, PostgreSQL also compiles the list of most
frequent column values. The number of the most frequent values and the size of the histograms is
determined by the default_statistics_target parameter and can range from 0 to 10000, much greater
range than Oracle's. In version 9, it is possible to set the histogram size for each table column
separately, using the “ALTER TABLE” statement. The same option is also present in Oracle 11g,
which allows the DBA to set the statistics preferences on the table level.
As all cost based optimizers, PostgreSQL optimizer is also sometimes counter intuitive and stubbornly
refuses to use an index. To alleviate the pain, database designers have made available a myriad of “set
session mode” options which turn on and off particular access methods. There are no hints, which is a
very big problem for porting projects. If a query starts misbehaving, there is no way to just make a
decision for the optimizer to use the particular index, one has to re-engineer sizable portion of the data
model, just to get things running. This makes porting projects prohibitively expensive as one should
expect to be forced to change the data model and essentially rewrite the application from scratch. This
is particularly big problem with Java ORM software, like Hibernate, which tends to generate
cumbersome and complex queries, which can be very tricky to optimize, even under the best of
circumstances. Also, the lack of hints makes the application much harder to maintain. Every
application contains SQL that will start causing troubles over time, because of the underlying data
model changes or simply because of the growth. Oracle allows the developers to quickly plug in such
holes with hints, without necessarily having to re-engineer the entire data model. PostgreSQL doesn't
allow that choice. Performance problems are usually addressed by fixing the underlying data model,
which can be prohibitively expensive and require prolonged downtime.
Optimizer will not use indexes in case of running a parsed query with bind variables because it doesn't
know anything about the values of the bind variables. The attitude of the PostgreSQL designers toward
hints is extremely negative and it is unlikely that this very useful tool will be provided anytime soon. It
is a curiosity, because almost all other relational databases support hints in one form or another, except
the PostgreSQL. Performance of complex queries, involving many tables, can deteriorate dramatically,
especially when ported from Oracle, where such queries are customary. The enmity of the PostgreSQL
developers to hints is almost religious in nature. Curiously, some of the most vocal opponents of hints
work for EnterpriseDB, whose flagship product is a commercial fork of PostgreSQL which supports
PostgreSQL also doesn't support index organized tables or clusters. It does support bitmap indexes, but
this is a murky subject which is usually avoided like a plague by the PostgreSQL experts. Until version
9, the general advice was that bitmap indexes are used at one's own risk. There is no star schema or star
transformation or any other tool to help with data warehouse type tasks.
PostgreSQL optimizer is geared toward OLTP type applications which are not overly complex and
which do not include large tables, tens of gigabyte in size, or even larger. Of course, there are examples
like Facebook and Skype, which use PostgreSQL as their backbone, but these are still exceptional
things, done by top class professionals and very hard to achieve by any standard. Recently, Skype has
been acquired by Microsoft and they started using SQL Server instead of PostgreSQL. The lack of
parallelism and unwillingness to support hints makes PostgreSQL rather awkward when it comes to
porting applications from Oracle.
Also, PostgreSQL optimizer is rather poorly documented, there no books like those written by Jonathan
Lewis or Tom Kyte in the PostgreSQL world. There is no method for overriding PostgreSQL optimizer
decisions. PostgreSQL may be smart, but it's still not smarter than us humans. Very few optimizers can
adequately deal with correlated data. PostgreSQL optimizer cannot deal with that. What does that
mean, precisely? Let's say that we have a table with the information about a group of people. The
information includes the date of birth and zodiac sign. There is 1/12 probability that any given person is
a Capricorn. There is also 1/12 probability that any given person was born in May. The optimizer
would reach the conclusion that the probability that a person born in May has Capricorn as his zodiac
sign is 1/12 * 1/12 = 1/144. That is still a hefty percent that might result in the full table scan. The
reality is, however, that people with Capricorn zodiac sign are born in January, not in May. Such query
would return no results and PostgreSQL cannot know anything about it. Correlated and interdependent
data columns are a very frequent occurrence, especially with financial data. This is far more frequent
situation than it appears. And the conclusion that the optimizer can reach is always incorrect. Having
no means to override optimizer decision means that we would have to remove all interdependent
columns from our data model, just as the 3
normal form would require. That is a lot of work and time,
especially if the application that is being ported is modestly complex and contains foreign keys,
interdependent columns, triggers, views and procedural code.
MVCC, Vacuum and IO architecture
To start this part, it is worth noticing that PostgreSQL tables are files. Each table is comprised of one or
more files. When the table size exceeds 1GB, a new file is started. PostgreSQL also relies on the file
system for block prefetch. PostgreSQL doesn't do asynchronous I/O or direct I/O. It also doesn't have
any built in parallelism. Now, let's discuss Postgres variant of multi-versioning concurrency control,
MVCC for short.
PostgreSQL, unlike Oracle RDBMS, leaves the old values for the modified row in the table itself. It
doesn't have anything like UNDO segments or UNDO tablespace, PostgreSQL just creates a new
version of the row in the same table. Removal of the old rows from the table is done by process named
vacuuming. Vacuuming also analyzes the table and gathers statistics. PostgreSQL can vacuum
automatically and fairly aggressively, depending on the parameter settings. Despite that, old row
versions can bloat the table significantly and slow down the I/O operations. On the plus side, there is
nothing like “ORA-01555: Snapshot too old” in the PostgreSQL world.
Generally speaking, Oracle I/O is much more sophisticated and noticeably faster than PostgreSQL.
Here is the result from the same machine and the table that was copied from Oracle to PostgreSQL
using a Perl script.
Oracle Database 11g Enterprise Edition Release 184.108.40.206.0 -Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> set timing on
SQL> select count(*) from t1;
Timing is on.
mgogala=# select count(*) from t1;
Time: 856.847 ms
The table is extremely simple and was created just for this comparison:
mgogala=# \d t1
Column | Type | Modifiers
| bigint | not null
c2 | character varying(30) |
"t1_pk" PRIMARY KEY, btree (c1)
In other words, oracle reads 3M very simple records in 620 milliseconds while PostgreSQL spends 856
milliseconds on the same table. The difference is rather significant. Both databases are running on the
same machine, so the comparison is valid. PostgreSQL performance mailing list is full of complaints
about the slowness of the full table scan but the complaints are not being addressed. The usual advice is
to tune the IO subsystem. The bigger the table gets, the bigger the difference in full table scan speed is.
There is no difference whatsoever when reading tables like EMP and DEPT. If the table is frequently
updated, PostgreSQL full table scan gets even slower. The “tune IO subsystem” advice is often
impractical and requires significant investments in hardware. Frequently, the advice includes solid
state disks, better controllers, more CPU power or something of that nature, that would make most of
the CIOs very unhappy.
Every relational database has a data dictionary. Oracle exposes its own data dictionary through the
series of views, which are rather simple and well documented. Postgres dictionary is much more
cryptic and much less documented than Oracle's. For instance, the pg_constraints table does not contain
the id of a referenced primary key or unique key constraint for foreign key constraints. It contains a
column list, which makes navigation through the dependency graph much harder.
Monitoring tools are also significantly less usable than the tools available for Oracle. First and
foremost, the fact that PostgreSQL has no shared pool means that it is not possible to see the execution
plan of a SQL statement being executed by some application. It is possible to run explain plan, but one
never knows what knobs were turned in the application itself and whether the execution environment is
the same for the person monitoring as for the executing application being monitored. A positive thing is
that it is possible to send a “cancel query” signal to a session, without killing the session.
There is no such thing as performance trace, there is no wait event interface. PostgreSQL tuning is still
a form of black magic, not an exact methodology. For Oracle, the methodology was laid out in Cary
Millsap's groundbreaking book “Optimizing Oracle Performance”. Basically, Cary Millsap and Anjo
Kolk introduced method and measurement to performance tuning. One could see where the time is
spent, what kind of event does the application wait for and what to do about it. PostgreSQL has not
developed any such methodology, PostgreSQL tuning is still a black art, of the kind that would get
people burnt at stake in some less happy times. There are some extensions which provide execution
plans for all SQL statements which execute longer than an preset time, but that is very messy indeed.
PostgreSQL log file parsers like PgFouine leave a lot to be desired.
PostgreSQL also has a problem with many connections. One should not try having 300 or so direct
simultaneous connections to a PostgreSQL cluster (a PostgreSQL term for what Oracle calls
“instance”). In such cases, a pooling tool like PgPool-II or PgBounce is a must. To tell the truth, these
two tools do a good job and allow PostgreSQL to scale to very large number of connections.
PostgreSQL can maintain a l standby database, which is extremely useful for high availability and
performance, because reporting can be offloaded to the standby. PostgreSQL standby database is open
for reading, which is similar to the Oracle Active Data Guard, a separately licensed option, available
only with the version 11G.
PostgreSQL is a nice little open source database which can be used for offloading internal application
systems from Oracle. It is very far from being on the same level with Oracle and it cannot replace
Oracle in the enterprise server room, not by any stretch of the imagination. The main problem of
PostgreSQL is the attitude of its developers who are resisting providing features like hints, demanded
by the corporate DBA personnel. That results in very minor usage. Recent MongoDB user group had
many more users and sponsors than the 2011 PgEast in New York City, which was a major flop. It was
that attitude that made me cancel the ambitious porting projects that the company I used to work for has
had. Simply said, providing the necessary level of service and performance wasn't possible in the
allotted time and with the resources I had at my disposal. In order to become really useful PostgreSQL
would need to have the following:
Wait event interface
Much improved partitioning.
Some form of parallelism, at least for the partitioned tables.
While it is true that there are commercial versions of PostgreSQL (EnterpriseDB, VoltDB, Vertica,
Greenplum, Netezza) providing the solution to some of the open source version problems, it is true that
the open source version is rather limited in its present state. If we go commercial, we have to include a
whole lot of other products, like Microsoft SQL Server and IBM DB2 into consideration, which is a
completely different story. I have no experience with the commercial versions of PostgreSQL and I
cannot make any claims about those versions.