PostgreSQL and Oracle Introduction

disturbedoctopusData Management

Nov 27, 2012 (5 years and 1 month ago)

610 views

PostgreSQL and Oracle
Introduction
Recently, driven by the ever more exorbitant prices of Oracle RDBMS components, many companies

started looking for an alternative platform that would help them decrease the license cost of the Oracle

RDBMS. Of course, as it became customary these days, the first place to look for an alternative is the

open source community. Open source community maintains several open source databases, the most

popular among them are MySQL and PostgreSQL. With Oracle Corp. acquisition of SUN

Microsystems, Oracle also acquired MySQL. There are other versions, like MariaDB and Drizzle, but
the general feeling is that MySQL is no longer a serious alternative. That leaves us with PostgreSQL.

This article strives to clarify PostgreSQL strengths and weaknesses for such purpose. I will refrain from

giving a verdict of the database suitability as a general Oracle replacement. It depends on the particular

project and the nature of the database that is being replaced. Overall impression is that Oracle is an 18-
wheeler, while PostgreSQL is a half ton truck. Each is useful its own area, but one cannot realistically

expect to replace one by another.
Characteristics of PostgreSQL
PostgreSQL is, in a way, very similar to Oracle. That similarity is intentional, as there are several

commercial products which use that similarity for marketing purposes. The most frequent complaints

of the Oracle developers, when trying out open source databases, are the problems with the CONNECT

BY clause, subquery factoring and analytic functions, also known as “window functions” to

PostgreSQL developers. PostreSQL supports “subquery factoring” or “Common Table Expressions”

(CTE), known to us oraclites as “WITH clause”. It also supports very decent set of window functions

like row_number, ntile, rank, dense_rank, lead, lag and some others.
Furthermore, recursive “WITH” is a very good replacement for Oracle's “CONNECT BY” query.

Incidentally, Oracle supports recursive “WITH” queries as of version 11.2. Below are both Oracle and

PostgreSQL versions of the same recursive query:
Oracle 11.2:
1 with e(empno,mgr,lev) as (
2 select empno,mgr,0 from emp where empno=7839
3 union all
4 select emp.empno,emp.mgr,e.lev+1
5 from emp,e
6 where emp.mgr=e.empno)
7* select * from e
SQL> /
EMPNO MGR LEV
7839 0
7566 7839 1
7698 7839 1
7782 7839 1
7499 7698 2
7521 7698 2
7654 7698 2
7788 7566 2
7844 7698 2
7900 7698 2
7902 7566 2
7934 7782 2
7369 7902 3
7876 7788 3
14 rows selected.
Oracle requires “UNION ALL” and prohibits the use of “LEVEL” keyword, so I replaced it by “LEV”.

The result is essentially the same as if we did a self join with “CONNECT BY prior empno= mgr”.
PostgreSQL can do the same thing:
PostgreSQL 9.0:
mgogala=# with recursive e(empno,mgr,level) as (
mgogala(# select empno,mgr,0 from emp where empno=7839
mgogala(# union
mgogala(# select emp.empno,emp.mgr,e.level+1
mgogala(# from emp,e
mgogala(# where emp.mgr=e.empno)
mgogala-# select * from e;
empno | mgr | level
-------+------+------
7839 | | 0
7566 | 7839 | 1
7698 | 7839 | 1
7782 | 7839 | 1
7499 | 7698 | 2
7521 | 7698 | 2
7654 | 7698 | 2
7788 | 7566 | 2
7844 | 7698 | 2
7900 | 7698 | 2
7902 | 7566 | 2
7934 | 7782 | 2
7369 | 7902 | 3
7876 | 7788 | 3
(14 rows)
mgogala=#
PostgreSQL doesn't object to my using the word “level” in the query, it's not a keyword in PostgreSQL.

It also doesn't require “UNION ALL”, like Oracle. The query is essentially the same, which simplifies
porting all those proprietary “CONNECT BY” constructs to PostgreSQL. Similarity between the SQL

syntax of Oracle and PostgreSQL is remarkable and should be of great help to Oracle developers,

tasked with developing an application using PostgreSQL database.
PL/PgSQL – Procedural extension
PostgreSQL procedural extension, PL/PgSQL is very similar to Oracle's own PL/SQL, but much
smaller. Again, this similarity in both name and syntax is intentional. However, differences are very

significant and there is no hope of using Oracle PL/SQL objects in the Postgres database. A complete

rewrite is usually required. This rewrite is , however, helped to a significant degree by the similarity of

the two languages.
Here are some of the most notable differences. PL/PgSQL functions cannot contain “COMMIT”

statement and procedures or packages are not supported, only functions. Until recently, when “DO”

statement was introduced in PostgreSQL 9.0, it wasn't possible to execute an anonymous PL/PgSQL.

PostgreSQL, however, is multi-lingual. It is possible to use Perl or Python as internal languages and

write functions or triggers in those languages. There is no collection of administrative and utility

packages like DBMS_LOB, DBMS_SPACE, UTL_FILE or UTL_MAIL.
PL PgSQL is not as powerful as PL/SQL, but it can definitely be used for enforcing a complex

business logic. On the other hand, coding rules are slightly different from Oracle. Below is an actual

trigger function that was used for partitioning in a live, production database:
CREATE FUNCTION moreover_insert_trgfn() RETURNS trigger
LANGUAGE plpgsql
AS $$
DECLARE
V_MONTH SMALLINT;
V_YEAR VARCHAR(4);
V_INS VARCHAR(512):='insert into moreover.moreover_documents_y';
V_VALS VARCHAR(128):=' VALUES($1.*)';
BEGIN
SELECT EXTRACT(MONTH FROM NEW.created_at) INTO V_MONTH;
SELECT EXTRACT(YEAR FROM NEW.created_at) INTO V_YEAR;
IF (V_MONTH>=10) THEN
V_INS:=V_INS||V_YEAR||'m'||V_MONTH||V_VALS;
ELSE
V_INS:=V_INS||V_YEAR||'m0'||V_MONTH||V_VALS;
END IF;
execute V_INS USING NEW;
RETURN NULL;
END;
$$
This function computes the partition into which the row needs to be inserted, based on the month and
year in the CREATED_AT field in the incoming row and executes insert using the variable “NEW” as
a bind value. Oracle is much more rigorous when it comes to coding and doesn't easily allow tricks
like this. PostgreSQL doesn't compile PL/PgSQL code, it stores it as a text literal. The code is parsed
for each session that executes it and cached in the private memory of the parsing process. That means
that there is no code sharing. If one process parses a PL/PgSQL function, other processes will not
benefit from it because PostgreSQL doesn't have anything even remotely like Oracle shared pool. That
applies to all procedural extension to PostgreSQL, not just PL/PgSQL. In other words, database
functions written in Perl or Python are also stored as text literals, parsed as needed and cached on per
process basis.
Partitioning
Partitioning in PostgreSQL leaves much to be desired. It is essentially the same technology used in

Oracle7, with partition views. In PostgreSQL, partitioning is based on the inheritance mechanism.

There is one “parent” table and partitions are separate tables that “inherit” the table structure from the

“parent”. Rows are routed into different partitions using triggers. PostgreSQL can do partition

elimination, but only during the parse phase. That means that it will not do partition pruning with an

already parsed SQL and a bind variable. Also, there is no global indexes, global constraints or any
other mechanism except of the very limited partition pruning. There is no global statistics, so the

optimizer cannot do anything on the table level, it has to descend to partition level and optimize the

SQL on per partition basis. Frequently, optimizer is not up to the task, so using partitions results in

horribly inefficient queries that take a very long time to complete. Partitioning can be useful, but it

requires a very careful design, which involves careful data model consideration and extremely careful

query design. There is no parallel query or parallel DML, not even for the partitioned tables.
PostgreSQL optimizer
PostgreSQL has a very sophisticated cost based optimizer, somewhat different from Oracle's. In
addition to the value histograms, that both databases collect, PostgreSQL also compiles the list of most
frequent column values. The number of the most frequent values and the size of the histograms is
determined by the default_statistics_target parameter and can range from 0 to 10000, much greater
range than Oracle's. In version 9, it is possible to set the histogram size for each table column
separately, using the “ALTER TABLE” statement. The same option is also present in Oracle 11g,

which allows the DBA to set the statistics preferences on the table level.
As all cost based optimizers, PostgreSQL optimizer is also sometimes counter intuitive and stubbornly

refuses to use an index. To alleviate the pain, database designers have made available a myriad of “set

session mode” options which turn on and off particular access methods. There are no hints, which is a

very big problem for porting projects. If a query starts misbehaving, there is no way to just make a

decision for the optimizer to use the particular index, one has to re-engineer sizable portion of the data

model, just to get things running. This makes porting projects prohibitively expensive as one should

expect to be forced to change the data model and essentially rewrite the application from scratch. This

is particularly big problem with Java ORM software, like Hibernate, which tends to generate

cumbersome and complex queries, which can be very tricky to optimize, even under the best of

circumstances. Also, the lack of hints makes the application much harder to maintain. Every

application contains SQL that will start causing troubles over time, because of the underlying data

model changes or simply because of the growth. Oracle allows the developers to quickly plug in such

holes with hints, without necessarily having to re-engineer the entire data model. PostgreSQL doesn't

allow that choice. Performance problems are usually addressed by fixing the underlying data model,

which can be prohibitively expensive and require prolonged downtime.
Optimizer will not use indexes in case of running a parsed query with bind variables because it doesn't

know anything about the values of the bind variables. The attitude of the PostgreSQL designers toward

hints is extremely negative and it is unlikely that this very useful tool will be provided anytime soon. It

is a curiosity, because almost all other relational databases support hints in one form or another, except

the PostgreSQL. Performance of complex queries, involving many tables, can deteriorate dramatically,

especially when ported from Oracle, where such queries are customary. The enmity of the PostgreSQL

developers to hints is almost religious in nature. Curiously, some of the most vocal opponents of hints

work for EnterpriseDB, whose flagship product is a commercial fork of PostgreSQL which supports

hints.
PostgreSQL also doesn't support index organized tables or clusters. It does support bitmap indexes, but

this is a murky subject which is usually avoided like a plague by the PostgreSQL experts. Until version

9, the general advice was that bitmap indexes are used at one's own risk. There is no star schema or star

transformation or any other tool to help with data warehouse type tasks.
PostgreSQL optimizer is geared toward OLTP type applications which are not overly complex and

which do not include large tables, tens of gigabyte in size, or even larger. Of course, there are examples

like Facebook and Skype, which use PostgreSQL as their backbone, but these are still exceptional

things, done by top class professionals and very hard to achieve by any standard. Recently, Skype has

been acquired by Microsoft and they started using SQL Server instead of PostgreSQL. The lack of

parallelism and unwillingness to support hints makes PostgreSQL rather awkward when it comes to

porting applications from Oracle.
Also, PostgreSQL optimizer is rather poorly documented, there no books like those written by Jonathan

Lewis or Tom Kyte in the PostgreSQL world. There is no method for overriding PostgreSQL optimizer

decisions. PostgreSQL may be smart, but it's still not smarter than us humans. Very few optimizers can

adequately deal with correlated data. PostgreSQL optimizer cannot deal with that. What does that

mean, precisely? Let's say that we have a table with the information about a group of people. The

information includes the date of birth and zodiac sign. There is 1/12 probability that any given person is

a Capricorn. There is also 1/12 probability that any given person was born in May. The optimizer

would reach the conclusion that the probability that a person born in May has Capricorn as his zodiac

sign is 1/12 * 1/12 = 1/144. That is still a hefty percent that might result in the full table scan. The

reality is, however, that people with Capricorn zodiac sign are born in January, not in May. Such query

would return no results and PostgreSQL cannot know anything about it. Correlated and interdependent

data columns are a very frequent occurrence, especially with financial data. This is far more frequent

situation than it appears. And the conclusion that the optimizer can reach is always incorrect. Having

no means to override optimizer decision means that we would have to remove all interdependent

columns from our data model, just as the 3
rd
normal form would require. That is a lot of work and time,

especially if the application that is being ported is modestly complex and contains foreign keys,

interdependent columns, triggers, views and procedural code.
MVCC, Vacuum and IO architecture
To start this part, it is worth noticing that PostgreSQL tables are files. Each table is comprised of one or

more files. When the table size exceeds 1GB, a new file is started. PostgreSQL also relies on the file

system for block prefetch. PostgreSQL doesn't do asynchronous I/O or direct I/O. It also doesn't have

any built in parallelism. Now, let's discuss Postgres variant of multi-versioning concurrency control,

MVCC for short.
PostgreSQL, unlike Oracle RDBMS, leaves the old values for the modified row in the table itself. It

doesn't have anything like UNDO segments or UNDO tablespace, PostgreSQL just creates a new

version of the row in the same table. Removal of the old rows from the table is done by process named

vacuuming. Vacuuming also analyzes the table and gathers statistics. PostgreSQL can vacuum

automatically and fairly aggressively, depending on the parameter settings. Despite that, old row
versions can bloat the table significantly and slow down the I/O operations. On the plus side, there is

nothing like “ORA-01555: Snapshot too old” in the PostgreSQL world.
Generally speaking, Oracle I/O is much more sophisticated and noticeably faster than PostgreSQL.

Here is the result from the same machine and the table that was copied from Oracle to PostgreSQL

using a Perl script.
Oracle 11.2:
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 -Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> set timing on
SQL> select count(*) from t1;
COUNT(*)
3000000
Elapsed: 00:00:00.62
Postgres 9.0:
Timing is on.
mgogala=# select count(*) from t1;
count
3000000
(1 row)
Time: 856.847 ms
mgogala=#
The table is extremely simple and was created just for this comparison:
mgogala=# \d t1
Table "public.t1"
Column | Type | Modifiers
--------+------------------------------+----------c1
| bigint | not null
c2 | character varying(30) |
Indexes:
"t1_pk" PRIMARY KEY, btree (c1)
In other words, oracle reads 3M very simple records in 620 milliseconds while PostgreSQL spends 856

milliseconds on the same table. The difference is rather significant. Both databases are running on the

same machine, so the comparison is valid. PostgreSQL performance mailing list is full of complaints

about the slowness of the full table scan but the complaints are not being addressed. The usual advice is

to tune the IO subsystem. The bigger the table gets, the bigger the difference in full table scan speed is.

There is no difference whatsoever when reading tables like EMP and DEPT. If the table is frequently
updated, PostgreSQL full table scan gets even slower. The “tune IO subsystem” advice is often

impractical and requires significant investments in hardware. Frequently, the advice includes solid

state disks, better controllers, more CPU power or something of that nature, that would make most of

the CIOs very unhappy.
Miscellaneous
Every relational database has a data dictionary. Oracle exposes its own data dictionary through the

series of views, which are rather simple and well documented. Postgres dictionary is much more

cryptic and much less documented than Oracle's. For instance, the pg_constraints table does not contain

the id of a referenced primary key or unique key constraint for foreign key constraints. It contains a

column list, which makes navigation through the dependency graph much harder.
Monitoring tools are also significantly less usable than the tools available for Oracle. First and

foremost, the fact that PostgreSQL has no shared pool means that it is not possible to see the execution

plan of a SQL statement being executed by some application. It is possible to run explain plan, but one

never knows what knobs were turned in the application itself and whether the execution environment is

the same for the person monitoring as for the executing application being monitored. A positive thing is

that it is possible to send a “cancel query” signal to a session, without killing the session.
There is no such thing as performance trace, there is no wait event interface. PostgreSQL tuning is still

a form of black magic, not an exact methodology. For Oracle, the methodology was laid out in Cary

Millsap's groundbreaking book “Optimizing Oracle Performance”. Basically, Cary Millsap and Anjo

Kolk introduced method and measurement to performance tuning. One could see where the time is

spent, what kind of event does the application wait for and what to do about it. PostgreSQL has not

developed any such methodology, PostgreSQL tuning is still a black art, of the kind that would get

people burnt at stake in some less happy times. There are some extensions which provide execution

plans for all SQL statements which execute longer than an preset time, but that is very messy indeed.

PostgreSQL log file parsers like PgFouine leave a lot to be desired.
PostgreSQL also has a problem with many connections. One should not try having 300 or so direct

simultaneous connections to a PostgreSQL cluster (a PostgreSQL term for what Oracle calls

“instance”). In such cases, a pooling tool like PgPool-II or PgBounce is a must. To tell the truth, these

two tools do a good job and allow PostgreSQL to scale to very large number of connections.
PostgreSQL can maintain a l standby database, which is extremely useful for high availability and

performance, because reporting can be offloaded to the standby. PostgreSQL standby database is open

for reading, which is similar to the Oracle Active Data Guard, a separately licensed option, available

only with the version 11G.
Conclusion
PostgreSQL is a nice little open source database which can be used for offloading internal application

systems from Oracle. It is very far from being on the same level with Oracle and it cannot replace

Oracle in the enterprise server room, not by any stretch of the imagination. The main problem of

PostgreSQL is the attitude of its developers who are resisting providing features like hints, demanded

by the corporate DBA personnel. That results in very minor usage. Recent MongoDB user group had

many more users and sponsors than the 2011 PgEast in New York City, which was a major flop. It was

that attitude that made me cancel the ambitious porting projects that the company I used to work for has

had. Simply said, providing the necessary level of service and performance wasn't possible in the

allotted time and with the resources I had at my disposal. In order to become really useful PostgreSQL

would need to have the following:
1.
Optimizer hints
2.
Wait event interface
3.
Much improved partitioning.
4.
Some form of parallelism, at least for the partitioned tables.
While it is true that there are commercial versions of PostgreSQL (EnterpriseDB, VoltDB, Vertica,

Greenplum, Netezza) providing the solution to some of the open source version problems, it is true that

the open source version is rather limited in its present state. If we go commercial, we have to include a

whole lot of other products, like Microsoft SQL Server and IBM DB2 into consideration, which is a

completely different story. I have no experience with the commercial versions of PostgreSQL and I

cannot make any claims about those versions.