Adding PostgreSQL functionality

offbeatlossData Management

Nov 22, 2012 (4 years and 10 months ago)

279 views

Adding PostgreSQL functionality
Hans Sj
¨
oberg(c02hsg@cs.umu.se)
Mattias Linde(c02lek@cs.umu.se)
Olov Ylinenp
¨
a
¨
a(c02yov@cs.umu.se)
Patrik Berg(c02bpk@cs.umu.se)
Abstract
This article describes the addition of an aggregate function to PostgreSQL.
Specifically,the addition of an aggregate function to calculate the convex hull,
given a set of tuples consisting of points.
1 Introduction
PostgreSQL is a free OSS
1
database backend application.Applications such as
this are in constant development and because it is OSS,anyone can add func-
tionality to it.An example of functionality PostgreSQL lacks is the calculation
of convex hulls.Although it is possible to calculate a bounding box given a set
of points,sometimes it is necessary to be more specific.
2 Approach
The first step when modifying any existing software is to learn the basics.The
PostgreSQL homepage [1] makes available a general documentation that,al-
though it doesn’t go into the absolute specifics of PostgreSQL internals,can be
used as a first start.The next step is to look at code and learn through reading
it.It is highly suggested to use some sort of source code indexing such as tags
[2] for this as the PostgreSQL source code is distributed over many files using
a large amount of macros and small functions to do its tasks.Having gotten a
general idea about how the system is built up and,specifically,how the process
of adding functionality works,the next step is the actual implementation.Af-
ter implementing the functionality required and adding this to PostgreSQL,the
task is complete.
2.1 Existing documentation
The documentation for PostgreSQL currently available is large,but is only par-
tially useful for understanding PostgreSQL internals.Therefore,reading and
1
Open Source Software
Adding PostgreSQL functionality
understanding the existing source code is vital to adding any kind of function-
ality.Unfortunately,the source code is sparsely commented.Scattered along
the source tree are small textfiles that each explain a small part of the whole.
Although they’re not very common,they contain very technical information,
useful for the task at hand.The source code itself is usually commented with a
small introduction for each function,as well as some small inline comments.
2.2 Implementing aggregate functions
The source code makes available a large amount of general functionality to
accomplish simpler tasks.However,the existence of these is not documented and
to find them,it’s necessary to look at existing examples.An aggregate function,
which is what is to be implemented,consists of one aggregating function (state
transition function) and an optional function that is called when the aggregation
is done (final function).
The aggregation of tuples requires an amount of memory that is unknown be-
forehand.Existing aggregation functionality in PostgreSQL that accumulates
tuples into arrays does this by allocating new memory for each aggregation.The
problem with this naive approach is that it doesn’t scale,for large amounts of
items the time required to do this unacceptably high (table 1,query 2).
For example,if 100,000 tuples are to be aggregated,then at the 99,999th aggre-
gation a memory area of 99,999 plus one elements needs to be allocated,and all
elements copied there.
The solution to this problem is to allocate more memory than required,so that
at each iteration in the aggregation,new memory allocation and copying is not
required.It should be noted that although PostgreSQL contains functionality
for reallocation of memory,this is implemented through the method described
above.PostgreSQL provides functionality for the creation and disassembly of
arrays,however,allocating more memory than is actually required at the point
of array creation isn’t supported.This requires the dissection of existing func-
tionality to extend it.Because PostgreSQL uses a custom format for arrays
without much documentation (since it is meant that users of arrays should use
the provided functionality) this task can be tedious.
PostgreSQL puts references to all internal functions into tables that together
make up the System Catalog.The System Catalog contains all the functions,
datatypes,etc.supported by PostgreSQL.Adding to the SystemCatalog can be
done through use of SQL (e.g.CREATE FUNCTION) or through the editing
of certain files in the PostgreSQL source tree.
2.3 Convex hulls
The convex hull of a set of points is the minimal convex polygon that encloses it.
A common algorithm to calculate the convex hull,having adequate complexity
(O(n*log(n))) called Graham’s Scan [3],which is the algorithm chosen.The
algorithm requires a sorted list of points,the chosen sorting algorithm was 1.
2
Adding PostgreSQL functionality
#Query Result Time
1 select count(*) from test;100000 78.331 ms
2
select array
accum(p)
from test;
(100000 points in an array) 377859.683 ms
3 select hull(p) from test;
(polygon representing a
convex hull)
728.207 ms
4
select * from test where
not p @ (select hull(p)
from test);
(0 rows) 875.772 ms
5
select convexhull(
array[’1,1’::point,
’3,5’::point,’0,8’::point,
’1.5,6’::point]);
((3,5),(1,1),(0,8)) 0.630 ms
6
select point
append(
array[point(’1,1’)],
point(’3,5’));
{”(1,1)”,”(3,5)”} 0.480 ms
Table 1:Time comparison for different queries.
3 Result
After implementing the described functionality,it is possible to,given a set of
tuples of the Point datatype,issue SQL queries as seen in table 1.The queries
in table 1 are executed on a table called test,containing 100,000 elements:
• Query one shows that the table contains 100000 rows.
• Query two shows an aggregation of 100000 points using the built-in func-
tion for appending to arrays,as can be seen this is very inefficient if
done solely using functionality provided by PostgreSQL.The function
array
accum uses the internal function array
append,which is the way
provided by PostgreSQL to do array aggregation.
• Query three demonstrates the aggregate function described in this article
computing the convex hull of 100000 points.As can be seen in the results,
this function is much faster than that in query 2.
• Query four checks if any point is outside of the convex hull.
• Query five shows how to calculate the convex hull when aggregation is not
necessary.
• Query six shows that it is possible to use the aggregation function by itself
to aggregate tuples into an array.
Practical uses for convex hulls aren’t hard to find,in figure 1 it is shown how
convex hulls are used to graphically group points.
3
Adding PostgreSQL functionality
Figure 1:Graphical representation of points grouped into three different regions.
4 Discussion
Having first set out to add area functionality to polygons,only to discover
that it was already implemented in the newest version of PostgreSQL hampered
development progress.Luckily,part of the work already done was learning
PostgreSQL,something that would be required no matter what functionality
was to be implemented.The actual work done was not in the implementation
of Graham’s Scan,but rather in the creation of an aggregate function that
could aggregate tuples into arrays.As shown in table 1,this was something
that needed to be highly optimized to be usable in a database backend.
Because PostgreSQL documentation isn’t targeted towards those who intend
to modify internals,but rather want to understand the basics,much time was
spent understanding what was actually done under the hood.For example,the
custom format of arrays is sufficiently poorly documented that to learn it,it is
best to look at how it is used and interpret it.
As should’ve been expected,much less work was done writing code than was
done learning what code to write.Because of this,the resulting code was only
4
Adding PostgreSQL functionality
a few hundred lines.
Possible future work of the project is to send a patch containing the necessary
addition to PostgreSQL developers for integration.
5 Conclusions
Looking at the PostgreSQL source code,it’s easy to see that efficiency was
sacrificed for an attempt at readability,portability and reuse of code.Thus,if
efficiency is of vital importance,then perhaps PostgreSQL may not be the best
choice.
The large amount of functions and macros used by PostgreSQL is a jungle for
those learning how to add functionality,but is most likely very useful once
learned.
References
[1] http://www.postgresql.org/
[2] http://linuxcommand.org/man
pages/etags1.html
[3] Udi Manber Algorithms:A Creative Approach,Addison-Wesley,1989,
ISBN 0-201-12037-2
5