Remoting Django (and other python) methods to be run inside the PostgreSQL database

offbeatlossΔιαχείριση Δεδομένων

22 Νοε 2012 (πριν από 4 χρόνια και 11 μήνες)

327 εμφανίσεις



Remoting Django (and other python)
methods to be run inside the
PostgreSQL database
PgWest 2010
Hannu Krosing


About me
Hannu Krosing
hannu@2ndQuadrant.com
skype: hkrosing
* Using python since v. 1.1
* Using linux since v. 0.9
* PostgreSQL since it was

Postgres 4.2 (using postQuel as its

query language instead of SQL)
Senior PostgreSQL Consultant
www.2ndQuadrant.com
Technical Advisor
Ambient Sound Investments
www.asi.ee


Python & PostgreSQL Background
Large portion of python code I have written has had to do with databases, be it
binary protocol or file format hacking, or using python to automate data loading,
backups or just plain data manipulation.
I was the first (and for some years the only) DBA for Skype, where we used
python for most data backend manipulation tasks and where I invented a
Proxying and database partitioning language pl/proxy which lets you build
unlimited scalability into your database backends with relative ease , both from
developer and administrator viewpoint.
For last few years I have been helping companies to manage and scale their
PostgreSQL databases as a Database Consultant at 2ndQuadrant
(www.2ndQuadrant.com) , as well as evaluating technical merits of startups at
ASI , the investment company of Skype founding engineers (www.asi.ee).


The 1
st
problem to solve

Many people use postgreSQL from python,
but don't want to dive deeply into SQL

Even when you are good at using SQL
from another language, it is still
cumbersome, due to different ways of
using and different expectations of users
(an “impedance mismatch”)


Solution to the 1
st
problem

Use an Object Relational Mapper (ORM)

ORMs replace SQL with something
syntactically more close to hosting
language

Django has a good and flexible ORM


ORM vs DB API

Sample DB API update to table row
cur.execute('update users set password=%s where 
name=%s', ('frog', 'bob')))

Same using Django ORM
usr=User.get(pk='bob')
usr.password = 'frog'
usr.save()


Other available ORMs

SQLAlchemy
keeps Object definition and Mapper
separate, joins when running

STORM (GPL)

SQLObject (mostly superseeded by more
flexible SQLAlchemy)


Main problem with ORMs

Most ORMs tend to generate lots of distinct database
requests and do not make use of most database features,
at least when used naively.

Not using good database features tends to prefer
databases with less features.

Most ORMs have ways to circumvent this and even write
your own SQL, but then why use the ORM in the first
place.



More problem with ORMs
Next ones are due to not using functions to access
database:

There is no easy way for a DBA to go and fix
inefficient functionality without changing the source
code of client application.

You can't use pl/proxy so it is hard to use partitioning
for scaling.


Is this really a problem ?

How bad is the “many db requests”
problem

As each db request has about 0.4ms
overhead even when doing nothing you
can easily get a large total runtime for
simple things.

This does usually not show up at
development time, when you are the only
user.


Sample schema
-- 100000 items
create table items(
id serial primary key,
item_name text not null,
price nume(10,4),
sale_price numeric(10,4)
);
-- 100000 items_available
create table items_available(
items_available
id serial primary key references items(id),
quantity int
);
-- 100000 items
create table items(
id serial primary key,
item_name text not null,
price nume(10,4),
sale_price numeric(10,4)
);
-- 1000000 order_lines
create table order_lines(
order_lines
id serial primary key,
order_id int not null references orders(id),
item_id int not null references items(id),
quantity_ordered int,
quantity_shipped int
);


Shipping an order using Django ORM
order=Order.get(id=1000)
for line in order__order_lines:
    
item_av = ItemAvailable.get(pk=line.id)
    
if line.quantity_ordered > item_av.quantity:
        
line.quantity_shipped = item_av.quantity
        
item_av.quantity = 0
    
else:
        
line.quantity_shipped = line.quantity_ordered
        
item_av.quantity ­= line.quantity_shipped
    
line.save()
    
item_av.save()

Does 1 query to get order and another one to get ~100 order lines

Then does 100 queries to get corresponding availability data

And then 200 to save order lines and availability back

That is 300+ queries in total, each doing RPC call to database

Likely to take at least 500 ms or more
● some of the queries can be avoided by using get_related but
most can't


Shipping an order in pl/python
create or replace function process_order(i_order_id int, out json_result text) as
$$
from cjson import encode, decode
if 'order_plan' not in SD:
    
q = 'select * from orders where id = $1'
    
SD['order_plan'] = plpy.prepare(q, [ 'int' ])
<..cache other plans here..>
order = plpy.execute(SD['order_plan'], [i_order_id])
order_lines = plpy.execute(SD['order_line_plan'], [i_order_id])
availability_query = SD['availability_plan']
update_quatities_query = SD['update_quatities']
update_order_shipping_query = SD['update_order_shipping']
for order_line in order_lines:
    
line_id = order_line['id']
    
item_id = order_line['item_id']
    
availability = plpy.execute(availability_query, [item_id])[0]
    
if availability['quantity'] < order_line['quantity_ordered']:
        
plpy.execute(update_quatities_query, [item_id, 0])
        
plpy.execute(update_order_shipping_query, [line_id,availability['quantity']])
    
else:
        
plpy.execute(update_quatities_query, [item_id, availability['quantity'] ­
                                                       
order_line['quantity_ordered']])
        
plpy.execute(update_order_shipping_query, [line_id, order_line['quantity_ordered']])
$$ language plpythonu security definer;


One RPC call to database
One RPC call to database


Total 30-120 ms to update 2x100 rows
Total 30-120 ms to update 2x100 rows


Problems with pl/python

Maintaining code in 2 places

Hard to write

Writing this as 1-2 SQL statements

would have performed even faster,

so why bother


Automatic remoting

You can automate moving code into

postgreSQL to be executed by pl/python
How ?

Write a decorator, which wraps up code

and arguments, and passes them for

execution to a special pl/python function


Writing a remoting decorator
All the following samples need these modules and connections
import psycopg2
import json, marshal, base64
# user who can create functions
admcon = psycopg2.connect('dbname=pyrpcdb user=admin')
admcur = admcon.cursor()
# connection for using the function
usrcon = psycopg2.connect('dbname=pyrpcdb user=bob')
usrcur = admcon.cursor()


Writing a remoting decorator
Decorator as class
class run_in_database(object):
    
def __init__(self, f):
        
self.f = f
        
self.code = f.func_code
        
self.mcode = marshal.dumps(self.code)
        
self.code64 = base64.encodestring(self.mcode)
        
self.db_name = "%s_%d" % (f.__name__, hash(self.mcode))
    
def __call__(self, *args, **kargs):
        
json_args = json.dumps((args, kargs))
        
usrcur.execute('select * from run_code(%s,%s)',
                       
(self.code64, json_args))
        
result = cjson.loads(usrcur.fetchone()[0])
        
return result

serialises actual code of function

serializes arguments, passed argunments and code to pl/python

deserialises return value


Remote executor in postgreSQL
All the following samples need these modules and connections
create or replace function run_code(in code64 text, in json_args text,  
                                    
out json_result text) as
$$
import json, marshal, base64
args, kargs = json.loads(json_args)
code = marshal.loads(base64.decodestring(code64))
def f():pass
f.func_code = code
res = f(*args, **kargs)
return json.dumps(res)
$$ language plpythonu security definer;

Deserializes arguments and code

Creates a dummy function, then attaches deserialised code to it

Calls the function, serialises result and returns it


Defining and calling the remoted function
All the following samples need these modules and connections
@run_in_database
def get_order(i_order_id):
    
if 'order_plan' not in SD:
        
q = 'select * from orders where id = $1'
        
SD['order_plan'] = plpy.prepare(q, [ 'int' ])
    
if 'order_line_plan' not in SD:
        
q = 'select * from order_lines where order_id = $1'
        
SD['order_line_plan'] = plpy.prepare(q, [ 'int' ])
    
order = plpy.execute(SD['order_plan'], [i_order_id])[0]
    
order_lines = plpy.execute(SD['order_line_plan'], [i_order_id])
    
return (order, [order_line for order_line in order_lines])
#call the decorated function, it is executed in database
res = get_order(1000)

Main overhead returning data, not shipping the code

Code shipping can be changed to happen only when code is changed


Some speed numbers

Simple ”select * from t where id = n” runs in 0.1-0.7 ms

The get_order(n) takes 4.5ms in database

Calling it with automatic remoting from python, 6.5ms
This means that the overhead of shipping the function
code is negligible.


Problems with simple remoting

While running everything via code shipping
works, it is not the best option

No caching of functions

No caching of query plans

No further optimisation or scaling possible

Better to create a corresponding db
function for each remoted function


Advanced remoting
A function that runs (and creates if necessary) a corresponding
function in database
class run_as_db_function(object):
    
def __init__(self, f):
        
self.f = f
        
self.py_fname = f.__name__
        
self.code = f.func_code
        
self.mcode = marshal.dumps(self.code)
        
self.code64 = base64.encodestring(self.mcode)
        
self.db_fname = "%s_%d" % (f.__name__, abs(hash(self.mcode)) )
    
def __call__(self, *args, **kargs):
        
json_args = json.dumps((args, kargs))
        
print  self.db_fname, (json_args,)
        
try:
            
# try calling the function in db
            
usrcur.execute("savepoint sp1; select * from %s(%%s)" % self.db_fname,
                                                                               
(json_args,))
        
except:
            
#  define func using admin connection
            
usrcur.execute("rollback to savepoint sp1")
            
admcur.execute(remote_function_template % (self.db_fname, self.code64) )
            
admcon.commit()
            
# run the function again
            
usrcur.execute("select * from %s(%%s)" % self.db_fname, (json_args,))
        
result = json.loads(usrcur.fetchone()[0])
        
return result


Advanced remoting (cont)
A database function template to be used when creating the function
remote_function_template = """\
CREATE OR REPLACE FUNCTION %s (in json_args text, out json_result text) AS
$$
import json, marshal, base64
code = marshal.loads(base64.decodestring(\"\"\"%s\"\"\"))
args, kargs = json.loads(json_args)
def f():pass
f.func_code = code
res = f(*args, **kargs)
return json.dumps(res)
$$  language plpythonu security definer;
"""
Create and call a remoted database function
@run_as_db_function
def double(strarg):
    
return strarg *2
double('cat')


Advanced remoting (cont)
The function that actually was created in database
CREATE FUNCTION double_1980513421(json_args text, OUT json_result text) RETURNS text
    
LANGUAGE plpythonu SECURITY DEFINER
    
AS $$
import json, marshal, base64
code = marshal.loads(base64.decodestring("""
YwEAAAABAAAAAgAAAEMAAABzCAAAAHwAAGQBABRTKAIAAABOaQIAAAAoAAAAACgBAAAAdAYAAABz
dHJhcmcoAAAAACgAAAAAcwcAAAA8c3RkaW4+dAYAAABkb3VibGUBAAAAcwIAAAAAAg==
"""))
args, kargs = json.loads(json_args)
def f():pass
f.func_code = code
res = f(*args, **kargs)
return json.dumps(res)
$$;


Making it scalable


The easiest way to achieve scalability with a function-based interface to
postgreSQL is pl/proxy


But pl/proxy needs a partitioning key in function signature while current
approach hides all arguments inside a JSON container


Time to add possibility to pass some arguments separately


While we are at it, lets also cache the internal function


Scalable remoting for pl/proxy
A decorator exposing an explicit partitioning key
def run_as_partitioned_db_function(part_key, part_key_type="text"):
  
class _run_as_partitioned_db_function(object):
    
def __init__(self, f):
        
self.f = f
        
self.py_fname = f.__name__
        
self.code = f.func_code
        
self.mcode = marshal.dumps(self.code)
        
self.code64 = base64.encodestring(self.mcode)
        
self.db_fname = "%s_%d" % (f.__name__, abs(hash(self.mcode)) )
        
self.part_key = part_key
        
self.part_key_type = part_key_type
        
self.part_key_pos = self.code.co_varnames.index(self.part_key)
    
def __call__(self, *args, **kargs):
        
json_args = json.dumps((args, kargs))
        
print 'args:',args
        
print 'kargs:',  kargs
        
print  self.db_fname, (args[self.part_key_pos], json_args)
        
try:
            
# try calling the function in db
            
usrcur.execute("savepoint sp1; select * from %s(%%s, %%s)" % self.db_fname,
                                                                       
(args[self.part_key_pos], json_args))
        
except:
            
#  define func using admin connection
            
usrcur.execute("rollback to savepoint sp1")
            
admcur.execute(partitioned_function_template % (self.db_fname, self.part_key_type, self.code64) )
            
admcon.commit()
            
# run the function again
            
usrcur.execute("select * from %s(%%s, %%s)" % self.db_fname, (args[self.part_key_pos], json_args))
        
result = json.loads(usrcur.fetchone()[0])
        
return result
  
return _run_as_partitioned_db_function


Scalable remoting for pl/proxy (cont.)
A function template for scalable remoting exposing key field
partitioned_function_template = """\
CREATE OR REPLACE FUNCTION %s (in part_key %s, in json_args text, out json_result text) 
AS
$$
import json
if not SD.has_key('_func_'):
    
import marshal, base64
    
code = marshal.loads(base64.decodestring(\"\"\"
%s\"\"\"))
    
def f():pass
    
f.func_code = code
    
SD['_func_'] = f
args, kargs = json.loads(json_args)
res = SD['_func_'](*args, **kargs)
return json.dumps(res)
$$  language plpythonu security definer;
"""
This also caches the function in pl/python's local dictionary SD so it is
not re-created at each call


Scalable remoting function in database
Defining and running this in python
@run_as_partitioned_db_function('username')
def hello(username, age):
    
return 'Hello %s, you are nor %d years old' % (username, age)
hello('bob',32)
Yields the following function in the database
CREATE OR REPLACE FUNCTION hello_678701658(IN  part_key text, json_args text, 
                                           
OUT json_result text)
    
LANGUAGE plpythonu SECURITY DEFINER AS 
$$
import json
if not SD.has_key('_func_'):
    
import marshal, base64
    
code = marshal.loads(base64.decodestring("""
YwIAAAACAAAAAwAAAEMAAABzDgAAAGQBAHwAAHwBAGYCABZTKAIAAABOcyIAAABIZWxsbyAlcywg
eW91IGFyZSBub3IgJWQgeWVhcnMgb2xkKAAAAAAoAgAAAHQIAAAAdXNlcm5hbWV0AwAAAGFnZSgA
AAAAKAAAAABzBwAAADxzdGRpbj50BQAAAGhlbGxvAQAAAHMCAAAAAAI=
"""))
    
def f():pass
    
f.func_code = code
    
SD['_func_'] = f
args, kargs = json.loads(json_args)
res = SD['_func_'](*args, **kargs)
return json.dumps(res)
$$;


So how would one scale this
Once you have a function with exposed partition key, it is
easy to use pl/proxy to do the actual partitioning. Just
define a proxy function
CREATE OR REPLACE  FUNCTION hello_678701658(part_key text, json_args text, 
                                            
OUT json_result text)
LANGUAGE plproxy AS $$
  
CLUSTER 'usercluster';
  
RUN ON hashtext(i_username) ;
$$;
And this function decides the partition to run the function in
based on part_key and then calls the real function in that
partition with exact same arguments as it was called and
returns the result to original caller


How far is Django support for this ?
While the aim of this project is support for remoting ORM-based
methods and Django as the first one, we are not there yet
I have an ugly hacked version of postgresql database backend
which can be run inside the database to execute Django QuerySet
objects, but it is far from complete nor is it easy to use or install.
You can write straight pl/python code inside your Django app and
have it remoted using the code from this tutorial


ToDo for Django support

Complete django ORM support on
pl/pythons SPI interface

Above needs a SPI-specific database
adapter

And a DB-API compliant SPI interface, a
long-time ToDo item in pl7python

Automatic generation of Django Models
from database Schema


Other nice-to-have features

Automatic generation of pl/proxy function
stubs

Bypassing SQL when generating the query
plans

Support for other python ORMs

Support for ORMs from other languages
(Ruby on Rails anybody ?)


Conclusion

It is possible to automatically move
functions into the database for execution
as demonstrated in this talk.

It is also possible to move Django methods
into database and make them scalable
using pl/proxy

Making the Django method shipping easy
still needs work (donations welcome ;)


Where to get more info

http://www.2ndQuadrant.com/
- if you want professional
paid help

http://www.postSQL.org
/ - where this code will be
developed and maintained

http://plproxy.projects.postgresql.org/doc/tutorial.html

Tutorial for using pl/proxy

How to reach me:

email
hannu@2ndQuardant.com

Skype hkrosing