Python, MongoDB, and asynchronous web frameworks

VIInternet and Web Development

Feb 14, 2012 (4 years and 10 months ago)

1,845 views

Python,
MongoDB
, and
asynchronous web frameworks

A. Jesse
Jiryu

Davis

jesse@10gen.com

emptysquare.net


CPU
-
bound web service

Client

Server

socket

• No need for async

• Just spawn o
ne process per core

Normal web service

Client

Server

socket

• Assume backend is unbounded

• Service is bound
by:





Context
-
switching overhead





File descriptors





Memory!

Backend

(DB, web service,

SAN, …)

socket

What’s
async

for?


Minimize resources per connection


Wait for backend as cheaply as possible

CPU
-

vs. Memory
-
bound

Crypto

Chat

HTML?

Memory
-
bound

CPU
-
bound

HTTP long
-
polling (“COMET”)


E.g., chat server


Async’s

killer app


Short
-
polling is CPU
-
bound: tradeoff between
latency and load


Long
-
polling is memory bound


“C10K problem”:
kegel.com
/c10k.html


Tornado was invented for this

Why is
async

hard to code?

Backend

Client

Server

request

response

store state

request

response

time

Ways to store state

Easy

Efficient

Blocked threads

Method:

Example:

Django
, WSGI

Tornado,
Node.js

Callbacks

Greenlets

Gevent

Tradeoff between
coding ease and memory efficiency

What’s a greenlet?


A.K.A. “green threads”


A feature of Stackless Python, packaged as a
module for standard Python


Greenlet stacks are stored on heap, copied to
/ from OS stack on resume / pause


Cooperative


Memory
-
efficient

Threads:

State stored on
OS stacks

# pseudo
-
Python


sock = listen()


request = parse_http(sock.recv())


mongo_data = db.collection.find()


response = format_response(mongo_data)


sock.sendall(response)

Gevent:

State stored on greenlet stacks

# pseudo
-
Python

import

gevent.monkey
; monkey
.
patch_all()



sock
=

listen()



request
=

parse_http(sock
.
recv())



mongo_data
=

db
.
collection
.
find()



response
=

format_response(mongo_data)



sock
.
sendall(response)

Tornado:

State stored in RequestHandler

class

MainHandler
(tornado
.
web
.
RequestHandler):


@tornado.web.asynchronous


def

get
(
self
):


AsyncHTTPClient()
.
fetch(
"http://example.com"
,


callback
=
self
.
on_response)




def

on_response
(
self
, response):


formatted
=

format_response(response)


self
.
write(formatted)


self
.
finish()

Tornado IOStream

class

IOStream
(
object
):


def

read_bytes
(
self
, num_bytes, callback):


self
.
read_bytes
=

num_bytes


self
.
read_callback
=

callback


io_loop
.
add_handler(


self
.
socket
.
fileno(),





self
.
handle_events,





events
=
READ)



def

handle_events
(
self
, fd, events):


data
=

self
.
socket
.
recv(
self
.
read_bytes)


self
.
read_callback(data)

Tornado IOLoop

class

IOLoop
(
object
):


def

add_handler
(
self
, fd, handler, events):


self
.
_handlers[fd]
=

handler


# _impl is epoll or kqueue or ...


self
.
_impl
.
register(fd, events)


def

start
(
self
):


while

True
:


event_pairs
=

self
.
_impl
.
poll()


for

fd, events
in

event_pairs:


self
.
_handlers[fd](fd, events)

Python, MongoDB, & concurrency


Threads work great with pymongo


Gevent works great with pymongo


monkey.patch_socket(); monkey.patch_thread()


Tornado works so
-
so


asyncmongo


No replica sets, only first batch, no SON manipulators, no
document classes, …


pymongo


OK if
all

your queries are fast


Use extra Tornado processes

Introducing: “Motor”


Mo
ngo +
Tor
nado


Experimental


Might be official in a few months


Uses Tornado IOLoop and IOStream


Presents standard Tornado callback API


Stores state internally with greenlets


github.com/ajdavis/mongo
-
python
-
driver/tree/tornado_async

Motor

class

MainHandler
(tornado
.
web
.
RequestHandler):


def

__init__
(
self
):


self
.
c
=

TornadoConnection()



@tornado.web.asynchronous


def

get
(
self
):


# No
-
op if already open


self
.
c
.
open(callback
=
self
.
connected)



def

connected
(
self
, c, error):


self
.
write(
'['
)


self
.
cursor
=

self
.
c
.
collection
.
find(callback
=
self
.
found)



def

found
(
self
, result, error):


for

i
in

result:


self
.
write(json
.
dumps(i))



if

self
.
cur
.
alive:


self
.
cur
.
get_more(callback
=
self
.
found)


else
:


self
.
write(
']'
)


self
.
finish()

Motor (with Tornado Tasks!)

class

MainHandler
(tornado
.
web
.
RequestHandler):


def

__init__
(
self
):


self
.
c
=

MongoTornadoConnection()



@tornado.web.asynchronous


@gen.engine


def

get
(
self
):


yield

gen
.
Task(
self
.
c
.
open)


self
.
write(
'['
)


cursor
=

self
.
c
.
db
.
collection
.
find(


callback
=
(
yield

gen
.
Callback(
'find'
)))



while

cursor
.
alive:


for

i
in

(
yield

gen
.
Wait(
'find'
)):


self
.
write(json
.
dumps(i))



self
.
write(
']'
)


self
.
finish()

Motor internals

pymongo

IOLoop

RequestHandler

request

schedule

callback

start

time

Client

greenlet

IOStream.send()

switch()

switch()

return

IOStream.recv()

stack depth

callback()

HTTP response

parse Mongo response

Motor internals: wrapper

class

TornadoCollection
(pymongo
.
collection
.
Collection):


def

find
(
self
,
*
args,
**
kwargs):


callback
=

kwargs
.
get(
'callback'
)


del

kwargs[
'callback'
]


cursor
=

super
(TornadoCollection,
self
)
.
find(
*
args,
**
kwargs)


tornado_cursor
=

TornadoCursor(cursor)


tornado_cursor
.
get_more(callback)


return

tornado_cursor


class

TornadoCursor
(
object
):


def

__init__
(
self
, cursor):


self
.
__sync_cursor
=

cursor



def

get_more
(
self
, callback):


def

_get_more
():


result
=

self
.
__sync_cursor
.
_refresh()


tornado
.
ioloop
.
IOLoop
.
instance()
.
add_callback(


lambda
: callback(result)


)



greenlet
.
greenlet(_get_more)
.
switch()


return

None

Motor internals: fake socket

class

TornadoSocket
(
object
):


@property


def

stream
(
self
):


if

not

self
.
_stream:


# Tornado's IOStream sets the socket to


# be non
-
blocking


self
.
_stream
=

tornado
.
iostream
.
IOStream(


self
.
socket)



return

self
.
_stream



def

recv
(
self
, num_bytes):


child_gr
=

greenlet
.
getcurrent()


def

recv_callback
(data):


child_gr
.
switch(data)



self
.
stream
.
read_bytes(


num_bytes, callback
=
recv_callback)


return

child_gr
.
parent
.
switch()

Motor


Shows a general method for asynchronizing
synchronous network APIs in Python


Who wants to try it with MySQL? Thrift?

Questions?

A. Jesse Jiryu Davis

jesse@10gen.com

emptysquare.net