jnosal

On programming and stuff

Not so exceptional anymore

| Comments

I'm going to share my personal view on some cases, connected more or less with python exception handling.

1. Worst python antipattern revisited

Everyone knows this is harmful:

try:
    <code>
except:
    pass

but, this (not so harmful in someone's eyes) snippet is even worse imho:

try:
    <code>
except as e:
    logging.error(e)

We may assume that someone was accessing complex function, which internals he was not familiar with that much, so he decided to log any exception that occur. And unless it wasn't quick script or snippet, that should tell us at least three things about this person:

a) he probably didn't bother writing unit tests
b) he didn't care enough to identify function internals and special cases that may occur (ValueErrors, KeyErrors etc)
c) he knows about the first anti-pattern (except / pass), but is lazy enough to actually reproduce it, just slightly improving on quality

and unless you're using sentry or somewhat similar software and rely purely on log files, I'm almost 100% sure that most of the exception will be lost in the abyss.

2. Ignoring with ignored

Introduced (to me) by Raymond Hettinger useful pattern to silencing unwanted exceptions - use with caution

@contextmanager
def ignored(*exceptions):
    try:
        yield
    except exceptions:
        pass
with ignored(ZeroDivisionError):
    print "Where is your God now ?"
    2 / 0

3. Using what's already been done

Unless you're writing big or very generic library or some fancy project I'd suggest you rely on already defined exceptions, which are present in Standard Library

a) Raise TypeError: use when someone's tries to play with your business logic in a harmful way - e.g providing car instances to function that expects bank accounts to do some money operations.

b) Raise AttributeError: may be used instead common ConfigurationExceptions that are thrown whenever something is not present in configuration / setting files

c) Raise KeyError: whenever handler is missing for specific actions, like so:

def play_music(device, *args, **kwargs):
    handlers = {
        'radio': radio_handler,
        'computer': computer_handler,
    }
    
    if action not in handlers:
        raise KeyError(u"Handler definition missing for {0}".format(action))
    
    handler = handlers[device]
    handler(*args, **kwargs)

d) Raise ValueError: throw anytime when data provided is invalid and doesn't match expectations, for instance You expect number between 1 and 100 and someone enters 1000

This little gang of four should keep you going for a while ;-)

4. Don't raise NotImplementedError

I literally hate snippets like this:

class BaseNotificationGateway(object):
    
    def prepare(self):
        raise NotImplementedError
        
    def notify(self):
        raise NotImplementedError

class IAmNotDefinedButIWillNotFailYet(BaseNotificationGateway):
    pass
    
gateway = IAmNotDefinedButIWillNotFailYet()

# later

gateway.notify()

It's time abc module becomes more popular - it's much better and flexible for creating interfaces.

import abc

class BaseNotificationGateway(object):
    __metaclass__ = abc.ABCMeta
    
    @abc.abstractmethod
    def prepare(self):
        pass
    
    @abc.abstractmethod    
    def notify(self):
        pass
        

class IAmMissingSomething(BaseNotificationGateway):
    pass
    
# will raise TypeError on instance creation

gateway = IAmMissingSomething()

It's not the best idea to put pass in each function perhaps, but this pattern is extremely useful for application factories: http://flask.pocoo.org/docs/0.10/patterns/appfactories/, which may be used to setup different objects that may be used later on app startup.

5. Drop try except else finally statements and favour context managers

Instead of this:

from app.services import github

client = github.ApiClient()

try:
    response = client.fetch()
except github.SuperCustomException as e:
    # Possibly log exception

    # Fallback action

else:
    # Handle response

finally:
    # Close what's to close

    # Cleanup

put it in context manager and let your API be neat and beautiful

from app.services import github

with github.ApiClient() as client:
    response = client.fetch()
    # handle response

6. Know your library

Django, Sqlalchemy, requests -they all come with create deal of predefined exceptions: non existing rows, http errors, timeouts, validation errors - it's all there for you.

Enjoy!

Don't miss Python articles on reddit

| Comments

In case you've ever wanted to create your personal "must reads" on Python related stuff, search for something in articles published a while ago - here is a simple approach how one may try to tackle the problem:

Crawl reddit's r/Python:

# -*- coding: utf-8 -*-

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector


class RedditItem(Item):
    title = Field()
    url = Field()


class RpythonSpider(CrawlSpider):
    name = "rpython"
    allowed_domains = [
        "reddit.com"
    ]
    start_urls = (
        'http://www.reddit.com/r/Python/new/',
    )
    crawled = set()
    crawled_prev_next = set()

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        prev_next = hxs.select('//span[@class="nextprev"]//a').\
            select('@href').extract()
        
        if len(prev_next) > 0:
            url = prev_next[-1]
            if url not in self.crawled_prev_next:
                self.crawled_prev_next.add(url)
                yield self.make_requests_from_url(prev_next[-1])

        for link in hxs.select('//*[@id="siteTable"]//div//p[1]/a'):
            
            url = link.select('@href').extract()[0]
            title = link.select('text()').extract()[0]

            if len(url) and len(title) and url not in self.crawled and '/r/Python/' not in url:
                item = RedditItem()
                item['title'] = title 
                item['url'] = url
                self.crawled.add(url)
                yield item

That's a very basic scrapy spider, which You can invoke for instance like that:

scrapy crawl rpython -o file.json -t json

and store all links with titles that lead to external blogs/services. That way You won't miss a single article :-). Hook this script to some database, add published_at field and get notified only when there is new stuff.

Enjoy!

Work on your Python style

| Comments

Goal

Coding style is important and in a way defines every programmer. Everyday we come across well tested code, code that works but is not generic, overthought code and of course crappy code :-). Personally I had a tendency to value code that is practical (does the job well and is straightforward) over generic solutions which I considered to be 'too much for now'. But, as time was passing I've learned that there are couple python tools that provide very good balance between something generic and practical and are still pythonic (that's what cool kids aim for).

The idea behind this article is to show different approaches to typical programmer struggles :-). We'll be designing a very simple metric system (basically a hit counter) for our views or business logic parts.

(Keep in mind that few lines are actually 'pseudo-code')

Slow start

from redis import StrictRedis


def my_view(request):
    # possibly database access

    r = StrictRedis()
    amount = int(r.get('my_view'))
    if not amount:
        amount = 1
    else:
        amount += 1
    r.set('my_view', amount)
    # setup context and render your response

    return AwesomeHttpResponse()

So we've decided to use redis to store our metrics - that's a plus. But, except that, overall quality is rather poor.

Improvements, improvements ...

First: we did not use INCR so we are not guarding against race conditions. Let's fix that (all imports are skipped for convenience):

def my_view(request):
    # possibly database access

    r = StrictRedis()
    r.incr('my_view')
    # setup context and render your response

    return AwesomeHttpResponse()

Shorter and better. But something does not feel right. In this case connecting to redis is simple because we rely on defaults, but it may require specifying host, port and db so our code my slightly grow. Apart from that, key name is somewhat hardcoded. So each time we'd like to 'install' this piece of code in another view we'd have to duplicate everything. It's a job for decorator!

def gimme_metric(f):
    @wraps(f)
    def _gimme_metric(*args, **kwargs):
        response = f(*args, **kwargs)
        r = StrictRedis()
        r.incr(f.__name__)
        return response
    return _gimme_metric


@gimme_metric
def my_view(request):
    return AwesomeHttpResponse()


@gimme_metric
def my_another_view(request):
    return YetAnotherHttpResponse()

That feels great. Not only it's reusable, allows us to switch everything on and off - it looks pythonic (it means we're getting there). But still, something is missing. What if I'd like to use this code in 'regular' class or function, not as a decorator. We need something better:

class metrics(object):

    def __init__(self, key=None, auto=True):
        self.key = key
        self.auto = auto

    def setup_backend(self):
        return StrictRedis()

    def bump_metric(self, key, amount=1):
        backend = self.setup_backend()
        backend.incr(name=key, amount=amount)

    def __enter__(self):
        if self.auto and not self.key:
            raise MetricException(u"Gimme key!")

        elif self.auto:
            self.bump_metric(key=self.key)

        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if exc_val:
            logging.error(exc_val)

    def __call__(self, f):
        @wraps(f)
        def wrapped(*args, **kwargs):
            response = f(*args, **kwargs)
            self.bump_metric(f.__name__)
            return response
        return wrapped

Uff, that's pretty long. But boy, it was worth it. As I believe Raymond Hettinger said - context managers are one of the most underused features of Python, which is strange since they're a good choice when it comes to implementing acquire & release pattern (here it's not that obvious, but we could move redis connection to enter, and possibly provide better exception handing, nevertheless context manager is a protocol which we can freely use).

We can decorate our regular views (function names will be used as keys):

@metrics()
def my_view(request):
    return AwesomeHttpResponse()


@metrics()
def my_another_view(request):
    return YetAnotherHttpResponse()

We can use custom metric around one block of code:

with metrics(key='super-metric'):
    # do something

Or let ourselves bump couple of metrics manually:

with metrics(auto=False) as m:
    m.bump_metric('Uno')
    m.bump_metric('Dos')

Last, but no least - since it's a class we can support other backends:

class memcached_metrics(metrics):

def setup_backend(self):
    return MemcachedSuperDriver()

That was a long road - and there are still places for improvements. As You can see with some tweaking we can improve quality of working code and bring it to another level. Here, we fixed race condition issue, improved reusability by rewriting component as decorator and allowed another, more raw usage from any class/function.

Cheers :)

Use strace. Duh!

| Comments

Really short entry, but I hope someone will benefit from it :)

The background

Strace is a linux tool that is able to attach to any process and monitor system calls being made by it.

It's simple & complex at the same time and requires a little reading on the topic, but it's totally worth it.

The scenario

Recently I had strange problems with postgresql (well not with postgresql itself but a certain postgresql configuration on a certain machine :)) which were pretty hard to debug using "standard" ways - monitoring software, logs, etc. Luckily I remembered about a small tool called strace, which to be honest I haven't used for like 2 or 3 years. With that utility, it was pretty easy to find & fix connection problems.

The point of this article: strace follows unix philosophy of doing one small thing pretty well, but in the world of many tools & open source software it's rather forgotton (at least to me it was), and it's a shame because it can be really helpful. Try to play with it and maybe one day you'll find it handy like I did :)

After all the problems I came across an article, which gives a more thorough explanation on this topic: Debugging obscure postgres problems

Locust.io with nanomsg ? Easy peasy.

| Comments

The Need

locust.io is a modern python tool for load testing. I have couple of small services working on development server and running on nanomsg, although nanomsg is still in beta phase I was tempted to move them to production environment (basically because most of them did one thing and did it well). But before deployment I wanted to test my architecture and I needed benchmarks for that.

It turns out that locust, which by default plays nicely with http based services, provides also a neat way to hook custom clients to its core features. So its fairly easy to test xml-rpc, zeromq, rabbitmq or nanomsg based apps.

Code at the bottom is pretty straightforward - it's a simple nanomsg client that serves as an example, to which we have to hook locust events in order to collect locust metrics. 20-30 lines of simple magic and we're done!

We can now run:

$ locust

and enjoy :-)

zdjecie.png

The Code

import json
import time
import nanomsg

from locust import Locust, events, task, TaskSet


class SteroidSocket(nanomsg.Socket):

    def send_json(self, msg, flags=0, **kwargs):
        msg = json.dumps(msg, **kwargs).encode('utf8')
        self.send(msg, flags)

    def recv_json(self, buf=None, flags=0):
        msg = self.recv(buf, flags)
        return json.loads(msg)


class NanomsgClient(object):
    socket_type = nanomsg.REQ
    default_send_timeout = 100

    def __init__(self, address, **kwargs):
        self.address = address
        self.setup()

    def get_socket(self):
        return SteroidSocket(self.socket_type)

    def setup(self):
        self.socket = self.get_socket()
        self.socket.connect(self.address)
        self.socket._set_send_timeout(self.default_send_timeout)

    def get(self, msg):
        start_time = time.time()
        try:
            self.socket.send_json(msg)
            result = self.socket.recv_json()
            print result
        except nanomsg.NanoMsgAPIError as e:
            total_time = int((time.time() - start_time) * 1000)
            events.request_failure.fire(
                request_type="nanomsg",
                name=msg.get('executable', ''),
                response_time=total_time,
                exception=e
            )
        else:
            total_time = int((time.time() - start_time) * 1000)
            events.request_success.fire(
                request_type="nanomsg",
                name=msg.get('executable', ''),
                response_time=total_time,
                response_length=0
            )

    def close(self):
        self.socket.close()


class NanomsgLocust(Locust):

    def __init__(self, *args, **kwargs):
        super(NanomsgLocust, self).__init__(*args, **kwargs)
        self.client = NanomsgClient(self.address)


class NanomsgUser(NanomsgLocust):
    address = "tcp://127.0.0.1:5001"
    min_wait = 100
    max_wait = 1000

    class task_set(TaskSet):

        @task(1)
        def ping(self):
            self.client.get({'executable': 'ping'})

        @task(1)
        def pong(self):
            self.client.get({'executable': 'pong'})
            

The Conclustion

Whenever in need to test custom architecture, find single points of failure or simply experiment with your stack - use locust.io to put your code under pressure :-)

Python in Production (II) - Postgresql

| Comments

Introduction

Database tuning is often considered unnecessary, and many people leave it for the very end of development or completely skip that part. I'll be covering logging, backups, indexing and orm role, in order to give you some insight into different database related tasks.

Logging

The thing that many people don't get right when it comes to configuring postgresql properly is logging.

$ vi postgresql.conf
  log_destination = 'stderr' # replace with csvlog when you use scripts/tools to analyze logs

  logging_collector = on # switch log functionality on

  log_directory = '/var/log/postgresql' # where our logs will go

  log_filename = 'postgresql-%Y-%m-%d.log' # log file naming strategy

  log_rotation_age = 1d # log files larger older than 512MB will be rotated

  log_rotation_size = 512MB # log files larger than 512MB will be rotated

  log_min_duration_statement = 50 # each query above 50ms will be logged

  log_line_prefix = '%t [%p]: [%l-1] db=%d,user=%u ' # how our logs will be written

  log_checkpoints = on
  log_connections = on
  log_disconnections = on
  log_hostname = on
  log_lock_waits = on
  log_temp_files = 0  

connections, disconnections, lock_waits, temp_files, checkpoints not only give us some overview but may be helpful to debug other configuration parameters that rely for instance on memory.

You can rely on that logging configuration for most of you projects - having them rotated daily with reasonable metadata will surely help you analyse the problems and bottlenecks of you DB setup.

Logging - PgBadger

Logging itself is a useful little feature, but we can do better :) We can have our logs analyzed by pgBadger. Which will not only aggregate some useful information like crud statistics, but provide a nice graphical representation of what's being done in our database.

Typical setup involves:

  1. Downloading & installing pgbadger according to official documentation.
  2. Setting up a cronjob that uses bash script which handles log files from /var/log/postgresql on a daily basis, creates pgbadger report and stores/uploads it somehwere.

There're many tutorials and guides how to setup pgbadger and tie it with cron: http://www.antelink.com/blog/using-pgbadger-monitor-your-postgresql-activity.html. In the end You'll be able to get visual representation of database queries which may look for instance like that:

img.png

Backups

There is no golden rule here, backup setup depends on project's business value and couple of other factors. For a small project having a small bash script which performs pg_dump and later pushing the output file to remote destination should be enough.

touch ~/scripts/database_backup.sh && vi ~/scripts/database_backup.sh

Add following script to the created file.

#!/bin/bash

pg_dump -h localhost -U <user> --no-password -F t <database> > /srv/backups/database/db-backup-`date +"%s"`.sql.tar

Save the file and create a cron entry:

crontab -e
# Database backup

0 0,12,19 * * * sh /home/<user>/scripts/database_backup.sh

It's an anti-pattern to store backups on the same server so either connect to your database from remote location or have it another way and create a script that will push your backups to another destination.

Backups - WAL-e

WAL-e is a tool created by heroku that provides continous archiving of WAL segments. To put it simply: if your app can't afford to loose transactions it's a way to go.

All your WAL files will be stored on S3 for backups, in case You need them. WAL is basically a utility which is used by postgresql when it comes to keeping track of database changes before they're applied. Archiving WAL segments with WAL-E will allow you to restore your database to the state from just before a crash.

Flock

It's not exactly part of this tutorial, but since I've mentioned creating cronjobs I think it's a good time to introduce You to flock. Most of linux distributions have a command called flock, which will run a command only if it can get a lock on a certain file.

So changing our entry in crontab to:

flock -n /tmp/database_backup.lock -c sh /home/<user>/scripts/database_backup.sh

will prevent us from running duplicate cronjobs, which not only slows server but may lead to hard to track errors. That's how You can easily take care of problems that your cronjobs make cause. Personally, I wrap all crontab entries with flock and I recommend You to do so.

Indexes

No golden rule here. First play with EXPLAIN ANALYZE to find slow queries. If that's not an option at some point use pgbadger and logs to find queries that may suffer without indexes.

Use partial indexes (also known as filtered) for stuff like:

  • querying database against a column which is NULL for most of the rows
  • querying database against some business value (salary, point of time, status or kind field)

for instance:

SELECT * from orders where payment_deadline IS NOT NULL;

or

SELECT * from orders where owner_id WHERE value > 1000;

This will index only subset of rows, which will keep index size significantly smaller compared to situation in which you'd index whole table.

Use composite indexes (known also as multicolumn) to index queries that rely constantly on same filtering conditions, like:

SELECT * from orders where owner_id = <owner_id> AND status = <status>;

Generally, indexing foreign keys can be considered good practice (some SQL databases do that automatically). For that You may find the following query useful (shamelessly borrowed from here):

WITH fk_actions ( code, action ) AS (
    VALUES ( 'a', 'error' ),
        ( 'r', 'restrict' ),
        ( 'c', 'cascade' ),
        ( 'n', 'set null' ),
        ( 'd', 'set default' )
),
fk_list AS (
    SELECT pg_constraint.oid as fkoid, conrelid, confrelid as parentid,
        conname, relname, nspname,
        fk_actions_update.action as update_action,
        fk_actions_delete.action as delete_action,
        conkey as key_cols
    FROM pg_constraint
        JOIN pg_class ON conrelid = pg_class.oid
        JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.oid
        JOIN fk_actions AS fk_actions_update ON confupdtype = fk_actions_update.code
        JOIN fk_actions AS fk_actions_delete ON confdeltype = fk_actions_delete.code
    WHERE contype = 'f'
),
fk_attributes AS (
    SELECT fkoid, conrelid, attname, attnum
    FROM fk_list
        JOIN pg_attribute
            ON conrelid = attrelid
            AND attnum = ANY( key_cols )
    ORDER BY fkoid, attnum
),
fk_cols_list AS (
    SELECT fkoid, array_agg(attname) as cols_list
    FROM fk_attributes
    GROUP BY fkoid
),
index_list AS (
    SELECT indexrelid as indexid,
        pg_class.relname as indexname,
        indrelid,
        indkey,
        indpred is not null as has_predicate,
        pg_get_indexdef(indexrelid) as indexdef
    FROM pg_index
        JOIN pg_class ON indexrelid = pg_class.oid
    WHERE indisvalid
),
fk_index_match AS (
    SELECT fk_list.*,
        indexid,
        indexname,
        indkey::int[] as indexatts,
        has_predicate,
        indexdef,
        array_length(key_cols, 1) as fk_colcount,
        array_length(indkey,1) as index_colcount,
        round(pg_relation_size(conrelid)/(1024^2)::numeric) as table_mb,
        cols_list
    FROM fk_list
        JOIN fk_cols_list USING (fkoid)
        LEFT OUTER JOIN index_list
            ON conrelid = indrelid
            AND (indkey::int2[])[0:(array_length(key_cols,1) -1)] @> key_cols

),
fk_perfect_match AS (
    SELECT fkoid
    FROM fk_index_match
    WHERE (index_colcount - 1) <= fk_colcount
        AND NOT has_predicate
        AND indexdef LIKE '%USING btree%'
),
fk_index_check AS (
    SELECT 'no index' as issue, *, 1 as issue_sort
    FROM fk_index_match
    WHERE indexid IS NULL
    UNION ALL
    SELECT 'questionable index' as issue, *, 2
    FROM fk_index_match
    WHERE indexid IS NOT NULL
        AND fkoid NOT IN (
            SELECT fkoid
            FROM fk_perfect_match)
),
parent_table_stats AS (
    SELECT fkoid, tabstats.relname as parent_name,
        (n_tup_ins + n_tup_upd + n_tup_del + n_tup_hot_upd) as parent_writes,
        round(pg_relation_size(parentid)/(1024^2)::numeric) as parent_mb
    FROM pg_stat_user_tables AS tabstats
        JOIN fk_list
            ON relid = parentid
),
fk_table_stats AS (
    SELECT fkoid,
        (n_tup_ins + n_tup_upd + n_tup_del + n_tup_hot_upd) as writes,
        seq_scan as table_scans
    FROM pg_stat_user_tables AS tabstats
        JOIN fk_list
            ON relid = conrelid
)
SELECT nspname as schema_name,
    relname as table_name,
    conname as fk_name,
    issue,
    table_mb,
    writes,
    table_scans,
    parent_name,
    parent_mb,
    parent_writes,
    cols_list,
    indexdef
FROM fk_index_check
    JOIN parent_table_stats USING (fkoid)
    JOIN fk_table_stats USING (fkoid)
WHERE table_mb > 9
    AND ( writes > 1000
        OR parent_writes > 1000
        OR parent_mb > 10 )
ORDER BY issue_sort, table_mb DESC, table_name, fk_name;

To put it simply: this query finds foreign keys that are not indexed :)

Know Your ORM

At the very end - I don't intend to post here another ORM showdown. What I want I want to outline is that it's good to learn your ORM, consider:

We will be playing with database with couple thousands of rows and following sqlalchemy models. Note that foreign keys are indexed already.

class Country(Base):
    __tablename__ = 'countries'

    id = Column(Integer, primary_key=True)
    name = Column(String(255), nullable=False)


class Player(Base):
    __tablename__ = 'players'

    id = Column(Integer, primary_key=True)
    name = Column(String(255), nullable=False)
    country_id = Column(ForeignKey(u'countries.id'))
    team_id = Column(ForeignKey(u'teams.id'))

    country = relationship(u'Country')
    team = relationship(u'Team')


class Team(Base):
    __tablename__ = 'teams'

    id = Column(Integer, primary_key=True)
    name = Column(String(255), nullable=False)

First let's start simple. Let's fetch all player ids, names and their team_ids;

djangodb=> EXPLAIN ANALYZE SELECT id, name, team_id FROM players;
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Seq Scan on players  (cost=0.00..3274.00 rows=200000 width=15) (actual time=0.010..33.241 rows=200000 loops=1)
 Total runtime: 42.780 ms

Ok, let's see how fast sqlalchemy will deal with that query

session.query(Player).all()

3.5 seconds. No surprise here - we have to create objects, allocate memory ang generally process everything. But can we do better ?

Let's try fetching particular columns first

session.query(Player.id, Player.name, Player.team_id).all()

0.6 seconds. Not bad. But wait, we can drop declarative and try using core

players = Player.__table__
statement = select([players.c.id, players.c.name, players.c.team_id])
return engine.execute(statement).fetchall()

0.4 seconds. Now, we're talking

Now, let's do some joining

EXPLAIN ANALYZE SELECT p.id, p.name, c.name, t.name FROM players p LEFT OUTER JOIN countries c ON p.country_id = c.id LEFT OUTER JOIN teams t ON p.team_id = t.id;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=13118.00..29534.00 rows=200000 width=25) (actual time=160.197..450.810 rows=200000 loops=1)
   Hash Cond: (p.team_id = t.id)
   ->  Hash Left Join  (cost=6559.00..16404.00 rows=200000 width=22) (actual time=85.288..226.012 rows=200000 loops=1)
         Hash Cond: (p.country_id = c.id)
         ->  Seq Scan on players p  (cost=0.00..3274.00 rows=200000 width=19) (actual time=0.016..16.286 rows=200000 loops=1)
         ->  Hash  (cost=3082.00..3082.00 rows=200000 width=11) (actual time=85.046..85.046 rows=200000 loops=1)
               Buckets: 4096  Batches: 16  Memory Usage: 558kB
               ->  Seq Scan on countries c  (cost=0.00..3082.00 rows=200000 width=11) (actual time=0.016..26.853 rows=200000 loops=1)
   ->  Hash  (cost=3082.00..3082.00 rows=200000 width=11) (actual time=74.713..74.713 rows=200000 loops=1)
         Buckets: 4096  Batches: 16  Memory Usage: 558kB
         ->  Seq Scan on teams t  (cost=0.00..3082.00 rows=200000 width=11) (actual time=0.004..23.784 rows=200000 loops=1)
 Total runtime: 458.158 ms

So, using all those 3 techiniques:

session.query(Player).\
            options(joinedload('country')).\
            options(joinedload('team')).\
            all()
            
session.query(Player.id, Player.name, Country.name, Team.name).\
            outerjoin(Country).outerjoin(Team).all()
            
players = Player.__table__
countries = Country.__table__
teams = Team.__table__
statement = select(
    [players.c.id, players.c.name, countries.c.name, teams.c.name]
).select_from(players.outerjoin(countries).outerjoin(teams))
engine.execute(statement).fetchall()

takes respectively: 9.5, 1.3 and 0.95 seconds.

Apart from powerful core that allows You to build low level queries, sqlalchemy also comes with yield_per, bundles, powerful join system and from_statement, which are really handy when queries need to perform slightly better.

Stay tuned for part 3!

Python in Production (I) - Linux

| Comments

Introduction

The aim of this guide (I plan to create couple of separate parts for different aspects of running web application in production mode) is to introduce You to key concepts and problems that may occur on live server. I'll touch server itself, database, application server and deployment and provide You with configuration files and flow that I use and rely on. Note that it's extremely subjective, but I think that configuration may be considered at least as "reasonable default". Let's dig in, shall we?

Stack

Most of the posts refer to the stack I work on, which is:
- nginx
- haproxy
- uwsgi
- redis
- postgresql
- python2.7

Linux

By Linux I refer to Ubuntu since that's the distribution I use for most of my projects. So, playing with linux configuration we can affect three things mostly:
- I/O
- Nginx
- Memory

Max open files

By default set to 1024. Since sockets are used to communication between different tools, they affect conncurrent connections which our stack is able to handle. If we expect high traffic, it's a good practice to tune that setting:

$ ulimit -n 99999
$ vi /etc/security/limits.conf
    nginx       soft    nofile  99999
    nginx       hard    nofile  99999

Now reload the changes

$ sudo sysctl -p

Kernel queue for accepting new connections

By default set to 128, it represents a size of kernel queue for accepting new connections.

$ sysctl -w net.core.somaxconn=99999
$ vi /etc/sysctl.d/haproxy-tuning.conf
    net.core.somaxconn=99999

Usable ports

By default set to 32768-61000, it represents range of ports that can be used by our system. The number affects number of concurrent open connections.

$ sysctl -w net.ipv4.ip_local_port_range="10000 61000"
$ vi /etc/sysctl.d/haproxy-tuning.conf
    net.ipv4.ip_local_port_range=10000 61000

Socket recycling

DO NOT ENABLE IT !

Common misconception is to enable fast recycling (most of tuning guides provide such advice), so sockets do not stay in TIME_WAIT, like that:

$ vi /etc/sysctl.conf
  net.ipv4.tcp_tw_recycle = 1
  net.ipv4.tcp_tw_reuse = 1

However, as explained here: Click its highly discouraged.

Filesystem access

In order to improve I/O we can tell linux not to store information about last file access or read time (which it keeps by default). In order to change that, modify confiration of a partition which your files reside in

Replace

$ vi /etc/fstab
    UUID=<UUID> /               ext4    errors=remount-ro 0       1

with

$ vi /etc/fstab
    UUID=<UUID> /               ext4    noatime,nodiratime,errors=remount-ro 0       1

noatime affects files, and nodiratime directories respectively

In memory filesystem for /tmp

Putting

$ vi /etc/fstab
    tmpfs /tmp               tmpfs    defaults,nosuid,noatime 0       0

to your /etc/fstab file results in replacing a filesystem for /tmp directory with an in-memory filesystem. This will highly increase I/O performance on file uploads. Note that it may be a bottleneck when files that are being uploaded are large or if you are lacking RAM.

At the very end mount new filesystem:

$ mount /tmp

Getting swap right

When you're forced to add some swap for your system be sure to put those two lines in your sysctl.conf:

$ vi /etc/sysctl.conf
    vm.swappiness=10
  vm.vfs_cache_pressure = 50

which respectively tell our system not to swap data of RAM to swap place that often (swapiness) and
tell our system to cache access data so it's not looked up frequently (vfs_cache_pressure).

The three things I want you to remember after this part are:

  1. Default configuration of you system is good, but may not be properly tuned for high loads and getting maximum out of tools from your stack (nginx, haproxy).
  2. No one knows all that stuff by heart (at least I don't), so if you find that article useful, save it somewhere so you can look it up later when it comes to configuration, or create an ansible playbook that deals with that stuff :-).
  3. Stay tuned for part 2 !

Cheers!