Digging Into Django QuerySets

Digging Into Django QuerySets

Object-relational mappers (or ORMs for short), such as the one that comes built-in with Django, make it easy for even new developers to become productive without needing to have a large body of knowledge about how to make use of relational databases. They abstract away the details of database access, replacing tables with declarative model classes and queries with chains of method calls. Since this is all done in standard Python developers can build on top of it further, adding instance methods to a model to wrap reusable pieces of logic. However, the abstraction provided by ORMs is not perfect. There are pitfalls lurking for unwary developers, such as the N + 1 problem. On the bright side, it is not difficult to explore and gain a better understanding of Django's ORM. Taking the time and effort to do so will help you become a better Django developer.

In this article I'll be setting up a simple example app, consisting of nothing more than a few models, and then making use of the Django shell to perform various queries and examine the results. You don't have to follow along, but it is recommended that you do so.

First, create a clean virtualenv. Here I'll be using Python 3 all of the way, but there should be little difference with Python 2.

$ mkvirtualenv -p $(which python3) querysets
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in /home/jrb/.virtualenvs/querysets/bin/python3
Also creating executable in /home/jrb/.virtualenvs/querysets/bin/python
Installing setuptools, pkg_resources, pip, wheel...done.

Next, install Django and IPython,

(querysets) $ pip install django ipython

Create the new project.

(querysets) $ django-admin.py startproject querysets
(querysets) $ cd querysets/
(querysets) $ ./manage.py startapp qs

Update querysets/settings.py to add 'qs', to the end of the INSTALLED_APPS list. Then, edit qs/models.py to add the simple models we will be dealing with

from django.db import models


class OnlyOne(models.Model):
    name = models.CharField(max_length=16)


class MainModel(models.Model):
    name = models.CharField(max_length=16)
    one = models.ForeignKey(OnlyOne, on_delete=models.CASCADE)


class RelatedModel(models.Model):
    name = models.CharField(max_length=16)
    main = models.ForeignKey(MainModel, related_name="many", on_delete=models.CASCADE)

Finally, set up the database.

(querysets) jrb@caktus025:~/caktus/querysets$ ./manage.py makemigrations qs
Migrations for 'qs':
  qs/migrations/0001_initial.py:
    - Create model MainModel
    - Create model OnlyOne
    - Create model RelatedModel
    - Add field one to mainmodel
(querysets) jrb@caktus025:~/caktus/querysets$ ./manage.py migrate
...

Running python manage.py shell should now pull up an IPython session.

Now that we have a working project set up, we'll need some means of keeping track of the quantity and the raw SQL of any queries sent to the database. Django's TransactionTestCase class provides an assertNumQueries method, which is interesting but too specific and too tied to the test suite for our needs. However, examining its implementation, we can see that it ultimately makes use of a context manager called CaptureQueriesContext, from the django.test.utils module. This context manager will cause a database connection to capture all of the SQL queries sent, even if such is currently turned off (i.e. if DEBUG = False is set), and make those queries available on the context object. I find this a useful tool to use in debugging to track down code that is issuing too many queries to the database, in situations where Django Debug Toolbar won't help.

At the time of writing, the most recent released version of Django is 1.10.6. I've copied the code for CaptureQueriesContext for this version below, with a few irrelevancies redacted.

class CaptureQueriesContext(object):
    def __init__(self, connection):
        self.connection = connection

    @property
    def captured_queries(self):
        return self.connection.queries[self.initial_queries:self.final_queries]

    def __enter__(self):
        self.force_debug_cursor = self.connection.force_debug_cursor
        self.connection.force_debug_cursor = True
        self.initial_queries = len(self.connection.queries_log)
        self.final_queries = None
        request_started.disconnect(reset_queries)
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.connection.force_debug_cursor = self.force_debug_cursor
        request_started.connect(reset_queries)
        if exc_type is not None:
            return
        self.final_queries = len(self.connection.queries_log)

So here we can see several things of interest to us. The context manager keeps a reference to the database connection (as self.connection), it sets and then unsets a flag on the connection (self.connection.force_debug_cursor) which tells the connection to do the captures, it stores the number of queries at the start and at the end (self.initial_queries and self.final_queries), and finally, it provides a slice of the actual queries captured as the property captured_queries. Nothing here restricts its use to the test suite, so we'll be making use of it throughout in our IPython session.

Let's try it out now.

In [1]: from django.test.utils import CaptureQueriesContext

In [2]: from django.db import connection

In [3]: from qs import models

In [4]: with CaptureQueriesContext(connection) as context:
   ...:     print(models.MainModel.objects.all())
   ...:
<QuerySet []>

In [5]: print(context.initial_queries, context.final_queries)
0 1

So we can see that there were no queries to start out with, and that a query was issued to the database by our code. Let's see what that looks like,

In [6]: print(context.captured_queries)
[{'time': '0.001', 'sql': 'SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "q
s_mainmodel"."one_id" FROM "qs_mainmodel" LIMIT 21'}]

This shows us that the captured_queries property gives us a list of dicts, and each dict contains the raw SQL and the time it took to execute. In the above query, note the LIMIT 21. This is there because the repr() of a QuerySet limits itself to showing no more than 20 of the items it contains. The additional twenty-first item is captured so that it knows whether or not to add an ellipsis at the end to indicate that there are more items available.

Let's create some data. First up, we need a quick and dirty way of populating the name fields

In [7]: import random

In [8]: import string

In [9]: def random_name():
   ...:     return ''.join(random.choice(string.ascii_letters) for i in range(16
   ...: ))
   ...:

In [10]: random_name()
Out[10]: 'nRtybzKaSZWjHOBZ'

Now the objects

In [11]: with CaptureQueriesContext(connection) as context:
    ...:     models.OnlyOne.objects.bulk_create([
    ...:         models.OnlyOne(name=random_name())
    ...:         for i in range(5)
    ...:     ])
    ...:     models.MainModel.objects.bulk_create([
    ...:         models.MainModel(name=random_name(), one_id=i + 1)
    ...:         for i in range(5)
    ...:     ])
    ...:     models.RelatedModel.objects.bulk_create([
    ...:         models.RelatedModel(name=random_name(), main_id=i + 1)
    ...:         for i in range(5)
    ...:         for x in range(7)
    ...:     ])
    ...:

In [12]: print(context.final_queries - context.initial_queries)
6

In [13]: print(context.captured_queries)
[{'sql': 'BEGIN', 'time': '0.000'}, {'sql': 'INSERT INTO "qs_onlyone" ("name") S
ELECT \'TSdxoYKuGUnijmVY\' UNION ALL SELECT \'DNcSpIJbnVXjbabq\' UNION ALL SELEC
T \'suAQQAzEflBqLxuc\' UNION ALL SELECT \'hWtuPkfjdATxZhNV\' UNION ALL SELECT \'
GTwPkXTUSpZYBWCT\'', 'time': '0.000'}, {'sql': 'BEGIN', 'time': '0.000'}, {'sql'
: 'INSERT INTO "qs_mainmodel" ("name", "one_id") SELECT \'fsekHOfSJxdiGiqp\', 1
UNION ALL SELECT \'dHdPoqeKZzRCJEql\', 2 UNION ALL SELECT \'MiDwEPvqIuxCEArT\',
3 UNION ALL SELECT \'yCCRaPiLzWnnUewS\', 4 UNION ALL SELECT \'ftfQWWfuZhXblNlF\'
, 5', 'time': '0.000'}, {'sql': 'BEGIN', 'time': '0.000'}, {'sql': 'INSERT INTO
"qs_relatedmodel" ("name", "main_id") SELECT \'tMOCzPRjKZHbwBLb\', 1 UNION ALL S
ELECT \'SLxCmCtmxeCwpkAC\', 1 UNION ALL SELECT \'qccDQKsgmTFCIMsF\', 1 UNION ALL
 SELECT \'LWAvYdGQvBdlsYjI\', 1 UNION ALL SELECT \'gLjaGTjoLoNMkbDl\', 1 UNION A
LL SELECT \'PqjvVLhCFoMOlVfH\', 1 UNION ALL SELECT \'BpHnEzhucSZNryWs\', 1 UNION
 ALL SELECT \'CYOkzJgsrGYOLOoB\', 2 UNION ALL SELECT \'HheikdsLaQnWpZBj\', 2 UNI
ON ALL SELECT \'mXqccnNYLDrQOoiT\', 2 UNION ALL SELECT \'BipJWoDVlJoyNPxD\', 2 U
NION ALL SELECT \'BgvvYHlAHegyRjbF\', 2 UNION ALL SELECT \'GrlOlbMwnqfkPKZX\', 2
 UNION ALL SELECT \'OFJchZLVjmXNAHjO\', 2 UNION ALL SELECT \'SYHRkSvBmupzUHXO\',
 3 UNION ALL SELECT \'imAQEQUyrNoRNRSG\', 3 UNION ALL SELECT \'ZEvmnPMurchiLfcd\
', 3 UNION ALL SELECT \'kYtKQNoeoUuxpYPC\', 3 UNION ALL SELECT \'FvGFRSMariUanWs
L\', 3 UNION ALL SELECT \'VKXEeClDnrnruAng\', 3 UNION ALL SELECT \'eDnEaWAqWRWdC
vMc\', 3 UNION ALL SELECT \'wmIKiiqHBAJiOkMb\', 4 UNION ALL SELECT \'pzEMvmVqbSk
LICVO\', 4 UNION ALL SELECT \'dIclLsVIXaHIyUYk\', 4 UNION ALL SELECT \'nDyHLSYgB
AYIZAkP\', 4 UNION ALL SELECT \'GfrOYcPPYRXMBvmC\', 4 UNION ALL SELECT \'PZUiAwe
kQlmIMJAW\', 4 UNION ALL SELECT \'jnWbngcVgVPFAJNn\', 4 UNION ALL SELECT \'RQQyr
DQpTIPxItND\', 5 UNION ALL SELECT \'SaFNLtavfdceqzTE\', 5 UNION ALL SELECT \'CSm
oYuPNttJTFdlH\', 5 UNION ALL SELECT \'PxufMeDfIBeMAtQV\', 5 UNION ALL SELECT \'m
NaTQepfHkFMRFet\', 5 UNION ALL SELECT \'CHlOqOHXIDyzorfW\', 5 UNION ALL SELECT \
'BKgXGwdXJQBMQGJM\', 5', 'time': '0.000'}]

This looks pretty ugly, but we can see that each .bulk_create() results in two queries, a BEGIN starting the transaction, and an INSERT INTO with a crazy set of SELECT and UNION ALL clauses following it.

Ok, now that we are finally all set up, let's explore. What happens if we just create a QuerySet and set it in a variable?

In [14]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.all()
    ...:

In [15]: print(context.final_queries - context.initial_queries)
0

In [16]: print(context.captured_queries)
[]

No queries were sent to the database! This is because a Django QuerySet is a lazy object. It contains all of the information it needs to populate itself from the database, but will not actually do so until the information is needed. Similarly, .filter(), .exclude(), and the other QuerySet-returning methods will not, by themselves, trigger a query sent to the database.

In [17]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.filter(name='foo')
    ...: print(context.final_queries - context.initial_queries)
    ...:
0

In [18]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.filter(name='foo')
    ...:     qs2 = qs.filter(name='bar')
    ...: print(context.final_queries - context.initial_queries)
    ...:
0

Here we see that even chaining a filtered QuerySet off of another QuerySet is insufficient to cause a database access. However, non-QuerySet-returning methods such as .count() will result in a query sent to the database.

In [19]: with CaptureQueriesContext(connection) as context:
    ...:     count = models.MainModel.objects.count()
    ...: print(context.final_queries - context.initial_queries)
    ...:
1

So, when will a QuerySet result in a round-trip to the database? Basically, this happens any time concrete results are needed from the QuerySet, such as looping explicitly or implicitly. Here are some of the more typical ones

In [20]: with CaptureQueriesContext(connection) as context:
    ...:     for m in models.MainModel.objects.all():
    ...:         obj = m
    ...:     r = repr(models.OnlyOne.objects.all())
    ...:     l = len(models.RelatedModel.objects.all())
    ...:     list_main = list(models.MainModel.objects.all())
    ...:     b = bool(models.OnlyOne.objects.all())
    ...: print(context.final_queries - context.initial_queries)
    ...:
5

Note that each of these triggers its own query. The Django docs have a full list of the things that cause a QuerySet to trigger a query.

We've now seen that simply instantiating a QuerySet doesn't send a query to the database, and that obtaining data out of it does. The next most obvious question is, will a QuerySet ask for data from the database multiple times? Let's find out

In [21]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.all()
    ...:     L = list(qs)
    ...:     L2 = list(qs)
    ...: print(context.final_queries - context.initial_queries)
    ...:
1

Terrific! Just as we would hope, the QuerySet somehow reuses its previous data when we ask for it again. Keep in mind, though, if we attempt to further refine a QuerySet,

In [22]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.all()
    ...:     L = list(qs)
    ...:     qs2 = qs.filter(name__startswith='b')
    ...:     L2 = list(qs2)
    ...: print(context.final_queries - context.initial_queries)
    ...:
2

it does not re-use the data. So how does this work? The implementation of QuerySet can be found in django.db.models.query, but in particular, let's look at the implementation of the relevant methods

def __iter__(self):
    self._fetch_all()
    return iter(self._result_cache)

def _fetch_all(self):
    if self._result_cache is None:
        self._result_cache = list(self.iterator())
    if self._prefetch_related_lookups and not self._prefetch_done:
    self._prefetch_related_objects()

def iterator(self):
    return iter(self._iterable_class(self))

So we can see that iterating over a QuerySet will check to see if a cache at ._result_cache is populated yet, and if not, populates it with a list of objects. This list, then, is what will be iterated over. Subsequent iterations will then get the cache, so no further queries are issued. Doing a chained .filter() call, though, results in a new QuerySet that does not share the cache of the previous one.

The iterator() method used above is a documented public method, which returns an iterator over a configurable iterable class of model instances. Note that it does not involve the cache, so subsequent calls will result in a new query to the database. So why is this a public method? Under what circumstances would it be useful to not populate the cache? The iterator() method is most useful when you have memory concerns when iterating over a particularly large QuerySet, or one that has a large amount of data stored in the fields, especially if it is known that the QuerySet will only be used once and then thrown away.

Interestingly, certain non-QuerySet-returning methods such as .count(),

In [23]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.all()
    ...:     L = list(qs)
    ...:     c = qs.count()
    ...: print(context.final_queries - context.initial_queries)
    ...:
1

can also make use of an already filled cache.

A common pattern that you will see is iterating over a QuerySet in a template, and rendering information about each item, which may involve access of related objects. To simulate this, let's loop and set the name of each item's OnlyOne into a variable.

In [24]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.all():
    ...:         name = item.one.name
    ...: print(context.final_queries - context.initial_queries)
    ...:
6

Six queries! What could possibly be going on here?

In [25]: for q in context.captured_queries:
    ...:     print(q['sql'])
    ...:
SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "qs_mainmodel"."one_id" FROM "qs_mainmodel"
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 1
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 2
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 3
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 4
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 5

As we can see, we have one query which populates the main QuerySet, but then as each item gets processed, each sends an additional query to get the item's associated OnlyOne object. This is referred to as the N + 1 Problem. But how can we fix it? It turns out that Django comes with a QuerySet method for just this purpose: select_related(). If we adjust our code like this,

In [26]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.select_related('one').all():
    ...:         name = item.one.name
    ...: print(context.final_queries - context.initial_queries)
    ...:
1

we drop back down to only one query again

In [27]: for q in context.captured_queries:
    ...:     print(q['sql'])
    ...:
SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "qs_mainmodel"."one_id", "qs_
onlyone"."id", "qs_onlyone"."name" FROM "qs_mainmodel" INNER JOIN "qs_onlyone" O
N ("qs_mainmodel"."one_id" = "qs_onlyone"."id")

So .select_related('one') tells Django to do an INNER JOIN across the foreign key, and make use of that information when instantiating the objects in Python. Great! The select_related() method is capable of taking multiple arguments and will do a join for each of them. You can also join multiple tables deep by using Django's double-underscore syntax, for example .select_related('foo__bar') would join our main model's table with the table for 'foo', and then further join with the table for 'bar'. Note that other things that would cause a join in the sql, such as filtering on a field on the related object, will not cause that related object to be made available as a Python object; you still need to specify your .select_related() fields explicitly.

This all works if the model we are querying has a foreign key to the other model. What if the relationship runs the other way, resulting in a one-to-many relationship?

In [29]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.all():
    ...:         for related in item.many.all():
    ...:             name = related.name
    ...: print(context.final_queries - context.initial_queries)
    ...:
6

In [30]: for q in context.captured_queries:
    ...:     print(q['sql'])
    ...:
SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "qs_mainmodel"."one_id" FROM
"qs_mainmodel"
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 1
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 2
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 3
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 4
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 5

As before, we get 6 queries. However, if we were to try to use .select_related('many'), we would get a FieldError. For this situation, Django provides a different method to mitigate the problem: prefetch_related.

In [31]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.prefetch_related('many').all()
    ...: :
    ...:         for related in item.many.all():
    ...:             name = related.name
    ...: print(context.final_queries - context.initial_queries)
    ...:
2

Two queries, that's at least better. What's going on here, though, why two? If we take a look at the queries generated, we see

In [32]: for q in context.captured_queries:
    ...:     print(q['sql'])
    ...:
SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "qs_mainmodel"."one_id" FROM
"qs_mainmodel"
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" IN (1, 2, 3, 4, 5)

So it turns out that Django first loads up the QuerySet for MainModel, then it determines what primary key values it received, and then does a second query on RelatedModel, filtering on those that have a foreign key to one of those values.

There is one thing that you should be aware of when prefetching one-to-many relationships in this manner. A fairly typical thing to do is to make use of Django model's object-oriented nature, and write instance methods that do some non-trivial computation, sometimes involving looping or filtering on one-to-many or many-to-many relationships. We'll simulate that here by just using a .filter() call in the inner loop

In [33]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.prefetch_related('many').all()
    ...: :
    ...:         for related in item.many.filter(name__startswith='b'):
    ...:             name = related.name
    ...: print(context.final_queries - context.initial_queries)
    ...:
7

And now we find that we're back up to seven queries, despite the use of .prefetch_related(). What's going on here is that the prefetch is making item.many.all() act exactly like an already iterated-over QuerySet, like from earlier in this article, by filling its cache for later re-use. However, as in those earlier cases, if you do any further refinement of the QuerySet it does not share the cache with the new QuerySet. In many cases, it would simply be better to iterate over the relationship and filter using Python directly. Additionally, Django starting with version 1.7 introduced a Prefetch object, which allows more control over the query used in the prefetch_related() call. I advise using tools such as Django Debug Toolbar, using real data, to determine what makes the most sense for your use.

There is another thing that you should be aware of when encapsulating queries involving one-to-many or many-to-many relationships. You may see code like this

def some_expensive_calculation(self):
    related_objs = RelatedModel.objects.filter(main=self)
    ...

This code, as we should now be able to see, is an anti-pattern that will always issue a query when called from a MainModel item, regardless of whatever optimizations have been used on the QuerySet which obtained the MainModel in the first place. It would be better to do this instead

def some_expensive_calculation(self):
    related_objs = self.many.all()
    ...

That way, if we have calling code that does this

for item in models.MainModel.objects.prefetch_related('many'):
    result = item.some_expensive_calculation()
    ...

we should only get the two queries we expect, not one for the main set plus one each for however many items are in that set.

So now we've seen that the QuerySets that you use in your apps can have significant real-world performance implications. However, with some care and understanding of the simple concepts behind Django's QuerySets, you can improve your code and become a better Django developer. But more than that, I hope that you take away from this article the realization that you shouldn't be afraid to read Django's source code to see how something works, or to build minimal working examples or simple tools to explore problems within the Django shell.

Read more Django posts on the Caktus blog.

New Call-to-action
blog comments powered by Disqus
Times
Check

Success!

Times

You're already subscribed

Times