abstract of database

What is Normalized Data?

In the early 2000s, I had a paper test that I gave to developer candidates. It was a single sheet that I printed out from Excel, in landscape, containing order data from a small fictional online bookstore. I asked each candidate to normalize the data from the spreadsheet by drawing and writing on a blank sheet of paper I provided.

Denormalized data.

The person taking the test might end up drawing separate tables for Person, Order, Book, and BookOrder with foreign keys drawn between tables. This is database normalization. We collected flat data in a spreadsheet, where each row contains all the data we needed for our purposes. We normalized it in order to prevent redundancy by designing separate tables with foreign keys.

Tables drawn as an answer to a test.

What is Denormalized Data?

On every web project I've worked on, I'm generally working with normalized data.

Denormalized data is when we actually have data redundancy in our database, which we might do in order to improve performance. If we keep data flat, it's called denormalized. With denormalized data, we don't have to seek the data from related tables, and our database calls might return more quickly than if the data was normalized.

Ok, so, I guess I should go and flatten all my data now, if you're telling me that's how to make my website fast, right?

Not so fast.

Using JSONField to Denormalize

I mean, yes, database-wide denormalization might make things faster, but don't throw out your relational database just yet.

  1. You can actually keep your data normalized and relational, thus preventing data redundancy and making it easier to manage for anyone with Django admin privileges.
  2. You can also denormalize in order to gain performance by using one or more Django JSONFields.
  3. Django JSONField currently supports PostgreSQL, which we use, as well as MariaDB, MySQL 5.7.8+, Oracle, and SQLite, at this time.

Speed up Reports

I work for an energy consulting company which has a mission to protect the environment, called Energy Solutions. The Django project I work on for ES is a bit too complex to use as an example, so I will continue to use the bookstore database example.

One way Django’s JSONField helped us was to speed up reporting. Consider the bookstore example. If we wanted a report on different factors on book sales, such as how well books were selling to customers in different states, we could write a specific Django query. That solution would require a developer to write the query and to implement it on our store website. If, however, we were able to allow users to report directly on a single table, say the Book table, we could let them write the queries themselves as long as all the data was in one place.

Total book sales for one book, across all states.

If we added a JSONField called book_stats for example, each time a book sold, we could capture data on the state of the shipping address and add that to the JSONField which tracked those statistics (and any other statistics).

Class Book():
    book_stats = models.JSONField(
        help_text="Keep a snapshot of book sales data.",

Class BookOrder():
    order = models.ForeignKey(“Order”)
    book = models.ForeignKey(“Book”) 

    def save(self, **args, **kwargs):
        # For each new book-order we save, update the book’s stats JSONField
        us_state = self.order.user.us_state
        self.book.book_stats[us_state][“total_sold”] = 
            self.book.bookstats[us_state][“total_sold”] + 1 
        self.book.book_stats[us_state][“last_purchase_username”] = user.username
        self.book.book_stats[us_state][“last_purchase_date”] = datetime.now()

Our JSONField for book_stats

If we built a front end to allow users to drill down into any data they wanted on the Book model, we could allow them to pull this data down on demand.

Front-end UI for our Book model. Massive appreciation to Scot Hacker for coming up with this speedy reporting feature for one of our projects.

Yes, we can get this data via queries and calculate these numbers on the fly with Django's annotate() and aggregate(). However, if we keep this running tally of statistics in a JSONField in one table, we don't have to write those queries in the code by hand. We can allow our users to report directly from that one table, and the report queries are fast because we aren't joining to multiple relational tables. Additionally, users can create reports without having to ask a developer.

Reporting is just one way we've found performance gains via JSONFields. How do you use JSONFields in your projects?

Hey there, Caktus folks.

  • I’m Erin Mullaney
  • I’m a Caktus alum (2015-2017) who has written a few articles on the blog.
  • I love writing about technical subjects when I want to share something I find interesting.
  • One time, at DjangoCon, someone told me that a blog post I wrote about Wagtail was helpful to them and it was a huge highlight for me.
  • Outside of blogging, I work for Energy Solutions as a codebase lead. We are currently partnering with Caktus on a project to automate testing of one of our Django projects.
blog comments powered by Disqus



You're already subscribed