How to Do Wagtail Data Migrations

Wagtail is a fantastic content management system that does a great job of making it easy for developers to get a new website up and running quickly and painlessly. It’s no wonder that Wagtail has grown to become the leading Django-based CMS. As one of the creators of Wagtail recently said, it makes the initial experience of getting a website set up and running very good. At Caktus, Wagtail is our go-to framework when we need a content management system.

Wagtail StreamFields are a particularly great idea: Rather than editing a big blob of Rich Text, you create a StreamField with the content blocks that you want — paragraphs, images, videos — and let editors create a stream of these blocks containing the content on the page. For example, with a video block, editors don’t have to fiddle with video embed codes, they just insert the video URL or shortcode, and the block handles all of the templating. This has the advantage of making it possible to have complex page content, without requiring the site editors to know how to model a complex design in HTML directly. It makes for a much better, more structured, and smoother editing experience. (StreamFields are such a great idea that WordPress has recently launched a similar feature — inspired by Wagtail?)

But…. There are some pain points for developers who work on large Wagtail projects. One of those is data migrations, particularly those that involve StreamFields. My informal survey of fellow developers yielded the following helpful comments:

  • “Wagtail migrations are evil.” —Eddie
  • “As a rule, avoid data migrations on streamfields at all costs.” —Neil
  • “It feels like we’re fighting the framework when we’re working programmatically with StreamField data, which is a real head-scratcher.” —name withheld

FWIW, I don’t think we’re exactly fighting the framework, so much as trying to do something that the framework hasn’t yet been optimized for. Wagtail has clearly been optimized to create a fantastic onboarding experience. And it’s really great. But it hasn’t yet been optimized for maintaining page data in an environment of shifting requirements. And so it’s currently really hard to do a data migration correctly.

The Caktus team was recently working on an existing Wagtail installation in which we were asked to migrate a page model from non-StreamFields to use a StreamField, giving editors greater flexibility and normalizing the page data. We were also asked, if possible, to migrate the existing pages’ data into the StreamField. That’s a pretty straightforward use case, and one that would seem to be a fairly common need: People start out their page models with regular ol’ fields, then they decide later (after building and publishing a bunch of pages!) that they want those pages to use StreamFields instead.

Considering all of this a worthy challenge, I rolled up my sleeves, dug in, and created a robust data migration for the project. It worked well, migrated all of the page and revision data successfully, and taught me a lot about Wagtail StreamFields.

At PyCon 2019, I hosted an open session on making Wagtail better for developers, and one of the things we talked about was data migrations (read more in my overview of PyCon 2019: “Be Quick or Eat Potatoes: A Newbie’s Guide to PyCon”). A couple of Wagtail core developers came to the session. I was pleased to learn that the method I used is essentially the same method that the Wagtail team has landed on as the best way to migrate StreamField data. So while this method isn’t yet officially supported in Wagtail, you heard it here first: This is currently the best way to do it.

The following section provides a worked example of the method I used. A repository containing all of the code in this example is available on GitHub @sharrisoncaktus/wagtail-data-migration-example.

Let’s dig in, shall we?

How I migrated a Wagtail page model with a StreamField

Start with an Existing Page Model

To illustrate the method I used, I’ll set up a simple page model with

  • a title (as always)
  • a placed image
  • a body
  • a list of documents (that can be displayed as a grid, for example)

The code for this page model looks like this (omitting all the scaffolding of imports etc.):

# First version of the model.

class ExamplePage(Page):
    image = models.ForeignKey(
        'wagtailimages.Image',
        null=True,
        blank=True,
        on_delete=models.SET_NULL,
        related_name='+'
    )
    body = RichTextField()
    docs = StreamField([
        ('doc', DocumentChooserBlock()),
    ])
    content_panels = Page.content_panels + [
        ImageChooserPanel('image'),
        FieldPanel('body'),
        StreamFieldPanel('docs'),
    ]


Example Page: Starting Point

Wagtail Example Page

Situation 1: Add New Fields to the Model without Moving or Renaming Anything

Now let’s suppose the customer wants to add pages to the docs block — they want to be able to display a link to a page in the grid alongside downloadable documents.

Here’s what the model looks like after adding a 'page' block to the 'docs' StreamField:

# Second version of the model: Added page block to the docs StreamField

class ExamplePage(Page):
    image = models.ForeignKey(
        'wagtailimages.Image',
        null=True,
        blank=True,
        on_delete=models.SET_NULL,
        related_name='+')
    body = RichTextField()
    docs = StreamField([
        ('doc', DocumentChooserBlock()),
        ('page', PageChooserBlock()),
    ])

    content_panels = Page.content_panels + [
        ImageChooserPanel('image'),
        FieldPanel('body'),
        StreamFieldPanel('docs'),
    ]

You can create and run this migration, no problem and no worries, because you haven’t moved or changed any existing data.

Rule 1: You can add fields to the model, and new blocks to StreamFields, with impunity — as long as you don’t move or rename anything.

Situation 2: Create Data Migrations to Move Existing Data

Some time later, the customer / site owner / editors have written and published a hundred pages using this model. Then the fateful day arrives: The customer / site owner / editors have enjoyed working with the docs field, and now want to move all the page content into a StreamField so that they can have a lot more flexibility about how they structure the content.

Does this sound familiar?

It’s not hard to write the new model definition.

# End result: The model after content has been migrated to a StreamField:

class ExamplePage(Page):
    content = StreamField([
        ('image', ImageChooserBlock()),
        ('text', RichTextBlock()),
        ('docs', StreamBlock([
            ('doc', DocumentChooserBlock()),
            ('page', PageChooserBlock()),
        ])),
    ])

    content_panels = Page.content_panels + [
        StreamFieldPanel('content'),
    ]

Now, it goes almost without saying: Do not create and run this migration. If you do, you will have a VERY angry customer, because you will have deleted all of their content data.

Instead, you need to break up your migration into several steps.

Rule 2: Split the migration into several steps and verify each before doing the next.

You’ll notice that I chose a different name for the new field — I didn’t, for example, name it “body,” which currently exists as a RichTextField. You want to avoid renaming fields, and you want to do things in an orderly way.

So, here are the steps of a Wagtail data migration.

Step 1: Add fields to the model without moving or renaming anything.

Here’s the non-destructive next version of the model.

# Data Migration Step 1: The model with the `content` StreamField added.

class ExamplePage(Page):
    # new content StreamField
    content = StreamField([
        ('image', ImageChooserBlock()),
        ('text', RichTextBlock()),
        ('docs', StreamBlock([
            ('doc', DocumentChooserBlock()),
            ('page', PageChooserBlock()),
        ], null=True)),
    ])

    # old fields retained for now
    image = models.ForeignKey(
        'wagtailimages.Image',
        null=True,
        blank=True,
        on_delete=models.SET_NULL,
        related_name='+')
    body = RichTextField()
    docs = StreamField([
        ('doc', DocumentChooserBlock()),
        ('page', PageChooserBlock()),
    ])

    content_panels = Page.content_panels + [
        StreamFieldPanel('content'),

        # old panels retained for now
        ImageChooserPanel('image'),
        FieldPanel('body'),
        StreamFieldPanel('docs'),
    ]

The content field has to allow null values (null=True), because it’s going to be empty for all existing pages and revisions until we migrate the data.

Step 2: Create a data migration that maps / copies all the data from the old fields to the new fields, without modifying the existing fields. (Treat existing data as immutable at this point.)

This is the hard part, the fateful day, the prospect of which makes Wagtail devs run away screaming.

I’m here to encourage you: You can do it. Although this procedure is not well-documented or supported by Wagtail, it works reliably and well.

So, let’s do this. First you’ll create an empty migration

$ python manage.py makemigrations APPNAME -n migrate_content_to_streamfield --empty

You’ll end up with an empty migration. For the “forward” migration, you’ll add a RunPython operation that copies all the content data from the existing fields to the new StreamField.

You can also create a “reverse” operation that undoes the changes, but I usually prevent reverse migrations — life is hard enough as it is. However, it’s up to you, and the same kind of procedure can work in reverse.

Here’s what things will look like so far:

def copy_page_data_to_content_streamfield(apps, schema_editor):
    raise NotImplementedError("TODO")

def prevent_reverse_migration(apps, schema_editor):
    raise NotImplementedError(
        "This migration cannot be reversed without"
        + " inordinate expenditure of time. You can"
        + " `--fake` it if you know what you're doing,"
        + " and are a migration ninja."
    )

class Migration(migrations.Migration):
    dependencies = [
        ('home', '0005_add_content_streamfield'),
    ]
    operations = [
      migrations.RunPython(
          copy_page_data_to_content_streamfield,
          prevent_reverse_migration,
            )
    ]

The copy_page_data_to_content_streamfield(…) function will copy all page and revision data from the existing fields to the new content StreamField. Here’s what it looks like:

def copy_page_data_to_content_streamfield(apps, schema_editor):
    """With the given page, copy the page data to the content stream_data"""
    # if the ExamplePage model no longer exists, return directly
    try:
        ExamplePage = import_module('home.models').ExamplePage
    except:
        return
    ExamplePage = import_module('home.models').ExamplePage
    for page in ExamplePage.objects.all():
        page_data = json.loads(page.to_json())
        content_data = page_data_to_content_streamfield_data(page_data)
        if content_data != page.content.stream_data:
            page.content.stream_data = content_data
            page.save()
        for revision in page.revisions.all():
            revision_data = json.loads(revision.content_json)
            content_data = page_data_to_content_streamfield_data(revision_data)
            if content_data != revision_data.get('content'):
                # StreamField data is stored in revision.content_json in a string field
                revision_data['content'] = json.dumps(content_data)
                revision.content_json = json.dumps(
                    revision_data, cls=DjangoJSONEncoder)
                revision.save()

There are several things to notice here:

  • We’re importing the ExamplePage definition from home.models rather than via apps.get_model(). This allows us to use the ExamplePage.to_json() method. We have to import the model during the migration using importlib so that future model changes don’t break the migration. (Never import from the app’s models at the module-level of a migration.) We also need to put the import into a try/except block, in case the model is deleted in the future.
  • Using page.to_json() puts the page_data into the same form as the page revision data, which makes it much easier to do a data migration (one function for both page data and revision data)
  • We’re using regular Python data structures – dicts, lists, etc. This turns out to be a lot easier than trying to build StreamValues directly.
  • We’re using the same helper function, page_data_to_content_streamfield_data(…) (which we haven’t yet created) for both the page data and all revisions data. (We’ll develop this function next.) We can use the same helper function for page data and revisions data because the data structures are the same when represented using Python data structures.
  • The content data in revisions is stored in a JSON string. No problem. We just use json.loads() and json.dumps() with the DjangoJSONEncoder (DjangoJSONEncoder is not entirely necessary here because we don’t have any date or datetime fields in this model, but it’s a good practice to use it in Django projects).

Next, we need to implement the page_data_to_content_streamfield_data() function. This function takes a Python dict as its only argument, representing either the page data or a revision’s data, and returns a Python list, representing the data to be placed in the new content StreamField. It’s a pure function, with no side-effects, and that means it doesn’t mutate the page or revision data (which is only a copy anyway).

To build this function, it’s helpful to start with the definition of the content StreamField, and use it to build a Python data structure that contains the existing data. Here is the content StreamField definition again:

content = StreamField(
    [
        ('image', ImageChooserBlock()),
        ('text', RichTextBlock()),
        ('docs',
         StreamBlock([
             ('doc', DocumentChooserBlock()),
             ('page', PageChooserBlock()),
         ])),
    ],
    blank=True,
    null=True,
)

StreamField definitions use a list of tuples, but the stream_data that we’re building uses a list of dicts, which will look like this:

def page_data_to_content_streamfield_data(page_data):
    """With the given page field data, build and return content stream_data:
    * Copy existing page data into new stream_data.
    * Handle either the main page data or any revision data.
    * page_data is unchanged! (treated as immutable).
    """
    content_data = [
        {'type': 'image', 'value': ...},
        {'type': 'text', 'value': ...},
        {'type': 'docs': 'value': [...]},
    ]
    return content_data

We need to fill in the values for each 'value' field. The 'image' and 'text' are easy: We just need to copy in the 'image' and the 'body' values from the page_data.

The 'docs' value is going to be little harder — but not much! We have to do is take the stream_data from the existing ‘docs’ field. Since ‘docs’ is a StreamField, it is stored as a string in the page_data that comes from json. When loaded, here’s what that field looks like — a typical stream_data value:

[
  {
    "type": "doc",
    "value": 1,
    "id": "91229f94-7aab-4711-ab47-c07cd71461a7"
  },
  {
    "type": "page",
    "value": 3,
    "id": "cc1d77e3-bc35-4f74-97f0-e00645692004"
  }
]

We’re simply going to load the json value and copy over the data, filtering out the ‘id’ fields (Wagtail will assign new ones for us).

Here’s what the final version of the function looks like:

def page_data_to_content_streamfield_data(page_data):
    """With the given page field data, build and return content stream_data:
    * Copy existing page data into new stream_data.
    * Handle either the main page data or any revision data.
    * page_data is unchanged! (treated as immutable).
    """
    return [
        {'type': 'image', 'value': page_data['image']},
        {'type': 'text','value': page_data['body']},
        {'type': 'docs', 'value': [
            {key: block_data[key] for key in ['type', 'value']}  # no 'id'
            for block_data in json.loads(page_data['docs'])
        ]},
    ]

That’s it! We’re just mapping the existing data to the new data structure, without changing any of the existing data. It reminds me a little bit of using XSLT to declare a transformation from one data schema to another.

The complete migrations can be seen on github.

Now we can run the data migration! When we do so, we see all of the existing page data populating the page content field.

Example Page with Added Fields and Data Migration Applied (Steps 1 & 2)

Wagtail Example Page

Step 3: Deploy the migration and let editors review everything, making sure that all the data was correctly copied.

Step 4: Switch the site templates / API to the new fields. By making this a separate step before deleting the old data, we make sure that we haven’t missed anything before we pass the point of no return. (As our CEO and Co-founder, Tobias McNulty pointed out while reviewing this post: “Extra reviews never hurt — plus, you'll have no way to revert if the new templates introduce some non-trivial breaking changes (and you've already deleted your model fields.”)

It’s a good idea not to delete any production data until the customer / site owner / editors are satisfied. So we deploy the site at this point and wait for them to be satisfied that the old data has migrated to the new fields, and that the site templates / API are correctly using the new fields.

Step 5: Create final migration that deletes the old data, and deploy it with updated templates that use the new fields. This is the point of no return.

Now your model can finally look like the “end result” above.

# End result: The model after content has been migrated to a StreamField:

class ExamplePage(Page):
    content = StreamField([
        ('image', ImageChooserBlock()),
        ('text', RichTextBlock()),
        ('docs', StreamBlock([
            ('doc', DocumentChooserBlock()),
            ('page', PageChooserBlock()),
        ])),
    ])

    content_panels = Page.content_panels + [
        StreamFieldPanel('content'),
    ]

Creating this final migration is super easy: Just delete all the old fields and content_panels from the model, let Django create the migration, and apply it. We’ve removed ,blank=True, null=True from the content field definition, because now that the migration has been applied, every instance of the content field should now be non-null. (Django’s makemigrations will ask you what to do about null existing rows. None of them should be, so you can choose 2) ignore for now… when makemigrations prompts you. Or you can just leave in the null=True parameter on the content field.)

Example Page in Final Form, with Old Fields Removed (Step 5)

Wagtail Example Page

Summary: Rules for Wagtail Data Migrations

  1. You can add fields to the model, and new blocks to StreamFields, with impunity — as long as you don’t move or rename anything.
  2. If you are moving or renaming data, split the migration into several steps
    • Step 1: add new fields that will contain the post-migration data
    • Step 2: create a data migration that maps / copies all the data from the old fields to the new fields. Do this without modifying the existing fields. (Treat existing data as immutable at this point.)
    • Step 3: you might want to pause here and let the editors review everything before changing the templates to use the new fields.
    • Step 4: switch the site templates / API to the new fields.
    • Step 5: once the editors are happy, you can create a migration that deletes the old fields from the model.
  3. Data migrations involving StreamFields are best done by writing directly to the stream_data property of the StreamField. This method:
    • allows the use of a json-able dict (Python-native data structure), which is a lot easier than trying to build the StreamValue using Wagtail data structures.
    • allows using the same function for both the page data and the page revisions data, keeping things sane.
    • is not officially supported by Wagtail, but can be said to be sanctioned by at least a couple of Wagtail core developers.

The repository containing the worked example is available at github.

Migrate with Confidence

There’s no question that migrating page data involving Wagtail StreamFields is an involved process, but it doesn’t have to be scary. By doing things in distinct stages and following the methods outlined here, you can migrate your data with security and confidence.

New Call-to-action
blog comments powered by Disqus
Times
Check

Success!

Times

You're already subscribed

Times