Database Migration Strategies

Introduction

Database schema management is essential as projects evolve over time.

A challenge arises when the schema, originally designed to accommodate a certain set of requirements, needs to change. These changes can include the introduction of new tables, columns, views, or functions, as well as the modification of existing ones.

And even though these schema changes are part of the natural progression of any database-driven application, they need to be managed carefully to ensure that the application’s functionality remains consistent and reliable.

While this article has been written with Python and PostgreSQL in mind, the concepts described probably apply to most combinations of SQL databases and general purpose programming languages out there.

Traditional Migration Approaches

Traditionally, frameworks such as Django provide tools that allow for database access by describing tables as classes, with its instances of it mapping almost directly into rows of the table.

Modifications to the underlying data model can then be made by changing those classes and running a tool that automatically translate them into a migration script, which, when applied, will emit the appropriate DDL for the database.

Django

In Django’s approach, the initial draft for the migration is generated by the tool and described in a very high level manner — focusing fundamentally in changes on the Python model of the data, and not on the database table itself. Based on that, the migration script dynamically generates the DDL to move from one state to another (forward and backwards) at applying time.

# Example of Django migration
# Original url: https://docs.djangoproject.com/en/5.0/topics/migrations/#migration-files

from django.db import migrations, models

class Migration(migrations.Migration):
    dependencies = [("migrations", "0001_initial")]

    operations = [
        migrations.DeleteModel("Tribble"),
        migrations.AddField("Author", "rating", models.IntegerField(default=0)),
    ]

However, this ORM-based approach has limitations. It assumes the database to be somewhat of a passive storage system, while in fact, SQL databases are powerful engines capable of complex data modelling and processing.

SQLAlchemy & Alembic

Python developers feeling this way often move to SQLAlchemy SQL toolkit, which, at the cost of abstracting a little less details, offer access to both abstractions: ORM and direct SQL constructs. And as a subproject, SQLAlchemy offers Alembic for schema management:

# Example of Alembic migration
# Original url: https://alembic.sqlalchemy.org/en/latest/tutorial.html#create-a-migration-script

revision = 'ae1027a6acf'
down_revision = '1975ea83b712'
branch_labels = None

from alembic import op
import sqlalchemy as sa

def upgrade():
    op.create_table(
        'account',
        sa.Column('id', sa.Integer, primary_key=True),
        sa.Column('name', sa.String(50), nullable=False),
        sa.Column('description', sa.Unicode(200)),
    )

def downgrade():
    op.drop_table('account')

While this approach certainly gives us more control about what's being done at the database, I've often found myself writing SQL directly and later trying to find a way to express it on Alembic's terms — only to have Alembic to regenerate SQL again.

The Raw Approach

In essence, a migration system is not a complex thing. It requires:

an ordered list of migration scripts
a list of which migrations have already been applied
a script to check and apply only pending migrations

Based on this idea, I've decided to create a simple script to handle migrations written in direct SQL:

a simple folder that contains a series of SQL files; each file is named following the pattern: <timestamp>_<summary>.sql; this way, scripts can be executed in creation order (based on the timestamp) and can be easily found (based on the summary)
I've also avoided the up/down migration approach, as I felt rolling back changes is rarely safe and an up/down mechanism can easily destroy data
each application environment contains a table which stores which migrations have been previously applied:

create table migrations (
    id bigint primary key not null,
    applied_on timestamp with time zone not null default CURRENT_TIMESTAMP
);

create index id on migrations using hash (id);

a script to automate the process:

  $ migrations --help
  Usage: migrations [OPTIONS] COMMAND [ARGS]...

  Options:
      --help Show this message and exit.

  Commands:
      apply
      check
      init
      new
  $

Checking which migrations need to be applied

To determine which migrations need to be applied, no more than a simple query is needed:

select
    id
from
    (values
        (:timestamp_1),
        (:timestamp_2),
        ...
        (:timestamp_n)
    ) v (id)
left join migrations m using (id)
where
    m.id is null
order by id;

Applying migrations

Applying migrations is easy, by just piping them into psql:

$ psql -q -e -1 -v ON_ERROR_STOP=1

where,

psql: this is the command-line interface to interact with PostgreSQL
-q: this stands for “quiet” mode. It suppresses the printing of the welcome message, headers, and footers in the output.
-e: this echoes the queries that psql executes to the standard output. It helps you to see the exact query that is being run, which can be beneficial for debugging
-1 or --single-transaction: this wraps all the SQL commands that are run inside a single transaction. It’s equivalent to issuing “BEGIN” before the first command and “COMMIT” after the last command, provided there are no errors. If there is an error and `ON_ERROR_STOP` is set, as in this command, it would issue a “ROLLBACK” instead
-v ON_ERROR_STOP=1: this sets the ON_ERROR_STOP variable to 1, telling psql to stop execution immediately if a SQL command results in an error

Sustainability — Handling Migrations Bloat

Over time, the number of migration scripts will grow. That's just the nature of modern, ever-evolving software projects. Because of this, this method handles individual scripts as naturally ephemeral and they get removed from the repository once they've been applied to all active environments.

For new installations a master schema is kept by creating a dump (with pg_dump) of the current considered-to-be-correct schema, which is included in the directory.

This way we can easily install a new instance of the application, while maintaining only a short term backlog of applicable schema changes.

Limitations

This approach is certainly not without its problems.

First, it does maintain some information duplication:

The master schema, created from the dump.
The Python definitions of the schema, for the benefit of code inspections and correct data type conversion.
The migration scripts.

Also, in my experience, as with any code we write, migration scripts do carry their fair share of bugs. Some leeway in SQL allows for schema diversion, as with constraint names and column ordering, for example.

In addition, programming mistakes allow the expected schema not to match what is specified as Python code — not to mention the duplicated work of keeping both specifications.

Some Ideas for the Future

If you consider the application repository and the version control system history, the schema migration scripts and the schema dumps do carry some duplicate information. At the same time, they both need to exist, as one describes how to go from A to B, while the other specifies unequivocally where B is.

But if we use some automation, we can use one to help build the other, reducing the amount of errors. To go from migration to master, we can design a migration, apply it to a testing environment and dump the testing environment state — which is mostly what I do today.

But with the AI technologies such as GPT-4, another approach also could offer interesting results: by showing the LLM the git diff between two schemas and possibly commenting on it, we could get a draft of what a migration for it could look like that is not an rule-based approach as it could be done by Django or Alembic, but something that adds meaning considerations to how data is migrated between the two states. This draft would then be reviewed and adjusted by developers to suit the exact requirements, combining the benefits of automation with human expertise.

Conclusion

This article documents a novelty approach to managing complex database migrations — a method that has been in use for a few years now. The aim of it is to share this approach's history and rationale not only with my team, but with wider community for collective improvement in handling similar challenges.

Database Migration Strategies

Introduction

Traditional Migration Approaches

Django

SQLAlchemy & Alembic

The Raw Approach

Checking which migrations need to be applied

Applying migrations

Sustainability — Handling Migrations Bloat

Limitations

Some Ideas for the Future

Conclusion

Comments

More from this blog

Reading through the US-EAST-1 Service Disruption Summary Report

It's Time to Write Tests

A path for improving LLM coding tools

AWS Lambda Cold Starts: Real-World Cost Optimization

Making Error Paths Visible: Learning from Rust's Type System

Command Palette

Introduction

Traditional Migration Approaches

Django

SQLAlchemy & Alembic

The Raw Approach

Checking which migrations need to be applied

Applying migrations

Sustainability — Handling Migrations Bloat

Limitations

Some Ideas for the Future

Conclusion

Comments

More from this blog