The Cost of a Bad
Database Migration
In fifteen years of shipping backend systems, nothing has aged me faster than bad database migrations. Not outages, not scaling emergencies, not even production hotfixes at 2am. Those feel urgent in the moment but resolve cleanly. A bad migration is different — it's a slow poison that corrupts your data layer, blocks deploys, and makes every subsequent change harder.
I've watched teams lose entire weekends to a migration that "looked simple." I've personally shipped a column rename that took down a payment service for forty minutes. These aren't hypothetical — they're scars. This post is the playbook I wish I'd had.
The Real Cost: By the Numbers
Before we talk patterns, let's talk consequences. These numbers are composites from incidents I've been involved in or seen firsthand across teams:
And those are the measurable costs. The unmeasurable ones — lost trust from your team, the deploy freeze that slows everyone down for a week, the lingering fear every time someone opens a migration file — those compound.
The Five Migration Anti-Patterns
Most migration disasters follow one of five patterns. Click each to see the war story and the fix:
Safe vs. Dangerous: The Migration Timeline
Here's how the same schema change — adding a NOT NULL column with a default — looks when done dangerously versus safely. Click each step for detail:
ALTER TABLE orders ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'pending';
On Postgres <11, this rewrites the entire table. On Postgres 11+, the constant default is instant BUT the NOT NULL constraint still requires a full table scan to validate. With a SHARE ROW EXCLUSIVE lock held the entire time, writes are blocked.
On a 50M-row table: estimated 3-8 minutes of write blocking. Your order creation endpoint returns 500s. Customers see failed checkouts.
If anything goes wrong during the lock, you have to wait for it to complete or kill the migration — which may leave the schema in a partial state. Rolling back the code deploy doesn't help because the column now exists and old code doesn't know about it.
By adding the column as nullable with a constant default, Postgres stores the default in the catalog without rewriting existing rows. Zero downtime. Old code that doesn't know about this column works fine.
UPDATE orders SET status = 'pending' WHERE status IS NULL AND id BETWEEN $1 AND $2;
Run this in a script, not a migration. Use batches of 5K-10K rows with short pauses to avoid overwhelming WAL and replication. Monitor pg_stat_activity for lock contention during each batch.
The NOT VALID flag tells Postgres to enforce the constraint on new writes only, without scanning existing rows. This is instant and non-blocking. New inserts/updates must satisfy the constraint.
This lock allows concurrent reads AND writes — only blocking DDL and vacuum. The validation scans rows but doesn't modify them. On a 50M-row table, this takes seconds to a couple minutes but with zero write blocking.
At any point in this sequence, you can stop and roll back. Column added? Drop it. Backfill partial? No harm. Constraint added NOT VALID? Drop it. Each step is independently reversible. This is the expand-contract pattern in practice.
Postgres Lock Types: A Cheat Sheet
Most migration disasters come from not understanding which DDL operations acquire which locks. Here's the critical subset:
ALTER TABLE ... ADD COLUMN ... NOT NULL
ALTER TABLE ... RENAME COLUMN
ALTER TABLE ... SET DATA TYPE
ALTER TABLE ... ADD FOREIGN KEY
ALTER TABLE ... ADD COLUMN (nullable)
ADD CONSTRAINT ... NOT VALID
VALIDATE CONSTRAINT
ALTER TABLE ... SET DEFAULT
ACCESS EXCLUSIVE or SHARE ROW EXCLUSIVE lock on a table with more than 1M rows or any concurrent write traffic, it needs the expand-contract treatment.
The Expand-Contract Pattern
This is the single most important pattern for zero-downtime migrations. Every safe migration I've ever run follows this structure:
Phase 1: Expand
Add new schema elements alongside existing ones. The old code continues working because nothing it depends on has changed. New columns are nullable, new tables are empty, new constraints are NOT VALID.
Phase 2: Migrate
Deploy code that writes to both old and new schema. Backfill historical data in batches. Run validation queries to confirm data integrity. This phase might span multiple deploys.
Phase 3: Contract
Once all code reads from the new schema and writes to the new schema, remove the old schema elements. But not immediately — wait a full release cycle. Run a check for any remaining references in application code, background jobs, and analytics queries.
Postgres-Specific Tips
Some hard-won lessons specific to Postgres that you won't find in most "safe migration" guides:
ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block. Most migration frameworks wrap each migration in a transaction by default. You need to explicitly opt out — in Knex, use knex.schema.raw() outside the transaction; in Django, set atomic = False on the migration; in Flyway, use non-transactional mode.
1. Check pg_stat_activity Before Migrating
Before running any DDL, check for long-running transactions. A migration that needs an ACCESS EXCLUSIVE lock will queue behind any open transaction on the same table — and block all subsequent transactions while it waits. One idle-in-transaction session can turn a 50ms DDL into a 10-minute outage.
2. Use Advisory Locks for Backfill Scripts
When running batch backfills, use pg_advisory_lock to ensure only one instance runs at a time. If your deploy spawns multiple replicas that all try to backfill simultaneously, you'll hammer the database with overlapping updates and WAL throughput spikes.
3. Monitor Replication Lag
Large migrations generate significant WAL traffic. If you have read replicas, monitor replication lag during and after migrations. A backfill that generates 10GB of WAL can push a replica minutes behind, causing stale reads in any service using the replica.
4. Set statement_timeout on Migration Connections
A runaway migration that holds a lock for 30 minutes is worse than a migration that fails after 5. Set statement_timeout = '60s' on your migration database connection. If a single statement takes longer than that, something is wrong and it's better to fail loud.
The Migration Checklist
I run through this before every production migration. It's saved me more than once:
VARCHAR column (or changing to TEXT) doesn't rewrite the table. It only updates catalog metadata. This is one of those cases where the "safe" thing is also the simple thing. However, decreasing the length requires a full table rewrite — know the difference.
The Rollback Question
Every migration review should start with: "How do we roll this back?" If the answer is "we can't," that's not automatically disqualifying — but it means you need a higher bar of confidence before deploying.
For truly irreversible operations (dropping a column that's confirmed unused), I follow a three-step gate:
Step 1: Soft-delete the column — rename it to _deprecated_columnname and deploy. Wait one full release cycle.
Step 2: Audit all queries touching that table — check pg_stat_statements for any reference to the old name. Check background jobs and analytics queries.
Step 3: Only after both steps pass, drop the renamed column in a separate migration.
Three deploys for one column drop feels excessive until the day it saves you.
Tooling That Helps
A few tools I've found genuinely useful for migration safety:
strong_migrations (Ruby) and django-pg-zero-downtime-migrations (Python) — these will actively reject unsafe migrations and suggest safe alternatives. If you're in one of these ecosystems, install them immediately.
squawk — a Postgres migration linter. Run it in CI to catch unsafe DDL before it hits production. It checks for missing CONCURRENTLY, non-validated constraints, and other footguns.
pgroll — newer tool from Xata that automates the expand-contract pattern. It manages dual-write versioning at the database level. Worth evaluating if you run many migrations.
explicit locking — it's the most important single page in the Postgres documentation.
Closing Thought
Database migrations are one of those rare areas where being "boring" is a feature. The teams that do migrations well aren't the ones with clever tooling — they're the ones with a culture of caution. They review every migration like a code review. They test against production-sized datasets. They ask "what if we need to roll this back at 3am?"
The cost of a bad migration isn't just the downtime. It's the trust erosion — from your team, from your users, from yourself. Treat your schema changes like the production changes they are, and you'll sleep a lot better.
By Pardeep Dhingra