← Back to Blog
28 Mar 2026 Engineering 10 min read

The Cost of a Bad
Database Migration

Pardeep By Pardeep Dhingra

In fifteen years of shipping backend systems, nothing has aged me faster than bad database migrations. Not outages, not scaling emergencies, not even production hotfixes at 2am. Those feel urgent in the moment but resolve cleanly. A bad migration is different — it's a slow poison that corrupts your data layer, blocks deploys, and makes every subsequent change harder.

I've watched teams lose entire weekends to a migration that "looked simple." I've personally shipped a column rename that took down a payment service for forty minutes. These aren't hypothetical — they're scars. This post is the playbook I wish I'd had.

The Real Cost: By the Numbers

Before we talk patterns, let's talk consequences. These numbers are composites from incidents I've been involved in or seen firsthand across teams:

0
Minutes average downtime per bad migration
0
Engineer-hours average recovery time
0
Dependent services typically affected

And those are the measurable costs. The unmeasurable ones — lost trust from your team, the deploy freeze that slows everyone down for a week, the lingering fear every time someone opens a migration file — those compound.

The Five Migration Anti-Patterns

Most migration disasters follow one of five patterns. Click each to see the war story and the fix:

1. The Big-Bang Rename

What happens: You rename a column in a single migration — ALTER TABLE users RENAME COLUMN name TO full_name. Looks clean. Deploys fine in staging. In production, your app servers are still running old code that references name. Every query crashes for the duration of the rolling deploy.

Why it hurts: Postgres acquires an ACCESS EXCLUSIVE lock on RENAME. On a table with any traffic, this blocks reads and writes. Combined with a rolling deploy where old code references the old name, you get cascading failures.

The fix: Add a new column, backfill, dual-write for one deploy cycle, then drop the old column in a separate migration after all code is updated. Yes, it's three deploys. That's the point.

2. The Unindexed Foreign Key

What happens: You add a foreign key constraint to an existing table: ALTER TABLE orders ADD CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users(id). On a 50M-row table, Postgres validates every single row. Your migration holds a SHARE ROW EXCLUSIVE lock for the entire validation.

Why it hurts: On a hot table, this blocks inserts and updates for the entire validation duration — which can be minutes on large tables. I once saw this take 11 minutes on a 200M-row orders table.

The fix: Use NOT VALID to add the constraint without checking existing rows, then run VALIDATE CONSTRAINT separately. The validation step only needs a SHARE UPDATE EXCLUSIVE lock, which doesn't block reads or normal writes.

3. The Default-Value Trap

What happens: Before Postgres 11, adding a column with a default value rewrote the entire table. In Postgres 11+, this is instant for non-volatile defaults — but many teams still hit the volatile variant: ALTER TABLE events ADD COLUMN created_at TIMESTAMPTZ DEFAULT now(). The now() is volatile. Full table rewrite.

Why it hurts: A table rewrite on a 100M-row table can take tens of minutes, during which the table is effectively locked for writes.

The fix: Add the column with no default (or a constant default), then backfill in batches using an UPDATE with a WHERE clause that limits each batch. For now() defaults, use DEFAULT CURRENT_TIMESTAMP which Postgres treats as non-volatile in the DDL context on 11+, or add with null default and backfill.

4. The Index That Locks the Table

What happens: You run CREATE INDEX idx_orders_email ON orders(email) inside a migration. Standard CREATE INDEX takes a SHARE lock on the table — blocking all inserts and updates until the index is fully built.

Why it hurts: Index creation on large tables can take minutes. Every write to the table queues behind the lock. Your API starts returning 504s. Your on-call gets paged.

The fix: Always use CREATE INDEX CONCURRENTLY. Yes, it's slower and can't run inside a transaction block. That's why most migration tools support a "non-transactional" flag. Use it.

5. The Irreversible Migration

What happens: You drop a column or table, discover the code still needs it, and realize there's no going back. Your "down" migration is a comment that says -- cannot reverse this.

Why it hurts: Without a rollback path, any issue post-deploy becomes a fire drill. You can't revert cleanly, so you're doing emergency data recovery while the production system is degraded.

The fix: Never drop columns immediately. Rename them first (e.g., _deprecated_name), wait a full release cycle to confirm nothing references them, then drop. For tables, consider creating a backup table first. Every destructive migration should have a tested rollback script — even if it's a separate file.

Safe vs. Dangerous: The Migration Timeline

Here's how the same schema change — adding a NOT NULL column with a default — looks when done dangerously versus safely. Click each step for detail:

Deploy 1 — The Only Deploy
Single migration: ADD COLUMN + NOT NULL + DEFAULT
One ALTER statement does everything at once

ALTER TABLE orders ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'pending';

On Postgres <11, this rewrites the entire table. On Postgres 11+, the constant default is instant BUT the NOT NULL constraint still requires a full table scan to validate. With a SHARE ROW EXCLUSIVE lock held the entire time, writes are blocked.

On a 50M-row table: estimated 3-8 minutes of write blocking. Your order creation endpoint returns 500s. Customers see failed checkouts.

Outcome
Downtime window: 3-8 minutes
No rollback path — column is there, constraint is enforced

If anything goes wrong during the lock, you have to wait for it to complete or kill the migration — which may leave the schema in a partial state. Rolling back the code deploy doesn't help because the column now exists and old code doesn't know about it.

Deploy 1 — Add Column (nullable)
ALTER TABLE orders ADD COLUMN status VARCHAR(20) DEFAULT 'pending'
Instant on Postgres 11+ — no table rewrite, no lock contention

By adding the column as nullable with a constant default, Postgres stores the default in the catalog without rewriting existing rows. Zero downtime. Old code that doesn't know about this column works fine.

Deploy 2 — Backfill + Dual-Write
Batch UPDATE + application code writes to new column
Backfill in chunks of 10K rows with 100ms sleep between batches

UPDATE orders SET status = 'pending' WHERE status IS NULL AND id BETWEEN $1 AND $2;

Run this in a script, not a migration. Use batches of 5K-10K rows with short pauses to avoid overwhelming WAL and replication. Monitor pg_stat_activity for lock contention during each batch.

Deploy 3 — Add Constraint
ALTER TABLE orders ADD CONSTRAINT status_not_null CHECK (status IS NOT NULL) NOT VALID
Instant — constraint added but not validated against existing rows

The NOT VALID flag tells Postgres to enforce the constraint on new writes only, without scanning existing rows. This is instant and non-blocking. New inserts/updates must satisfy the constraint.

Deploy 4 — Validate
ALTER TABLE orders VALIDATE CONSTRAINT status_not_null
Scans existing rows, but only takes SHARE UPDATE EXCLUSIVE lock

This lock allows concurrent reads AND writes — only blocking DDL and vacuum. The validation scans rows but doesn't modify them. On a 50M-row table, this takes seconds to a couple minutes but with zero write blocking.

Outcome
Zero downtime. Full rollback at every step.
Four deploys over 2 days vs. 5 minutes of terror

At any point in this sequence, you can stop and roll back. Column added? Drop it. Backfill partial? No harm. Constraint added NOT VALID? Drop it. Each step is independently reversible. This is the expand-contract pattern in practice.

Postgres Lock Types: A Cheat Sheet

Most migration disasters come from not understanding which DDL operations acquire which locks. Here's the critical subset:

Blocks writes
CREATE INDEX (non-concurrent)
ALTER TABLE ... ADD COLUMN ... NOT NULL
ALTER TABLE ... RENAME COLUMN
ALTER TABLE ... SET DATA TYPE
ALTER TABLE ... ADD FOREIGN KEY
Writes continue
CREATE INDEX CONCURRENTLY
ALTER TABLE ... ADD COLUMN (nullable)
ADD CONSTRAINT ... NOT VALID
VALIDATE CONSTRAINT
ALTER TABLE ... SET DEFAULT
Rule of thumb: If a DDL statement takes an ACCESS EXCLUSIVE or SHARE ROW EXCLUSIVE lock on a table with more than 1M rows or any concurrent write traffic, it needs the expand-contract treatment.

The Expand-Contract Pattern

This is the single most important pattern for zero-downtime migrations. Every safe migration I've ever run follows this structure:

Phase 1: Expand

Add new schema elements alongside existing ones. The old code continues working because nothing it depends on has changed. New columns are nullable, new tables are empty, new constraints are NOT VALID.

Phase 2: Migrate

Deploy code that writes to both old and new schema. Backfill historical data in batches. Run validation queries to confirm data integrity. This phase might span multiple deploys.

Phase 3: Contract

Once all code reads from the new schema and writes to the new schema, remove the old schema elements. But not immediately — wait a full release cycle. Run a check for any remaining references in application code, background jobs, and analytics queries.

The most common mistake: Skipping Phase 2 and going straight from Expand to Contract. "The backfill ran, data looks good, let's drop the old column." Then a background worker that processes delayed jobs references the old column at 3am. Ask me how I know.

Postgres-Specific Tips

Some hard-won lessons specific to Postgres that you won't find in most "safe migration" guides:

Quick check: What happens if you run CREATE INDEX CONCURRENTLY inside a transaction block?
Correct. Postgres will raise: ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block. Most migration frameworks wrap each migration in a transaction by default. You need to explicitly opt out — in Knex, use knex.schema.raw() outside the transaction; in Django, set atomic = False on the migration; in Flyway, use non-transactional mode.

1. Check pg_stat_activity Before Migrating

Before running any DDL, check for long-running transactions. A migration that needs an ACCESS EXCLUSIVE lock will queue behind any open transaction on the same table — and block all subsequent transactions while it waits. One idle-in-transaction session can turn a 50ms DDL into a 10-minute outage.

check_before_migrate.sql
-- Find transactions older than 30 seconds SELECT pid, now() - xact_start AS duration, state, query FROM pg_stat_activity WHERE xact_start IS NOT NULL AND now() - xact_start > interval '30 seconds' ORDER BY duration DESC; -- Set a lock timeout so your migration fails fast -- instead of waiting (and blocking) forever SET lock_timeout = '5s';

2. Use Advisory Locks for Backfill Scripts

When running batch backfills, use pg_advisory_lock to ensure only one instance runs at a time. If your deploy spawns multiple replicas that all try to backfill simultaneously, you'll hammer the database with overlapping updates and WAL throughput spikes.

3. Monitor Replication Lag

Large migrations generate significant WAL traffic. If you have read replicas, monitor replication lag during and after migrations. A backfill that generates 10GB of WAL can push a replica minutes behind, causing stale reads in any service using the replica.

4. Set statement_timeout on Migration Connections

A runaway migration that holds a lock for 30 minutes is worse than a migration that fails after 5. Set statement_timeout = '60s' on your migration database connection. If a single statement takes longer than that, something is wrong and it's better to fail loud.

The Migration Checklist

I run through this before every production migration. It's saved me more than once:

You need to change a column from VARCHAR(100) to VARCHAR(255) on a 20M-row table. What's the safest approach?
Correct. In Postgres, increasing the length of a VARCHAR column (or changing to TEXT) doesn't rewrite the table. It only updates catalog metadata. This is one of those cases where the "safe" thing is also the simple thing. However, decreasing the length requires a full table rewrite — know the difference.

The Rollback Question

Every migration review should start with: "How do we roll this back?" If the answer is "we can't," that's not automatically disqualifying — but it means you need a higher bar of confidence before deploying.

For truly irreversible operations (dropping a column that's confirmed unused), I follow a three-step gate:

Step 1: Soft-delete the column — rename it to _deprecated_columnname and deploy. Wait one full release cycle.
Step 2: Audit all queries touching that table — check pg_stat_statements for any reference to the old name. Check background jobs and analytics queries.
Step 3: Only after both steps pass, drop the renamed column in a separate migration.

Three deploys for one column drop feels excessive until the day it saves you.

Tooling That Helps

A few tools I've found genuinely useful for migration safety:

strong_migrations (Ruby) and django-pg-zero-downtime-migrations (Python) — these will actively reject unsafe migrations and suggest safe alternatives. If you're in one of these ecosystems, install them immediately.

squawk — a Postgres migration linter. Run it in CI to catch unsafe DDL before it hits production. It checks for missing CONCURRENTLY, non-validated constraints, and other footguns.

pgroll — newer tool from Xata that automates the expand-contract pattern. It manages dual-write versioning at the database level. Worth evaluating if you run many migrations.

My general rule: The migration tool doesn't matter as much as the discipline. If you can't explain the lock behavior of every DDL statement in your migration, you shouldn't be running it against production. Review the Postgres docs on explicit locking — it's the most important single page in the Postgres documentation.

Closing Thought

Database migrations are one of those rare areas where being "boring" is a feature. The teams that do migrations well aren't the ones with clever tooling — they're the ones with a culture of caution. They review every migration like a code review. They test against production-sized datasets. They ask "what if we need to roll this back at 3am?"

The cost of a bad migration isn't just the downtime. It's the trust erosion — from your team, from your users, from yourself. Treat your schema changes like the production changes they are, and you'll sleep a lot better.