Zero-Downtime Migrations: A Field Guide

The naive approach to schema migration is: stop the service, run the migration, start the service. This works fine for a personal project. In production, “stop the service” means downtime, and downtime means angry users and broken SLAs.

Zero-downtime migration is a discipline, not a single technique. It requires thinking about the transition window — the period where old code and new code run simultaneously against the same database.

The Expand / Contract Pattern

Every zero-downtime migration follows the same shape:

Expand — add new structure without removing old structure
Migrate — backfill data, dual-write
Contract — remove the old structure once all code uses the new

This can span multiple deployments. That’s fine. Schema cleanup is not urgent.

Renaming a Column Safely

Renaming users.name to users.full_name seems trivial. It is not.

Step 1 — Expand: add the new column

ALTER TABLE users ADD COLUMN full_name TEXT;

Deploy code that writes to both name and full_name. Reads still use name.

Step 2 — Backfill existing rows

UPDATE users
SET full_name = name
WHERE full_name IS NULL;

Do this in batches on large tables. A single UPDATE on millions of rows holds a lock for its entire duration:

DO $$
DECLARE
  batch_size INT := 5000;
  updated    INT;
BEGIN
  LOOP
    UPDATE users
    SET    full_name = name
    WHERE  id IN (
      SELECT id FROM users
      WHERE  full_name IS NULL
      LIMIT  batch_size
    );
    GET DIAGNOSTICS updated = ROW_COUNT;
    EXIT WHEN updated = 0;
    PERFORM pg_sleep(0.05); -- breathe
  END LOOP;
END $$;

Step 3 — Flip reads

Deploy code that reads full_name. Keep writing to both columns.

Step 4 — Contract: drop the old column

ALTER TABLE users DROP COLUMN name;

Deploy code that stops writing to name.

Adding a NOT NULL Column

PostgreSQL rewrites the entire table when you add a NOT NULL column without a default on versions before 11. On modern PostgreSQL this is handled via a metadata-only change if you supply a constant default — but adding a computed or function-based default still triggers a rewrite.

Safe pattern:

-- 1. Add nullable
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;

-- 2. Backfill
UPDATE orders SET processed_at = created_at WHERE processed_at IS NULL;

-- 3. Enforce constraint (fast on PG 12+ with NOT VALID)
ALTER TABLE orders
  ADD CONSTRAINT orders_processed_at_not_null
  CHECK (processed_at IS NOT NULL) NOT VALID;

-- 4. Validate in the background (no full lock)
ALTER TABLE orders
  VALIDATE CONSTRAINT orders_processed_at_not_null;

Indexing Without Locking

CREATE INDEX takes a ShareLock that blocks all writes. Use CONCURRENTLY:

CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

This takes longer but only takes a weaker lock that allows concurrent reads and writes. The trade-off: if it fails, you’re left with an INVALID index that must be dropped and rebuilt.

The Golden Rules

Never remove a column in the same deployment that stops writing to it. Old pods may still be running.
Backfill in batches. Large updates hold locks proportionally.
Use NOT VALID + VALIDATE CONSTRAINT for deferred enforcement.
CREATE INDEX CONCURRENTLY — always.
Test migrations against a production-sized dataset in staging, not a 10-row dev database.