Database Design · February 22, 2025 · 7 min read

Structural Honesty in Database Schema Design

Why normalization is still the most efficient way to maintain data integrity in complex systems, and when to break the rule intentionally.

There’s a pattern I’ve seen in codebases that grow from startup to scale: the database schema starts clean, then someone denormalizes “for performance,” then someone else adds a redundant column “for simplicity,” and within eighteen months you have a schema that no longer tells the truth about the data.

I call this schema drift. It’s the database equivalent of leaking abstractions.

What Normalization Actually Is

Normalization is not about following rules from a textbook. It’s about ensuring that each fact is stored exactly once. When you store a fact in two places, you’ve created an obligation: every write must update both, or your data is inconsistent.

Third Normal Form (3NF) is the practical target for most OLTP systems. It means:

  1. Every column depends on the key (1NF)
  2. Every column depends on the whole key (2NF)
  3. Every column depends on nothing but the key (3NF)

The third rule is the one people violate most — storing user_city on an orders table because it was “convenient at the time.”

A Real Schema Violation

-- What you often see
CREATE TABLE orders (
    id          UUID PRIMARY KEY,
    user_id     UUID NOT NULL,
    user_email  TEXT NOT NULL,   -- violation: depends on user, not order
    user_city   TEXT NOT NULL,   -- violation: same
    amount      NUMERIC NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Now user_email lives in two tables. A user updates their email. Do you update every order row? What happens during the window between the users update and the orders backfill?

-- Honest schema
CREATE TABLE orders (
    id         UUID PRIMARY KEY,
    user_id    UUID NOT NULL REFERENCES users(id),
    amount     NUMERIC NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Join to users at read time. PostgreSQL’s query planner handles this efficiently; you will not feel it at realistic data volumes.

When Denormalization Is the Right Call

There are real cases where denormalization is justified:

1. Temporal snapshots — An order should capture what the user’s address was at the time of purchase, not what it is today. In this case, copying the address onto the order is semantically correct, not a violation.

2. Extreme read throughput — If a query runs millions of times per second and a join is measurably the bottleneck (confirmed with EXPLAIN ANALYZE), a materialized column can be warranted. But benchmark first; the PostgreSQL planner is better than most engineers expect.

3. Event-sourced systems — Denormalized read models are the point of CQRS/ES. The source of truth is the event log; the read model is a derived projection.

The Rule of One Source

Before adding any column, ask: is there already a table that owns this fact?

If yes, reference it with a foreign key. If the join is too slow, investigate the query plan before denormalizing. Add an index. Materialize a view. Denormalization should be a deliberate, documented decision — not a default.

The schema is documentation. It tells every engineer who reads it what the system believes to be true. Keep it honest.