The Star Schema, Monthly Partitions, and Why I Indexed That Way

The warehouse layer is a star schema: fact_events in the center, four dimension tables around it. This is build log 04: the DDL, the partitioning strategy, and the index choices and their reasoning.

// 01 — THE FACT TABLE

CREATE TABLE warehouse.fact_events (
    event_id        UUID NOT NULL,
    event_hash      TEXT UNIQUE NOT NULL,  -- deterministic dedup key
    account_id      UUID REFERENCES warehouse.dim_accounts,
    user_id         UUID REFERENCES warehouse.dim_users,
    geo_id          INTEGER REFERENCES warehouse.dim_geography,
    event_type_id   INTEGER REFERENCES warehouse.dim_event_types,
    occurred_at     TIMESTAMPTZ NOT NULL,
    event_month     DATE NOT NULL,          -- partition key
    properties      JSONB
) PARTITION BY RANGE (event_month);

event_hash is the idempotency key: MD5(event_id || account_id || occurred_at). The ON CONFLICT (event_hash) DO NOTHING on insert means re-processing an event is a no-op.

properties is kept as JSONB rather than typed columns because event shapes vary. I’d rather deserialize in the analytics query than maintain a wide schema.

// 02 — MONTHLY PARTITIONS

-- created dynamically by Phase 1 of the pipeline
CREATE TABLE warehouse.fact_events_2026_05
    PARTITION OF warehouse.fact_events
    FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');

Partitioning by month means:

Queries filtered by date range (WHERE occurred_at BETWEEN ...) scan only the relevant partitions via partition pruning.
Retention queries that join fact to itself across months (W+1, W+2, W+4) scan only the two or three partitions involved.
Old partitions can be detached and archived without touching the live table.

Monthly granularity was chosen over daily because at current event volume, daily partitions are too granular (too many small tables) and yearly partitions are too coarse (no pruning benefit within a year).

// 03 — THE DIMENSION TABLES

-- representative example
CREATE TABLE warehouse.dim_accounts (
    account_id   UUID PRIMARY KEY,
    account_name TEXT NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT NOW()
);

Dimensions are deliberately small: account, user, geography (country/region/city), event type. Each has a surrogate key the fact table references. Upserts use ON CONFLICT DO UPDATE to keep dimension values current as they evolve.

Geography uses UNIQUE NULLS NOT DISTINCT (country_code, region, city) to handle events where only country_code is known, a PostgreSQL 15 feature that treats NULL = NULL for uniqueness.

// 04 — THE INDEX STRATEGY

-- per partition, created alongside the partition itself
CREATE INDEX ON warehouse.fact_events_2026_05 (account_id, occurred_at);
CREATE INDEX ON warehouse.fact_events_2026_05 (user_id, occurred_at);
CREATE INDEX ON warehouse.fact_events_2026_05 (event_type_id);

The analytics queries have three shapes:

Funnel / retention: filter by account_id, scan forward in time: (account_id, occurred_at).
User-level retention: filter by user_id, scan forward: (user_id, occurred_at).
Event-type aggregation: group by event_type_id: (event_type_id).

All three are covered without a composite index that spans all four columns. Each partition gets its own copy, since partition-local indexes are smaller and faster to build than a global index across all partitions.

event_hash already has a UNIQUE index (from the constraint), which doubles as the dedup lookup.

// 05 — WHAT I SKIPPED

No covering indexes, no partial indexes, no BRIN. At current data volume, the above set is sufficient. I’ll add more as query patterns emerge from the dashboards. Indexes added speculatively are rarely the right ones.

Build log 05: funnels, retention, and revenue as pure SQL against the star schema.

// 01 — THE FACT TABLE

// 02 — MONTHLY PARTITIONS

// 03 — THE DIMENSION TABLES

// 04 — THE INDEX STRATEGY

// 05 — WHAT I SKIPPED

NEXT