DOC: spectrum-s
STATUS: ● PUBLISHED
SYSTEM SPECTRUM

The Star Schema, Monthly Partitions, and Why I Indexed That Way

Building the warehouse layer: fact_events, four dimensions, and the index strategy.

Cover image — The Star Schema, Monthly Partitions, and Why I Indexed That Way

The warehouse layer is a star schema: fact_events in the center, four dimension tables around it. This is build log 04: the DDL, the partitioning strategy, and the index choices and their reasoning.

// 01 — THE FACT TABLE

CREATE TABLE warehouse.fact_events (
    event_id        UUID NOT NULL,
    event_hash      TEXT UNIQUE NOT NULL,  -- deterministic dedup key
    account_id      UUID REFERENCES warehouse.dim_accounts,
    user_id         UUID REFERENCES warehouse.dim_users,
    geo_id          INTEGER REFERENCES warehouse.dim_geography,
    event_type_id   INTEGER REFERENCES warehouse.dim_event_types,
    occurred_at     TIMESTAMPTZ NOT NULL,
    event_month     DATE NOT NULL,          -- partition key
    properties      JSONB
) PARTITION BY RANGE (event_month);

event_hash is the idempotency key: MD5(event_id || account_id || occurred_at). The ON CONFLICT (event_hash) DO NOTHING on insert means re-processing an event is a no-op.

properties is kept as JSONB rather than typed columns because event shapes vary. I’d rather deserialize in the analytics query than maintain a wide schema.

// 02 — MONTHLY PARTITIONS

-- created dynamically by Phase 1 of the pipeline
CREATE TABLE warehouse.fact_events_2026_05
    PARTITION OF warehouse.fact_events
    FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');

Partitioning by month means:

Monthly granularity was chosen over daily because at current event volume, daily partitions are too granular (too many small tables) and yearly partitions are too coarse (no pruning benefit within a year).

// 03 — THE DIMENSION TABLES

-- representative example
CREATE TABLE warehouse.dim_accounts (
    account_id   UUID PRIMARY KEY,
    account_name TEXT NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT NOW()
);

Dimensions are deliberately small: account, user, geography (country/region/city), event type. Each has a surrogate key the fact table references. Upserts use ON CONFLICT DO UPDATE to keep dimension values current as they evolve.

Geography uses UNIQUE NULLS NOT DISTINCT (country_code, region, city) to handle events where only country_code is known, a PostgreSQL 15 feature that treats NULL = NULL for uniqueness.

// 04 — THE INDEX STRATEGY

-- per partition, created alongside the partition itself
CREATE INDEX ON warehouse.fact_events_2026_05 (account_id, occurred_at);
CREATE INDEX ON warehouse.fact_events_2026_05 (user_id, occurred_at);
CREATE INDEX ON warehouse.fact_events_2026_05 (event_type_id);

The analytics queries have three shapes:

  1. Funnel / retention: filter by account_id, scan forward in time: (account_id, occurred_at).
  2. User-level retention: filter by user_id, scan forward: (user_id, occurred_at).
  3. Event-type aggregation: group by event_type_id: (event_type_id).

All three are covered without a composite index that spans all four columns. Each partition gets its own copy, since partition-local indexes are smaller and faster to build than a global index across all partitions.

event_hash already has a UNIQUE index (from the constraint), which doubles as the dedup lookup.

// 05 — WHAT I SKIPPED

No covering indexes, no partial indexes, no BRIN. At current data volume, the above set is sufficient. I’ll add more as query patterns emerge from the dashboards. Indexes added speculatively are rarely the right ones.

NEXT

@frogwebp brand mark
ANTHONY PENA · @FROGWEBP
I build data systems and write about everything around them, the architecture, the failures, what each one teaches me. Documenting in public since 2021: the process, not just the result.

// NEWSLETTER — THE BUILD LOG SIGNAL

When I ship something or learn something worth keeping, it lands here first — build logs, concepts, and the honest process behind them. Come along; no spam, leave anytime.