DOC: contextflo
STATUS: ● PUBLISHED
SYSTEM CONTEXTFLOW

Deterministic Chunk IDs: Idempotency as Structure, Not a Flag

Re-running the pipeline updates in place, never duplicates, because of how IDs are made.

Cover image — Deterministic Chunk IDs: Idempotency as Structure, Not a Flag

Re-ingest a document into ContextFlow and nothing duplicates. The chunks update in place. There’s no “already processed?” check, no dedup pass. Idempotency is built into how chunk IDs are generated, so it can’t be forgotten.

// 01 — THE ID

A chunk’s ID is a hash of its identity: source, page, position:

key = f"{source}::p{page_number}::c{chunk_index}"
chunk_id = hashlib.sha256(key.encode()).hexdigest()[:16]

The same document, chunked the same way, always produces the same IDs. The ID isn’t assigned; it’s derived, so a re-processed chunk arrives carrying the identity it had last time.

// 02 — UPSERT, NOT ADD

The load step uses collection.upsert(), not add(). Combined with stable IDs, re-running on an already-indexed document overwrites those exact chunk IDs instead of inserting new rows. A document you ingest five times occupies the same space as one you ingest once.

// 03 — WHY STRUCTURAL BEATS A FLAG

You could get idempotency with a tracking table (“have I seen this file?”), but that’s a check you can forget, get wrong, or race. Deriving the ID from the content makes duplication impossible by construction: there’s no code path that creates a second copy, because the second copy would have the same ID as the first and upsert onto it. The guarantee lives in the data model, not in a conditional someone has to remember to write.

TAKEAWAYS

NEXT

@frogwebp brand mark
ANTHONY PENA · @FROGWEBP
I build data systems and write about everything around them, the architecture, the failures, what each one teaches me. Documenting in public since 2021: the process, not just the result.

// NEWSLETTER — THE BUILD LOG SIGNAL

When I ship something or learn something worth keeping, it lands here first — build logs, concepts, and the honest process behind them. Come along; no spam, leave anytime.