Run from the CLI, ContextFlow is a script. Run from Airflow, it’s an orchestrated pipeline with retries, run history, and a failure that fails loudly. The DAG is five tasks, each with a deliberate retry policy.
// 01 — THE TASK GRAPH
validate_inputs >> extract >> transform >> load >> notify
max_active_runs = 1 # one model load at a time on CPU
retries = 2
retry_exponential_backoff = True
max_active_runs = 1 prevents two runs from loading the embedding model onto the CPU simultaneously and starving each other.
// 02 — RETRIES THAT MATCH REALITY
Not every task should retry the same way:
validate_inputs: fails loudly if no PDFs are found, so an empty run never silently “succeeds” with zero output.extract: only 1 retry. A corrupt PDF doesn’t become readable on the second attempt; retrying it is wasted time.transform/load: 2 retries. These cover transient issues (a model download blip, a momentarily unavailable ChromaDB) that a retry genuinely fixes.
The retry count encodes a belief about why a task fails. Retrying a deterministic failure is superstition.
// 03 — THE XCOM FILE-PATH PATTERN
Airflow’s XCom passes data between tasks through its metadata database, capped around 48 KB. Embedded chunks are ~2.5 MB, far too big for this (this caused a real, silent bug; its own anomaly log). The fix: transform writes chunks to a timestamped JSON file on the shared volume and passes only the file path through XCom; load reads the file and deletes it. XCom carries a pointer, never the payload.
TAKEAWAYS
- Set retries per task based on the failure cause. Deterministic failures (corrupt input) get fewer; transient ones (network) get more. A uniform retry count is a missed signal.
- Make validation fail loudly. A pipeline that “succeeds” with zero output is worse than one that errors.
- Never pass large payloads through XCom. Pass a path to the data on shared storage, and clean up after.
NEXT
- Anomaly log: 2.5MB through a 48KB door: the XCom bug in full.
