The Airflow DAG: Validate, Extract, Transform, Load, Notify

Run from the CLI, ContextFlow is a script. Run from Airflow, it’s an orchestrated pipeline with retries, run history, and a failure that fails loudly. The DAG is five tasks, each with a deliberate retry policy.

// 01 — THE TASK GRAPH

validate_inputs >> extract >> transform >> load >> notify

max_active_runs = 1   # one model load at a time on CPU
retries = 2
retry_exponential_backoff = True

max_active_runs = 1 prevents two runs from loading the embedding model onto the CPU simultaneously and starving each other.

// 02 — RETRIES THAT MATCH REALITY

Not every task should retry the same way:

validate_inputs: fails loudly if no PDFs are found, so an empty run never silently “succeeds” with zero output.
extract: only 1 retry. A corrupt PDF doesn’t become readable on the second attempt; retrying it is wasted time.
transform / load: 2 retries. These cover transient issues (a model download blip, a momentarily unavailable ChromaDB) that a retry genuinely fixes.

The retry count encodes a belief about why a task fails. Retrying a deterministic failure is superstition.

// 03 — THE XCOM FILE-PATH PATTERN

Airflow’s XCom passes data between tasks through its metadata database, capped around 48 KB. Embedded chunks are ~2.5 MB, far too big for this (this caused a real, silent bug; its own anomaly log). The fix: transform writes chunks to a timestamped JSON file on the shared volume and passes only the file path through XCom; load reads the file and deletes it. XCom carries a pointer, never the payload.

TAKEAWAYS

Set retries per task based on the failure cause. Deterministic failures (corrupt input) get fewer; transient ones (network) get more. A uniform retry count is a missed signal.
Make validation fail loudly. A pipeline that “succeeds” with zero output is worse than one that errors.
Never pass large payloads through XCom. Pass a path to the data on shared storage, and clean up after.

Anomaly log: 2.5MB through a 48KB door: the XCom bug in full.

// 01 — THE TASK GRAPH

// 02 — RETRIES THAT MATCH REALITY

// 03 — THE XCOM FILE-PATH PATTERN

TAKEAWAYS

NEXT