SYS · VECTOR ETL · SEMANTIC SEARCH · COMPLETE

CONTEXT
FLOW

Multilingual PDF to semantic vector ETL. Drop documents in, ask in plain English, get ranked results with exact page provenance in under two seconds.

PRODUCTION READY · 60/60 TESTS

60 TESTS · CI GREEN

384D VECTOR DIMENSIONS

<2s QUERY LATENCY

$0 API COST · LOCAL

Fully local inference. sentence-transformers runs on-device. No external API calls, no data leaving the machine, no per-token billing.

// 01 · THE PROBLEM

INFORMATION
TRAPPED IN
DOCUMENTS

The standard approach to a library of multilingual academic PDFs is to open them one by one and press Ctrl+F. That is not search. That is manual labor disguised as a workflow.

The harder version: LaTeX-compiled linguistics papers with IPA symbols, UTF-8 encoding edge cases, inconsistent whitespace conventions, and words split across lines at the PDF page-content level. Naive pipelines silently corrupt these. They extract garbage and never report it.

ContextFlow was built for that harder case. A pipeline that handles mis-decoded IPA glyphs and hyphenated linebreaks correctly handles everything else by default.

// CORE INSIGHT

Build for the edge case. Everything else becomes easy.

Full idempotency via SHA-256 chunk IDs. Re-running on the same document performs an upsert, never a duplicate. Orchestrated with Apache Airflow: validate, extract, transform, load, notify. max_active_runs=1, retries=2, XCom carries file path only.

// 02 · THE ETL PIPELINE

DATA
FLOW

// CONTEXTFLOW_INGEST — 7 STAGES · AIRFLOW DAG · IDEMPOTENT

INPUT data/raw/ · 50MB cap per file

PDF files placed in the watched directory. validate_inputs runs as the first Airflow task before any I/O begins. Fails loudly rather than silently producing an empty run.

EXTRACT pypdf 4.3.1 · stream_pages() generator

Pages yielded as (page_number, text) tuples via a generator. O(1) memory relative to document size. _decode_page_text() recovers IPA symbols via latin-1 to utf-8 re-encode. Pages below 100 chars skipped.

CLEAN transform.py · clean_text()

Ligature expansion. Unicode NFC normalization. Non-breaking space collapse. Hyphenated linebreak joining. Excess newline compression. Applied in order on every page.

CHUNK LangChain Text Splitters 0.2.4 · 512c / 64 overlap

RecursiveCharacterTextSplitter with hierarchical separators. Every chunk carries provenance: source, page_number, chunk_index, char_count. Overlap ensures semantic context is never split.

EMBED sentence-transformers 3.0.1 · all-MiniLM-L6-v2

384-dimensional vectors per chunk. L2-normalized so cosine similarity reduces to a dot product. Single batched call, batch size 64. Fully local, zero API cost, zero data-privacy risk. Strong multilingual transfer.

LOAD ChromaDB 0.5.3 · _stable_chunk_id() · batch 256

SHA-256 hash of "{source}::p{page}::c{chunk_index}" truncated to 16 hex chars. Same document always produces same IDs. collection.upsert() enforces idempotency across reruns.

QUERY Streamlit 1.36.0 · CLI · cosine similarity

collection.query() with ChromaDB cosine distance. Results include documents, metadata (source, page, chunk), distances converted to similarity = 1.0 - distance. Under 2 seconds for any corpus size.

◈

Apache Airflow DAG

2.9.3 · contextflow_ingest · TaskFlow API

validate_inputs, extract, transform, load, notify. max_active_runs=1, retries=2. XCom carries file path only to avoid size limits.

◉

SQLite Audit Trail

STDLIB · data/processed/runs.db · RunTimer context

One RunRecord per run: source, status, pages, chunks, duration, error. Written on context exit regardless of success. Zero dependencies.

// 03 · IDEMPOTENCY

DETERMINISTIC
CHUNK IDS

The same document always produces the same chunk IDs. Re-running the pipeline performs an upsert, never a duplicate. The mechanism that makes ContextFlow safe to re-trigger without consequence.

// src/load.py · _stable_chunk_id() IDEMPOTENT

def _stable_chunk_id(source: str, page: int, chunk_index: int) -> str:
    # Same inputs always produce the same ID.
    # Re-running on the same document: upsert, never a duplicate.
    raw = f"{source}::p{page}::c{chunk_index}"
    return hashlib.sha256(raw.encode()).hexdigest()[:16]

# result: "a3f292b1c0e4d7f8"
# deterministic across machines, restarts, and reruns

// 04 · TECH STACK

BUILT WITH
INTENTION

pypdf

4.3.1

PDF extraction via generator. Handles multilingual encoding edge cases and mis-decoded IPA glyphs via latin-1 to utf-8 re-encode.

LangChain Splitters

0.2.4

RecursiveCharacterTextSplitter with hierarchical separators. Matches academic PDF structure better than naive fixed-character splitting.

sentence-transformers

3.0.1

all-MiniLM-L6-v2. 384-dim, L2-normalized. Zero API cost. Zero data privacy risk. Runs fully local with strong multilingual transfer.

ChromaDB

0.5.3

Local-first vector store. PersistentClient in dev, HttpClient in Docker. Cosine similarity native. Upsert semantics enforce idempotency.

Apache Airflow

2.9.3

DAG orchestration via TaskFlow API. Retry logic, XCom, manual and scheduled triggers, run history from a single config.

Streamlit

1.36.0

RAG dashboard: file uploader, ingestion trigger, semantic search with similarity scores, run history. Session state for embedder singleton.

Pydantic Settings

2.3.4

4 nested config classes per prefix: EMBED_, CHUNK_, CHROMA_, PIPELINE_. Env-var overrides and .env file support built in.

structlog

24.2.0

JSON output in production inside Docker. Coloured console in dev. Consistent key=value format across all pipeline stages.

Docker Compose

5 SERVICES

ChromaDB, Streamlit, Airflow init, Airflow scheduler, Airflow webserver. Shared volumes, service wiring, reproducible environments.

CONTEXT
FLOW

INFORMATION
TRAPPED IN
DOCUMENTS

DATA
FLOW

DETERMINISTIC
CHUNK IDS

BUILT WITH
INTENTION

BY THE
NUMBERS

THE SYSTEM
AT WORK

OPEN
SOURCE

CONTEXTFLOW

INFORMATIONTRAPPED INDOCUMENTS

DATAFLOW

DETERMINISTICCHUNK IDS

BUILT WITHINTENTION

BY THENUMBERS

THE SYSTEMAT WORK

OPENSOURCE

CONTEXT
FLOW

INFORMATION
TRAPPED IN
DOCUMENTS

DATA
FLOW

DETERMINISTIC
CHUNK IDS

BUILT WITH
INTENTION

BY THE
NUMBERS

THE SYSTEM
AT WORK

OPEN
SOURCE