SYS · VECTOR ETL · SEMANTIC SEARCH · COMPLETE

CONTEXT
FLOW

Multilingual PDF to semantic vector ETL. Drop documents in, ask in plain English, get ranked results with exact page provenance in under two seconds.

PRODUCTION READY · 60/60 TESTS
60 TESTS · CI GREEN
384D VECTOR DIMENSIONS
<2s QUERY LATENCY
$0 API COST · LOCAL
Fully local inference. sentence-transformers runs on-device. No external API calls, no data leaving the machine, no per-token billing.
// 01 · THE PROBLEM

INFORMATION
TRAPPED IN
DOCUMENTS

The standard approach to a library of multilingual academic PDFs is to open them one by one and press Ctrl+F. That is not search. That is manual labor disguised as a workflow.

The harder version: LaTeX-compiled linguistics papers with IPA symbols, UTF-8 encoding edge cases, inconsistent whitespace conventions, and words split across lines at the PDF page-content level. Naive pipelines silently corrupt these. They extract garbage and never report it.

ContextFlow was built for that harder case. A pipeline that handles mis-decoded IPA glyphs and hyphenated linebreaks correctly handles everything else by default.

// CORE INSIGHT
Build for the edge case. Everything else becomes easy.
Full idempotency via SHA-256 chunk IDs. Re-running on the same document performs an upsert, never a duplicate. Orchestrated with Apache Airflow: validate, extract, transform, load, notify. max_active_runs=1, retries=2, XCom carries file path only.
// 02 · THE ETL PIPELINE

DATA
FLOW

// CONTEXTFLOW_INGEST — 7 STAGES · AIRFLOW DAG · IDEMPOTENT
01
INPUT data/raw/ · 50MB cap per file
PDF files placed in the watched directory. validate_inputs runs as the first Airflow task before any I/O begins. Fails loudly rather than silently producing an empty run.
02
EXTRACT pypdf 4.3.1 · stream_pages() generator
Pages yielded as (page_number, text) tuples via a generator. O(1) memory relative to document size. _decode_page_text() recovers IPA symbols via latin-1 to utf-8 re-encode. Pages below 100 chars skipped.
03
CLEAN transform.py · clean_text()
Ligature expansion. Unicode NFC normalization. Non-breaking space collapse. Hyphenated linebreak joining. Excess newline compression. Applied in order on every page.
04
CHUNK LangChain Text Splitters 0.2.4 · 512c / 64 overlap
RecursiveCharacterTextSplitter with hierarchical separators. Every chunk carries provenance: source, page_number, chunk_index, char_count. Overlap ensures semantic context is never split.
05
EMBED sentence-transformers 3.0.1 · all-MiniLM-L6-v2
384-dimensional vectors per chunk. L2-normalized so cosine similarity reduces to a dot product. Single batched call, batch size 64. Fully local, zero API cost, zero data-privacy risk. Strong multilingual transfer.
06
LOAD ChromaDB 0.5.3 · _stable_chunk_id() · batch 256
SHA-256 hash of "{source}::p{page}::c{chunk_index}" truncated to 16 hex chars. Same document always produces same IDs. collection.upsert() enforces idempotency across reruns.
07
QUERY Streamlit 1.36.0 · CLI · cosine similarity
collection.query() with ChromaDB cosine distance. Results include documents, metadata (source, page, chunk), distances converted to similarity = 1.0 - distance. Under 2 seconds for any corpus size.
Apache Airflow DAG
2.9.3 · contextflow_ingest · TaskFlow API
validate_inputs, extract, transform, load, notify. max_active_runs=1, retries=2. XCom carries file path only to avoid size limits.
SQLite Audit Trail
STDLIB · data/processed/runs.db · RunTimer context
One RunRecord per run: source, status, pages, chunks, duration, error. Written on context exit regardless of success. Zero dependencies.
// 03 · IDEMPOTENCY

DETERMINISTIC
CHUNK IDS

The same document always produces the same chunk IDs. Re-running the pipeline performs an upsert, never a duplicate. The mechanism that makes ContextFlow safe to re-trigger without consequence.

// src/load.py · _stable_chunk_id() IDEMPOTENT
def _stable_chunk_id(source: str, page: int, chunk_index: int) -> str:
    # Same inputs always produce the same ID.
    # Re-running on the same document: upsert, never a duplicate.
    raw = f"{source}::p{page}::c{chunk_index}"
    return hashlib.sha256(raw.encode()).hexdigest()[:16]

# result: "a3f292b1c0e4d7f8"
# deterministic across machines, restarts, and reruns
// 04 · TECH STACK

BUILT WITH
INTENTION

pypdf
4.3.1
PDF extraction via generator. Handles multilingual encoding edge cases and mis-decoded IPA glyphs via latin-1 to utf-8 re-encode.
LangChain Splitters
0.2.4
RecursiveCharacterTextSplitter with hierarchical separators. Matches academic PDF structure better than naive fixed-character splitting.
sentence-transformers
3.0.1
all-MiniLM-L6-v2. 384-dim, L2-normalized. Zero API cost. Zero data privacy risk. Runs fully local with strong multilingual transfer.
ChromaDB
0.5.3
Local-first vector store. PersistentClient in dev, HttpClient in Docker. Cosine similarity native. Upsert semantics enforce idempotency.
Apache Airflow
2.9.3
DAG orchestration via TaskFlow API. Retry logic, XCom, manual and scheduled triggers, run history from a single config.
Streamlit
1.36.0
RAG dashboard: file uploader, ingestion trigger, semantic search with similarity scores, run history. Session state for embedder singleton.
Pydantic Settings
2.3.4
4 nested config classes per prefix: EMBED_, CHUNK_, CHROMA_, PIPELINE_. Env-var overrides and .env file support built in.
structlog
24.2.0
JSON output in production inside Docker. Coloured console in dev. Consistent key=value format across all pipeline stages.
Docker Compose
5 SERVICES
ChromaDB, Streamlit, Airflow init, Airflow scheduler, Airflow webserver. Shared volumes, service wiring, reproducible environments.
// 05 · OUTCOMES

BY THE
NUMBERS

60 TESTS · CI GREEN EVERY PUSH
384 VECTOR DIMENSIONS PER CHUNK
<2s QUERY LATENCY · ANY CORPUS
512c CHUNK SIZE · 64 CHAR OVERLAP
5 DOCKER SERVICES · FULL STACK
$0 API COST · 100% LOCAL
// 06 · IN PRODUCTION

THE SYSTEM
AT WORK

Streamlit RAG dashboard
// STREAMLIT RAG DASHBOARD · SEMANTIC SEARCH · RUN HISTORY
Data normalization
// DATA NORMALIZATION · UNICODE REPAIR · IPA RECOVERY · HYPHENATION JOIN
// 07 · SOURCE CODE

OPEN
SOURCE

$ git clone https://github.com/frogwebp/contextflow
$ pip install -r requirements.txt
$ python -m src.pipeline ingest ./data/raw/
$ python -m src.pipeline query "your question" --top 5
# 60/60 tests · ChromaDB · Airflow · Streamlit · Docker

→ VIEW ON GITHUB
CONTEXTFLOW · MULTILINGUAL PDF TO SEMANTIC VECTOR ETL · @FROGWEBP