Chunking With Provenance: 512 Chars, 64 Overlap, Full Lineage

Embeddings work on chunks, not whole documents, so how you split the text directly shapes what the search can find. ContextFlow chunks with a structure-aware splitter and stamps every chunk with where it came from, so results are both accurate and citable.

// 01 — STRUCTURE-AWARE SPLITTING

A naive splitter cuts every N characters, slicing through sentences and entities. ContextFlow uses RecursiveCharacterTextSplitter with a hierarchy of separators. It tries to break on the most natural boundary available, in order:

separators = ["\n\n", "\n", ". ", "! ", "? ", " ", ""]
chunk_size = 512    # ~2–3 academic sentences
chunk_overlap = 64  # 12.5%

It prefers paragraph breaks, then line breaks, then sentence ends, falling back to spaces only when it must. Chunks land on meaningful boundaries instead of mid-thought.

// 02 — WHY OVERLAP

Each chunk repeats the last 64 characters of the previous one. That overlap preserves cross-boundary context: a named entity or clause that straddles a split still appears whole in one of the two chunks. Without overlap, the concept that happens to fall on a boundary is findable in neither.

// 03 — PROVENANCE ON EVERY CHUNK

Every chunk carries metadata: source, page_number, chunk_index, char_count. This is what lets a search result say “page 14 of paper.pdf” instead of just handing back text. Provenance is what makes retrieval trustworthy: you can follow any answer back to the exact page it came from and verify it.

TAKEAWAYS

Chunk on natural boundaries, not fixed offsets. A separator hierarchy keeps sentences and entities intact, which keeps embeddings meaningful.
Overlap (~10–15%) saves the concepts that fall on a split. It’s cheap insurance against boundary blindness.
Carry provenance through every chunk. A retrieval system you can’t cite is a retrieval system you can’t trust.

Build log 04: 384 dimensions, zero API cost: local embeddings.

// 01 — STRUCTURE-AWARE SPLITTING

// 02 — WHY OVERLAP

// 03 — PROVENANCE ON EVERY CHUNK

TAKEAWAYS

NEXT