DOC: contextflo
STATUS: ● PUBLISHED
SYSTEM CONTEXTFLOW

Cleaning Multilingual PDFs: Ligatures, IPA, and Broken Unicode

Six transformations, in an order that matters, to make academic text searchable.

Cover image — Cleaning Multilingual PDFs: Ligatures, IPA, and Broken Unicode

A PDF is a layout format, not a text format. Pulling clean text out of one, especially a LaTeX-compiled linguistics paper full of IPA and accents, is most of the battle in a retrieval pipeline. Garbage text produces garbage embeddings, and the search is only as good as the text underneath it.

// 01 — RECOVERING MIS-DECODED CHARACTERS

pypdf sometimes hands back IPA symbols and accented letters that were mis-decoded at the byte level. _decode_page_text() recovers them with a latin-1 → UTF-8 re-encode trick, reinterpreting the bytes under the correct encoding so ʃ, é, and friends come back intact instead of as mojibake.

// 02 — THE CLEANING ORDER

clean_text() applies six transformations, and the order is load-bearing: each step assumes the previous one ran:

  1. Expand ligatures: fi → fi, fl → fl, ff/ffi/ffl, st → st. PDF typesetting artifacts.
  2. NFC normalize: collapse combining diacritics to precomposed form, so é (e + ◌́) equals é (single codepoint). Without this, two visually identical strings embed differently.
  3. Strip Unicode spaces: remove non-breaking and zero-width spaces PDF renderers inject.
  4. Join hyphenated line-breaks: seman-\ntic → semantic, the LaTeX line-wrap artifact.
  5. Collapse excess newlines: 3+ newlines → 2, preserving paragraph breaks.
  6. Strip trailing whitespace per line: column layouts pad lines to page width with spaces.

// 03 — WHY ORDER MATTERS

Run NFC normalization before ligature expansion and you can normalize a ligature into a form the expander no longer recognizes. Join line-breaks before stripping the zero-width spaces and the hyphen match can miss. Each step is cheap; the sequence is the design. Get it wrong and the corruption is subtle: the text looks fine to a human and embeds wrong.

TAKEAWAYS

NEXT

@frogwebp brand mark
ANTHONY PENA · @FROGWEBP
I build data systems and write about everything around them, the architecture, the failures, what each one teaches me. Documenting in public since 2021: the process, not just the result.

// NEWSLETTER — THE BUILD LOG SIGNAL

When I ship something or learn something worth keeping, it lands here first — build logs, concepts, and the honest process behind them. Come along; no spam, leave anytime.