Implement Extractor for MediaType::Pdf. Extracts text page-by-page, emits one Block::Paragraph per page with SourceSpan::Page. Failed-text pages get an empty paragraph + Provenance::Warning so they can be picked up later by an OCR fallback pipeline.
Why now / why this size
Strict scope: page text + page numbers. Layout reconstruction (multi-column merge, table extraction) is intentionally NOT in scope — it's its own engineering project. This task gets a usable PDF retrieval surface online with minimal moving parts.
Allowed dependencies
kb-core
kb-config
pdf-extract = "0.7" (or current stable)
lopdf = "0.32" for page metadata (count, optional title from /Info)
serde, serde_json
time
tracing
thiserror
Forbidden dependencies
kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-*, kb-embed*, kb-search, kb-llm*, kb-rag, kb-tui, kb-desktop, OCR libraries (OCR fallback is a separate task, not this one)
Page count obtained via lopdf::Document::load_mem; iterate 1..=n.
For each page:
Try pdf-extract::extract_text_from_mem_by_pages(bytes) (or equivalent) to get a Vec<String> aligned with pages.
If extraction returns text for page i: produce Block::Paragraph(TextBlock { common, text, inlines: vec![Inline::Text(text)] }) with common.source_span = SourceSpan::Page { page: i, char_start: Some(0), char_end: Some(text.len() as u32) } and common.heading_path = vec![].
If text is empty or extraction errored: produce Block::Paragraph with text: "", Provenance::Warning { note: "page<i> empty (scanned candidate)" }.
title precedence: /Info/Title from lopdf (when non-empty) → filename without extension.
lang = Lang("und") (PDFs rarely declare; lingua detection over the body could be a future enhancement).
metadata.user["pdf"] = { "page_count": n, "producer": "...", "creator": "..." } from /Info.
Bookmark / outline ingestion (could become heading_path later — note for P+).
Risks / notes
pdf-extract text quality varies wildly. For broken-glyph PDFs, the text may be unicode noise; downstream embedding still works but quality is poor. Mark such pages with a confidence-style warning when feasible.
Some PDFs have layered text (selectable text + scanned image overlay). v1 captures the selectable text only.
For very large PDFs (> 1k pages), memory usage may spike. Document a soft limit (config.pdf.max_pages default 5000) and refuse beyond it.