--- phase: P7 component: kebab-parse-pdf (text extractor) task_id: p7-1 title: "Text PDF extractor → CanonicalDocument with page-level blocks" status: completed depends_on: [p0-1, p1-6] unblocks: [p7-2] contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md contract_sections: [§3.4 SourceSpan::Page, §3.4 Block::Paragraph, §9.2 PDF text extraction, §9 versioning] --- # p7-1 — PDF text extractor ## Goal Implement `Extractor` for `MediaType::Pdf`. Extracts text page-by-page, emits one `Block::Paragraph` per page with `SourceSpan::Page`. Failed-text pages get an empty paragraph + `Provenance::Warning` so they can be picked up later by an OCR fallback pipeline. ## Why now / why this size Strict scope: page text + page numbers. Layout reconstruction (multi-column merge, table extraction) is intentionally NOT in scope — it's its own engineering project. This task gets a usable PDF retrieval surface online with minimal moving parts. ## Allowed dependencies - `kebab-core` - `kebab-config` - `pdf-extract = "0.7"` (or current stable) - `lopdf = "0.32"` for page metadata (count, optional title from /Info) - `serde`, `serde_json` - `time` - `tracing` - `thiserror` ## Forbidden dependencies - `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`, OCR libraries (OCR fallback is a separate task, not this one) ## Inputs | input | type | source | |-------|------|--------| | `RawAsset` | `kebab_core::RawAsset` | `kebab-source-fs` | | PDF bytes | `&[u8]` | filesystem | ## Outputs | output | type | downstream | |--------|------|------------| | `CanonicalDocument` | `kebab_core::CanonicalDocument` | `kebab-chunk` (`pdf-page-v1` chunker in p7-2) | ## Public surface (signatures only — no new types) ```rust pub struct PdfTextExtractor; impl kebab_core::Extractor for PdfTextExtractor { fn supports(&self, m: &kebab_core::MediaType) -> bool { matches!(m, kebab_core::MediaType::Pdf) } fn parser_version(&self) -> kebab_core::ParserVersion { kebab_core::ParserVersion("pdf-text-v1".into()) } fn extract(&self, ctx: &kebab_core::ExtractContext, bytes: &[u8]) -> anyhow::Result; } ``` ## Behavior contract - `pdf-extract` (0.7+) does NOT expose a per-page Rust API. Its public surface is `pdf_extract::extract_text(path)` and `pdf_extract::extract_text_from_mem(bytes)` — both return a single `String` for the whole document. Per-page text MUST therefore be obtained by iterating `lopdf::Document::load_mem(bytes)` page objects directly: 1. Load via `lopdf::Document::load_mem(bytes)`. 2. `doc.get_pages()` → `BTreeMap` (1-based page numbers). 3. For each `(page_num, page_id)`: call `doc.extract_text(&[page_num])` (lopdf's per-page text extraction), wrap with `catch_unwind` to absorb the rare crash on malformed pages. 4. Treat returned text as `text` for that page. Empty result OR Err → fall through to "scanned candidate" branch. - For each page (1-based `i` from above): - On success: produce `Block::Paragraph(TextBlock { common, text, inlines: vec![Inline::Text(text)] })` with `common.source_span = SourceSpan::Page { page: i, char_start: Some(0), char_end: Some(text.chars().count() as u32) }` (NOTE: char count, not byte len, so spans match `Citation::Page` fragment semantics) and `common.heading_path = vec![]`. - On empty/error: produce `Block::Paragraph` with `text: ""`, `Provenance::Warning { note: format!("page{} empty (scanned candidate)", i) }`. The warning marks the page as a candidate for the OCR fallback pipeline (out of scope for this task). - `pdf-extract` whole-document call MAY still be used as a sanity check (`extract_text_from_mem`) to detect catastrophic decoding failure early, but per-page text is sourced from `lopdf` only. - `title` precedence: `/Info/Title` from `lopdf` (when non-empty) → filename without extension. - `lang = Lang("und")` (PDFs rarely declare; lingua detection over the body could be a future enhancement). - `metadata.user["pdf"] = { "page_count": n, "producer": "...", "creator": "..." }` from `/Info`. - `metadata.source_type = SourceType::Paper`; `trust_level = TrustLevel::Primary`. - `provenance` events: `Discovered`, `Parsed` (per page text or warning). - `block_id` per design §4.2 with `block_kind = "paragraph"`, `heading_path = []`, `ordinal = page - 1`, `source_span = SourceSpan::Page { page }`. - Streaming: read PDF in memory only once; do not load `pdf-extract` per page (that re-parses N times). - Failure modes: - File not a PDF / corrupt header → `anyhow::Error`. - Encrypted PDF → `anyhow::Error` with hint to remove encryption (no decryption attempt in v1). - Determinism: identical bytes → identical doc/block IDs and text. ## Storage / wire effects - None directly. ## Test plan | kind | description | fixture / data | |------|-------------|----------------| | unit | 3-page PDF produces 3 paragraph blocks with `SourceSpan::Page { page: 1..=3 }` | `fixtures/pdf/three-page-en.pdf` | | unit | PDF with image-only page 2 (no text) emits warning + empty text for page 2 | `fixtures/pdf/scanned-mixed.pdf` | | unit | encrypted PDF returns error with helpful hint | `fixtures/pdf/encrypted.pdf` | | unit | corrupt header PDF returns error | `fixtures/pdf/corrupt.pdf` | | unit | `metadata.user.pdf.page_count` matches actual count | inline | | unit | Korean text PDF preserved (CID mapping permitting) | `fixtures/pdf/korean.pdf` | | determinism | identical bytes → identical CanonicalDocument JSON across two runs | inline | | snapshot | CanonicalDocument JSON for fixture stable | `fixtures/pdf/three-page-en.pdf` | All tests under `cargo test -p kebab-parse-pdf`. ## Definition of Done - [ ] `cargo check -p kebab-parse-pdf` passes - [ ] `cargo test -p kebab-parse-pdf` passes - [ ] No OCR / LLM code present - [ ] No imports outside Allowed dependencies - [ ] PR links design §3.4 SourceSpan::Page, §9.2 ## Out of scope - OCR for scanned PDFs (separate future task; reuses p6-2 OCR adapter). - Layout reconstruction (multi-column reading order, tables). - Math rendering / formula detection. - Form-field extraction. - Bookmark / outline ingestion (could become heading_path later — note for P+). ## Risks / notes - `pdf-extract` text quality varies wildly. For broken-glyph PDFs, the text may be unicode noise; downstream embedding still works but quality is poor. Mark such pages with a confidence-style warning when feasible. - Some PDFs have layered text (selectable text + scanned image overlay). v1 captures the selectable text only. - For very large PDFs (> 1k pages), memory usage may spike. Document a soft limit (`config.pdf.max_pages` default 5000) and refuse beyond it.