Files
kebab/tasks/p7/p7-1-pdf-text-extractor.md
kb b999a12ab5 tasks: address PR #1 review
- p3-3: SQLite-first/Lance-second + status marker (V003__embedding_status); drop "best-effort 2PC" misnomer
- p4-3: replace print_stream FnMut closure with mpsc::Sender<String> (RagPipeline stays Send+Sync)
- p4-3: tighten citation regex to strict [#<n>] only — reject [n]/prose/code-block false positives
- p5-2: compare_runs across chunker_version is graceful (doc + span overlap fallback) with chunker_version_match audit field; --strict-chunker-version restores refusal
- p7-1: per-page text via lopdf (pdf-extract has no per-page Rust API); use char count for spans
- p8-1: explicit rubato (FftFixedIn) for 16 kHz mono resample; symphonia decode only
- p9-5: drop cmd_read_pdf_page + pdfium native dep; cmd_read_file_bytes + frontend pdfjs; add traversal tests
2026-04-27 13:10:31 +00:00

126 lines
6.6 KiB
Markdown

---
phase: P7
component: kb-parse-pdf (text extractor)
task_id: p7-1
title: "Text PDF extractor → CanonicalDocument with page-level blocks"
status: planned
depends_on: [p0-1, p1-6]
unblocks: [p7-2]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§3.4 SourceSpan::Page, §3.4 Block::Paragraph, §9.2 PDF text extraction, §9 versioning]
---
# p7-1 — PDF text extractor
## Goal
Implement `Extractor` for `MediaType::Pdf`. Extracts text page-by-page, emits one `Block::Paragraph` per page with `SourceSpan::Page`. Failed-text pages get an empty paragraph + `Provenance::Warning` so they can be picked up later by an OCR fallback pipeline.
## Why now / why this size
Strict scope: page text + page numbers. Layout reconstruction (multi-column merge, table extraction) is intentionally NOT in scope — it's its own engineering project. This task gets a usable PDF retrieval surface online with minimal moving parts.
## Allowed dependencies
- `kb-core`
- `kb-config`
- `pdf-extract = "0.7"` (or current stable)
- `lopdf = "0.32"` for page metadata (count, optional title from /Info)
- `serde`, `serde_json`
- `time`
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`, OCR libraries (OCR fallback is a separate task, not this one)
## Inputs
| input | type | source |
|-------|------|--------|
| `RawAsset` | `kb_core::RawAsset` | `kb-source-fs` |
| PDF bytes | `&[u8]` | filesystem |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `CanonicalDocument` | `kb_core::CanonicalDocument` | `kb-chunk` (`pdf-page-v1` chunker in p7-2) |
## Public surface (signatures only — no new types)
```rust
pub struct PdfTextExtractor;
impl kb_core::Extractor for PdfTextExtractor {
fn supports(&self, m: &kb_core::MediaType) -> bool { matches!(m, kb_core::MediaType::Pdf) }
fn parser_version(&self) -> kb_core::ParserVersion { kb_core::ParserVersion("pdf-text-v1".into()) }
fn extract(&self, ctx: &kb_core::ExtractContext, bytes: &[u8]) -> anyhow::Result<kb_core::CanonicalDocument>;
}
```
## Behavior contract
- `pdf-extract` (0.7+) does NOT expose a per-page Rust API. Its public surface is `pdf_extract::extract_text(path)` and `pdf_extract::extract_text_from_mem(bytes)` — both return a single `String` for the whole document. Per-page text MUST therefore be obtained by iterating `lopdf::Document::load_mem(bytes)` page objects directly:
1. Load via `lopdf::Document::load_mem(bytes)`.
2. `doc.get_pages()``BTreeMap<u32, ObjectId>` (1-based page numbers).
3. For each `(page_num, page_id)`: call `doc.extract_text(&[page_num])` (lopdf's per-page text extraction), wrap with `catch_unwind` to absorb the rare crash on malformed pages.
4. Treat returned text as `text` for that page. Empty result OR Err → fall through to "scanned candidate" branch.
- For each page (1-based `i` from above):
- On success: produce `Block::Paragraph(TextBlock { common, text, inlines: vec![Inline::Text(text)] })` with `common.source_span = SourceSpan::Page { page: i, char_start: Some(0), char_end: Some(text.chars().count() as u32) }` (NOTE: char count, not byte len, so spans match `Citation::Page` fragment semantics) and `common.heading_path = vec![]`.
- On empty/error: produce `Block::Paragraph` with `text: ""`, `Provenance::Warning { note: format!("page{} empty (scanned candidate)", i) }`. The warning marks the page as a candidate for the OCR fallback pipeline (out of scope for this task).
- `pdf-extract` whole-document call MAY still be used as a sanity check (`extract_text_from_mem`) to detect catastrophic decoding failure early, but per-page text is sourced from `lopdf` only.
- `title` precedence: `/Info/Title` from `lopdf` (when non-empty) → filename without extension.
- `lang = Lang("und")` (PDFs rarely declare; lingua detection over the body could be a future enhancement).
- `metadata.user["pdf"] = { "page_count": n, "producer": "...", "creator": "..." }` from `/Info`.
- `metadata.source_type = SourceType::Paper`; `trust_level = TrustLevel::Primary`.
- `provenance` events: `Discovered`, `Parsed` (per page text or warning).
- `block_id` per design §4.2 with `block_kind = "paragraph"`, `heading_path = []`, `ordinal = page - 1`, `source_span = SourceSpan::Page { page }`.
- Streaming: read PDF in memory only once; do not load `pdf-extract` per page (that re-parses N times).
- Failure modes:
- File not a PDF / corrupt header → `anyhow::Error`.
- Encrypted PDF → `anyhow::Error` with hint to remove encryption (no decryption attempt in v1).
- Determinism: identical bytes → identical doc/block IDs and text.
## Storage / wire effects
- None directly.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | 3-page PDF produces 3 paragraph blocks with `SourceSpan::Page { page: 1..=3 }` | `fixtures/pdf/three-page-en.pdf` |
| unit | PDF with image-only page 2 (no text) emits warning + empty text for page 2 | `fixtures/pdf/scanned-mixed.pdf` |
| unit | encrypted PDF returns error with helpful hint | `fixtures/pdf/encrypted.pdf` |
| unit | corrupt header PDF returns error | `fixtures/pdf/corrupt.pdf` |
| unit | `metadata.user.pdf.page_count` matches actual count | inline |
| unit | Korean text PDF preserved (CID mapping permitting) | `fixtures/pdf/korean.pdf` |
| determinism | identical bytes → identical CanonicalDocument JSON across two runs | inline |
| snapshot | CanonicalDocument JSON for fixture stable | `fixtures/pdf/three-page-en.pdf` |
All tests under `cargo test -p kb-parse-pdf`.
## Definition of Done
- [ ] `cargo check -p kb-parse-pdf` passes
- [ ] `cargo test -p kb-parse-pdf` passes
- [ ] No OCR / LLM code present
- [ ] No imports outside Allowed dependencies
- [ ] PR links design §3.4 SourceSpan::Page, §9.2
## Out of scope
- OCR for scanned PDFs (separate future task; reuses p6-2 OCR adapter).
- Layout reconstruction (multi-column reading order, tables).
- Math rendering / formula detection.
- Form-field extraction.
- Bookmark / outline ingestion (could become heading_path later — note for P+).
## Risks / notes
- `pdf-extract` text quality varies wildly. For broken-glyph PDFs, the text may be unicode noise; downstream embedding still works but quality is poor. Mark such pages with a confidence-style warning when feasible.
- Some PDFs have layered text (selectable text + scanned image overlay). v1 captures the selectable text only.
- For very large PDFs (> 1k pages), memory usage may spike. Document a soft limit (`config.pdf.max_pages` default 5000) and refuse beyond it.