Files
kebab/tasks/p7/p7-1-pdf-text-extractor.md
altair823 5a158d7343 feat(kebab-parse-pdf): P7-1 text PDF extractor — per-page CanonicalDocument
`PdfTextExtractor`(MediaType::Pdf) lopdf 기반 per-page 텍스트 추출.
페이지마다 `Block::Paragraph` + `SourceSpan::Page { page, char_start, char_end }`
emit. 본문이 비거나 추출 panic 인 페이지는 빈 paragraph + `Provenance::Warning`
("scanned candidate") 로 표시 — 이후 OCR fallback (별도 task) 의 입력.

핵심 동작:
- `lopdf::Document::load_mem` + `is_encrypted()` → 암호화 PDF 는 명시 에러
  (`qpdf --decrypt` 안내).
- 페이지 단위 `extract_text(&[page])` 를 `catch_unwind` 로 감싸 malformed
  page panic 을 recoverable warning 으로 변환.
- `/Info` dict 에서 Title/Producer/Creator best-effort 추출. UTF-16BE BOM
  prefixed 문자열도 디코드 (한국어 등 non-ASCII Title 정상 처리).
- 9개 통합 테스트: 3-page emit, scanned-mixed warning, encrypted refuse,
  corrupt header error, page_count 메타, UTF-16BE Title, filename
  fallback, determinism, snapshot.

`parser_version = "pdf-text-v1"`. Allowed deps: `lopdf 0.32` + `pdf-extract 0.7`
(원본 spec 그대로). 본문 다국어 OCR fallback 은 §9.2 후속 task (out of scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 08:34:55 +00:00

126 lines
6.7 KiB
Markdown

---
phase: P7
component: kebab-parse-pdf (text extractor)
task_id: p7-1
title: "Text PDF extractor → CanonicalDocument with page-level blocks"
status: completed
depends_on: [p0-1, p1-6]
unblocks: [p7-2]
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
contract_sections: [§3.4 SourceSpan::Page, §3.4 Block::Paragraph, §9.2 PDF text extraction, §9 versioning]
---
# p7-1 — PDF text extractor
## Goal
Implement `Extractor` for `MediaType::Pdf`. Extracts text page-by-page, emits one `Block::Paragraph` per page with `SourceSpan::Page`. Failed-text pages get an empty paragraph + `Provenance::Warning` so they can be picked up later by an OCR fallback pipeline.
## Why now / why this size
Strict scope: page text + page numbers. Layout reconstruction (multi-column merge, table extraction) is intentionally NOT in scope — it's its own engineering project. This task gets a usable PDF retrieval surface online with minimal moving parts.
## Allowed dependencies
- `kebab-core`
- `kebab-config`
- `pdf-extract = "0.7"` (or current stable)
- `lopdf = "0.32"` for page metadata (count, optional title from /Info)
- `serde`, `serde_json`
- `time`
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`, OCR libraries (OCR fallback is a separate task, not this one)
## Inputs
| input | type | source |
|-------|------|--------|
| `RawAsset` | `kebab_core::RawAsset` | `kebab-source-fs` |
| PDF bytes | `&[u8]` | filesystem |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `CanonicalDocument` | `kebab_core::CanonicalDocument` | `kebab-chunk` (`pdf-page-v1` chunker in p7-2) |
## Public surface (signatures only — no new types)
```rust
pub struct PdfTextExtractor;
impl kebab_core::Extractor for PdfTextExtractor {
fn supports(&self, m: &kebab_core::MediaType) -> bool { matches!(m, kebab_core::MediaType::Pdf) }
fn parser_version(&self) -> kebab_core::ParserVersion { kebab_core::ParserVersion("pdf-text-v1".into()) }
fn extract(&self, ctx: &kebab_core::ExtractContext, bytes: &[u8]) -> anyhow::Result<kebab_core::CanonicalDocument>;
}
```
## Behavior contract
- `pdf-extract` (0.7+) does NOT expose a per-page Rust API. Its public surface is `pdf_extract::extract_text(path)` and `pdf_extract::extract_text_from_mem(bytes)` — both return a single `String` for the whole document. Per-page text MUST therefore be obtained by iterating `lopdf::Document::load_mem(bytes)` page objects directly:
1. Load via `lopdf::Document::load_mem(bytes)`.
2. `doc.get_pages()``BTreeMap<u32, ObjectId>` (1-based page numbers).
3. For each `(page_num, page_id)`: call `doc.extract_text(&[page_num])` (lopdf's per-page text extraction), wrap with `catch_unwind` to absorb the rare crash on malformed pages.
4. Treat returned text as `text` for that page. Empty result OR Err → fall through to "scanned candidate" branch.
- For each page (1-based `i` from above):
- On success: produce `Block::Paragraph(TextBlock { common, text, inlines: vec![Inline::Text(text)] })` with `common.source_span = SourceSpan::Page { page: i, char_start: Some(0), char_end: Some(text.chars().count() as u32) }` (NOTE: char count, not byte len, so spans match `Citation::Page` fragment semantics) and `common.heading_path = vec![]`.
- On empty/error: produce `Block::Paragraph` with `text: ""`, `Provenance::Warning { note: format!("page{} empty (scanned candidate)", i) }`. The warning marks the page as a candidate for the OCR fallback pipeline (out of scope for this task).
- `pdf-extract` whole-document call MAY still be used as a sanity check (`extract_text_from_mem`) to detect catastrophic decoding failure early, but per-page text is sourced from `lopdf` only.
- `title` precedence: `/Info/Title` from `lopdf` (when non-empty) → filename without extension.
- `lang = Lang("und")` (PDFs rarely declare; lingua detection over the body could be a future enhancement).
- `metadata.user["pdf"] = { "page_count": n, "producer": "...", "creator": "..." }` from `/Info`.
- `metadata.source_type = SourceType::Paper`; `trust_level = TrustLevel::Primary`.
- `provenance` events: `Discovered`, `Parsed` (per page text or warning).
- `block_id` per design §4.2 with `block_kind = "paragraph"`, `heading_path = []`, `ordinal = page - 1`, `source_span = SourceSpan::Page { page }`.
- Streaming: read PDF in memory only once; do not load `pdf-extract` per page (that re-parses N times).
- Failure modes:
- File not a PDF / corrupt header → `anyhow::Error`.
- Encrypted PDF → `anyhow::Error` with hint to remove encryption (no decryption attempt in v1).
- Determinism: identical bytes → identical doc/block IDs and text.
## Storage / wire effects
- None directly.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | 3-page PDF produces 3 paragraph blocks with `SourceSpan::Page { page: 1..=3 }` | `fixtures/pdf/three-page-en.pdf` |
| unit | PDF with image-only page 2 (no text) emits warning + empty text for page 2 | `fixtures/pdf/scanned-mixed.pdf` |
| unit | encrypted PDF returns error with helpful hint | `fixtures/pdf/encrypted.pdf` |
| unit | corrupt header PDF returns error | `fixtures/pdf/corrupt.pdf` |
| unit | `metadata.user.pdf.page_count` matches actual count | inline |
| unit | Korean text PDF preserved (CID mapping permitting) | `fixtures/pdf/korean.pdf` |
| determinism | identical bytes → identical CanonicalDocument JSON across two runs | inline |
| snapshot | CanonicalDocument JSON for fixture stable | `fixtures/pdf/three-page-en.pdf` |
All tests under `cargo test -p kebab-parse-pdf`.
## Definition of Done
- [ ] `cargo check -p kebab-parse-pdf` passes
- [ ] `cargo test -p kebab-parse-pdf` passes
- [ ] No OCR / LLM code present
- [ ] No imports outside Allowed dependencies
- [ ] PR links design §3.4 SourceSpan::Page, §9.2
## Out of scope
- OCR for scanned PDFs (separate future task; reuses p6-2 OCR adapter).
- Layout reconstruction (multi-column reading order, tables).
- Math rendering / formula detection.
- Form-field extraction.
- Bookmark / outline ingestion (could become heading_path later — note for P+).
## Risks / notes
- `pdf-extract` text quality varies wildly. For broken-glyph PDFs, the text may be unicode noise; downstream embedding still works but quality is poor. Mark such pages with a confidence-style warning when feasible.
- Some PDFs have layered text (selectable text + scanned image overlay). v1 captures the selectable text only.
- For very large PDFs (> 1k pages), memory usage may spike. Document a soft limit (`config.pdf.max_pages` default 5000) and refuse beyond it.