`PdfTextExtractor`(MediaType::Pdf) lopdf 기반 per-page 텍스트 추출.
페이지마다 `Block::Paragraph` + `SourceSpan::Page { page, char_start, char_end }`
emit. 본문이 비거나 추출 panic 인 페이지는 빈 paragraph + `Provenance::Warning`
("scanned candidate") 로 표시 — 이후 OCR fallback (별도 task) 의 입력.
핵심 동작:
- `lopdf::Document::load_mem` + `is_encrypted()` → 암호화 PDF 는 명시 에러
(`qpdf --decrypt` 안내).
- 페이지 단위 `extract_text(&[page])` 를 `catch_unwind` 로 감싸 malformed
page panic 을 recoverable warning 으로 변환.
- `/Info` dict 에서 Title/Producer/Creator best-effort 추출. UTF-16BE BOM
prefixed 문자열도 디코드 (한국어 등 non-ASCII Title 정상 처리).
- 9개 통합 테스트: 3-page emit, scanned-mixed warning, encrypted refuse,
corrupt header error, page_count 메타, UTF-16BE Title, filename
fallback, determinism, snapshot.
`parser_version = "pdf-text-v1"`. Allowed deps: `lopdf 0.32` + `pdf-extract 0.7`
(원본 spec 그대로). 본문 다국어 OCR fallback 은 §9.2 후속 task (out of scope).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
126 lines
6.7 KiB
Markdown
126 lines
6.7 KiB
Markdown
---
|
|
phase: P7
|
|
component: kebab-parse-pdf (text extractor)
|
|
task_id: p7-1
|
|
title: "Text PDF extractor → CanonicalDocument with page-level blocks"
|
|
status: completed
|
|
depends_on: [p0-1, p1-6]
|
|
unblocks: [p7-2]
|
|
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
|
contract_sections: [§3.4 SourceSpan::Page, §3.4 Block::Paragraph, §9.2 PDF text extraction, §9 versioning]
|
|
---
|
|
|
|
# p7-1 — PDF text extractor
|
|
|
|
## Goal
|
|
|
|
Implement `Extractor` for `MediaType::Pdf`. Extracts text page-by-page, emits one `Block::Paragraph` per page with `SourceSpan::Page`. Failed-text pages get an empty paragraph + `Provenance::Warning` so they can be picked up later by an OCR fallback pipeline.
|
|
|
|
## Why now / why this size
|
|
|
|
Strict scope: page text + page numbers. Layout reconstruction (multi-column merge, table extraction) is intentionally NOT in scope — it's its own engineering project. This task gets a usable PDF retrieval surface online with minimal moving parts.
|
|
|
|
## Allowed dependencies
|
|
|
|
- `kebab-core`
|
|
- `kebab-config`
|
|
- `pdf-extract = "0.7"` (or current stable)
|
|
- `lopdf = "0.32"` for page metadata (count, optional title from /Info)
|
|
- `serde`, `serde_json`
|
|
- `time`
|
|
- `tracing`
|
|
- `thiserror`
|
|
|
|
## Forbidden dependencies
|
|
|
|
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`, OCR libraries (OCR fallback is a separate task, not this one)
|
|
|
|
## Inputs
|
|
|
|
| input | type | source |
|
|
|-------|------|--------|
|
|
| `RawAsset` | `kebab_core::RawAsset` | `kebab-source-fs` |
|
|
| PDF bytes | `&[u8]` | filesystem |
|
|
|
|
## Outputs
|
|
|
|
| output | type | downstream |
|
|
|--------|------|------------|
|
|
| `CanonicalDocument` | `kebab_core::CanonicalDocument` | `kebab-chunk` (`pdf-page-v1` chunker in p7-2) |
|
|
|
|
## Public surface (signatures only — no new types)
|
|
|
|
```rust
|
|
pub struct PdfTextExtractor;
|
|
|
|
impl kebab_core::Extractor for PdfTextExtractor {
|
|
fn supports(&self, m: &kebab_core::MediaType) -> bool { matches!(m, kebab_core::MediaType::Pdf) }
|
|
fn parser_version(&self) -> kebab_core::ParserVersion { kebab_core::ParserVersion("pdf-text-v1".into()) }
|
|
fn extract(&self, ctx: &kebab_core::ExtractContext, bytes: &[u8]) -> anyhow::Result<kebab_core::CanonicalDocument>;
|
|
}
|
|
```
|
|
|
|
## Behavior contract
|
|
|
|
- `pdf-extract` (0.7+) does NOT expose a per-page Rust API. Its public surface is `pdf_extract::extract_text(path)` and `pdf_extract::extract_text_from_mem(bytes)` — both return a single `String` for the whole document. Per-page text MUST therefore be obtained by iterating `lopdf::Document::load_mem(bytes)` page objects directly:
|
|
1. Load via `lopdf::Document::load_mem(bytes)`.
|
|
2. `doc.get_pages()` → `BTreeMap<u32, ObjectId>` (1-based page numbers).
|
|
3. For each `(page_num, page_id)`: call `doc.extract_text(&[page_num])` (lopdf's per-page text extraction), wrap with `catch_unwind` to absorb the rare crash on malformed pages.
|
|
4. Treat returned text as `text` for that page. Empty result OR Err → fall through to "scanned candidate" branch.
|
|
- For each page (1-based `i` from above):
|
|
- On success: produce `Block::Paragraph(TextBlock { common, text, inlines: vec![Inline::Text(text)] })` with `common.source_span = SourceSpan::Page { page: i, char_start: Some(0), char_end: Some(text.chars().count() as u32) }` (NOTE: char count, not byte len, so spans match `Citation::Page` fragment semantics) and `common.heading_path = vec![]`.
|
|
- On empty/error: produce `Block::Paragraph` with `text: ""`, `Provenance::Warning { note: format!("page{} empty (scanned candidate)", i) }`. The warning marks the page as a candidate for the OCR fallback pipeline (out of scope for this task).
|
|
- `pdf-extract` whole-document call MAY still be used as a sanity check (`extract_text_from_mem`) to detect catastrophic decoding failure early, but per-page text is sourced from `lopdf` only.
|
|
- `title` precedence: `/Info/Title` from `lopdf` (when non-empty) → filename without extension.
|
|
- `lang = Lang("und")` (PDFs rarely declare; lingua detection over the body could be a future enhancement).
|
|
- `metadata.user["pdf"] = { "page_count": n, "producer": "...", "creator": "..." }` from `/Info`.
|
|
- `metadata.source_type = SourceType::Paper`; `trust_level = TrustLevel::Primary`.
|
|
- `provenance` events: `Discovered`, `Parsed` (per page text or warning).
|
|
- `block_id` per design §4.2 with `block_kind = "paragraph"`, `heading_path = []`, `ordinal = page - 1`, `source_span = SourceSpan::Page { page }`.
|
|
- Streaming: read PDF in memory only once; do not load `pdf-extract` per page (that re-parses N times).
|
|
- Failure modes:
|
|
- File not a PDF / corrupt header → `anyhow::Error`.
|
|
- Encrypted PDF → `anyhow::Error` with hint to remove encryption (no decryption attempt in v1).
|
|
- Determinism: identical bytes → identical doc/block IDs and text.
|
|
|
|
## Storage / wire effects
|
|
|
|
- None directly.
|
|
|
|
## Test plan
|
|
|
|
| kind | description | fixture / data |
|
|
|------|-------------|----------------|
|
|
| unit | 3-page PDF produces 3 paragraph blocks with `SourceSpan::Page { page: 1..=3 }` | `fixtures/pdf/three-page-en.pdf` |
|
|
| unit | PDF with image-only page 2 (no text) emits warning + empty text for page 2 | `fixtures/pdf/scanned-mixed.pdf` |
|
|
| unit | encrypted PDF returns error with helpful hint | `fixtures/pdf/encrypted.pdf` |
|
|
| unit | corrupt header PDF returns error | `fixtures/pdf/corrupt.pdf` |
|
|
| unit | `metadata.user.pdf.page_count` matches actual count | inline |
|
|
| unit | Korean text PDF preserved (CID mapping permitting) | `fixtures/pdf/korean.pdf` |
|
|
| determinism | identical bytes → identical CanonicalDocument JSON across two runs | inline |
|
|
| snapshot | CanonicalDocument JSON for fixture stable | `fixtures/pdf/three-page-en.pdf` |
|
|
|
|
All tests under `cargo test -p kebab-parse-pdf`.
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] `cargo check -p kebab-parse-pdf` passes
|
|
- [ ] `cargo test -p kebab-parse-pdf` passes
|
|
- [ ] No OCR / LLM code present
|
|
- [ ] No imports outside Allowed dependencies
|
|
- [ ] PR links design §3.4 SourceSpan::Page, §9.2
|
|
|
|
## Out of scope
|
|
|
|
- OCR for scanned PDFs (separate future task; reuses p6-2 OCR adapter).
|
|
- Layout reconstruction (multi-column reading order, tables).
|
|
- Math rendering / formula detection.
|
|
- Form-field extraction.
|
|
- Bookmark / outline ingestion (could become heading_path later — note for P+).
|
|
|
|
## Risks / notes
|
|
|
|
- `pdf-extract` text quality varies wildly. For broken-glyph PDFs, the text may be unicode noise; downstream embedding still works but quality is poor. Mark such pages with a confidence-style warning when feasible.
|
|
- Some PDFs have layered text (selectable text + scanned image overlay). v1 captures the selectable text only.
|
|
- For very large PDFs (> 1k pages), memory usage may spike. Document a soft limit (`config.pdf.max_pages` default 5000) and refuse beyond it.
|