Files

altair823 5a158d7343 feat(kebab-parse-pdf): P7-1 text PDF extractor — per-page CanonicalDocument

`PdfTextExtractor`(MediaType::Pdf) lopdf 기반 per-page 텍스트 추출.
페이지마다 `Block::Paragraph` + `SourceSpan::Page { page, char_start, char_end }`
emit. 본문이 비거나 추출 panic 인 페이지는 빈 paragraph + `Provenance::Warning`
("scanned candidate") 로 표시 — 이후 OCR fallback (별도 task) 의 입력.

핵심 동작:
- `lopdf::Document::load_mem` + `is_encrypted()` → 암호화 PDF 는 명시 에러
  (`qpdf --decrypt` 안내).
- 페이지 단위 `extract_text(&[page])` 를 `catch_unwind` 로 감싸 malformed
  page panic 을 recoverable warning 으로 변환.
- `/Info` dict 에서 Title/Producer/Creator best-effort 추출. UTF-16BE BOM
  prefixed 문자열도 디코드 (한국어 등 non-ASCII Title 정상 처리).
- 9개 통합 테스트: 3-page emit, scanned-mixed warning, encrypted refuse,
  corrupt header error, page_count 메타, UTF-16BE Title, filename
  fallback, determinism, snapshot.

`parser_version = "pdf-text-v1"`. Allowed deps: `lopdf 0.32` + `pdf-extract 0.7`
(원본 spec 그대로). 본문 다국어 OCR fallback 은 §9.2 후속 task (out of scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 08:34:55 +00:00

6.7 KiB

Raw Permalink Blame History

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections

phase

component

task_id

title

status

depends_on

unblocks

contract_source

contract_sections

kebab-parse-pdf (text extractor)

p7-1

Text PDF extractor → CanonicalDocument with page-level blocks

completed

p0-1

p1-6

p7-2

../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md

§3.4 SourceSpan::Page

§3.4 Block::Paragraph

§9.2 PDF text extraction

§9 versioning

p7-1 — PDF text extractor

Goal

Implement Extractor for MediaType::Pdf. Extracts text page-by-page, emits one Block::Paragraph per page with SourceSpan::Page. Failed-text pages get an empty paragraph + Provenance::Warning so they can be picked up later by an OCR fallback pipeline.

Why now / why this size

Strict scope: page text + page numbers. Layout reconstruction (multi-column merge, table extraction) is intentionally NOT in scope — it's its own engineering project. This task gets a usable PDF retrieval surface online with minimal moving parts.

Allowed dependencies

kebab-core
kebab-config
pdf-extract = "0.7" (or current stable)
lopdf = "0.32" for page metadata (count, optional title from /Info)
serde, serde_json
time
tracing
thiserror

Forbidden dependencies

kebab-source-fs, kebab-parse-md, kebab-normalize, kebab-chunk, kebab-store-*, kebab-embed*, kebab-search, kebab-llm*, kebab-rag, kebab-tui, kebab-desktop, OCR libraries (OCR fallback is a separate task, not this one)

Inputs

input	type	source
`RawAsset`	`kebab_core::RawAsset`	`kebab-source-fs`
PDF bytes	`&[u8]`	filesystem

Outputs

output	type	downstream
`CanonicalDocument`	`kebab_core::CanonicalDocument`	`kebab-chunk` (`pdf-page-v1` chunker in p7-2)

Public surface (signatures only — no new types)

pub struct PdfTextExtractor;

impl kebab_core::Extractor for PdfTextExtractor {
    fn supports(&self, m: &kebab_core::MediaType) -> bool { matches!(m, kebab_core::MediaType::Pdf) }
    fn parser_version(&self) -> kebab_core::ParserVersion { kebab_core::ParserVersion("pdf-text-v1".into()) }
    fn extract(&self, ctx: &kebab_core::ExtractContext, bytes: &[u8]) -> anyhow::Result<kebab_core::CanonicalDocument>;
}

Behavior contract

pdf-extract (0.7+) does NOT expose a per-page Rust API. Its public surface is pdf_extract::extract_text(path) and pdf_extract::extract_text_from_mem(bytes) — both return a single String for the whole document. Per-page text MUST therefore be obtained by iterating lopdf::Document::load_mem(bytes) page objects directly:
1. Load via lopdf::Document::load_mem(bytes).
2. doc.get_pages() → BTreeMap<u32, ObjectId> (1-based page numbers).
3. For each (page_num, page_id): call doc.extract_text(&[page_num]) (lopdf's per-page text extraction), wrap with catch_unwind to absorb the rare crash on malformed pages.
4. Treat returned text as text for that page. Empty result OR Err → fall through to "scanned candidate" branch.
For each page (1-based i from above):
- On success: produce Block::Paragraph(TextBlock { common, text, inlines: vec![Inline::Text(text)] }) with common.source_span = SourceSpan::Page { page: i, char_start: Some(0), char_end: Some(text.chars().count() as u32) } (NOTE: char count, not byte len, so spans match Citation::Page fragment semantics) and common.heading_path = vec![].
- On empty/error: produce Block::Paragraph with text: "", Provenance::Warning { note: format!("page{} empty (scanned candidate)", i) }. The warning marks the page as a candidate for the OCR fallback pipeline (out of scope for this task).
pdf-extract whole-document call MAY still be used as a sanity check (extract_text_from_mem) to detect catastrophic decoding failure early, but per-page text is sourced from lopdf only.
title precedence: /Info/Title from lopdf (when non-empty) → filename without extension.
lang = Lang("und") (PDFs rarely declare; lingua detection over the body could be a future enhancement).
metadata.user["pdf"] = { "page_count": n, "producer": "...", "creator": "..." } from /Info.
metadata.source_type = SourceType::Paper; trust_level = TrustLevel::Primary.
provenance events: Discovered, Parsed (per page text or warning).
block_id per design §4.2 with block_kind = "paragraph", heading_path = [], ordinal = page - 1, source_span = SourceSpan::Page { page }.
Streaming: read PDF in memory only once; do not load pdf-extract per page (that re-parses N times).
Failure modes:
- File not a PDF / corrupt header → anyhow::Error.
- Encrypted PDF → anyhow::Error with hint to remove encryption (no decryption attempt in v1).
Determinism: identical bytes → identical doc/block IDs and text.

Storage / wire effects

None directly.

Test plan

kind	description	fixture / data
unit	3-page PDF produces 3 paragraph blocks with `SourceSpan::Page { page: 1..=3 }`	`fixtures/pdf/three-page-en.pdf`
unit	PDF with image-only page 2 (no text) emits warning + empty text for page 2	`fixtures/pdf/scanned-mixed.pdf`
unit	encrypted PDF returns error with helpful hint	`fixtures/pdf/encrypted.pdf`
unit	corrupt header PDF returns error	`fixtures/pdf/corrupt.pdf`
unit	`metadata.user.pdf.page_count` matches actual count	inline
unit	Korean text PDF preserved (CID mapping permitting)	`fixtures/pdf/korean.pdf`
determinism	identical bytes → identical CanonicalDocument JSON across two runs	inline
snapshot	CanonicalDocument JSON for fixture stable	`fixtures/pdf/three-page-en.pdf`

All tests under cargo test -p kebab-parse-pdf.

Definition of Done

cargo check -p kebab-parse-pdf passes
cargo test -p kebab-parse-pdf passes
No OCR / LLM code present
No imports outside Allowed dependencies
PR links design §3.4 SourceSpan::Page, §9.2

Out of scope

OCR for scanned PDFs (separate future task; reuses p6-2 OCR adapter).
Layout reconstruction (multi-column reading order, tables).
Math rendering / formula detection.
Form-field extraction.
Bookmark / outline ingestion (could become heading_path later — note for P+).

Risks / notes

pdf-extract text quality varies wildly. For broken-glyph PDFs, the text may be unicode noise; downstream embedding still works but quality is poor. Mark such pages with a confidence-style warning when feasible.
Some PDFs have layered text (selectable text + scanned image overlay). v1 captures the selectable text only.
For very large PDFs (> 1k pages), memory usage may spike. Document a soft limit (config.pdf.max_pages default 5000) and refuse beyond it.

6.7 KiB Raw Permalink Blame History