Files
kebab/tasks/p7/p7-1-pdf-text-extractor.md
altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report
마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:01:55 +00:00

6.7 KiB

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
phase component task_id title status depends_on unblocks contract_source contract_sections
P7 kebab-parse-pdf (text extractor) p7-1 Text PDF extractor → CanonicalDocument with page-level blocks planned
p0-1
p1-6
p7-2
../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
§3.4 SourceSpan::Page
§3.4 Block::Paragraph
§9.2 PDF text extraction
§9 versioning

p7-1 — PDF text extractor

Goal

Implement Extractor for MediaType::Pdf. Extracts text page-by-page, emits one Block::Paragraph per page with SourceSpan::Page. Failed-text pages get an empty paragraph + Provenance::Warning so they can be picked up later by an OCR fallback pipeline.

Why now / why this size

Strict scope: page text + page numbers. Layout reconstruction (multi-column merge, table extraction) is intentionally NOT in scope — it's its own engineering project. This task gets a usable PDF retrieval surface online with minimal moving parts.

Allowed dependencies

  • kebab-core
  • kebab-config
  • pdf-extract = "0.7" (or current stable)
  • lopdf = "0.32" for page metadata (count, optional title from /Info)
  • serde, serde_json
  • time
  • tracing
  • thiserror

Forbidden dependencies

  • kebab-source-fs, kebab-parse-md, kebab-normalize, kebab-chunk, kebab-store-*, kebab-embed*, kebab-search, kebab-llm*, kebab-rag, kebab-tui, kebab-desktop, OCR libraries (OCR fallback is a separate task, not this one)

Inputs

input type source
RawAsset kebab_core::RawAsset kebab-source-fs
PDF bytes &[u8] filesystem

Outputs

output type downstream
CanonicalDocument kebab_core::CanonicalDocument kebab-chunk (pdf-page-v1 chunker in p7-2)

Public surface (signatures only — no new types)

pub struct PdfTextExtractor;

impl kebab_core::Extractor for PdfTextExtractor {
    fn supports(&self, m: &kebab_core::MediaType) -> bool { matches!(m, kebab_core::MediaType::Pdf) }
    fn parser_version(&self) -> kebab_core::ParserVersion { kebab_core::ParserVersion("pdf-text-v1".into()) }
    fn extract(&self, ctx: &kebab_core::ExtractContext, bytes: &[u8]) -> anyhow::Result<kebab_core::CanonicalDocument>;
}

Behavior contract

  • pdf-extract (0.7+) does NOT expose a per-page Rust API. Its public surface is pdf_extract::extract_text(path) and pdf_extract::extract_text_from_mem(bytes) — both return a single String for the whole document. Per-page text MUST therefore be obtained by iterating lopdf::Document::load_mem(bytes) page objects directly:
    1. Load via lopdf::Document::load_mem(bytes).
    2. doc.get_pages()BTreeMap<u32, ObjectId> (1-based page numbers).
    3. For each (page_num, page_id): call doc.extract_text(&[page_num]) (lopdf's per-page text extraction), wrap with catch_unwind to absorb the rare crash on malformed pages.
    4. Treat returned text as text for that page. Empty result OR Err → fall through to "scanned candidate" branch.
  • For each page (1-based i from above):
    • On success: produce Block::Paragraph(TextBlock { common, text, inlines: vec![Inline::Text(text)] }) with common.source_span = SourceSpan::Page { page: i, char_start: Some(0), char_end: Some(text.chars().count() as u32) } (NOTE: char count, not byte len, so spans match Citation::Page fragment semantics) and common.heading_path = vec![].
    • On empty/error: produce Block::Paragraph with text: "", Provenance::Warning { note: format!("page{} empty (scanned candidate)", i) }. The warning marks the page as a candidate for the OCR fallback pipeline (out of scope for this task).
  • pdf-extract whole-document call MAY still be used as a sanity check (extract_text_from_mem) to detect catastrophic decoding failure early, but per-page text is sourced from lopdf only.
  • title precedence: /Info/Title from lopdf (when non-empty) → filename without extension.
  • lang = Lang("und") (PDFs rarely declare; lingua detection over the body could be a future enhancement).
  • metadata.user["pdf"] = { "page_count": n, "producer": "...", "creator": "..." } from /Info.
  • metadata.source_type = SourceType::Paper; trust_level = TrustLevel::Primary.
  • provenance events: Discovered, Parsed (per page text or warning).
  • block_id per design §4.2 with block_kind = "paragraph", heading_path = [], ordinal = page - 1, source_span = SourceSpan::Page { page }.
  • Streaming: read PDF in memory only once; do not load pdf-extract per page (that re-parses N times).
  • Failure modes:
    • File not a PDF / corrupt header → anyhow::Error.
    • Encrypted PDF → anyhow::Error with hint to remove encryption (no decryption attempt in v1).
  • Determinism: identical bytes → identical doc/block IDs and text.

Storage / wire effects

  • None directly.

Test plan

kind description fixture / data
unit 3-page PDF produces 3 paragraph blocks with SourceSpan::Page { page: 1..=3 } fixtures/pdf/three-page-en.pdf
unit PDF with image-only page 2 (no text) emits warning + empty text for page 2 fixtures/pdf/scanned-mixed.pdf
unit encrypted PDF returns error with helpful hint fixtures/pdf/encrypted.pdf
unit corrupt header PDF returns error fixtures/pdf/corrupt.pdf
unit metadata.user.pdf.page_count matches actual count inline
unit Korean text PDF preserved (CID mapping permitting) fixtures/pdf/korean.pdf
determinism identical bytes → identical CanonicalDocument JSON across two runs inline
snapshot CanonicalDocument JSON for fixture stable fixtures/pdf/three-page-en.pdf

All tests under cargo test -p kebab-parse-pdf.

Definition of Done

  • cargo check -p kebab-parse-pdf passes
  • cargo test -p kebab-parse-pdf passes
  • No OCR / LLM code present
  • No imports outside Allowed dependencies
  • PR links design §3.4 SourceSpan::Page, §9.2

Out of scope

  • OCR for scanned PDFs (separate future task; reuses p6-2 OCR adapter).
  • Layout reconstruction (multi-column reading order, tables).
  • Math rendering / formula detection.
  • Form-field extraction.
  • Bookmark / outline ingestion (could become heading_path later — note for P+).

Risks / notes

  • pdf-extract text quality varies wildly. For broken-glyph PDFs, the text may be unicode noise; downstream embedding still works but quality is poor. Mark such pages with a confidence-style warning when feasible.
  • Some PDFs have layered text (selectable text + scanned image overlay). v1 captures the selectable text only.
  • For very large PDFs (> 1k pages), memory usage may spike. Document a soft limit (config.pdf.max_pages default 5000) and refuse beyond it.