P7-1 (`PdfTextExtractor`) + P7-2 (`PdfPageV1Chunker`) 의 라이브러리를 `kebab-app::ingest_with_config` 에 와이어링. `kebab-source-fs` 가 이미 `*.pdf` 를 `MediaType::Pdf` 로 분류하던 자산이 이제 검색 가능한 doc 으로 색인됨. P6-4 image wiring 패턴과 평행 — `ingest_one_asset` 에 `MediaType::Pdf` arm 추가, 새 private fn `ingest_one_pdf_asset` 로 분기. 핵심 동작: - per-medium chunker 선택: PDF 자산은 `PdfPageV1Chunker` 하드코딩 (compile-time match 기반). `config.chunking.chunker_version` 은 markdown 만 represent — PDF 는 항상 `pdf-page-v1`. HOTFIXES entry `2026-05-02 P7-3` 에 deviation 기록. - encrypted PDF / corrupt PDF → `errors+=1` + P7-1 의 `qpdf --decrypt` hint 를 `IngestItem.error` 에 verbatim 보존. - 빈/scanned candidate 페이지 → 0 chunk, P7-1 의 `Provenance::Warning` 그대로 통과. v1 에서는 검색 불가, P+ scanned-PDF OCR fallback 대기. - determinism stress: extract → chunk 사이 `now()` 추가 호출 없음 (P6-4 invariant 계승). PDF doc/chunk_id 모두 결정적. 통합 테스트 (`tests/pdf_pipeline.rs`, 8 passed + 1 ignored): - 3-page text PDF → 1 doc + 3 chunk + Page span 검증 - identical re-ingest → Updated, doc_id 동일 - encrypted PDF → Error + `qpdf` hint 보존 - corrupt header PDF → Error + 미저장 - mixed page (page 2 빈) → 2 chunk + Warning 1개 - IngestReport 산술 invariant - 50-page 긴 PDF → ≥50 chunk - inspect doc → SourceSpan::Page round-trip - (ignored) edited bytes re-ingest → storage UNIQUE bug 노출, P+ fix 대기 추가 발견 (HOTFIXES `2026-05-02 P7-3`): `assets.workspace_path` 의 UNIQUE 제약과 `upsert_asset_row` 의 `ON CONFLICT(asset_id)` 만 처리하는 부분 사이에 gap 존재. byte 변경 시 새 asset_id → 같은 workspace_path 충돌. md / image / pdf 모두 영향. P7-3 통합 테스트가 처음 노출. 본 PR 은 fix 안 함 — P+ storage task. `docs/SMOKE.md` 에 PDF 섹션 + 검증 체크리스트 + 알려진 동작 4건 추가. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
20 KiB
phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
| phase | component | task_id | title | status | depends_on | unblocks | contract_source | contract_sections | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P7 | kebab-app (PDF ingest dispatch + chunker selection) | p7-3 | Wire PdfTextExtractor + PdfPageV1Chunker into kebab-app::ingest end-to-end | completed |
|
../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md |
|
p7-3 — PDF ingest wiring (kebab-app)
Goal
Make kebab ingest end-to-end functional for PDF assets. P7-1 (PdfTextExtractor) and P7-2 (PdfPageV1Chunker) each ship a tested library; this task connects the wires from kebab-source-fs (which already classifies .pdf as MediaType::Pdf) through kebab-app::ingest_with_config to kebab-store-sqlite + kebab-store-vector, so a user running kebab ingest against a workspace containing PDF papers / books / reports sees those assets indexed and searchable, with page-level citations preserved.
Why now / why this size
P7-1 / P7-2 deliver value only after kebab-app::ingest learns to dispatch on MediaType::Pdf. The wiring is small — one new dispatch arm in ingest_one_asset, one new private helper ingest_one_pdf_asset, and a per-medium chunker selection step — but it is materially user-facing: without it, the entire P7 phase is invisible from the CLI. P6-4 (image wiring) established the pattern; P7-3 follows it identically with two structural differences:
- PDF chunking uses a separate chunker (
pdf-page-v1) rather than an in-place branch insidemd-heading-v1. The chunker selection happens at dispatch time keyed onMediaType. - PDF documents commonly produce many chunks per asset (one per page, sometimes more for long pages); the embedder loop must scale to hundreds of chunks per doc without breaking the per-asset transaction.
Pulling this into its own task keeps P7-1 / P7-2 specs frozen as written while letting integration evolve under its own contract.
Allowed dependencies
kebab-app 의 현재 Cargo.toml + kebab-parse-pdf 한 줄 추가.
kebab-corekebab-configkebab-source-fskebab-parse-md,kebab-parse-typeskebab-normalizekebab-chunk(uses bothMdHeadingV1ChunkerANDPdfPageV1Chunker— chunker selection at dispatch time)kebab-store-sqlite,kebab-store-vectorkebab-searchkebab-embed,kebab-embed-localkebab-llm,kebab-llm-localkebab-ragkebab-parse-imagekebab-parse-pdf(NEW — added by this task)anyhow,serde_json,tracing
Forbidden dependencies
kebab-tui,kebab-desktop(P9 미시작 — UI crate 가 ingest 호출하면 layering 위반).kebab-eval(cycle 위험 — eval 이 ingest 를 호출).- 본 task 안에서 PDF parsing / chunking 로직을 재구현 금지.
kebab-parse-pdf+kebab-chunk::PdfPageV1Chunker의 thin dispatch + glue 만 허용.
Inputs
| input | type | source |
|---|---|---|
| workspace assets | RawAsset stream |
kebab-source-fs::SourceConnector::scan (already classifies MediaType::Pdf) |
| PDF bytes | &[u8] |
filesystem read in kebab-app |
Config |
kebab_config::Config |
CLI --config flag → Config::load |
ChunkPolicy |
kebab_core::ChunkPolicy |
derived from config.chunking (existing) |
Outputs
| output | type | downstream |
|---|---|---|
CanonicalDocument per PDF |
written via DocumentStore::put_document |
kebab-store-sqlite |
Vec<Chunk> per PDF (≥1 chunk per non-empty page) |
kebab-store-sqlite::put_chunks + kebab-store-vector::upsert |
kebab-store-sqlite + kebab-store-vector |
Updated IngestReport counters |
scanned / new / updated / skipped / errors |
wire output (ingest_report.v1) |
Public surface
No new public types. The wiring exists inside kebab-app::ingest_with_config (and its private helpers). One new private function:
// crates/kebab-app/src/lib.rs (additions only — sketch)
/// P7-3: process one `MediaType::Pdf` asset end-to-end.
fn ingest_one_pdf_asset(
app: &App,
asset: &RawAsset,
chunk_policy: &ChunkPolicy,
embedder: Option<&Arc<dyn Embedder + Send + Sync>>,
vector_store: Option<&Arc<kebab_store_vector::LanceVectorStore>>,
existing_doc_ids: &std::collections::HashSet<String>,
) -> anyhow::Result<kebab_core::IngestItem> { ... }
ingest_one_asset gets a new match arm:
match &asset.media_type {
MediaType::Markdown => { /* existing fall-through */ }
MediaType::Image(_) => return ingest_one_image_asset(...),
MediaType::Pdf => return ingest_one_pdf_asset(...), // NEW
_ => return Ok(IngestItem { kind: Skipped, ... }),
}
Behavior contract
Ingest dispatch (kebab-app)
- For each
RawAsset:MediaType::Markdown→ existing markdown extractor path (unchanged).MediaType::Image(_)→ P6-4 image branch (unchanged).MediaType::Pdf→ new PDF branch (this task).MediaType::Audio(_) | MediaType::Other(_)→skipped += 1(existing behaviour).
- Operational jump: before this task,
MediaType::Pdffell into the_arm and was counted asSkipped. After merge, every PDF asset shifts fromskippedtoscanned/new/updated. A workspace with N PDF files reportsskippeddecreasing by N andscanned(andnewon the first ingest after merge) increasing by N — flag this in the implementation PR description so eval / smoke / runtime users can interpret the one-time discontinuity. - PDF branch (
ingest_one_pdf_asset):- Read bytes via
std::fs::read(consistent with markdown / image branches). - Build
kebab_core::ExtractContext { asset, workspace_root, config: &ExtractConfig::default() }. - Call
PdfTextExtractor::new().extract(&ctx, &bytes). Failure (Err(_)) →IngestItemKind::Errorwith the formatted error inIngestItem.error. Continue to next asset (do not abort the whole ingest).- Encrypted PDFs hit this branch (P7-1 returns
Errwith theqpdf --decrypthint preserved verbatim — the operator sees the actionable message in thekebab ingestoutput). - Corrupt / non-PDF bytes likewise.
- Encrypted PDFs hit this branch (P7-1 returns
- The returned
CanonicalDocumentmay carry per-pageProvenance::Warningevents (P7-1 emits one for each empty / extract-failed page, marked "scanned candidate"). Pass these through unchanged — the chunker correctly emits 0 chunks for empty pages, so the asset is still indexed but those pages are not searchable until a future scanned-PDF OCR fallback lands (out of scope here). - Pass the
CanonicalDocumenttoPdfPageV1Chunker(NOTMdHeadingV1Chunker— chunker selection is keyed onMediaType::Pdf). The chunker validates that every block isBlock::ParagraphwithSourceSpan::Page; if validation fails (which would mean P7-1's contract drifted), the chunker's error propagates up asIngestItemKind::Error. - Persist
CanonicalDocument+Vec<Chunk>via the sameDocumentStore::put_document+put_chunkscalls the markdown branch uses. - Embed each chunk if
embedder.is_some(). Each PDF chunk gets one vector — the embedder loop processes them in batches ofconfig.indexing.max_parallel_embeddingslike markdown chunks (no PDF-specific batching).
- Read bytes via
Chunker selection
- Per-medium chunker selection is the new architectural piece. Today the markdown branch hard-codes
MdHeadingV1Chunker; the PDF branch hard-codesPdfPageV1Chunker. There is no runtime config switch — the medium → chunker mapping is compiled in. - A future task (P+ "chunker registry") may make this configurable, at which point the mapping moves to
Config::chunking.chunker_for_media. P7-3 deliberately does not introduce that config slot — premature config surface. config.chunking.chunker_versionis a fingerprint, not a dispatcher. Markdown sets it to"md-heading-v1", PDF would set it to"pdf-page-v1"— but in the currentConfigschema the field is single-valued and serves the markdown path only. Deviation logged in HOTFIXES: PDF ingest ignoresconfig.chunking.chunker_version, hard-codespdf-page-v1. A future P+ task either splits the config field per medium or builds the chunker registry above.
Determinism stress
- The existing markdown path's
kb-normalize::build_canonical_documentand the image path'singest_one_image_assetshare oneOffsetDateTime::now_utc()reading per Provenance event group. P7-1's extractor already shares onenowreading across its Discovered + Parsed + per-page Warning events. The PDF branch must not insert a secondnow()between extract and chunk — chunking is a pure function of theCanonicalDocument, so this constraint is structural, not stylistic. - Re-ingest of the same PDF bytes produces identical
doc_id, identicalblock_ids (per-page deterministic viaid_for_block(doc_id, "paragraph", &[], page-1, span)), and identicalchunk_ids (P7-2's per-chunk policy_hash variant#c{char_start}makeschunk_iddeterministic-and-collision-free across the document).
Failure semantics summary
| failure | counter | doc stored? | provenance |
|---|---|---|---|
PdfTextExtractor::extract Err (corrupt header / not a PDF) |
errors+=1 |
no | n/a (no doc emitted) |
PdfTextExtractor::extract Err (encrypted PDF) |
errors+=1 |
no | n/a — error message includes qpdf --decrypt hint |
PdfPageV1Chunker::chunk Err (validates non-Page span / non-Paragraph block) |
errors+=1 |
no | n/a — defensive validation: fires on P7-1 contract drift OR future routing bug (e.g. a chunker registry mis-routes a markdown doc here) |
Per-page text extraction Err (panic absorbed by catch_unwind) |
unchanged | yes | Provenance::Warning (page N "scanned candidate") — emitted by P7-1, propagated through |
Empty page (no /Contents stream) |
unchanged | yes | Provenance::Warning (page N "empty (scanned candidate)") — emitted by P7-1 |
Embedder::embed(...) Err (any chunk) |
errors+=1 |
yes (doc + chunk rows already written before embed call — see below) | n/a |
The embedding call sits after put_document / put_blocks / put_chunks in kebab-app::ingest_one_asset (markdown path, lines 615+), so a failed embed leaves doc + chunk rows on disk while no vector exists. This is consistent with the markdown path and accepted as v1 behaviour — re-running kebab ingest re-attempts the embed for any chunk whose embedding_id is missing from the vector store. Whole-asset rollback on embed-fail is a P+ task (atomic ingest transaction).
Storage / wire effects
kebab-store-sqlite::documentstable gains rows whoseparser_version = "pdf-text-v1".kebab-store-sqlite::blockstable gains rows ofblock_kind = "paragraph"whosesource_spanisPage { page, char_start, char_end }. This is the first time the workspace stores blocks withSourceSpan::Page; any downstream reader that did not handle this variant is exposed.kebab-store-sqlite::chunkstable gains rows whosechunker_version = "pdf-page-v1"AND whosesource_spans[0]isSourceSpan::Page. Same exposure note for downstream readers.kebab-store-vector::chunk_embeddings_<model>_<dim>gains one vector per PDF chunk.IngestReport(wireingest_report.v1) counters update naturally:scannedincludes PDFs,new/updatedtrack PDF docs,errorscounts decode / encryption / corrupt failures,skippedcontinues to count audio / unknown formats.- No new wire schemas or
kebab-coretypes.
Citation surface
Citation(search hits / RAG answers) already carriesSourceSpan::Page. The CLI / wire layer must render it — currentkebab searchJSON output passessource_spanthrough verbatim; no change needed in this task. UI rendering of "page 12" labels for PDF citations is the responsibility of P9 (TUI / desktop) or whatever consumer is reading the wire.
Test plan
| kind | description | fixture / data |
|---|---|---|
| integration | TempDir KB + 1 small text PDF (3 pages) → kebab ingest produces 1 doc + 3 chunks; each chunk's source_spans[0] is Page { page: i, .. }; chunks stored + embedded |
kebab-app/tests/pdf_pipeline.rs (new), in-memory PDF via lopdf builder (mirrors kebab-parse-pdf::tests::common) |
| integration | Re-ingest same PDF (identical bytes) → identical doc_id and identical chunk_id set (P1 idempotency contract) |
inline |
| integration | Edit a PDF (replace bytes — different blake3 → different asset_id → different doc_id) and re-ingest → new+=1 for the new doc_id; old doc_id row remains untouched (orphan handling is a P+ task) |
inline |
| integration | Encrypted PDF → asset NOT stored; errors+=1; IngestItem.error mentions qpdf / decrypt |
inline (lopdf builder + fake /Encrypt trailer) |
| integration | Corrupt header PDF → asset NOT stored; errors+=1; error message mentions PDF parse failure |
inline |
| integration | Mixed page PDF (page 1 text, page 2 empty / scanned, page 3 text) → asset stored; 2 chunks (pages 1 + 3); doc.provenance.events contains exactly 1 Warning for page 2 marked "scanned candidate" |
inline |
| integration | kebab inspect doc <pdf_doc_id> returns the PDF CanonicalDocument with per-page Block::Paragraph and SourceSpan::Page intact |
inline |
| integration | Hybrid search across mixed corpus (1 markdown + 1 PDF) returns the PDF chunk for a query whose terms appear only in the PDF body | inline (real multilingual-e5-small embedding) |
| integration | IngestReport invariant scanned == new + updated + skipped + errors holds when ingesting a mixed corpus including a corrupt PDF |
inline |
| integration | Long PDF (50 pages × ~1.5 KB body each = ~75 KB) produces ≥50 chunks (≥1 per page); embedding loop completes; storage round-trips | inline |
| smoke | Update docs/SMOKE.md so the runbook ingests at least one PDF fixture and verifies search-by-page-text works + inspect doc shows SourceSpan::Page |
docs change |
The opt-in real-Ollama integration tests stay in kebab-llm-local / kebab-parse-image. P7-3 adds zero LM dependency — PDF text extraction is local-only, so there is no equivalent hermetic-vs-real split to manage.
Definition of Done
Spec PR (this PR — spec/p7-3-pdf-ingest-wiring)
tasks/p7/p7-3-pdf-ingest-wiring.md작성 + self-review (placeholder / 모순 / 모호성 / scope)tasks/INDEX.md"P7 — 3 components" 반영- PR 본문에 design §3.4, §3.5, §6.1, §9.2 링크
- HOTFIXES note 가 필요한 deviation (chunker selection,
config.chunking.chunker_versionPDF 무시) 본문에 명시
Implementation PR (follow-up — feat/p7-3-pdf-ingest-wiring)
cargo check --workspacepassescargo test --workspace --no-fail-fast -j 1passes (all new integration tests green)cargo clippy --workspace --all-targets -- -D warningspasseskebab ingestagainst a TempDir KB containing 1 markdown + 1 image + 1 PDF producesscanned 3 / new 3 / errors 0kebab search --mode hybrid "<text from PDF page 7>"returns a chunk whosesource_spanisPage { page: 7, .. }kebab inspect doc <pdf_doc_id>shows per-pageBlock::ParagraphwithSourceSpan::Page- HOTFIXES entry written for the
config.chunking.chunker_versiondeviation (PDF ignores; hard-codedpdf-page-v1) docs/SMOKE.mdincludes a PDF-fixture step
Out of scope
- RAG
kebab askPDF citation — verifying thatAnswer::Citation::source_span = Page { ... }round-trips through the RAG pipeline is structurally a P4-3 (RAG pipeline) responsibility, not a P7-3 ingest-wiring responsibility. P4-3 already exercises the citation contract over markdown chunks; PDF chunks share the exact sameCitationshape (the difference is onlysource_spanvariant). A future PR can bolt on a "PDF chunks survive RAG citation" assertion to either P4-3's existing tests or a dedicatedkebab-ragintegration test — bringing wiremock + RAG fixture infrastructure intokebab-appintegration tests is out of proportion for the P7-3 invariant (which is "PDF chunks emerge from search withPagespans"). Captured here so reviewers can find this decision later. - Scanned PDF OCR fallback — empty/extract-failed pages stay searchable=false in v1. A future task ("P+ scanned-PDF-ocr") routes those pages through P6-2's
OllamaVisionOcrafter rasterising the page via a PDF renderer (e.g.mupdf-rs,pdfium-render). Excluded here because (a) it requires a new system / Rust dep we don't have yet, and (b) v1 user scenario is text-embed PDFs (papers, exported reports). - Multi-column reading order / table extraction / formula detection / form-field extraction / bookmark or outline ingestion — all deferred to future PDF-layout task. P7-1 already lists these as out of scope and the wiring inherits.
- Body multilingual via CID font support — handled at the parser layer (P7-1). UTF-16BE Title metadata works today; non-Latin body text depends on the PDF's font CID mapping.
- Per-medium
chunker_versionconfig — currentConfig::chunking.chunker_versionis single-valued and serves markdown only. PDF ingest ignores it (hard-codespdf-page-v1). A future P+ task either splits the field per medium or introduces a chunker registry. Logged as a deviation intasks/HOTFIXES.mdonce implementation lands. - Chunker dispatch as a runtime registry — current dispatch is a compile-time
matchonMediaType. Adequate while the workspace has 3 chunkers (md-heading-v1, pdf-page-v1, future audio); a registry makes sense once the count grows. - Parallelism beyond
config.indexing.max_parallel_extractors— the existing knob applies. Per-PDF parallelism is a P+ scale-hardening task (a 500-page book produces ≥500 chunks; embedding throughput is the bottleneck, not extraction). - Progress reporting (
kebab ingest --progress) — a 500-page book produces visible-but-silent work; UX gap acknowledged but a P+ enhancement. - Wire schema bump — no new wire types;
ingest_report.v1counters absorb PDF events naturally;search_hit.v1already carriessource_spanpolymorphically. kebab-store-sqliteschema migration forSourceSpan::Pagecolumns — the existingblocks.source_span_kind/source_span_payloadcolumns store the JSON discriminator polymorphically (per P1-6). PDF rows reuse the existing schema without alteration.
Risks / notes
config.chunking.chunker_versionbecomes ambiguous. A user reading theirconfig.tomlseeschunker_version = "md-heading-v1"and reasonably assumes PDFs use the same. They don't. The implementation PR must either log atracing::info!at ingest start ("PDF assets use chunker_version=pdf-page-v1 regardless of config.chunking.chunker_version") OR leave aTODOto address in the chunker-registry task. The HOTFIXES entry documents the deviation persistently.- A 500-page book produces 500+ chunks in one transaction. The existing
put_chunkscall already loops chunks but the SQLite transaction boundary may need tuning. The implementation PR should benchmark and decide whether to chunk-batch the writes (e.g. 100 chunks per transaction) or trust the existing path. Not a correctness risk — only a throughput / WAL-size risk. - Encrypted-PDF error message is the only operator-visible signal. The
kebab ingestsummary showserrors=1and theIngestItem.errormessage; a user who didn't read the full output may miss theqpdf --decrypthint. Acceptable in v1 — a futurekebab inspect ingest <run_id>(P9) renders structured per-asset errors. For now, ensure the test asserts the hint is preserved verbatim from P7-1's bail string. - Determinism stress with
now()calls. Same constraint as P6-4 — extract → chunk pipeline must not insert wall-clock reads between steps. P7-1 emits its ownnow()once for all per-page Provenance events; the PDF wiring branch must add no furthernow()s inside the per-asset path. pdf-page-v1per-chunk hash variant#c{char_start}is opaque. Downstream tools comparingchunk_ids by exact match work fine (it's still a deterministic blake3 input). Tools attempting to derive a stable position from thechunk_idalone would fail — they must readchunk.source_spans[0].char_start. Documented in P7-2's HOTFIXES entry; cross-referenced here for findability.- HEIC / RAW image and PDF/A subspecies share
MediaType::Other. PDF/A is detected bykebab-source-fsasMediaType::Pdf(header is still%PDF-); PDF subtype variants do not branch separately. Acceptable for v1.