feat(wire): additive minor — IngestEvent kind 의 pdf_ocr_* + ingest_report.items[] 의 pdf_ocr_pages/ms_total + skipped field carry (Step 6 M-4/M-2)

Step 7 (Group G) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 6 code reviewer Important M-4 (skipped field carry) + Minor M-2
(ordering invariant doc) fix.

G3 — JSON Schema sync (additive minor — schema_version 보존):

ingest_progress.schema.json:
- kind enum 2 추가: pdf_ocr_started + pdf_ocr_finished.
- 새 field: page (1-based PDF page), ocr_engine (engine_name), skipped (bool).
- 기존 ms / chars field 의 description 갱신 (pdf_ocr_finished carry 추가).

ingest_report.schema.json:
- items.items.properties 신규 정의 (이전 stub ["array", "null"] 만).
- pdf_ocr_pages + pdf_ocr_ms_total (nullable integer).
- 모든 기존 IngestItem field 도 명시화 (kind, doc_path, byte_len, ...).

Step 6 reviewer M-4 (Important) — skipped field carry:
- IngestEvent::PdfOcrFinished 에 skipped: bool 추가.
- ingest_one_pdf_asset 의 emit closure (lib.rs:~1864) 가 source
  PdfOcrProgress::Finished { skipped } 를 discard 않고 propagate.

Step 6 reviewer M-2 (Minor) — ordering invariant doc:
- crates/kebab-app/src/ingest_progress.rs 의 ordering text 갱신:
  ScanStarted < ScanCompleted < (AssetStarted [< (PdfOcrStarted <
  PdfOcrFinished)*] < AssetFinished)* < (Completed | Aborted).

.md doc (docs/wire-schema/v1/*.md) 부재 — plan §3 Step 7 G3 의 .md
deliverable retro N/A (해당 file 0).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 7 G3)
prior: b9ee09f (Step 6 wiring) + Step 6 reviewer M-4/M-2 권고
contract: §9 (additive minor wire bump — schema_version 보존)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 08:51:51 +00:00
parent b9ee09f176
commit 4c5ccd5447
4 changed files with 39 additions and 5 deletions

View File

@@ -46,10 +46,13 @@ pub struct AggregateCounts {
/// Ordering invariant per design §2.4a:
///
/// ```text
/// ScanStarted < ScanCompleted < (AssetStarted < AssetFinished)*
/// < (Completed | Aborted)
/// ScanStarted < ScanCompleted
/// < (AssetStarted [< (PdfOcrStarted < PdfOcrFinished)*] < AssetFinished)*
/// < (Completed | Aborted)
/// ```
///
/// `[]` = optional, per-PDF asset only (v0.20.0 sub-item 1).
///
/// Embed-batch events (`embed_batch_started` / `embed_batch_finished`
/// in §2.4a) are reserved for a future iteration and are not emitted
/// by this task; the spec calls them out as "임의 위치" (optional).
@@ -88,11 +91,14 @@ pub enum IngestEvent {
/// PDF page 별 OCR 시작 시 emit. v0.20.0 sub-item 1.
PdfOcrStarted { page: u32 },
/// PDF page 별 OCR 종료 시 emit. v0.20.0 sub-item 1.
/// `skipped` = `true` 일 시 OCR 미수행 (DCTDecode 부재 또는 engine 실패).
/// `chars = 0` 만으로는 "skip" 과 "0-char OCR result" 구분 불가, `skipped` field 가 명시적.
PdfOcrFinished {
page: u32,
ms: u64,
chars: u32,
ocr_engine: String,
skipped: bool,
},
}

View File

@@ -1865,7 +1865,7 @@ fn ingest_one_pdf_asset(
page,
ms,
chars,
skipped: _,
skipped,
} => {
if let Some(sender) = progress {
let _ = sender.send(
@@ -1874,6 +1874,7 @@ fn ingest_one_pdf_asset(
ms,
chars,
ocr_engine: engine.engine_name().to_string(),
skipped,
},
);
}

View File

@@ -16,6 +16,8 @@
"asset_finished",
"embed_batch_started",
"embed_batch_finished",
"pdf_ocr_started",
"pdf_ocr_finished",
"completed",
"aborted"
]
@@ -33,7 +35,11 @@
},
"chunks": { "type": "integer", "minimum": 0, "description": "asset_finished: chunk count produced for this asset." },
"n_chunks": { "type": "integer", "minimum": 0, "description": "embed_batch_started / embed_batch_finished: chunks in this embedding batch." },
"ms": { "type": "integer", "minimum": 0, "description": "embed_batch_finished: wall-clock duration of the batch." },
"ms": { "type": "integer", "minimum": 0, "description": "embed_batch_finished / pdf_ocr_finished: wall-clock duration (ms). pdf_ocr_finished skip path 의 의미는 mixed (DCTDecode 부재 시 0, engine 실패 시 latency-before-bail)." },
"chars": { "type": "integer", "minimum": 0, "description": "pdf_ocr_finished: char count of OCR result. Skip 시 0." },
"page": { "type": "integer", "minimum": 1, "description": "pdf_ocr_started / pdf_ocr_finished: 1-based PDF page number under OCR." },
"ocr_engine": { "type": "string", "description": "pdf_ocr_finished: engine_name (e.g. 'ollama-vision')." },
"skipped": { "type": "boolean", "description": "pdf_ocr_finished: true 일 시 OCR 미수행 (DCTDecode 부재 또는 engine 실패). chars=0 만으로는 skip 과 0-char result 구분 불가." },
"counts": {
"type": "object",
"description": "completed / aborted: aggregate counters at the moment the run ended (mirrors fields on `ingest_report.v1`).",

View File

@@ -38,7 +38,28 @@
},
"description": "p9-fb-25: per-extension skip count. Key = lowercase extension without leading dot (e.g. 'docx'). Files without extension key under '<no-ext>'."
},
"items": { "type": ["array", "null"] },
"items": {
"type": ["array", "null"],
"items": {
"type": "object",
"required": ["kind", "doc_path"],
"properties": {
"kind": { "type": "string", "enum": ["new", "updated", "skipped", "unchanged", "error"] },
"doc_id": { "type": ["string", "null"] },
"doc_path": { "type": "string" },
"asset_id": { "type": ["string", "null"] },
"byte_len": { "type": ["integer", "null"], "minimum": 0 },
"block_count": { "type": ["integer", "null"], "minimum": 0 },
"chunk_count": { "type": ["integer", "null"], "minimum": 0 },
"parser_version": { "type": ["string", "null"] },
"chunker_version": { "type": ["string", "null"] },
"warnings": { "type": "array", "items": { "type": "string" } },
"pdf_ocr_pages": { "type": ["integer", "null"], "minimum": 0, "description": "v0.20.0 sub-item 1: number of PDF pages 가 OCR pipeline 통과. null = OCR disabled or non-PDF asset." },
"pdf_ocr_ms_total": { "type": ["integer", "null"], "minimum": 0, "description": "v0.20.0 sub-item 1: cumulative OCR engine wall-clock duration (ms). null = OCR disabled or non-PDF asset." },
"error": { "type": ["string", "null"] }
}
}
},
"skipped_gitignore": { "type": "integer", "minimum": 0 },
"skipped_kebabignore": { "type": "integer", "minimum": 0 },
"skipped_builtin_blacklist": { "type": "integer", "minimum": 0 },