Step 7 (Group G) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 6 code reviewer Important M-4 (skipped field carry) + Minor M-2
(ordering invariant doc) fix.
G3 — JSON Schema sync (additive minor — schema_version 보존):
ingest_progress.schema.json:
- kind enum 2 추가: pdf_ocr_started + pdf_ocr_finished.
- 새 field: page (1-based PDF page), ocr_engine (engine_name), skipped (bool).
- 기존 ms / chars field 의 description 갱신 (pdf_ocr_finished carry 추가).
ingest_report.schema.json:
- items.items.properties 신규 정의 (이전 stub ["array", "null"] 만).
- pdf_ocr_pages + pdf_ocr_ms_total (nullable integer).
- 모든 기존 IngestItem field 도 명시화 (kind, doc_path, byte_len, ...).
Step 6 reviewer M-4 (Important) — skipped field carry:
- IngestEvent::PdfOcrFinished 에 skipped: bool 추가.
- ingest_one_pdf_asset 의 emit closure (lib.rs:~1864) 가 source
PdfOcrProgress::Finished { skipped } 를 discard 않고 propagate.
Step 6 reviewer M-2 (Minor) — ordering invariant doc:
- crates/kebab-app/src/ingest_progress.rs 의 ordering text 갱신:
ScanStarted < ScanCompleted < (AssetStarted [< (PdfOcrStarted <
PdfOcrFinished)*] < AssetFinished)* < (Completed | Aborted).
.md doc (docs/wire-schema/v1/*.md) 부재 — plan §3 Step 7 G3 의 .md
deliverable retro N/A (해당 file 0).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 7 G3)
prior: b9ee09f (Step 6 wiring) + Step 6 reviewer M-4/M-2 권고
contract: §9 (additive minor wire bump — schema_version 보존)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
58 lines
3.8 KiB
JSON
58 lines
3.8 KiB
JSON
{
|
|
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
|
"$id": "https://kb.local/wire/v1/ingest_progress.schema.json",
|
|
"title": "IngestProgressEvent v1",
|
|
"description": "Streaming progress event emitted by `kebab ingest --json`. One event per line (line-delimited JSON). Discriminated by `kind`. The terminal events are `completed` and `aborted` — every ingest run ends with exactly one of them. The final stdout line of a `--json` ingest is still the existing `ingest_report.v1` for backwards compatibility; progress events stream above it.",
|
|
"type": "object",
|
|
"required": ["schema_version", "kind", "ts"],
|
|
"properties": {
|
|
"schema_version": { "const": "ingest_progress.v1" },
|
|
"kind": {
|
|
"type": "string",
|
|
"enum": [
|
|
"scan_started",
|
|
"scan_completed",
|
|
"asset_started",
|
|
"asset_finished",
|
|
"embed_batch_started",
|
|
"embed_batch_finished",
|
|
"pdf_ocr_started",
|
|
"pdf_ocr_finished",
|
|
"completed",
|
|
"aborted"
|
|
]
|
|
},
|
|
"ts": { "type": "string", "format": "date-time", "description": "RFC 3339 timestamp at the moment the event was emitted." },
|
|
"root": { "type": "string", "description": "scan_started: workspace root being walked." },
|
|
"total": { "type": "integer", "minimum": 0, "description": "scan_completed / asset_started / asset_finished: total assets discovered." },
|
|
"idx": { "type": "integer", "minimum": 1, "description": "asset_started / asset_finished: 1-based index of the current asset within the scan." },
|
|
"path": { "type": "string", "description": "asset_started: workspace-relative path of the asset being processed." },
|
|
"media": { "type": "string", "description": "asset_started: media kind label (e.g. `markdown`, `pdf`, `image`)." },
|
|
"result": {
|
|
"type": "string",
|
|
"enum": ["new", "updated", "skipped", "error"],
|
|
"description": "asset_finished: per-asset outcome (mirrors `ingest_report.v1.items[].kind`)."
|
|
},
|
|
"chunks": { "type": "integer", "minimum": 0, "description": "asset_finished: chunk count produced for this asset." },
|
|
"n_chunks": { "type": "integer", "minimum": 0, "description": "embed_batch_started / embed_batch_finished: chunks in this embedding batch." },
|
|
"ms": { "type": "integer", "minimum": 0, "description": "embed_batch_finished / pdf_ocr_finished: wall-clock duration (ms). pdf_ocr_finished skip path 의 의미는 mixed (DCTDecode 부재 시 0, engine 실패 시 latency-before-bail)." },
|
|
"chars": { "type": "integer", "minimum": 0, "description": "pdf_ocr_finished: char count of OCR result. Skip 시 0." },
|
|
"page": { "type": "integer", "minimum": 1, "description": "pdf_ocr_started / pdf_ocr_finished: 1-based PDF page number under OCR." },
|
|
"ocr_engine": { "type": "string", "description": "pdf_ocr_finished: engine_name (e.g. 'ollama-vision')." },
|
|
"skipped": { "type": "boolean", "description": "pdf_ocr_finished: true 일 시 OCR 미수행 (DCTDecode 부재 또는 engine 실패). chars=0 만으로는 skip 과 0-char result 구분 불가." },
|
|
"counts": {
|
|
"type": "object",
|
|
"description": "completed / aborted: aggregate counters at the moment the run ended (mirrors fields on `ingest_report.v1`).",
|
|
"properties": {
|
|
"scanned": { "type": "integer", "minimum": 0 },
|
|
"new": { "type": "integer", "minimum": 0 },
|
|
"updated": { "type": "integer", "minimum": 0 },
|
|
"skipped": { "type": "integer", "minimum": 0 },
|
|
"errors": { "type": "integer", "minimum": 0 },
|
|
"chunks_indexed": { "type": "integer", "minimum": 0 },
|
|
"embeddings_indexed": { "type": "integer", "minimum": 0 }
|
|
}
|
|
}
|
|
}
|
|
}
|