415227bf76
test(app): ingest_log_smoke integration test (AC-9)
...
crates/kebab-app/tests/ingest_log_smoke.rs 신규:
* ingest_log_smoke (AC-9): tempdir + 1 md + 1 scanned PDF →
ingest → assert log file exists + 각 line valid JSON +
각 kind ∈ {ocr,parse_error,skip,error,summary} + last
line kind=summary + scanned>0.
* ingest_log_disabled_emits_no_file (AC-6): enabled=false 일
때 log_dir 안 ingest-*.ndjson 파일 0개 verify.
fixture: ../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf
재사용 (OCR disabled — Ollama 없이 smoke 실행).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-28 03:06:43 +00:00
f9dc0f749f
feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc<Mutex> sync)
...
v0.20.x ingest log feature 의 ingest pipeline wiring. 5 emit hook:
Hook 1: ingest_with_config_opts entry/exit (writer init + summary write + flush)
Hook 2: apply_ocr_to_pdf_pages closure (PdfOcrProgress::Finished → LogEvent::Ocr)
Hook 3: ingest_one_*_asset Err arm (LogEvent::Error)
Hook 4: scan 직후 fs_skips.events enumerate (LogEvent::Skip)
Hook 5: (Hook 3 통합) per-asset fatal error → LogEvent::Error
Hook 4 의 skip event carry 위해 kebab-source-fs 의 FsScanSkips 에
events: Vec<FsSkipEvent> field 추가 (kebab-source-fs 가 kebab-app
재호출 안 함 — cycle 회피).
Ownership: Option<Arc<Mutex<IngestLogWriter>>> binding 1 곳, 5 hook 이
clone+lock+write. ocr_ms_samples (Vec<u64> success-only) 는 Arc<Mutex>
로 share, summary stage 가 sort+p50/p90/max 계산. single-threaded
per-asset loop 라 deadlock/contention 위험 없음.
Writer 실패는 ingest 자체 fail 시키지 않음 (tracing::warn + 진행).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-28 03:05:07 +00:00
bef0c98867
feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields
...
v0.20.x ingest log feature 의 wire side. additive minor cascade:
* PdfOcrProgress::Finished + IngestEvent::PdfOcrFinished 의 4 field:
- image_byte_size: Option<u64>
- image_width: Option<u32>
- image_height: Option<u32>
- failure_reason: Option<String>
* docs/wire-schema/v1/ingest_progress.schema.json — 4 추가 property
(모두 optional, required 변경 없음 = additive minor)
* integrations/claude-code/kebab/SKILL.md — wire schema description 동기
기존 ingest_progress.v1 consumer (CLI wire dump, integration test
fixture, kebab-cli wire_search/wire_ask) 는 4 추가 field 의
Option::None 으로 backward-compat. version bump 0 (additive minor =
binary-version cascade trigger 아님 per CLAUDE.md §Versioning cascade).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-28 02:57:59 +00:00
f8a4c79727
feat(app): IngestLogWriter + LogEvent enum (per-ingest-run ndjson log)
...
v0.20.x ingest log surface 의 module side. crates/kebab-app/src/
ingest_log.rs 신규:
* IngestLogWriter — open/write_event/write_summary/flush + Drop flush
* LogEvent enum 4 variant (ocr / parse_error / skip / error)
* IngestSummary struct (kind="summary" literal + 11 stat field)
* generate_run_id (ISO 8601 prefix + uuid v7 마지막 8 hex)
* expand_log_dir ({state_dir} placeholder 의 hand-roll expand)
* now_ts (Rfc3339 UTC helper)
* percentiles helper (sorted Vec p50/p90/max)
uuid v7 = workspace dep, rand 신규 의존 회피 (spec §6 R-5).
본 step 은 self-contained writer + 5 unit test. ingest pipeline 의
emit hook 5개 wiring 은 step 4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-28 02:53:09 +00:00
f60304beb4
feat(config): add [logging] section (ingest_log_enabled + ingest_log_dir)
...
v0.20.x ingest log surface 의 config side. LoggingCfg struct 신설:
* ingest_log_enabled (bool, default true)
* ingest_log_dir (PathBuf, default "{state_dir}/logs")
#[serde(default)] tag 로 pre-v0.20 config 가 [logging] section 부재
시 LoggingCfg::default() 자동 init (AC-10 backward compat).
{state_dir} placeholder 의 실제 expand 는 step 2 (IngestLogWriter)
의 expand_log_dir helper 가 담당 (kebab-config 의 expand_path_with_base
는 {state_dir} 미지원, spec §6 R-3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-28 02:44:21 +00:00
6a9551e0fa
fix(config): pdf.ocr.request_timeout_secs default 60 → 180 (Bug #11 follow-up)
...
Round 3 final dogfood (2026-05-28) 에서 60s default 가 dense Korean page
(metro-korea.pdf page 8/9/13) 의 OCR 을 강제 timeout — round 2 대비 1 page
더 indexed 손실. user perspective: cost vs coverage trade-off 가 60s 에선
coverage 쪽으로 너무 깎임.
Sweet spot 점진적 축소 정책 채택 — conservative starting point 180s 부터
dogfood evidence (OCR 평균 ms 분포) 기반 점진적 축소. 60s 같은 짧은 default
로 직접 jump 안 함.
- crates/kebab-config/src/lib.rs::default_pdf_ocr_request_timeout_secs() = 180
- unit test rename (_is_60s → _is_180s) + assertion 180
- crates/kebab-config/tests/pdf_ocr.rs assert_eq 180
- tasks/HOTFIXES.md 2026-05-28 follow-up entry 추가
User override path 보존 — config.toml [pdf.ocr] request_timeout_secs = N
로 user 가 직접 tune.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-28 01:40:23 +00:00
46e99470eb
docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md
...
3-round dogfood-driven fix cycle 의 산출물:
- bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line
- bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line
- bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line
- docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus)
각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-28 01:21:34 +00:00
9b44e27dfe
test(app): update schema_report assertion for streaming_ask=true (Bug #9 follow-up)
...
schema_report_reflects_freshly_ingested_kb 가 `!streaming_ask` 를 assert 했으나
Bug #9 fix (760eee8 ) 로 streaming_ask 가 true 로 정정됨. assertion 을 반전.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 23:58:10 +00:00
854a180365
fix(cli): add active_parsers + active_chunkers to Models test fixture in wire.rs (Bug #13 )
...
Step 4 의 Models struct 확장 (active_parsers / active_chunkers 추가) 이
crates/kebab-cli/src/wire.rs 의 테스트 fixture 초기화를 누락 → E0063 컴파일 에러.
#[serde(default)] 는 serde 역직렬화 전용 — struct literal 초기화 에는 모든 field 필요.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 23:18:27 +00:00
5bba95fd71
docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation
...
Bug #11 (이전 commit `fix(config): pdf.ocr.request_timeout_secs default 600 → 60`)
의 frozen-spec deviation handoff.
- tasks/HOTFIXES.md: 2026-05-27 dated subsection — Discovered / Symptom / Root cause /
Fix / Amends 5-field 포맷 (기존 entries 와 일치).
- docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md: PDF OCR config block
line 1000 (default value) + OQ-1 line 1628 에 inline HTML 주석 2 줄 cross-link.
prose 변경 0 — parent spec frozen contract 보존, HTML 주석은 markdown render 시 invisible.
HOTFIXES entry 가 live source of truth (CLAUDE.md "Spec contract" 규칙).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 23:16:18 +00:00
2c7fa7142a
fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14 )
...
이전: `kebab search "" --json` / `kebab search " " --json` / `kebab ask "" --json`
모두 exit=0 + silent 0 hit (search) 또는 LLM 빈 prompt round-trip (ask). user
mistake (typo, shell expansion 실수) 가 silent → debugging 비용.
이후: 양쪽 arm 에서 `query.trim().is_empty()` → kebab_app::StructuredError
(ErrorV1, code=invalid_input, hint 포함). exit=2 (StructuredError → 기존
exit_code() 의 generic non-zero path).
--bulk mode 는 영향 0 (bulk arm 이 query 무시).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 23:16:08 +00:00
d9c7aabce1
feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13 )
...
이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 →
multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest)
의 version cascade audit 누락 risk.
이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec<String>
추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는
optional (#[serde(default)] + JSON schema required 미포함).
source:
- kebab_store_sqlite::fetch_distinct_parser_versions() 가
documents.parser_version DISTINCT + ORDER BY 반환.
- fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern.
- collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결).
wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분.
integrations/claude-code/kebab/SKILL.md 동기 갱신.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 23:15:58 +00:00
10b0e2f4f2
fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11 )
...
metro-korea.pdf v0.20 final-dogfood (2026-05-27):
- page 8 + page 13 양쪽 모두 600s default 까지 완전 timeout
(`ms: 600000, chars: 0, skipped: true`)
- 결과: 본문 indexed 안 됨 + page 당 20분 cost 낭비
cloud GPU Ollama 의 실측 per-page throughput 는 6-32s (parent spec 가정 105s 보다
훨씬 빠름). 60s 면 production-friendly upper-bound. dense/고해상도 page 는
config.toml override (`[pdf.ocr] request_timeout_secs = N`) 로 user 가 늘릴 수
있음 — Step 6 에서 HOTFIXES + parent spec cross-link.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 23:15:21 +00:00
28f513795e
fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10 )
...
이전: `kebab search "rust" --config /tmp/nonexistent.toml --json` 가 exit=0 +
`{"hits":[]}` silent fallback to XDG default. typo / wrong path 가 0-hit 으로만
surface — debugging nightmare.
이후: kebab_config::ConfigNotFound thiserror::Error 추가, Config::load 의
`Some(p) if !p.exists()` arm 이 anyhow::Error::new(ConfigNotFound { path })
return. kebab_app::error_wire::classify 가 downcast → ErrorV1 code=config_not_found,
hint, details.path 채워서 stderr 에 ndjson 으로 emit.
R-1 (relative path): std::path::Path::exists() 는 cwd-relative — 별도 작업 없이
absolute + relative 모두 cover. integration test 두 개로 검증.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 23:14:54 +00:00
760eee89c8
fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9 )
...
capabilities_snapshot() 가 streaming_ask + single_file_ingest 를 hardcoded false 로
보고했으나 실제 구현은 v0.20 final-dogfood 에서 production-grade:
- kebab ask --stream → answer_event.v1 ndjson 191 event 정상 emit
- kebab ingest-file <path> / kebab ingest-stdin --title <T> → ingest_report.v1 정상
MCP host + Claude Code skill 등 agent 가 schema.capabilities 로 routing 결정 시
false negative → 사용자가 실제 동작 feature 를 사용 불가능하다고 오인.
http_daemon 은 false 유지 (별도 sub-item 의 non-impl).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 23:13:57 +00:00
f763049923
test(cli): assert 'code' in search --help output (Bug #7 regression pin)
...
Why: Step 2 의 doc-comment edit 가 향후 누군가 value list 를 재정렬
하거나 alias section 으로 분리할 때 silently 사라질 risk. clap 의
--help 렌더링 가 doc-comment 의 free-form text 라 grep-only smoke 가
유일한 검출 수단.
Change: 신규 test file (kebab-cli convention `cli_*` prefix 답습).
CARGO_BIN_EXE_kebab 으로 fresh binary 실행, stdout 의 `code` substring
assert. spec §4.4 의 acceptance row 1:1 mapping.
Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.4
/ §5 (acceptance row 4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 15:47:55 +00:00
8cf73d1f43
docs(cli): list 'code' in --media help string + SKILL.md (Bug #7 )
...
Why: kebab search --media code 가 v0.18.0 부터 functional support 됨
(MEDIA_KINDS 외 path 로 first-class 처리, schema.v1.media_breakdown.code
존재). 그러나 SearchArgs 의 clap doc-comment + SKILL.md line 57 의
value list 가 stale — `code` 누락. user 가 --help 만 보고 code 미지원이라
오해 가능.
Change: 2 surface 동기 — main.rs line 158-160 의 multi-line clap
doc-comment + integrations/claude-code/kebab/SKILL.md line 57.
Rust binary surface / wire schema 변경 0.
Out of scope (follow-up): crates/kebab-mcp/tools/search.rs:44,
crates/kebab-core/src/search.rs:32+52, crates/kebab-app/src/
ingest_progress.rs:69, crates/kebab-cli/tests/wire_schema_breakdowns.rs:35
도 동일 stale list 보유. spec ACCEPT (round 1c) 의 grep boundary
밖이므로 본 round 미포함.
Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.3 / §4.3a.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 15:47:16 +00:00
a58ee10dfb
fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6 )
...
Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의
ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit
`?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char()
의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5
threshold bypass.
Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계
(strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker
list extensible. is_valid_text_char() 본체 변경 0.
Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version
/ wire schema 변경 0.
Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.1 / §4.2 / §6 R-1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 15:42:59 +00:00
e674ff474b
fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4 )
...
v0.20.0 sub-item 1 dogfood report 의 Bug #4 — F4 mojibake.pdf 의 lopdf
`get_pages()` count = 0 (Pages tree broken). root cause = 기존 byte-
level `re.sub` + manual startxref edit 가 lopdf strict load 통과시키지만
Pages dict 의 `/Kids` reference 깨짐.
- `tests/fixtures/_synth/mojibake.py`: full rewrite — replace byte-level
`re.sub` + manual startxref with pikepdf open+inject-dummy-ToUnicode+
del+save (auto xref regen). HYSMyeongJo-Medium CID font: CID font 이
ToUnicode 를 자체 생성하지 않아 dummy stream 을 inject 후 strip
(removed=1 invariant). Exit codes 2/3/4 for invariant fail.
- `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`: regenerate via
pikepdf — 1 valid page, no /ToUnicode marker, byte-identical 후 reproducible.
- `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json`:
regen via 2-run cargo test pattern (hand-rolled unwrap_or_else baseline
bootstrap, no insta crate).
- `crates/kebab-parse-pdf/tests/text_extractor_regression.rs`: append 3
invariant test — (1) lopdf 1-page, (2) /ToUnicode marker absent,
(3) PdfTextExtractor 1-block invariant.
- `crates/kebab-parse-pdf/src/text_quality.rs`: f4_fixture_ratio_under_threshold
threshold 0.3 → 0.5 (production valid_ratio_threshold 기본값). 구 broken
fixture (pages=0) 는 extract_text="" → ratio=0.0; 신 fixed fixture 는
CID 2-byte fallback decode → ratio≈0.375 — 여전히 OCR trigger 조건 충족.
spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 4)
prior: 241ded5 (Step 3 integration test)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-27 14:02:17 +00:00
241ded59df
test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
...
v0.20.0 sub-item 1 bugfix Step 3 (Group C) — integration-level regression
for Bug #3 (intra-doc chunk_id collision under aggressive overlap).
- `crates/kebab-app/tests/common/mod.rs`: `pub mod mock_ocr;` 1 line append.
- `crates/kebab-app/tests/common/mock_ocr.rs` (new): MockOcrEngine lift +
`single` / `per_page` ctor (backward-compat single + per-page cursor).
- `crates/kebab-app/tests/pdf_ocr_apply.rs`: inline MockOcrEngine 제거 +
`mod common; use common::mock_ocr::MockOcrEngine;` import. 10 ctor call
site migration (`MockOcrEngine { .. }` → `MockOcrEngine::single(...)`).
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs`
(new): F1 + F2 scanned PDF + Bug #3 trigger shape (10 char "가" + ". " +
500 char "나") via mock OCR. assertion: chunk_id global uniqueness (HashSet
dedup) across F1 + F2; F2 trigger text produces ≥2 chunks (collision shape).
- C1 decision: Option A (share via tests/common/mock_ocr.rs). Facade mock
injection unavailable (OllamaVisionOcr hardcoded) — helper-level chain test
(apply_ocr_to_pdf_pages → PdfPageV1Chunker) adds value beyond unit B5.
spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4.5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 3)
prior: 436fd01 (Step 2 Bug #3 chunk_id fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-27 13:45:38 +00:00
436fd015a2
fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3 )
...
v0.20.0 sub-item 1 dogfood report 의 Bug #3 (Critical). scanned_page2.pdf
(1580 char OCR text) ingest 시 `chunks.chunk_id` PRIMARY KEY violation —
`per_chunk_hash = #c{char_start}` 가 post-overlap `actual_start` 사용 +
overlap walk floor 가 `prev_min` 으로 collapse → segment 1/2 동일 `#c0`.
- `crates/kebab-chunk/src/pdf_page_v1.rs`: `chunk_page` returns 4-tuple
(segment_start, actual_start, chunk_end, slice); caller `per_chunk_hash`
suffix uses `segment_start` (pre-overlap boundary, strictly increasing)
instead of `char_start` (post-overlap, may collapse to prev_min).
- VERSION_LABEL `"pdf-page-v1"` → `"pdf-page-v1.1"` (design §9 cascade,
explicit user-facing audit trail). `crates/kebab-app/tests/pdf_pipeline.rs:
168, 368` 의 hardcoded literal 도 v1.1 로 갱신.
- module docs (`pdf_page_v1.rs:47-60`): workaround description 의
`#c{char_start}` reference 를 `#c{segment_start}` 로 갱신 + segment_start
invariant 명문 + HOTFIXES.md cross-ref.
- `pdf_page_v1.rs::tests`: `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
regression pin (10 char "가" + ". " + 500 char "나" — multi-chunk +
overlap walk collapse trigger).
- `tasks/HOTFIXES.md`: 2026-05-27 entry (symptom F2 1580 char OCR,
intra-doc collision root cause, second-iteration patch rationale).
spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 2)
prior: d9acda5 (Step 1 Bug #2 walker fix)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 13:32:09 +00:00
d9acda517a
fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2 )
...
v0.20.0 sub-item 1 dogfood report 의 Bug #2 — `[ingest.code].max_file_bytes`
가 walker 단계의 모든 file 에 일률 적용 → PDF/image/markdown 의 대부분 (256 KB+)
이 walker pre-extract skip. fix:
- `crates/kebab-source-fs/src/code_meta.rs`: `pub(crate) fn is_code_file(path)
-> bool` helper 추가 (= `code_lang_for_path(path).is_some()`).
- `crates/kebab-source-fs/src/connector.rs:168-190`: walker size-cap check 가
`is_code_file(&abs_path) && is_oversized(...)` short-circuit. PDF/image/
markdown 는 walker bypass — parser 의 자체 size control (lopdf load_mem,
image OCR max_pixels) 가 cover.
- `crates/kebab-source-fs/src/connector.rs` 기존 mod tests 안 추가:
`size_cap_skips_only_code_files` — 300 KB PDF + MD + .rs 의 walker 결과
검증. 기존 sibling test (huge.rs / longfile.rs, fixture 명 `.rs`) regression 0.
spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§3)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 1)
prior: b4d9e60 (PR #189 )
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 13:20:38 +00:00
b4d9e60816
chore(release): bump version 0.19.0 → 0.20.0 — v0.20.0 sub-item 1 scanned PDF OCR
...
# v0.20.0 — scanned PDF OCR via Ollama vision LLM
v0.20.0 의 핵심 변경 = embedded text 가 없는 scanned PDF (책 스캔, 영수증,
카메라 page) 의 OCR ingest. PoC 의 5 engine 비교 (Tesseract / EasyOCR /
PaddleOCR / gemma4:e4b / qwen2.5vl:3b) 에서 qwen2.5vl:3b 의 alnum 94.79%
(page1) / 81.56% (받침) 가 모든 다른 engine 을 능가 — 본 release 의 default
vision OCR.
## 1. OCR opt-in 사용법
`[pdf.ocr]` config 의 `enabled = true` 또는 `KEBAB_PDF_OCR_ENABLED=true` env
로 활성화. default off — OCR 한 page 당 45-100s (qwen2.5vl:3b on CPU,
remote Ollama) 의 cost 가 책 archive 외 비-OCR KB 에 부적합.
```toml
[pdf.ocr]
enabled = true
model = "qwen2.5vl:3b"
# 다른 default 는 README 참조
```
qwen2.5vl:3b 의 Ollama pull:
```bash
ollama pull qwen2.5vl:3b # 3GB Ollama image
```
## 2. v0.19 indexed scanned PDF 의 force-reingest
v0.19 binary 로 scanned PDF 를 ingest 한 KB 는 자동으로 OCR path 진입 안
함 — parser_version "pdf-text-v1" 보존 (CLAUDE.md §Versioning cascade 의
trigger 회피 결정, H-4). 따라서 v0.20 binary upgrade + config
`pdf.ocr.enabled = true` 만 적용 시 try_skip_unchanged 의 Unchanged path 가
OCR 실행을 skip. 명시적 재처리:
```bash
kebab ingest --root /path/to/kb --force
```
## 3. DCTDecode-only v1 scope (FlateDecode / CCITTFax page 처리)
v0.20.0 의 PDF page image extract = lopdf 의 image XObject 의 /Filter ==
DCTDecode 만 cover (JPEG passthrough). 다른 encoding (FlateDecode raw
pixel, CCITTFaxDecode bilevel, JPXDecode JPEG2000) 은 warning event 발행 +
해당 page skip.
scanned PDF 의 일부 page 가 FlateDecode 또는 CCITTFax 로 encoded 시:
```bash
qpdf --object-streams=disable --recompress-flate input.pdf normalized.pdf
```
v1 의 의도 = single binary 원칙 (image crate 도입 0). v1.1+ 또는 별
sub-item 에서 multi-filter 지원 검토.
## 4. Family asymmetry (image OCR gemma4:e4b vs PDF OCR qwen2.5vl:3b)
image OCR (P6) 의 default 는 gemma4:e4b 그대로 (변경 0). PDF OCR (v0.20)
만 qwen2.5vl:3b. 사용자가 [image.ocr] model = "qwen2.5vl:3b" 으로 통일
가능 단 default 는 family asymmetric 보존.
## Dogfood + test 결과
- workspace test: 178 result lines, 0 failure.
- workspace clippy (-D warnings): exit 0.
- alnum e2e (real Ollama, manual invoke):
- F1 (한국어 page1): 94.79% (≥ 0.85 threshold).
- F2 (받침-intensive): 81.56% (≥ 0.70 threshold).
- integration smoke + vector PDF regression: pass.
## 변경된 surface
- new config: [pdf.ocr] (11 field) + 11 env override KEBAB_PDF_OCR_*.
- new wire: IngestEvent::PdfOcrStarted/Finished (additive minor).
- new wire: IngestItem.pdf_ocr_pages/ms_total (additive minor).
- new CLI line: "📷 OCR page N..." / "✓ OCR page N (chars chars, msms via ollama-vision)".
- new module: kebab-parse-pdf::{page_image, text_quality} + kebab-app::pdf_ocr_apply.
- dep: workspace lopdf = "0.32" 통합.
- fixture: 5 PDF (F1/F2/F4/F6/F7) under crates/kebab-parse-pdf/tests/fixtures/.
## 변경되지 않은 surface (invariant)
- Extractor::extract trait body byte-identical (PR #187 ).
- PdfTextExtractor body 변경 0 — post-extract enrichment pattern 으로 분리.
- parser_version "pdf-text-v1" 보존.
- chunker_version "pdf-page-v1" 보존.
- workspace.dependencies 의 production dep graph 변경 0 (-e normal baseline 보존).
## sub-item 의 11 commit history
9d7faab Step 1: foundation + cargo tree baselines
aeeff36 Step 2: lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7)
fb3952d Step 2 fix: F7 conversion engine record correction
c2cd3a7 Step 3: page_image + text_quality modules (10 test)
8d81bc1 Step 3 fix: clippy pedantic in page_image
9f003ef Step 4: pdf_ocr_apply helper (10 test, F7 split + cancel)
fd918a6 Step 5: [pdf.ocr] config section + PdfOcrOpts doc
4672cba Step 5 fix: clippy::bool_assert_comparison in pdf_ocr tests
b9ee09f Step 6: wire PDF OCR enrichment + cancel propagation
4c5ccd5 Step 7: wire schema additive — IngestEvent + IngestItem + skipped
c9e0594 Step 8: CLI printer activation + ingest_progress test + spec literal
4819768 Step 9: integration smoke + vector regression + alnum e2e
1d4e301 Step 9 follow-up: Cargo.lock for dev-dep additions
90726ab Step 10: docs sync (README + HANDOFF + ARCHITECTURE + SMOKE)
## § Acceptance §9 verifier evidence
K5 의 15 row scriptable verifier 모두 green (또는 manual real-Ollama row 의 결과 보고):
- Row #4 (vector PDF byte-identical): pass.
- Row #5 (Extractor::extract trait byte-identical): 0 line diff.
- Row #6 (wire schema additive): jq + diff exit 0.
- Row #7-#8 (clippy / workspace test): exit 0.
- Row #9-#10 (dep graph baseline -e normal): empty diff.
- Row #11 (docs sync): grep evidence.
- Row #12 (version bump): "0.20.0" + Cargo.lock cascade ≥ 22.
- Row #14 (PR #187 invariant): extract_for(&asset.media_type) ≥ 1.
- Row #15 (DCTDecode-only v1, F6/F7 skip): test green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 11:03:44 +00:00
90726ab283
docs(v0.20): sync README + HANDOFF + ARCHITECTURE + SMOKE for scanned PDF OCR (post-extract enrichment, qwen2.5vl:3b, DCTDecode-only v1)
...
Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
J0 — release notes path decision: commit body (RELEASE_NOTES.md /
docs/RELEASE_NOTES_*.md 부재, v0.17.x/v0.18.0 patterns 의 commit body
release notes 형식 따름). Step 11 K1 commit body 안 inline.
J1 — README.md:
- Configuration section 의 toml table list 에 `[pdf.ocr]` 추가.
- 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field
toml example + `KEBAB_PDF_OCR_*` 11 env override + force-reingest UX
("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용,
`kebab ingest --force` 필요").
J2 — HANDOFF.md:
- phase status P7 row 확장: 3/3 component + post-extract OCR enrichment
(v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM).
- "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope
(H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX).
J3 — docs/ARCHITECTURE.md:
- OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)`
(qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply,
DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%).
- PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) +
parser_version "pdf-text-v1" 보존 + provenance event 차별화.
J3 — docs/SMOKE.md:
- `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b).
- 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`:
v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke.
J4 — release notes draft (Step 11 K1 commit body 의 source):
- result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only +
family asymmetry + dogfood/test 결과).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4)
prior: 1d4e301 (Step 9 + Cargo.lock follow-up)
contract: §9
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 10:34:24 +00:00
1d4e301e5e
chore(deps): Cargo.lock for Step 9 dev-dep additions (strsim + kebab-parse-image)
...
Step 9 (commit 4819768 ) 에서 추가된 dev-dep (strsim 0.11 + kebab-parse-image
path) 의 Cargo.lock cascade. worker 가 명시적 commit 에 포함 안 함 — follow-up
commit 으로 lock 동기화.
dep graph baseline (-e normal) 영향 0 (dev-dep 만 추가).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 10:12:38 +00:00
48197687b7
test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1
...
Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
의 OCR text indexed verify (§ Acceptance #2 ).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
cancel=true + dummy endpoint, no panic/deadlock verify).
I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
전체 부재 — 신규 12-line).
I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
baseline -e normal 보존).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-27 10:10:58 +00:00
c9e05941c5
feat(cli): activate per-page PDF OCR progress printer + test(app): ingest_progress emit verify + spec(pdf-ocr): align §4.6.1 literal with option_A (ms/chars)
...
Step 8 (Group H) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 reviewer concern fix (spec literal deviation).
H1 — kebab-cli/src/progress.rs printer activation:
- 구 no-op stub `IngestEvent::PdfOcr* { .. } => {}` (Step 6 placeholder)
를 사람-친화 stderr line printer 로 활성화.
- spec §4.6.1 line 1085-1086 wording 그대로:
- PdfOcrStarted → ` 📷 OCR page {page}...`
- PdfOcrFinished (skipped=false) → ` ✓ OCR page {page} ({chars} chars, {ms}ms via {ocr_engine})`
- PdfOcrFinished (skipped=true) → ` ⊘ OCR page {page} skipped (no DCTDecode or engine fail, {ms}ms)` (M-4 의 skipped field carry 활용)
- `!quiet` gate 정합 (AssetStarted/Finished pattern mirror).
H2 — crates/kebab-app/tests/ingest_progress.rs 의 새 test:
- pdf_ocr_progress_emits_started_finished_events (real Ollama 의존, `#[ignore]`).
- F1 fixture (scanned_page1.pdf) ingest 시 pdf_ocr_started + pdf_ocr_finished
event 가 emit 됨을 verify. Started count == Finished count invariant.
- Manual invoke: `KEBAB_PDF_OCR_ENABLED=true cargo test -p kebab-app --test
ingest_progress --ignored`.
- mock OcrEngine inject path 부재 (Step 6 의 eager build), Step 9 I5 의
ocr_e2e pattern (real Ollama + `#[ignore]`) 와 동일.
Step 7 reviewer concern fix — spec §4.6.1 literal:
- line 1076-1077 의 `ocr_ms` / `ocr_chars` literal 을 wire schema 의 실제
field name `ms` / `chars` (option_A, Rust serde 와 정합) 로 갱신.
- line 1087 의 printer wording 도 `{ocr_chars}` / `{ocr_ms}` → `{chars}` / `{ms}`.
- line 1556 의 rationale 참조 `pdf_ocr_finished.ocr_ms` → `.ms`.
- `skipped` field 도 명시 (Step 6 reviewer M-4 결과).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.6.1)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 8 H1+H2)
prior: 4c5ccd5 (Step 7 wire schema) — Step 7 reviewer concern 1 의 fix
contract: §9 (additive minor wire bump — Step 7 commit 에서 완료)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-27 09:18:49 +00:00
4c5ccd5447
feat(wire): additive minor — IngestEvent kind 의 pdf_ocr_* + ingest_report.items[] 의 pdf_ocr_pages/ms_total + skipped field carry (Step 6 M-4/M-2)
...
Step 7 (Group G) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 6 code reviewer Important M-4 (skipped field carry) + Minor M-2
(ordering invariant doc) fix.
G3 — JSON Schema sync (additive minor — schema_version 보존):
ingest_progress.schema.json:
- kind enum 2 추가: pdf_ocr_started + pdf_ocr_finished.
- 새 field: page (1-based PDF page), ocr_engine (engine_name), skipped (bool).
- 기존 ms / chars field 의 description 갱신 (pdf_ocr_finished carry 추가).
ingest_report.schema.json:
- items.items.properties 신규 정의 (이전 stub ["array", "null"] 만).
- pdf_ocr_pages + pdf_ocr_ms_total (nullable integer).
- 모든 기존 IngestItem field 도 명시화 (kind, doc_path, byte_len, ...).
Step 6 reviewer M-4 (Important) — skipped field carry:
- IngestEvent::PdfOcrFinished 에 skipped: bool 추가.
- ingest_one_pdf_asset 의 emit closure (lib.rs:~1864) 가 source
PdfOcrProgress::Finished { skipped } 를 discard 않고 propagate.
Step 6 reviewer M-2 (Minor) — ordering invariant doc:
- crates/kebab-app/src/ingest_progress.rs 의 ordering text 갱신:
ScanStarted < ScanCompleted < (AssetStarted [< (PdfOcrStarted <
PdfOcrFinished)*] < AssetFinished)* < (Completed | Aborted).
.md doc (docs/wire-schema/v1/*.md) 부재 — plan §3 Step 7 G3 의 .md
deliverable retro N/A (해당 file 0).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 7 G3)
prior: b9ee09f (Step 6 wiring) + Step 6 reviewer M-4/M-2 권고
contract: §9 (additive minor wire bump — schema_version 보존)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-27 08:51:51 +00:00
b9ee09f176
feat(app): wire PDF OCR enrichment + cancel propagation into ingest_one_pdf_asset (H-5 eager init + post-extract hook + per-page cancel) + workspace lopdf dep (Step 4 M-4)
...
Step 6 (Group E) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 spillover (IngestEvent variant + IngestItem field for compile
boundary) + Step 4 reviewer Minor M-4 fix.
E1 — eager PDF OCR engine build at `ingest_with_config_opts` entry,
mirror of image OCR pattern (lib.rs:338-347). `pdf.ocr.enabled ||
always_on` 시 `OllamaVisionOcr::from_parts(endpoint, model, ...)` 호출
+ fail-fast `?`. App field 추가 0 (local var only, spec L-1 / Step 1
A1 cosmetic fix 정합).
E2 — `ingest_one_pdf_asset` signature extension: +3 param
(`pdf_ocr_engine: Option<&OllamaVisionOcr>`, `progress: Option<&
mpsc::Sender<IngestEvent>>`, `cancel: Option<&Arc<AtomicBool>>`).
`ingest_one_asset` dispatch wrapper + caller (dispatch loop) update.
E3 — post-extract enrichment block at `extract_for` 직후 (line 1779).
`pdf.ocr.enabled || always_on` 시 `apply_ocr_to_pdf_pages` 호출,
PdfOcrProgress → IngestEvent emit (PdfOcrStarted / PdfOcrFinished
with ocr_engine), summary 의 pages_ocrd/ms_total 을 IngestItem field
로 carry. PR #187 registry dispatch invariant 보존
(`extract_for(&asset.media_type, ...)` 그대로).
E4 — cancel handle propagation: ingest_with_config_cancellable →
IngestOpts.cancel → ingest_with_config_opts → ingest_one_asset →
ingest_one_pdf_asset (new `cancel` param) → PdfOcrOpts.cancel chain.
spec §4.8 line 1159 production wiring.
Step 7 spillover (compile boundary):
- `kebab_app::ingest_progress::IngestEvent`: PdfOcrStarted { page } +
PdfOcrFinished { page, ms, chars, ocr_engine }. serde discriminant
`pdf_ocr_started` / `pdf_ocr_finished` (Step 7 G3 wire schema 와 일치).
- `kebab_core::IngestItem`: pdf_ocr_pages: Option<u32> +
pdf_ocr_ms_total: Option<u64> (warnings/error 사이). 11 non-PDF
IngestItem construct site 가 `None` 채움.
- `kebab-cli/src/progress.rs` + `kebab-tui/src/ingest_progress.rs`:
새 variant no-op handler (v1에서 per-page progress 미노출, future
refinement 시 활성화 가능).
- `kebab-store-sqlite/tests/ingest_report_snapshot.rs` + snapshot
`ingest_report.snapshot.json`: 2 IngestItem fixture 의 새 field 추가.
- Step 7 의 JSON Schema 갱신 + CLI printer activation + snapshot
regenerate 는 별 commit (G3/H1/H2 deliverable).
M-4 (Step 4 reviewer Minor) — lopdf workspace dep 통합:
- workspace `Cargo.toml [workspace.dependencies] lopdf = "0.32"`.
- kebab-app + kebab-parse-pdf 의 direct dep → `{ workspace = true }`.
Verifier evidence:
- workspace test (`cargo test --workspace --no-fail-fast -j 1`):
175 test result summary lines, 0 failures, 0 FAILED.
- workspace clippy (`-D warnings`): exit 0, 0 warning.
- dep graph baseline (`.omc/state/pdf-ocr-{parse-pdf,app-parse}-deps.baseline.txt`):
empty diff for both.
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.4 + §4.6 + §4.8)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 6 E1-E4 + Step 7 partial G1+G2)
prior: 4672cba (Step 5 fix) + fd918a6 (Step 5) + 9f003ef (Step 4 helper)
contract: §9 (additive minor wire bump — Step 7 JSON Schema 완료 시)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 08:18:34 +00:00
4672cba6c6
fix(config): satisfy clippy::bool_assert_comparison in pdf_ocr tests
...
fd918a6 의 F2 test file (crates/kebab-config/tests/pdf_ocr.rs) 의 4 line
`assert_eq!(bool_field, true|false)` 가 workspace clippy pedantic
의 `bool_assert_comparison` 위반 → CI gate
`cargo clippy --workspace --all-targets -- -D warnings` exit 1.
각 assertion 의 canonical form 적용:
- assert_eq!(x, false) → assert!(!x)
- assert_eq!(x, true) → assert!(x)
semantic + behavior 동일, 4 line edit, logic 변경 0.
review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-05-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-05-spec-review-result.md (I-1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 07:17:46 +00:00
fd918a60ce
feat(config): add [pdf.ocr] section — qwen2.5vl:3b default, opt-in + env overrides + doc(app): PdfOcrOpts field doc (Step 4 I-1)
...
Step 5 (Group F) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 4 reviewer Important I-1 fix (PdfOcrOpts field doc) 동봉.
F1 — `kebab-config::PdfCfg` + `PdfOcrCfg` + 4 default fn:
- PdfCfg { ocr: PdfOcrCfg }.
- PdfOcrCfg with 11 field (enabled/always_on/engine/model/endpoint/
languages/max_pixels/request_timeout_secs/valid_ratio_threshold/
min_char_count/lang_hint).
- defaults: opt-in (enabled=false), qwen2.5vl:3b, 0.5 threshold, 20 char.
- mirror of image OCR cfg pattern (spec §4.5).
Config struct extension:
- `pdf: PdfCfg` field with `#[serde(default = "PdfCfg::defaults")]`.
11 env var override (parallel to KEBAB_IMAGE_OCR_*):
KEBAB_PDF_OCR_{ENABLED,ALWAYS_ON,ENGINE,MODEL,ENDPOINT,LANGUAGES,
MAX_PIXELS,REQUEST_TIMEOUT_SECS,VALID_RATIO_THRESHOLD,MIN_CHAR_COUNT,
LANG_HINT}.
F2 — `crates/kebab-config/tests/pdf_ocr.rs` (신규):
- toml roundtrip (11 field).
- defaults (opt-in + qwen2.5vl:3b).
- env override (4 key sample + default preservation).
F3 (Step 4 I-1) — `pdf_ocr_apply.rs` 4 public item 의 doc comment:
- PdfOcrOpts struct + 6 field.
- PdfOcrSummary struct + 2 field.
- apply_ocr_to_pdf_pages fn (Errors block 포함).
- PdfOcrProgress enum + 2 variant + 5 field.
body 변경 0, doc-only.
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.5)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 5 F1+F2)
prior: 9f003ef (Step 4) — code reviewer Important I-1 resolution
contract: §9 (additive minor wire bump — Step 7)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 07:07:18 +00:00
9f003ef1cd
feat(app): add pdf_ocr_apply helper (10 test, F7 split + cancel) — post-extract OCR enrichment for PDF (H-1 resolution)
...
Step 4 (Group D) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
D1 — `apply_ocr_to_pdf_pages(&mut canonical, &dyn OcrEngine, &bytes, &opts, emit_progress)`
in `kebab-app::pdf_ocr_apply`. spec §4.1 line 381-599 body 그대로 +
PdfOcrOpts.cancel field + per-page cancel check (verifier LOW L-1).
post-extract enrichment pattern (H-1 resolution): kebab-parse-pdf 가
kebab-parse-image::OcrEngine 을 import 하지 않음 (parser isolation 보존).
helper 가 kebab-app 의 facade 안 — both parser crate 의 cross-import 회피.
Per-page decision matrix (spec §4.1 line 459-464):
- always_on=true → 모든 page OCR (dual-block, ordinal = page-1 + page_count).
- always_on=false + needs_ocr → in-place OCR (text-detect block mutate).
- needs_ocr=false → skip.
DCTDecode-only v1 (H-3): FlateDecode / CCITTFaxDecode page 는
extract_dctdecode_page_image=None → Warning event + skip + emit_progress(skipped=true).
OcrEngine.recognize 실패 → Warning event + skip + emit_progress(skipped=true).
D3 — per-page cancel handle (verifier LOW L-1 + spec §4.8 line 1159):
PdfOcrOpts.cancel: Option<Arc<AtomicBool>>. set→true 시
`anyhow::bail!("PDF OCR cancelled mid-PDF at page N")`.
lopdf = "0.32" added to [dependencies] (already transitive via kebab-parse-pdf;
no new crate introduced — dep graph kebab-parse-* baseline unchanged).
Integration test (`tests/pdf_ocr_apply.rs`, 10 test):
- f1_input_with_ocr_enabled_replaces_empty_block — in-place mutate.
- f3_input_with_ocr_enabled_keeps_text_detect_blocks — vector PDF skip.
- f1_input_with_ocr_disabled_keeps_empty_block — disabled no-op.
- f4_input_with_ocr_enabled_replaces_mojibake_block — mojibake → in-place mutate.
- f3_input_with_always_on_pushes_dual_blocks — always_on dual-block.
- f6_flatedecode_skipped_with_warning — FlateDecode skip + Warning event.
- f7_ccittfax_skipped_with_warning — CCITTFax skip + Warning event (verifier M-4 split).
- ocr_engine_failure_surfaces_as_warning — OCR failure → Warning event.
- dual_block_ordinals_are_deterministic_and_unique — ordinal invariant.
- cancel_handle_aborts_mid_pdf — cancel handle 의 production source (D3).
MockOcrEngine fixture: spec §5.5 line 1284-1299. F3 fixture 부재 →
mock CanonicalDocument construction + F1 bytes reuse pattern (Option B:
PdfTextExtractor::extract 를 통한 실제 production path canonical 생성).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 + §5.5)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 4 D1+D2+D3)
prior: c2cd3a7 (Step 3) + 8d81bc1 (Step 3 clippy fix)
contract: §9 (additive minor wire bump — 후속 step)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-27 06:42:01 +00:00
8d81bc1071
style(parse-pdf): satisfy clippy pedantic in page_image (uninlined_format_args + map_unwrap_or)
...
c2cd3a7 의 `extract_dctdecode_page_image` 에 workspace clippy pedantic 위반 2 건
잔존 → CI gate (cargo clippy --workspace --all-targets -- -D warnings) fail.
두 lint 모두 1-line edit + semantic 동일, logic 변경 0.
- L20 uninlined_format_args: format!("page {} not in get_pages()", page_num)
→ format!("page {page_num} not in get_pages()")
- L48-52 map_unwrap_or: .map(|n| n == b"Image").unwrap_or(false)
→ .is_some_and(|n| n == b"Image")
cargo clippy --workspace --all-targets -j 4 -- -D warnings → exit 0.
cargo test -p kebab-parse-pdf -j 4 → 21 passed (regression 0).
review trail:
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-03-spec-review-result.md (SPEC_COMPLIANT)
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-03-code-review-result.md (Critical C-1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 06:14:00 +00:00
c2cd3a7ab7
feat(parse-pdf): add page_image (DCTDecode passthrough, 2 test) + text_quality (valid char ratio, 8 unit test) modules
...
Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` ->
Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image
XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1
line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른
encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode
passthrough only (H-3 invariant, image crate 도입 0).
Integration test (`tests/page_image.rs`, 2 test):
- f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path.
- flate_raw_fixture_yields_none — F6 fixture negative path.
C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char =
ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin
Extended + common Korean punctuation. 빈 string → 0.0. caller
(`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5).
Unit test (`mod tests`, 7 + F4 conditional):
- empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo.
- f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가
ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3).
Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page
image extract helpers ...", Step 1 A2 이연분).
fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차
수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2)
prior: aeeff36 (Step 2 fixtures) + fb3952d (Step 2 F7 record fix)
contract: §9 (additive minor wire bump — 후속 step)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 05:59:10 +00:00
fb3952d54f
docs(pdf-ocr): correct F7 conversion engine record in PoC doc (gs, not ImageMagick)
...
aeeff36 의 PoC doc append (engine-comparison.md L134, L141) 가 F7 (`ccitt.pdf`)
의 conversion engine 을 "ImageMagick `convert -compress Group4`" 로 기록했으나,
실제 tests/fixtures/_synth/flate_ccittfax.sh:77-83 은
`gs -sDEVICE=pdfwrite -dMonoImageFilter=/CCITTFaxEncode -dEncodeMonoImages=true`
flag 사용 (ImageMagick `convert` 호출 0회).
fixture binary (`/Filter [ /CCITTFaxDecode ]`, 2060 bytes) 는 invariant 충족 OK
(Step 2 spec compliance + code quality review verified). historical record 의
factual correction only.
review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-02-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-02-spec-review-result.md
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-02-code-review-result.md (I1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 05:36:56 +00:00
aeeff3635b
poc+test(pdf-ocr): lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7) for v0.20 sub-item 1
...
Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).
Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
ImageMagick `convert -compress Group4` used for F7 CCITTFax.
B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf — DejaVu TTF + ToUnicode CMap stripped (count=0);
Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).
Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 04:04:47 +00:00
9d7faab650
docs+chore(plan-bootstrap): apply spec L-1 cosmetic fix + capture cargo tree baselines for v0.20 sub-item 1 verifier gates
...
Step 1 (Group A) of v0.20.0 sub-item 1 (scanned PDF OCR) implementation plan.
A1 — spec §4.2 line 740 prose pseudo-code fix: `app.pdf_ocr_engine.as_ref()`
→ local `pdf_ocr_engine: Option<OllamaVisionOcr>` built in
`ingest_with_config_opts` (정합 with §4.4 eager init, App field 도입 0).
A2 — Cargo.toml dep invariant verified (image crate 미도입 — H-3 DCTDecode-only
v1 invariant 보존; kebab-parse-pdf + kebab-parse-image 가 kebab-app 의 기존
dep). description 갱신은 Step 3 (module 추가 후) 으로 이연.
A3 — cargo tree baseline 캡처 — K5 row #9/#10 의 ground-truth
(.omc/state/pdf-ocr-{app-parse,parse-pdf}-deps.baseline.txt). 본 sub-item
의 다른 step 의 dep graph 변경 0 invariant 의 verifier 의 baseline.
Note: .omc/ 는 .gitignore 대상 — baseline files 는 로컬 파일로 존재.
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (round 1c ACCEPT)
contract: §9 (additive minor wire bump — 후속 step)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-27 03:40:49 +00:00
bcd1e37dab
chore(repo): .omc/ ignore + AGENTS·GEMINI symlinks + release notes 작성 가이드 강화
...
- .gitignore: .omc/ (OMC state directory) 추가 — .claude/, .superpowers/ 와 동급
- AGENTS.md / GEMINI.md: CLAUDE.md 로의 symlink — Codex / Gemini CLI 도 동일 지침 따르도록
- CLAUDE.md release 절차: release notes 가 commit subject 단순 나열 대신 사용자 친절한 설명 + 도그푸딩/테스트 결과 포함하도록 가이드 강화
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-26 23:58:04 +00:00
e7a4330798
Merge pull request 'docs: v0.20 image+pdf handoff + sub-item 3 spec/plan backfill' ( #188 ) from docs/v0-20-image-pdf-handoff into main
...
Reviewed-on: #188
2026-05-26 23:36:49 +00:00
574e1b1ca1
docs: v0.20 image+pdf handoff + sub-item 3 spec/plan backfill
...
v0.19.0 release 후 다음 session 인계용 handoff 문서 + 사후 backfill.
- docs/superpowers/handoffs/2026-05-26-v0.20-image-pdf-normalize-handoff.md (540 lines, 9 section)
- sub-item 1/2/3 머지 결과 + 도그푸딩 baseline (1781 doc / 9050 chunks) + user memory + OMC workflow + 빌드 환경
- 현재 구현 상태 (v0.19.0, image+pdf) — 정확한 file:line + struct/fn signature + flow
- 8 TODO 상세 (problem + scope + affected files + risk + trigger 조건)
- 우선순위 + sequencing 권장 + 새 session 첫 단계 제안
- docs/superpowers/specs/2026-05-26-extractor-dispatch-unification-spec.md (sub-item 3 spec)
- docs/superpowers/plans/2026-05-26-extractor-dispatch-unification-plan.md (sub-item 3 plan)
PR #187 머지 시 source code 만 들어가고 spec/plan 누락 — 동일 PR 의 reference link 가 main 에서 404. 본 commit 으로 backfill.
Assisted-by: Claude Code
2026-05-26 23:34:17 +00:00
c1e82cca92
Merge pull request 'refactor(app): extract dispatch polymorphism — App.extract_for(...) + 11 Extractor registry' ( #187 ) from refactor/extractor-dispatch-unification into main
...
Reviewed-on: #187
v0.19.0
2026-05-26 21:07:20 +00:00
2c05dbd0dd
refactor(app): extract dispatch polymorphism — App.extract_for(...) + 11 Extractor registry
...
kebab-app 의 hardcoded extract dispatch (`ImageExtractor` + `PdfTextExtractor` + 9 AST `*Extractor` 의 `::new().extract(…)` callsite 11곳 + 9 AST arm match) 를 `App::extract_for(&MediaType, &ExtractContext, &[u8])` 단일 polymorphic call 로 통합. trait 변경 0, parser source 변경 0, wire schema 변경 0 (success path).
핵심 변경:
- App struct 에 `pub(crate) extractors: Vec<Box<dyn Extractor + Send + Sync>>` field + `pub(crate) fn extract_for(...)` helper method.
- App::open_with_config 의 registry init = 11 Extractor (image + pdf + 9 AST).
- ImagePipeline struct 의 `extractor: &'a ImageExtractor` field 제거 + lib.rs:356 local + lib.rs:1235 alias 삭제 (atomic block).
- 9 AST arm (lib.rs:2012-2047 의 12 arm = 11 explicit + 1 wildcard) → 4 arm (9 AST grouped + 7 manifest + 1 shell + 1 other-bail).
- in-crate unit test (app.rs 의 `mod tests_extractor_dispatch`) 3 class: registry length 11 / mutually-exclusive supports() grid (16 sample MediaType) / extract_for error path (Audio).
scope = AST 9-arm + image + pdf extract callsite only. MarkdownExtractor / Tier 2/3 / outer 4-arm / inner 4 match / Chunker dispatch 모두 future-defer (별 PR — spec §11).
Wire schema (success path) 변경 0 — ingest_report.v1 / search_response.v1 / answer.v1 byte-identical (4-medium SMOKE 비교 검증). error.v1.message 의 internal context string wording 변경 (예: `kb-parse-image::ImageExtractor::extract` → `kb-app::extract_for (image)`) 은 spec §5.5 risk acceptance — `error.v1.code` + `error.v1.schema_version` 보존, user-visible surface 외. Cargo workspace.version bump 0.
Refs:
- docs/superpowers/specs/2026-05-26-extractor-dispatch-unification-spec.md (2 round APPROVE)
- docs/superpowers/plans/2026-05-26-extractor-dispatch-unification-plan.md (3 round APPROVE)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com >
2026-05-26 17:43:44 +00:00
96766406aa
Merge pull request 'refactor(parse-md): absorb kebab-normalize + kebab-parse-types — 24 → 22 crates + §3.7b 재작성' ( #186 ) from refactor/normalize-absorption into main
...
Reviewed-on: #186
2026-05-26 15:37:21 +00:00
710945c4b0
refactor(parse-md): absorb kebab-normalize + kebab-parse-types — 24 → 22 crates + §3.7b 재작성
...
design §3.7b 의 thin layer (ParsedBlock 류) 가 4 parser 중 1개 (markdown) 만 lift 를
경유하는 현실 — fan-in/fan-out 모두 1 → layer 의미 잃음. kebab-normalize (1097 LOC)
+ kebab-parse-types (98 LOC) 둘을 kebab-parse-md 로 흡수.
설계: docs/superpowers/specs/2026-05-26-normalize-absorption-spec.md
플랜: docs/superpowers/plans/2026-05-26-normalize-absorption-plan.md
HOTFIXES: tasks/HOTFIXES.md 의 2026-05-26 entry (design deviation)
- 5 사용 type + 3 forward-declared struct → kebab-parse-md::types module 의 pub explicit re-export.
- build_canonical_document + derive_title + warning_agent → kebab-parse-md::normalize module.
- 4 hard-coded agent literal (lib.rs:122/128/134/153) + warning_agent body return + tracing target literal 모두 보존 — stage label 일관성.
- kebab-app callsite (lib.rs:51 use + :1119 context string) + Cargo.toml 의 2 dep (regular + dead) 제거.
- kebab-chunk + kebab-store-sqlite 의 [dev-dependencies] kebab-normalize → 제거 (kebab-parse-md 로 갈음). 통합 test source 의 use shift.
- test file 이동 (kebab-normalize/tests/normalize_snapshot.rs → kebab-parse-md/tests/).
- workspace Cargo.toml: Hunk (a) members 2 entry 삭제 + Hunk (b) version 0.18.0 → 0.19.0 (frozen contract 변경).
- design §3.7b 4-단락 재작성 (원래 intent 보존 + 현재 상태 + 보존된 surface + future re-extraction trigger).
- design §8 graph 갱신 (3 edge 제거 + 2 forbidden bullet 의미 갱신 + commentary).
- ARCHITECTURE.md crate graph + directory tree mechanical 갱신.
- tasks/INDEX.md L169 closure mention + "Future work / deferred" 섹션 신설 (image/pdf normalize integration entry).
- tasks/HOTFIXES.md 신규 entry (4-block — design deviation Symptom).
- HANDOFF.md cross-link 한 줄.
- 3 dead struct (ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment) 는 보존 — v0.20+ image/pdf normalize integration 의 future surface (spec §11).
Wire / surface impact: 0건. CLI / TUI / MCP / --json 출력 / config / XDG path /
parser_version 모두 unchanged. wire-invisible provenance.events[].agent + tracing target
literal "kb-normalize" 도 보존 — old DB row 와 new DB row 의 audit log 일관성.
Verification: cargo test --workspace --no-fail-fast -j 1 → 1313 passed / 0 failed (172 result blocks).
cargo clippy --workspace --all-targets -j 1 -- -D warnings → 0 warning (5m 46s).
cargo metadata --no-deps --format-version 1 | jq '.workspace_members | length' = 22.
cargo tree -p kebab-app --depth 2 | grep -E "kebab_(parse_types|normalize)" = 0 줄.
2026-05-26 15:00:59 +00:00
d4395a306b
Merge pull request 'refactor(source-fs): drop kebab-parse-code dep — 9 tree-sitter grammars drag 제거' ( #185 ) from refactor/source-fs-dep-lightening into main
...
Reviewed-on: #185
2026-05-26 12:31:29 +00:00
bd48baa19a
refactor(source-fs): drop kebab-parse-code dep — 9 tree-sitter grammars drag 제거
...
kebab-source-fs 가 kebab-parse-code 의 9 tree-sitter grammars 를 drag 했던 무거운 의존성 제거. 4 surface (code_lang_for_path / is_generated_file / is_oversized / BUILTIN_BLACKLIST) 만 사용하지만 dep 그래프에서 9 grammar 전체 link → kebab-source-fs::code_meta 로 이전 + kebab-parse-code 측 cleanup.
핵심 변경:
- kebab-source-fs::code_meta 신설: 4 surface 이전 (BUILTIN_BLACKLIST `pub` for frozen contract + 3 helper fn `pub(crate)`). lib.rs 의 `pub use code_meta::BUILTIN_BLACKLIST` 1 줄 추가 (Option A — 다른 mod surface 무근거 확장 0).
- callsite migration: media.rs (1) + walker.rs (2) + connector.rs (2) 모두 `kebab_source_fs::code_meta::*` 로 갱신.
- kebab-parse-code 측 cleanup: skip.rs 삭제 + lang.rs narrow edit (code_lang_for_path body + unit test 2 + Path import 삭제, module_path_for_* 보존) + lib.rs 헤더 doc rewrite (migration breadcrumb 포함).
- tests/{lang,skip}.rs 13 test 이동 — 12 unit (`src/code_meta.rs::tests`) + 1 integration (`tests/code_meta.rs` for BUILTIN_BLACKLIST frozen contract).
- design §8 graph: edge 제거 + p10-2 inline note. ARCHITECTURE.md 산문 1 줄 갱신. kebab-core::metadata.rs:36 stale dep reference 정정.
G1+G5: cargo tree -p kebab-source-fs | grep tree-sitter = 0 줄.
G2+G3: workspace test 회귀 0 + 13 test 1:1 이동.
G4: design §8 + ARCHITECTURE.md 갱신.
Wire 영향: 없음 (internal Rust crate-API surface 만, user-facing 0). Cargo workspace.version bump 불필요.
Refs:
- docs/superpowers/specs/2026-05-26-source-fs-dep-lightening-spec.md (v3, 4-round APPROVE)
- docs/superpowers/plans/2026-05-26-source-fs-dep-lightening-plan.md (v4, 4-round ACCEPT)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-26 12:19:32 +00:00
b02ac8200e
Merge pull request 'fix(rag): S3 NLI unavailable — hypothesis char budget + token-count fallback retry' ( #184 ) from fix/s3-nli-model-unavailable-diagnose into main
...
Reviewed-on: #184
2026-05-26 09:17:12 +00:00
336962715a
fix(rag): S3 NLI unavailable — hypothesis char budget + token-count fallback retry
...
S3 dogfood query 의 `nli_model_unavailable` consistent fail root cause = mDeBERTa-v3 tokenizer 의 `OnlyFirst` strategy + 949-token hypothesis. 기존 char-budget 단독 fix 의 KR-extreme density 미해결 → token-count fallback retry + RC1-residual trait dispatch 정합.
핵심 변경:
- kebab-nli::NliVerifier: `hypothesis_token_count(&str) -> Result<usize>` trait method 추가 (default `Ok(0)` backward-compat). `OnnxNliVerifier` 가 *trait impl block* 안에서 real mDeBERTa tokenize override — vtable 등록 보장 (round-3 critic RC1-residual closure).
- kebab-rag::pipeline: `MAX_NLI_HYPOTHESIS_CHARS_INITIAL = 1200` + `MAX_NLI_HYPOTHESIS_CHARS_MIN = 150` const + `pub(crate) fn truncate_chars` pure-fn + `pub fn truncate_hypothesis_for_nli_with_budget` retry helper (char budget 반감 retry, min floor 시 graceful unavailable). step 8.5 hook 의 callsite explicit `match` + `return self.refuse_nli_model_unavailable` 패턴 (`?` 금지 — round-2 plan critic CRITICAL #1 closure).
- SpyNliVerifier 신규 helper (closure score_fn + hypothesis_token_count_fn, 2-arg constructor).
- §5.1 의 2 ignored test (EN-long err + vtable dispatch RC1-residual pin) + §5.2 의 4 boundary test (truncate_chars) + §5.3 의 3 mock multi-hop test (long_en_grounded / long_kr_retries / unrelenting_fallback). +7 new tests (2 ignored default skip).
- tasks/HOTFIXES.md 신규 dated entry `## 2026-05-26 — S3 NLI unavailable ...` — Symptom / Root cause / Action / Amends 4-block.
- spec + plan (`docs/superpowers/{specs,plans}/2026-05-26-s3-nli-model-unavailable-diagnose-*.md`) — 4 round spec + 3 round plan OMC reviewer ACCEPT 산출물.
검증:
- cargo test -p kebab-nli -j 1 → 11/11 pass + 7 ignored default skip.
- cargo test -p kebab-rag -j 1 → 19+3+3+... 전체 pass + 3 new mock + 4 new boundary.
- cargo test --workspace --no-fail-fast -j 1 → **1313 pass (+7 new)**, 0 failed. 회귀 0 (HOTFIX #15 이미 fixed, no remaining flaky).
- cargo clippy --workspace --all-targets -j 1 -- -D warnings clean (type_complexity allow on Arc<dyn Fn> type aliases).
KR safe (token-count retry path) + graceful fallback (min floor 시 기존 unavailable wire 유지, regression 0). Wire 영향 없음 (additive trait method). Cargo bump 불필요.
Refs:
- spec: docs/superpowers/specs/2026-05-26-s3-nli-model-unavailable-diagnose-spec.md (4 round APPROVE — analyst → critic + verifier × 4 rounds)
- plan: docs/superpowers/plans/2026-05-26-s3-nli-model-unavailable-diagnose-plan.md (3 round ACCEPT — planner → critic-plan + verifier-plan × 3 rounds)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-26 09:12:21 +00:00
1a224bf983
Merge pull request 'fix(mcp): HOTFIX #15 — MCP ask multi_hop dispatch-divergence assertion (fixture 보강)' ( #183 ) from fix/hotfix-15-mcp-ask-multi-hop-flaky into main
...
Reviewed-on: #183
2026-05-26 07:02:26 +00:00
a210bf5d52
docs(rag): HOTFIX #15 spec + plan (3 round OMC reviewer approve)
...
OMC team `hotfix-15-mcp-flaky` 의 spec + plan 작성 + 리뷰 산출물.
- spec: analyst 가 진단 (root cause = PR-7 probe-first 가 PR-5 test 의 stale empty-KB contract 와 mismatch) + Option A 권장 (test-only fix). 3 round review (critic + verifier): CRITICAL C1 (fixture/query FTS5 0 hits) + MAJOR M1/M2 + 등 closure.
- plan: planner 가 7 steps + subagent dispatch task 작성. 3 round review (critic-plan + verifier-plan): empirical SQLite REPL 검증, level-1 dated entry placement, actual KebabHandler/KebabAppState pattern 정합.
implementation = 429287f commit (executor).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-26 06:52:04 +00:00