Commit Graph

919 Commits

Author SHA1 Message Date
10b0e2f4f2 fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11)
metro-korea.pdf v0.20 final-dogfood (2026-05-27):
- page 8 + page 13 양쪽 모두 600s default 까지 완전 timeout
  (`ms: 600000, chars: 0, skipped: true`)
- 결과: 본문 indexed 안 됨 + page 당 20분 cost 낭비

cloud GPU Ollama 의 실측 per-page throughput 는 6-32s (parent spec 가정 105s 보다
훨씬 빠름). 60s 면 production-friendly upper-bound. dense/고해상도 page 는
config.toml override (`[pdf.ocr] request_timeout_secs = N`) 로 user 가 늘릴 수
있음 — Step 6 에서 HOTFIXES + parent spec cross-link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:15:21 +00:00
28f513795e fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10)
이전: `kebab search "rust" --config /tmp/nonexistent.toml --json` 가 exit=0 +
`{"hits":[]}` silent fallback to XDG default. typo / wrong path 가 0-hit 으로만
surface — debugging nightmare.

이후: kebab_config::ConfigNotFound thiserror::Error 추가, Config::load 의
`Some(p) if !p.exists()` arm 이 anyhow::Error::new(ConfigNotFound { path })
return. kebab_app::error_wire::classify 가 downcast → ErrorV1 code=config_not_found,
hint, details.path 채워서 stderr 에 ndjson 으로 emit.

R-1 (relative path): std::path::Path::exists() 는 cwd-relative — 별도 작업 없이
absolute + relative 모두 cover. integration test 두 개로 검증.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:14:54 +00:00
760eee89c8 fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9)
capabilities_snapshot() 가 streaming_ask + single_file_ingest 를 hardcoded false 로
보고했으나 실제 구현은 v0.20 final-dogfood 에서 production-grade:
- kebab ask --stream → answer_event.v1 ndjson 191 event 정상 emit
- kebab ingest-file <path> / kebab ingest-stdin --title <T> → ingest_report.v1 정상

MCP host + Claude Code skill 등 agent 가 schema.capabilities 로 routing 결정 시
false negative → 사용자가 실제 동작 feature 를 사용 불가능하다고 오인.

http_daemon 은 false 유지 (별도 sub-item 의 non-impl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:13:57 +00:00
f763049923 test(cli): assert 'code' in search --help output (Bug #7 regression pin)
Why: Step 2 의 doc-comment edit 가 향후 누군가 value list 를 재정렬
하거나 alias section 으로 분리할 때 silently 사라질 risk. clap 의
--help 렌더링 가 doc-comment 의 free-form text 라 grep-only smoke 가
유일한 검출 수단.

Change: 신규 test file (kebab-cli convention `cli_*` prefix 답습).
CARGO_BIN_EXE_kebab 으로 fresh binary 실행, stdout 의 `code` substring
assert. spec §4.4 의 acceptance row 1:1 mapping.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.4
/ §5 (acceptance row 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:47:55 +00:00
8cf73d1f43 docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)
Why: kebab search --media code 가 v0.18.0 부터 functional support 됨
(MEDIA_KINDS 외 path 로 first-class 처리, schema.v1.media_breakdown.code
존재). 그러나 SearchArgs 의 clap doc-comment + SKILL.md line 57 의
value list 가 stale — `code` 누락. user 가 --help 만 보고 code 미지원이라
오해 가능.

Change: 2 surface 동기 — main.rs line 158-160 의 multi-line clap
doc-comment + integrations/claude-code/kebab/SKILL.md line 57.
Rust binary surface / wire schema 변경 0.

Out of scope (follow-up): crates/kebab-mcp/tools/search.rs:44,
crates/kebab-core/src/search.rs:32+52, crates/kebab-app/src/
ingest_progress.rs:69, crates/kebab-cli/tests/wire_schema_breakdowns.rs:35
도 동일 stale list 보유. spec ACCEPT (round 1c) 의 grep boundary
밖이므로 본 round 미포함.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.3 / §4.3a.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:47:16 +00:00
a58ee10dfb fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6)
Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의
ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit
`?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char()
의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5
threshold bypass.

Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계
(strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker
list extensible. is_valid_text_char() 본체 변경 0.

Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version
/ wire schema 변경 0.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.1 / §4.2 / §6 R-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:42:59 +00:00
e674ff474b fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)
v0.20.0 sub-item 1 dogfood report 의 Bug #4 — F4 mojibake.pdf 의 lopdf
`get_pages()` count = 0 (Pages tree broken). root cause = 기존 byte-
level `re.sub` + manual startxref edit 가 lopdf strict load 통과시키지만
Pages dict 의 `/Kids` reference 깨짐.

- `tests/fixtures/_synth/mojibake.py`: full rewrite — replace byte-level
  `re.sub` + manual startxref with pikepdf open+inject-dummy-ToUnicode+
  del+save (auto xref regen). HYSMyeongJo-Medium CID font: CID font 이
  ToUnicode 를 자체 생성하지 않아 dummy stream 을 inject 후 strip
  (removed=1 invariant). Exit codes 2/3/4 for invariant fail.
- `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`: regenerate via
  pikepdf — 1 valid page, no /ToUnicode marker, byte-identical 후 reproducible.
- `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json`:
  regen via 2-run cargo test pattern (hand-rolled unwrap_or_else baseline
  bootstrap, no insta crate).
- `crates/kebab-parse-pdf/tests/text_extractor_regression.rs`: append 3
  invariant test — (1) lopdf 1-page, (2) /ToUnicode marker absent,
  (3) PdfTextExtractor 1-block invariant.
- `crates/kebab-parse-pdf/src/text_quality.rs`: f4_fixture_ratio_under_threshold
  threshold 0.3 → 0.5 (production valid_ratio_threshold 기본값). 구 broken
  fixture (pages=0) 는 extract_text="" → ratio=0.0; 신 fixed fixture 는
  CID 2-byte fallback decode → ratio≈0.375 — 여전히 OCR trigger 조건 충족.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 4)
prior: 241ded5 (Step 3 integration test)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:02:17 +00:00
241ded59df test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
v0.20.0 sub-item 1 bugfix Step 3 (Group C) — integration-level regression
for Bug #3 (intra-doc chunk_id collision under aggressive overlap).

- `crates/kebab-app/tests/common/mod.rs`: `pub mod mock_ocr;` 1 line append.
- `crates/kebab-app/tests/common/mock_ocr.rs` (new): MockOcrEngine lift +
  `single` / `per_page` ctor (backward-compat single + per-page cursor).
- `crates/kebab-app/tests/pdf_ocr_apply.rs`: inline MockOcrEngine 제거 +
  `mod common; use common::mock_ocr::MockOcrEngine;` import. 10 ctor call
  site migration (`MockOcrEngine { .. }` → `MockOcrEngine::single(...)`).
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs`
  (new): F1 + F2 scanned PDF + Bug #3 trigger shape (10 char "가" + ". " +
  500 char "나") via mock OCR. assertion: chunk_id global uniqueness (HashSet
  dedup) across F1 + F2; F2 trigger text produces ≥2 chunks (collision shape).
- C1 decision: Option A (share via tests/common/mock_ocr.rs). Facade mock
  injection unavailable (OllamaVisionOcr hardcoded) — helper-level chain test
  (apply_ocr_to_pdf_pages → PdfPageV1Chunker) adds value beyond unit B5.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4.5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 3)
prior: 436fd01 (Step 2 Bug #3 chunk_id fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:45:38 +00:00
436fd015a2 fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)
v0.20.0 sub-item 1 dogfood report 의 Bug #3 (Critical). scanned_page2.pdf
(1580 char OCR text) ingest 시 `chunks.chunk_id` PRIMARY KEY violation —
`per_chunk_hash = #c{char_start}` 가 post-overlap `actual_start` 사용 +
overlap walk floor 가 `prev_min` 으로 collapse → segment 1/2 동일 `#c0`.

- `crates/kebab-chunk/src/pdf_page_v1.rs`: `chunk_page` returns 4-tuple
  (segment_start, actual_start, chunk_end, slice); caller `per_chunk_hash`
  suffix uses `segment_start` (pre-overlap boundary, strictly increasing)
  instead of `char_start` (post-overlap, may collapse to prev_min).
- VERSION_LABEL `"pdf-page-v1"` → `"pdf-page-v1.1"` (design §9 cascade,
  explicit user-facing audit trail). `crates/kebab-app/tests/pdf_pipeline.rs:
  168, 368` 의 hardcoded literal 도 v1.1 로 갱신.
- module docs (`pdf_page_v1.rs:47-60`): workaround description 의
  `#c{char_start}` reference 를 `#c{segment_start}` 로 갱신 + segment_start
  invariant 명문 + HOTFIXES.md cross-ref.
- `pdf_page_v1.rs::tests`: `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
  regression pin (10 char "가" + ". " + 500 char "나" — multi-chunk +
  overlap walk collapse trigger).
- `tasks/HOTFIXES.md`: 2026-05-27 entry (symptom F2 1580 char OCR,
  intra-doc collision root cause, second-iteration patch rationale).

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 2)
prior: d9acda5 (Step 1 Bug #2 walker fix)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:32:09 +00:00
d9acda517a fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)
v0.20.0 sub-item 1 dogfood report 의 Bug #2 — `[ingest.code].max_file_bytes`
가 walker 단계의 모든 file 에 일률 적용 → PDF/image/markdown 의 대부분 (256 KB+)
이 walker pre-extract skip. fix:

- `crates/kebab-source-fs/src/code_meta.rs`: `pub(crate) fn is_code_file(path)
  -> bool` helper 추가 (= `code_lang_for_path(path).is_some()`).
- `crates/kebab-source-fs/src/connector.rs:168-190`: walker size-cap check 가
  `is_code_file(&abs_path) && is_oversized(...)` short-circuit. PDF/image/
  markdown 는 walker bypass — parser 의 자체 size control (lopdf load_mem,
  image OCR max_pixels) 가 cover.
- `crates/kebab-source-fs/src/connector.rs` 기존 mod tests 안 추가:
  `size_cap_skips_only_code_files` — 300 KB PDF + MD + .rs 의 walker 결과
  검증. 기존 sibling test (huge.rs / longfile.rs, fixture 명 `.rs`) regression 0.

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§3)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 1)
prior: b4d9e60 (PR #189)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:20:38 +00:00
b4d9e60816 chore(release): bump version 0.19.0 → 0.20.0 — v0.20.0 sub-item 1 scanned PDF OCR
# v0.20.0 — scanned PDF OCR via Ollama vision LLM

v0.20.0 의 핵심 변경 = embedded text 가 없는 scanned PDF (책 스캔, 영수증,
카메라 page) 의 OCR ingest. PoC 의 5 engine 비교 (Tesseract / EasyOCR /
PaddleOCR / gemma4:e4b / qwen2.5vl:3b) 에서 qwen2.5vl:3b 의 alnum 94.79%
(page1) / 81.56% (받침) 가 모든 다른 engine 을 능가 — 본 release 의 default
vision OCR.

## 1. OCR opt-in 사용법

`[pdf.ocr]` config 의 `enabled = true` 또는 `KEBAB_PDF_OCR_ENABLED=true` env
로 활성화. default off — OCR 한 page 당 45-100s (qwen2.5vl:3b on CPU,
remote Ollama) 의 cost 가 책 archive 외 비-OCR KB 에 부적합.

```toml
[pdf.ocr]
enabled = true
model = "qwen2.5vl:3b"
# 다른 default 는 README 참조
```

qwen2.5vl:3b 의 Ollama pull:

```bash
ollama pull qwen2.5vl:3b   # 3GB Ollama image
```

## 2. v0.19 indexed scanned PDF 의 force-reingest

v0.19 binary 로 scanned PDF 를 ingest 한 KB 는 자동으로 OCR path 진입 안
함 — parser_version "pdf-text-v1" 보존 (CLAUDE.md §Versioning cascade 의
trigger 회피 결정, H-4). 따라서 v0.20 binary upgrade + config
`pdf.ocr.enabled = true` 만 적용 시 try_skip_unchanged 의 Unchanged path 가
OCR 실행을 skip. 명시적 재처리:

```bash
kebab ingest --root /path/to/kb --force
```

## 3. DCTDecode-only v1 scope (FlateDecode / CCITTFax page 처리)

v0.20.0 의 PDF page image extract = lopdf 의 image XObject 의 /Filter ==
DCTDecode 만 cover (JPEG passthrough). 다른 encoding (FlateDecode raw
pixel, CCITTFaxDecode bilevel, JPXDecode JPEG2000) 은 warning event 발행 +
해당 page skip.

scanned PDF 의 일부 page 가 FlateDecode 또는 CCITTFax 로 encoded 시:

```bash
qpdf --object-streams=disable --recompress-flate input.pdf normalized.pdf
```

v1 의 의도 = single binary 원칙 (image crate 도입 0). v1.1+ 또는 별
sub-item 에서 multi-filter 지원 검토.

## 4. Family asymmetry (image OCR gemma4:e4b vs PDF OCR qwen2.5vl:3b)

image OCR (P6) 의 default 는 gemma4:e4b 그대로 (변경 0). PDF OCR (v0.20)
만 qwen2.5vl:3b. 사용자가 [image.ocr] model = "qwen2.5vl:3b" 으로 통일
가능 단 default 는 family asymmetric 보존.

## Dogfood + test 결과

- workspace test: 178 result lines, 0 failure.
- workspace clippy (-D warnings): exit 0.
- alnum e2e (real Ollama, manual invoke):
  - F1 (한국어 page1): 94.79% (≥ 0.85 threshold).
  - F2 (받침-intensive): 81.56% (≥ 0.70 threshold).
- integration smoke + vector PDF regression: pass.

## 변경된 surface

- new config: [pdf.ocr] (11 field) + 11 env override KEBAB_PDF_OCR_*.
- new wire: IngestEvent::PdfOcrStarted/Finished (additive minor).
- new wire: IngestItem.pdf_ocr_pages/ms_total (additive minor).
- new CLI line: "📷 OCR page N..." / "✓ OCR page N (chars chars, msms via ollama-vision)".
- new module: kebab-parse-pdf::{page_image, text_quality} + kebab-app::pdf_ocr_apply.
- dep: workspace lopdf = "0.32" 통합.
- fixture: 5 PDF (F1/F2/F4/F6/F7) under crates/kebab-parse-pdf/tests/fixtures/.

## 변경되지 않은 surface (invariant)

- Extractor::extract trait body byte-identical (PR #187).
- PdfTextExtractor body 변경 0 — post-extract enrichment pattern 으로 분리.
- parser_version "pdf-text-v1" 보존.
- chunker_version "pdf-page-v1" 보존.
- workspace.dependencies 의 production dep graph 변경 0 (-e normal baseline 보존).

## sub-item 의 11 commit history

9d7faab Step 1: foundation + cargo tree baselines
aeeff36 Step 2: lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7)
fb3952d Step 2 fix: F7 conversion engine record correction
c2cd3a7 Step 3: page_image + text_quality modules (10 test)
8d81bc1 Step 3 fix: clippy pedantic in page_image
9f003ef Step 4: pdf_ocr_apply helper (10 test, F7 split + cancel)
fd918a6 Step 5: [pdf.ocr] config section + PdfOcrOpts doc
4672cba Step 5 fix: clippy::bool_assert_comparison in pdf_ocr tests
b9ee09f Step 6: wire PDF OCR enrichment + cancel propagation
4c5ccd5 Step 7: wire schema additive — IngestEvent + IngestItem + skipped
c9e0594 Step 8: CLI printer activation + ingest_progress test + spec literal
4819768 Step 9: integration smoke + vector regression + alnum e2e
1d4e301 Step 9 follow-up: Cargo.lock for dev-dep additions
90726ab Step 10: docs sync (README + HANDOFF + ARCHITECTURE + SMOKE)

## § Acceptance §9 verifier evidence

K5 의 15 row scriptable verifier 모두 green (또는 manual real-Ollama row 의 결과 보고):
- Row #4 (vector PDF byte-identical): pass.
- Row #5 (Extractor::extract trait byte-identical): 0 line diff.
- Row #6 (wire schema additive): jq + diff exit 0.
- Row #7-#8 (clippy / workspace test): exit 0.
- Row #9-#10 (dep graph baseline -e normal): empty diff.
- Row #11 (docs sync): grep evidence.
- Row #12 (version bump): "0.20.0" + Cargo.lock cascade ≥ 22.
- Row #14 (PR #187 invariant): extract_for(&asset.media_type) ≥ 1.
- Row #15 (DCTDecode-only v1, F6/F7 skip): test green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:03:44 +00:00
90726ab283 docs(v0.20): sync README + HANDOFF + ARCHITECTURE + SMOKE for scanned PDF OCR (post-extract enrichment, qwen2.5vl:3b, DCTDecode-only v1)
Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

J0 — release notes path decision: commit body (RELEASE_NOTES.md /
docs/RELEASE_NOTES_*.md 부재, v0.17.x/v0.18.0 patterns 의 commit body
release notes 형식 따름). Step 11 K1 commit body 안 inline.

J1 — README.md:
- Configuration section 의 toml table list 에 `[pdf.ocr]` 추가.
- 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field
  toml example + `KEBAB_PDF_OCR_*` 11 env override + force-reingest UX
  ("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용,
  `kebab ingest --force` 필요").

J2 — HANDOFF.md:
- phase status P7 row 확장: 3/3 component + post-extract OCR enrichment
  (v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM).
- "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope
  (H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX).

J3 — docs/ARCHITECTURE.md:
- OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)`
  (qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply,
  DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%).
- PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) +
  parser_version "pdf-text-v1" 보존 + provenance event 차별화.

J3 — docs/SMOKE.md:
- `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b).
- 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`:
  v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke.

J4 — release notes draft (Step 11 K1 commit body 의 source):
- result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only +
  family asymmetry + dogfood/test 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4)
prior: 1d4e301 (Step 9 + Cargo.lock follow-up)
contract: §9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:34:24 +00:00
1d4e301e5e chore(deps): Cargo.lock for Step 9 dev-dep additions (strsim + kebab-parse-image)
Step 9 (commit 4819768) 에서 추가된 dev-dep (strsim 0.11 + kebab-parse-image
path) 의 Cargo.lock cascade. worker 가 명시적 commit 에 포함 안 함 — follow-up
commit 으로 lock 동기화.

dep graph baseline (-e normal) 영향 0 (dev-dep 만 추가).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:12:38 +00:00
48197687b7 test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1
Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
  ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
  의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
  cancel=true + dummy endpoint, no panic/deadlock verify).

I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
  PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
  전체 부재 — 신규 12-line).

I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
  Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
  scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
  parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
  baseline -e normal 보존).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 10:10:58 +00:00
c9e05941c5 feat(cli): activate per-page PDF OCR progress printer + test(app): ingest_progress emit verify + spec(pdf-ocr): align §4.6.1 literal with option_A (ms/chars)
Step 8 (Group H) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 reviewer concern fix (spec literal deviation).

H1 — kebab-cli/src/progress.rs printer activation:
- 구 no-op stub `IngestEvent::PdfOcr* { .. } => {}` (Step 6 placeholder)
  를 사람-친화 stderr line printer 로 활성화.
- spec §4.6.1 line 1085-1086 wording 그대로:
  - PdfOcrStarted → `  📷 OCR page {page}...`
  - PdfOcrFinished (skipped=false) → `  ✓ OCR page {page} ({chars} chars, {ms}ms via {ocr_engine})`
  - PdfOcrFinished (skipped=true)  → `  ⊘ OCR page {page} skipped (no DCTDecode or engine fail, {ms}ms)` (M-4 의 skipped field carry 활용)
- `!quiet` gate 정합 (AssetStarted/Finished pattern mirror).

H2 — crates/kebab-app/tests/ingest_progress.rs 의 새 test:
- pdf_ocr_progress_emits_started_finished_events (real Ollama 의존, `#[ignore]`).
- F1 fixture (scanned_page1.pdf) ingest 시 pdf_ocr_started + pdf_ocr_finished
  event 가 emit 됨을 verify. Started count == Finished count invariant.
- Manual invoke: `KEBAB_PDF_OCR_ENABLED=true cargo test -p kebab-app --test
  ingest_progress --ignored`.
- mock OcrEngine inject path 부재 (Step 6 의 eager build), Step 9 I5 의
  ocr_e2e pattern (real Ollama + `#[ignore]`) 와 동일.

Step 7 reviewer concern fix — spec §4.6.1 literal:
- line 1076-1077 의 `ocr_ms` / `ocr_chars` literal 을 wire schema 의 실제
  field name `ms` / `chars` (option_A, Rust serde 와 정합) 로 갱신.
- line 1087 의 printer wording 도 `{ocr_chars}` / `{ocr_ms}` → `{chars}` / `{ms}`.
- line 1556 의 rationale 참조 `pdf_ocr_finished.ocr_ms` → `.ms`.
- `skipped` field 도 명시 (Step 6 reviewer M-4 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.6.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 8 H1+H2)
prior: 4c5ccd5 (Step 7 wire schema) — Step 7 reviewer concern 1 의 fix
contract: §9 (additive minor wire bump — Step 7 commit 에서 완료)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 09:18:49 +00:00
4c5ccd5447 feat(wire): additive minor — IngestEvent kind 의 pdf_ocr_* + ingest_report.items[] 의 pdf_ocr_pages/ms_total + skipped field carry (Step 6 M-4/M-2)
Step 7 (Group G) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 6 code reviewer Important M-4 (skipped field carry) + Minor M-2
(ordering invariant doc) fix.

G3 — JSON Schema sync (additive minor — schema_version 보존):

ingest_progress.schema.json:
- kind enum 2 추가: pdf_ocr_started + pdf_ocr_finished.
- 새 field: page (1-based PDF page), ocr_engine (engine_name), skipped (bool).
- 기존 ms / chars field 의 description 갱신 (pdf_ocr_finished carry 추가).

ingest_report.schema.json:
- items.items.properties 신규 정의 (이전 stub ["array", "null"] 만).
- pdf_ocr_pages + pdf_ocr_ms_total (nullable integer).
- 모든 기존 IngestItem field 도 명시화 (kind, doc_path, byte_len, ...).

Step 6 reviewer M-4 (Important) — skipped field carry:
- IngestEvent::PdfOcrFinished 에 skipped: bool 추가.
- ingest_one_pdf_asset 의 emit closure (lib.rs:~1864) 가 source
  PdfOcrProgress::Finished { skipped } 를 discard 않고 propagate.

Step 6 reviewer M-2 (Minor) — ordering invariant doc:
- crates/kebab-app/src/ingest_progress.rs 의 ordering text 갱신:
  ScanStarted < ScanCompleted < (AssetStarted [< (PdfOcrStarted <
  PdfOcrFinished)*] < AssetFinished)* < (Completed | Aborted).

.md doc (docs/wire-schema/v1/*.md) 부재 — plan §3 Step 7 G3 의 .md
deliverable retro N/A (해당 file 0).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 7 G3)
prior: b9ee09f (Step 6 wiring) + Step 6 reviewer M-4/M-2 권고
contract: §9 (additive minor wire bump — schema_version 보존)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 08:51:51 +00:00
b9ee09f176 feat(app): wire PDF OCR enrichment + cancel propagation into ingest_one_pdf_asset (H-5 eager init + post-extract hook + per-page cancel) + workspace lopdf dep (Step 4 M-4)
Step 6 (Group E) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 spillover (IngestEvent variant + IngestItem field for compile
boundary) + Step 4 reviewer Minor M-4 fix.

E1 — eager PDF OCR engine build at `ingest_with_config_opts` entry,
mirror of image OCR pattern (lib.rs:338-347). `pdf.ocr.enabled ||
always_on` 시 `OllamaVisionOcr::from_parts(endpoint, model, ...)` 호출
+ fail-fast `?`. App field 추가 0 (local var only, spec L-1 / Step 1
A1 cosmetic fix 정합).

E2 — `ingest_one_pdf_asset` signature extension: +3 param
(`pdf_ocr_engine: Option<&OllamaVisionOcr>`, `progress: Option<&
mpsc::Sender<IngestEvent>>`, `cancel: Option<&Arc<AtomicBool>>`).
`ingest_one_asset` dispatch wrapper + caller (dispatch loop) update.

E3 — post-extract enrichment block at `extract_for` 직후 (line 1779).
`pdf.ocr.enabled || always_on` 시 `apply_ocr_to_pdf_pages` 호출,
PdfOcrProgress → IngestEvent emit (PdfOcrStarted / PdfOcrFinished
with ocr_engine), summary 의 pages_ocrd/ms_total 을 IngestItem field
로 carry. PR #187 registry dispatch invariant 보존
(`extract_for(&asset.media_type, ...)` 그대로).

E4 — cancel handle propagation: ingest_with_config_cancellable →
IngestOpts.cancel → ingest_with_config_opts → ingest_one_asset →
ingest_one_pdf_asset (new `cancel` param) → PdfOcrOpts.cancel chain.
spec §4.8 line 1159 production wiring.

Step 7 spillover (compile boundary):
- `kebab_app::ingest_progress::IngestEvent`: PdfOcrStarted { page } +
  PdfOcrFinished { page, ms, chars, ocr_engine }. serde discriminant
  `pdf_ocr_started` / `pdf_ocr_finished` (Step 7 G3 wire schema 와 일치).
- `kebab_core::IngestItem`: pdf_ocr_pages: Option<u32> +
  pdf_ocr_ms_total: Option<u64> (warnings/error 사이). 11 non-PDF
  IngestItem construct site 가 `None` 채움.
- `kebab-cli/src/progress.rs` + `kebab-tui/src/ingest_progress.rs`:
  새 variant no-op handler (v1에서 per-page progress 미노출, future
  refinement 시 활성화 가능).
- `kebab-store-sqlite/tests/ingest_report_snapshot.rs` + snapshot
  `ingest_report.snapshot.json`: 2 IngestItem fixture 의 새 field 추가.
- Step 7 의 JSON Schema 갱신 + CLI printer activation + snapshot
  regenerate 는 별 commit (G3/H1/H2 deliverable).

M-4 (Step 4 reviewer Minor) — lopdf workspace dep 통합:
- workspace `Cargo.toml [workspace.dependencies] lopdf = "0.32"`.
- kebab-app + kebab-parse-pdf 의 direct dep → `{ workspace = true }`.

Verifier evidence:
- workspace test (`cargo test --workspace --no-fail-fast -j 1`):
  175 test result summary lines, 0 failures, 0 FAILED.
- workspace clippy (`-D warnings`): exit 0, 0 warning.
- dep graph baseline (`.omc/state/pdf-ocr-{parse-pdf,app-parse}-deps.baseline.txt`):
  empty diff for both.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.4 + §4.6 + §4.8)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 6 E1-E4 + Step 7 partial G1+G2)
prior: 4672cba (Step 5 fix) + fd918a6 (Step 5) + 9f003ef (Step 4 helper)
contract: §9 (additive minor wire bump — Step 7 JSON Schema 완료 시)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 08:18:34 +00:00
4672cba6c6 fix(config): satisfy clippy::bool_assert_comparison in pdf_ocr tests
fd918a6 의 F2 test file (crates/kebab-config/tests/pdf_ocr.rs) 의 4 line
`assert_eq!(bool_field, true|false)` 가 workspace clippy pedantic
의 `bool_assert_comparison` 위반 → CI gate
`cargo clippy --workspace --all-targets -- -D warnings` exit 1.

각 assertion 의 canonical form 적용:
- assert_eq!(x, false) → assert!(!x)
- assert_eq!(x, true)  → assert!(x)

semantic + behavior 동일, 4 line edit, logic 변경 0.

review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-05-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-05-spec-review-result.md (I-1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:17:46 +00:00
fd918a60ce feat(config): add [pdf.ocr] section — qwen2.5vl:3b default, opt-in + env overrides + doc(app): PdfOcrOpts field doc (Step 4 I-1)
Step 5 (Group F) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 4 reviewer Important I-1 fix (PdfOcrOpts field doc) 동봉.

F1 — `kebab-config::PdfCfg` + `PdfOcrCfg` + 4 default fn:
- PdfCfg { ocr: PdfOcrCfg }.
- PdfOcrCfg with 11 field (enabled/always_on/engine/model/endpoint/
  languages/max_pixels/request_timeout_secs/valid_ratio_threshold/
  min_char_count/lang_hint).
- defaults: opt-in (enabled=false), qwen2.5vl:3b, 0.5 threshold, 20 char.
- mirror of image OCR cfg pattern (spec §4.5).

Config struct extension:
- `pdf: PdfCfg` field with `#[serde(default = "PdfCfg::defaults")]`.

11 env var override (parallel to KEBAB_IMAGE_OCR_*):
KEBAB_PDF_OCR_{ENABLED,ALWAYS_ON,ENGINE,MODEL,ENDPOINT,LANGUAGES,
MAX_PIXELS,REQUEST_TIMEOUT_SECS,VALID_RATIO_THRESHOLD,MIN_CHAR_COUNT,
LANG_HINT}.

F2 — `crates/kebab-config/tests/pdf_ocr.rs` (신규):
- toml roundtrip (11 field).
- defaults (opt-in + qwen2.5vl:3b).
- env override (4 key sample + default preservation).

F3 (Step 4 I-1) — `pdf_ocr_apply.rs` 4 public item 의 doc comment:
- PdfOcrOpts struct + 6 field.
- PdfOcrSummary struct + 2 field.
- apply_ocr_to_pdf_pages fn (Errors block 포함).
- PdfOcrProgress enum + 2 variant + 5 field.
body 변경 0, doc-only.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 5 F1+F2)
prior: 9f003ef (Step 4) — code reviewer Important I-1 resolution
contract: §9 (additive minor wire bump — Step 7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:07:18 +00:00
9f003ef1cd feat(app): add pdf_ocr_apply helper (10 test, F7 split + cancel) — post-extract OCR enrichment for PDF (H-1 resolution)
Step 4 (Group D) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

D1 — `apply_ocr_to_pdf_pages(&mut canonical, &dyn OcrEngine, &bytes, &opts, emit_progress)`
in `kebab-app::pdf_ocr_apply`. spec §4.1 line 381-599 body 그대로 +
PdfOcrOpts.cancel field + per-page cancel check (verifier LOW L-1).

post-extract enrichment pattern (H-1 resolution): kebab-parse-pdf 가
kebab-parse-image::OcrEngine 을 import 하지 않음 (parser isolation 보존).
helper 가 kebab-app 의 facade 안 — both parser crate 의 cross-import 회피.

Per-page decision matrix (spec §4.1 line 459-464):
- always_on=true → 모든 page OCR (dual-block, ordinal = page-1 + page_count).
- always_on=false + needs_ocr → in-place OCR (text-detect block mutate).
- needs_ocr=false → skip.

DCTDecode-only v1 (H-3): FlateDecode / CCITTFaxDecode page 는
extract_dctdecode_page_image=None → Warning event + skip + emit_progress(skipped=true).

OcrEngine.recognize 실패 → Warning event + skip + emit_progress(skipped=true).

D3 — per-page cancel handle (verifier LOW L-1 + spec §4.8 line 1159):
PdfOcrOpts.cancel: Option<Arc<AtomicBool>>. set→true 시
`anyhow::bail!("PDF OCR cancelled mid-PDF at page N")`.

lopdf = "0.32" added to [dependencies] (already transitive via kebab-parse-pdf;
no new crate introduced — dep graph kebab-parse-* baseline unchanged).

Integration test (`tests/pdf_ocr_apply.rs`, 10 test):
- f1_input_with_ocr_enabled_replaces_empty_block — in-place mutate.
- f3_input_with_ocr_enabled_keeps_text_detect_blocks — vector PDF skip.
- f1_input_with_ocr_disabled_keeps_empty_block — disabled no-op.
- f4_input_with_ocr_enabled_replaces_mojibake_block — mojibake → in-place mutate.
- f3_input_with_always_on_pushes_dual_blocks — always_on dual-block.
- f6_flatedecode_skipped_with_warning — FlateDecode skip + Warning event.
- f7_ccittfax_skipped_with_warning — CCITTFax skip + Warning event (verifier M-4 split).
- ocr_engine_failure_surfaces_as_warning — OCR failure → Warning event.
- dual_block_ordinals_are_deterministic_and_unique — ordinal invariant.
- cancel_handle_aborts_mid_pdf — cancel handle 의 production source (D3).

MockOcrEngine fixture: spec §5.5 line 1284-1299. F3 fixture 부재 →
mock CanonicalDocument construction + F1 bytes reuse pattern (Option B:
PdfTextExtractor::extract 를 통한 실제 production path canonical 생성).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 + §5.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 4 D1+D2+D3)
prior: c2cd3a7 (Step 3) + 8d81bc1 (Step 3 clippy fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 06:42:01 +00:00
8d81bc1071 style(parse-pdf): satisfy clippy pedantic in page_image (uninlined_format_args + map_unwrap_or)
c2cd3a7 의 `extract_dctdecode_page_image` 에 workspace clippy pedantic 위반 2 건
잔존 → CI gate (cargo clippy --workspace --all-targets -- -D warnings) fail.
두 lint 모두 1-line edit + semantic 동일, logic 변경 0.

- L20  uninlined_format_args: format!("page {} not in get_pages()", page_num)
                              → format!("page {page_num} not in get_pages()")
- L48-52 map_unwrap_or:        .map(|n| n == b"Image").unwrap_or(false)
                              → .is_some_and(|n| n == b"Image")

cargo clippy --workspace --all-targets -j 4 -- -D warnings → exit 0.
cargo test -p kebab-parse-pdf -j 4 → 21 passed (regression 0).

review trail:
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-03-spec-review-result.md (SPEC_COMPLIANT)
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-03-code-review-result.md (Critical C-1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 06:14:00 +00:00
c2cd3a7ab7 feat(parse-pdf): add page_image (DCTDecode passthrough, 2 test) + text_quality (valid char ratio, 8 unit test) modules
Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` ->
Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image
XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1
line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른
encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode
passthrough only (H-3 invariant, image crate 도입 0).

Integration test (`tests/page_image.rs`, 2 test):
- f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path.
- flate_raw_fixture_yields_none — F6 fixture negative path.

C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char =
ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin
Extended + common Korean punctuation. 빈 string → 0.0. caller
(`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5).

Unit test (`mod tests`, 7 + F4 conditional):
- empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo.
- f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가
  ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3).

Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page
image extract helpers ...", Step 1 A2 이연분).

fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차
수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2)
prior: aeeff36 (Step 2 fixtures) + fb3952d (Step 2 F7 record fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:59:10 +00:00
fb3952d54f docs(pdf-ocr): correct F7 conversion engine record in PoC doc (gs, not ImageMagick)
aeeff36 의 PoC doc append (engine-comparison.md L134, L141) 가 F7 (`ccitt.pdf`)
의 conversion engine 을 "ImageMagick `convert -compress Group4`" 로 기록했으나,
실제 tests/fixtures/_synth/flate_ccittfax.sh:77-83 은
`gs -sDEVICE=pdfwrite -dMonoImageFilter=/CCITTFaxEncode -dEncodeMonoImages=true`
flag 사용 (ImageMagick `convert` 호출 0회).

fixture binary (`/Filter [ /CCITTFaxDecode ]`, 2060 bytes) 는 invariant 충족 OK
(Step 2 spec compliance + code quality review verified). historical record 의
factual correction only.

review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-02-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-02-spec-review-result.md
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-02-code-review-result.md (I1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:36:56 +00:00
aeeff3635b poc+test(pdf-ocr): lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7) for v0.20 sub-item 1
Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).

Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
  useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
  manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
  ImageMagick `convert -compress Group4` used for F7 CCITTFax.

B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf       — DejaVu TTF + ToUnicode CMap stripped (count=0);
                          Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf      — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf          — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).

Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py    — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py       — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 04:04:47 +00:00
9d7faab650 docs+chore(plan-bootstrap): apply spec L-1 cosmetic fix + capture cargo tree baselines for v0.20 sub-item 1 verifier gates
Step 1 (Group A) of v0.20.0 sub-item 1 (scanned PDF OCR) implementation plan.

A1 — spec §4.2 line 740 prose pseudo-code fix: `app.pdf_ocr_engine.as_ref()`
→ local `pdf_ocr_engine: Option<OllamaVisionOcr>` built in
`ingest_with_config_opts` (정합 with §4.4 eager init, App field 도입 0).

A2 — Cargo.toml dep invariant verified (image crate 미도입 — H-3 DCTDecode-only
v1 invariant 보존; kebab-parse-pdf + kebab-parse-image 가 kebab-app 의 기존
dep). description 갱신은 Step 3 (module 추가 후) 으로 이연.

A3 — cargo tree baseline 캡처 — K5 row #9/#10 의 ground-truth
(.omc/state/pdf-ocr-{app-parse,parse-pdf}-deps.baseline.txt). 본 sub-item
의 다른 step 의 dep graph 변경 0 invariant 의 verifier 의 baseline.
Note: .omc/ 는 .gitignore 대상 — baseline files 는 로컬 파일로 존재.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (round 1c ACCEPT)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 03:40:49 +00:00
bcd1e37dab chore(repo): .omc/ ignore + AGENTS·GEMINI symlinks + release notes 작성 가이드 강화
- .gitignore: .omc/ (OMC state directory) 추가 — .claude/, .superpowers/ 와 동급
- AGENTS.md / GEMINI.md: CLAUDE.md 로의 symlink — Codex / Gemini CLI 도 동일 지침 따르도록
- CLAUDE.md release 절차: release notes 가 commit subject 단순 나열 대신 사용자 친절한 설명 + 도그푸딩/테스트 결과 포함하도록 가이드 강화

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 23:58:04 +00:00
e7a4330798 Merge pull request 'docs: v0.20 image+pdf handoff + sub-item 3 spec/plan backfill' (#188) from docs/v0-20-image-pdf-handoff into main
Reviewed-on: #188
2026-05-26 23:36:49 +00:00
574e1b1ca1 docs: v0.20 image+pdf handoff + sub-item 3 spec/plan backfill
v0.19.0 release 후 다음 session 인계용 handoff 문서 + 사후 backfill.

- docs/superpowers/handoffs/2026-05-26-v0.20-image-pdf-normalize-handoff.md (540 lines, 9 section)
  - sub-item 1/2/3 머지 결과 + 도그푸딩 baseline (1781 doc / 9050 chunks) + user memory + OMC workflow + 빌드 환경
  - 현재 구현 상태 (v0.19.0, image+pdf) — 정확한 file:line + struct/fn signature + flow
  - 8 TODO 상세 (problem + scope + affected files + risk + trigger 조건)
  - 우선순위 + sequencing 권장 + 새 session 첫 단계 제안

- docs/superpowers/specs/2026-05-26-extractor-dispatch-unification-spec.md (sub-item 3 spec)
- docs/superpowers/plans/2026-05-26-extractor-dispatch-unification-plan.md (sub-item 3 plan)

PR #187 머지 시 source code 만 들어가고 spec/plan 누락 — 동일 PR 의 reference link 가 main 에서 404. 본 commit 으로 backfill.

Assisted-by: Claude Code
2026-05-26 23:34:17 +00:00
c1e82cca92 Merge pull request 'refactor(app): extract dispatch polymorphism — App.extract_for(...) + 11 Extractor registry' (#187) from refactor/extractor-dispatch-unification into main
Reviewed-on: #187
v0.19.0
2026-05-26 21:07:20 +00:00
2c05dbd0dd refactor(app): extract dispatch polymorphism — App.extract_for(...) + 11 Extractor registry
kebab-app 의 hardcoded extract dispatch (`ImageExtractor` + `PdfTextExtractor` + 9 AST `*Extractor` 의 `::new().extract(…)` callsite 11곳 + 9 AST arm match) 를 `App::extract_for(&MediaType, &ExtractContext, &[u8])` 단일 polymorphic call 로 통합. trait 변경 0, parser source 변경 0, wire schema 변경 0 (success path).

핵심 변경:
- App struct 에 `pub(crate) extractors: Vec<Box<dyn Extractor + Send + Sync>>` field + `pub(crate) fn extract_for(...)` helper method.
- App::open_with_config 의 registry init = 11 Extractor (image + pdf + 9 AST).
- ImagePipeline struct 의 `extractor: &'a ImageExtractor` field 제거 + lib.rs:356 local + lib.rs:1235 alias 삭제 (atomic block).
- 9 AST arm (lib.rs:2012-2047 의 12 arm = 11 explicit + 1 wildcard) → 4 arm (9 AST grouped + 7 manifest + 1 shell + 1 other-bail).
- in-crate unit test (app.rs 의 `mod tests_extractor_dispatch`) 3 class: registry length 11 / mutually-exclusive supports() grid (16 sample MediaType) / extract_for error path (Audio).

scope = AST 9-arm + image + pdf extract callsite only. MarkdownExtractor / Tier 2/3 / outer 4-arm / inner 4 match / Chunker dispatch 모두 future-defer (별 PR — spec §11).

Wire schema (success path) 변경 0 — ingest_report.v1 / search_response.v1 / answer.v1 byte-identical (4-medium SMOKE 비교 검증). error.v1.message 의 internal context string wording 변경 (예: `kb-parse-image::ImageExtractor::extract` → `kb-app::extract_for (image)`) 은 spec §5.5 risk acceptance — `error.v1.code` + `error.v1.schema_version` 보존, user-visible surface 외. Cargo workspace.version bump 0.

Refs:
- docs/superpowers/specs/2026-05-26-extractor-dispatch-unification-spec.md (2 round APPROVE)
- docs/superpowers/plans/2026-05-26-extractor-dispatch-unification-plan.md (3 round APPROVE)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 17:43:44 +00:00
96766406aa Merge pull request 'refactor(parse-md): absorb kebab-normalize + kebab-parse-types — 24 → 22 crates + §3.7b 재작성' (#186) from refactor/normalize-absorption into main
Reviewed-on: #186
2026-05-26 15:37:21 +00:00
710945c4b0 refactor(parse-md): absorb kebab-normalize + kebab-parse-types — 24 → 22 crates + §3.7b 재작성
design §3.7b 의 thin layer (ParsedBlock 류) 가 4 parser 중 1개 (markdown) 만 lift 를
경유하는 현실 — fan-in/fan-out 모두 1 → layer 의미 잃음. kebab-normalize (1097 LOC)
+ kebab-parse-types (98 LOC) 둘을 kebab-parse-md 로 흡수.

설계: docs/superpowers/specs/2026-05-26-normalize-absorption-spec.md
플랜: docs/superpowers/plans/2026-05-26-normalize-absorption-plan.md
HOTFIXES: tasks/HOTFIXES.md 의 2026-05-26 entry (design deviation)

- 5 사용 type + 3 forward-declared struct → kebab-parse-md::types module 의 pub explicit re-export.
- build_canonical_document + derive_title + warning_agent → kebab-parse-md::normalize module.
- 4 hard-coded agent literal (lib.rs:122/128/134/153) + warning_agent body return + tracing target literal 모두 보존 — stage label 일관성.
- kebab-app callsite (lib.rs:51 use + :1119 context string) + Cargo.toml 의 2 dep (regular + dead) 제거.
- kebab-chunk + kebab-store-sqlite 의 [dev-dependencies] kebab-normalize → 제거 (kebab-parse-md 로 갈음). 통합 test source 의 use shift.
- test file 이동 (kebab-normalize/tests/normalize_snapshot.rs → kebab-parse-md/tests/).
- workspace Cargo.toml: Hunk (a) members 2 entry 삭제 + Hunk (b) version 0.18.0 → 0.19.0 (frozen contract 변경).
- design §3.7b 4-단락 재작성 (원래 intent 보존 + 현재 상태 + 보존된 surface + future re-extraction trigger).
- design §8 graph 갱신 (3 edge 제거 + 2 forbidden bullet 의미 갱신 + commentary).
- ARCHITECTURE.md crate graph + directory tree mechanical 갱신.
- tasks/INDEX.md L169 closure mention + "Future work / deferred" 섹션 신설 (image/pdf normalize integration entry).
- tasks/HOTFIXES.md 신규 entry (4-block — design deviation Symptom).
- HANDOFF.md cross-link 한 줄.
- 3 dead struct (ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment) 는 보존 — v0.20+ image/pdf normalize integration 의 future surface (spec §11).

Wire / surface impact: 0건. CLI / TUI / MCP / --json 출력 / config / XDG path /
parser_version 모두 unchanged. wire-invisible provenance.events[].agent + tracing target
literal "kb-normalize" 도 보존 — old DB row 와 new DB row 의 audit log 일관성.

Verification: cargo test --workspace --no-fail-fast -j 1 → 1313 passed / 0 failed (172 result blocks).
cargo clippy --workspace --all-targets -j 1 -- -D warnings → 0 warning (5m 46s).
cargo metadata --no-deps --format-version 1 | jq '.workspace_members | length' = 22.
cargo tree -p kebab-app --depth 2 | grep -E "kebab_(parse_types|normalize)" = 0 줄.
2026-05-26 15:00:59 +00:00
d4395a306b Merge pull request 'refactor(source-fs): drop kebab-parse-code dep — 9 tree-sitter grammars drag 제거' (#185) from refactor/source-fs-dep-lightening into main
Reviewed-on: #185
2026-05-26 12:31:29 +00:00
bd48baa19a refactor(source-fs): drop kebab-parse-code dep — 9 tree-sitter grammars drag 제거
kebab-source-fs 가 kebab-parse-code 의 9 tree-sitter grammars 를 drag 했던 무거운 의존성 제거. 4 surface (code_lang_for_path / is_generated_file / is_oversized / BUILTIN_BLACKLIST) 만 사용하지만 dep 그래프에서 9 grammar 전체 link → kebab-source-fs::code_meta 로 이전 + kebab-parse-code 측 cleanup.

핵심 변경:
- kebab-source-fs::code_meta 신설: 4 surface 이전 (BUILTIN_BLACKLIST `pub` for frozen contract + 3 helper fn `pub(crate)`). lib.rs 의 `pub use code_meta::BUILTIN_BLACKLIST` 1 줄 추가 (Option A — 다른 mod surface 무근거 확장 0).
- callsite migration: media.rs (1) + walker.rs (2) + connector.rs (2) 모두 `kebab_source_fs::code_meta::*` 로 갱신.
- kebab-parse-code 측 cleanup: skip.rs 삭제 + lang.rs narrow edit (code_lang_for_path body + unit test 2 + Path import 삭제, module_path_for_* 보존) + lib.rs 헤더 doc rewrite (migration breadcrumb 포함).
- tests/{lang,skip}.rs 13 test 이동 — 12 unit (`src/code_meta.rs::tests`) + 1 integration (`tests/code_meta.rs` for BUILTIN_BLACKLIST frozen contract).
- design §8 graph: edge 제거 + p10-2 inline note. ARCHITECTURE.md 산문 1 줄 갱신. kebab-core::metadata.rs:36 stale dep reference 정정.

G1+G5: cargo tree -p kebab-source-fs | grep tree-sitter = 0 줄.
G2+G3: workspace test 회귀 0 + 13 test 1:1 이동.
G4: design §8 + ARCHITECTURE.md 갱신.

Wire 영향: 없음 (internal Rust crate-API surface 만, user-facing 0). Cargo workspace.version bump 불필요.

Refs:
- docs/superpowers/specs/2026-05-26-source-fs-dep-lightening-spec.md (v3, 4-round APPROVE)
- docs/superpowers/plans/2026-05-26-source-fs-dep-lightening-plan.md (v4, 4-round ACCEPT)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:19:32 +00:00
b02ac8200e Merge pull request 'fix(rag): S3 NLI unavailable — hypothesis char budget + token-count fallback retry' (#184) from fix/s3-nli-model-unavailable-diagnose into main
Reviewed-on: #184
2026-05-26 09:17:12 +00:00
336962715a fix(rag): S3 NLI unavailable — hypothesis char budget + token-count fallback retry
S3 dogfood query 의 `nli_model_unavailable` consistent fail root cause = mDeBERTa-v3 tokenizer 의 `OnlyFirst` strategy + 949-token hypothesis. 기존 char-budget 단독 fix 의 KR-extreme density 미해결 → token-count fallback retry + RC1-residual trait dispatch 정합.

핵심 변경:
- kebab-nli::NliVerifier: `hypothesis_token_count(&str) -> Result<usize>` trait method 추가 (default `Ok(0)` backward-compat). `OnnxNliVerifier` 가 *trait impl block* 안에서 real mDeBERTa tokenize override — vtable 등록 보장 (round-3 critic RC1-residual closure).
- kebab-rag::pipeline: `MAX_NLI_HYPOTHESIS_CHARS_INITIAL = 1200` + `MAX_NLI_HYPOTHESIS_CHARS_MIN = 150` const + `pub(crate) fn truncate_chars` pure-fn + `pub fn truncate_hypothesis_for_nli_with_budget` retry helper (char budget 반감 retry, min floor 시 graceful unavailable). step 8.5 hook 의 callsite explicit `match` + `return self.refuse_nli_model_unavailable` 패턴 (`?` 금지 — round-2 plan critic CRITICAL #1 closure).
- SpyNliVerifier 신규 helper (closure score_fn + hypothesis_token_count_fn, 2-arg constructor).
- §5.1 의 2 ignored test (EN-long err + vtable dispatch RC1-residual pin) + §5.2 의 4 boundary test (truncate_chars) + §5.3 의 3 mock multi-hop test (long_en_grounded / long_kr_retries / unrelenting_fallback). +7 new tests (2 ignored default skip).
- tasks/HOTFIXES.md 신규 dated entry `## 2026-05-26 — S3 NLI unavailable ...` — Symptom / Root cause / Action / Amends 4-block.
- spec + plan (`docs/superpowers/{specs,plans}/2026-05-26-s3-nli-model-unavailable-diagnose-*.md`) — 4 round spec + 3 round plan OMC reviewer ACCEPT 산출물.

검증:
- cargo test -p kebab-nli -j 1 → 11/11 pass + 7 ignored default skip.
- cargo test -p kebab-rag -j 1 → 19+3+3+... 전체 pass + 3 new mock + 4 new boundary.
- cargo test --workspace --no-fail-fast -j 1 → **1313 pass (+7 new)**, 0 failed. 회귀 0 (HOTFIX #15 이미 fixed, no remaining flaky).
- cargo clippy --workspace --all-targets -j 1 -- -D warnings clean (type_complexity allow on Arc<dyn Fn> type aliases).

KR safe (token-count retry path) + graceful fallback (min floor 시 기존 unavailable wire 유지, regression 0). Wire 영향 없음 (additive trait method). Cargo bump 불필요.

Refs:
- spec: docs/superpowers/specs/2026-05-26-s3-nli-model-unavailable-diagnose-spec.md (4 round APPROVE — analyst → critic + verifier × 4 rounds)
- plan: docs/superpowers/plans/2026-05-26-s3-nli-model-unavailable-diagnose-plan.md (3 round ACCEPT — planner → critic-plan + verifier-plan × 3 rounds)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 09:12:21 +00:00
1a224bf983 Merge pull request 'fix(mcp): HOTFIX #15 — MCP ask multi_hop dispatch-divergence assertion (fixture 보강)' (#183) from fix/hotfix-15-mcp-ask-multi-hop-flaky into main
Reviewed-on: #183
2026-05-26 07:02:26 +00:00
a210bf5d52 docs(rag): HOTFIX #15 spec + plan (3 round OMC reviewer approve)
OMC team `hotfix-15-mcp-flaky` 의 spec + plan 작성 + 리뷰 산출물.

- spec: analyst 가 진단 (root cause = PR-7 probe-first 가 PR-5 test 의 stale empty-KB contract 와 mismatch) + Option A 권장 (test-only fix). 3 round review (critic + verifier): CRITICAL C1 (fixture/query FTS5 0 hits) + MAJOR M1/M2 + 등 closure.
- plan: planner 가 7 steps + subagent dispatch task 작성. 3 round review (critic-plan + verifier-plan): empirical SQLite REPL 검증, level-1 dated entry placement, actual KebabHandler/KebabAppState pattern 정합.

implementation = 429287f commit (executor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 06:52:04 +00:00
429287f6cb fix(mcp,tests): HOTFIX #15 — MCP ask multi_hop dispatch-divergence assertion (fixture 보강)
PR-7 (v0.18 dogfood probe-first) 머지 후 PR-5 의 test `ask_tool_routes_multi_hop_true_to_decompose_first` 가 stale empty-KB contract 로 deterministic fail. test-only fix — production code 0 touch.

- `minimal_config`: `score_gate = 0.0` (probe 의 second gate `top_score < score_gate` 우회, test config isolation).
- fixture `workspace_root/note.md`: "This note is about a compound containing X and Y in detail." — build_match_string 의 token_and branch (FTS5 implicit-AND) 가 `compound` + `about` + `and` 셋 다 매칭 필요. empirical SQLite REPL (V007 trigram DDL) 로 1 hit 확정.
- 기존 assertion 보존, single-pass branch 도 query "anything" 으로 fixture 미매칭 → NoChunks refusal 유지.
- 신규 `_multi_hop_short_circuits_when_probe_empty` test (REQUIRED — round-1 critic HIGH + verifier 격상): probe-empty short-circuit 의 MCP-layer wire shape pin (kebab-rag::multi_hop_empty_probe_pool_refuses_before_any_llm_call 은 RAG-layer 만 pin, MCP-layer 안전망 부재).
- module doc 갱신: 두 test 가 각각 pin 하는 contract enumerate. inline 주석 (line 94-101) 도 새 contract 정합.
- HOTFIXES.md 신규 dated entry \`## 2026-05-26 — HOTFIX #15 ...\` (date-top convention).

검증: cargo test --workspace -j 1 — 회귀 0 (known flaky 1 → 0). cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.

Wire / behavior / version cascade: 0.

Refs: docs/superpowers/specs/2026-05-26-hotfix-15-mcp-ask-multi-hop-flaky-spec.md (review 3 rounds APPROVE)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 06:51:06 +00:00
08495eb425 Merge pull request 'chore(release): bump version 0.17.2 → 0.18.0 + cut fb-41 multi-hop' (#182) from chore/v0-18-0-cut into main
Reviewed-on: #182
v0.18.0
2026-05-26 05:36:05 +00:00
98cf4e8a04 chore(release): bump version 0.17.2 → 0.18.0 + cut fb-41 multi-hop
v0.18.0 cut PR. fb-41 multi-hop RAG + NLI verification 의 user-visible surface (PR #176-180) + post-PR9 cleanup/refactor (PR #181) ship 마무리.

## 변경 사항

### Version
- workspace `Cargo.toml`: 0.17.2 → 0.18.0. Cargo.lock 자동 cascade (24 kebab-* crate 모두 0.18.0).

### Frozen design contract
- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`:
  - §3.8 RAG types — RefusalReason 에 NliVerificationFailed + NliModelUnavailable + MultiHopDecomposeFailed 추가 + Multi-hop RAG + NLI verification 의 ask_multi_hop facade + step 8.5 NLI hook + HopRecord / VerificationSummary 명시.
  - §9 versioning rules 표 — nli_model_version row 신규 (선택 — v0.19+ second adapter 시 wire surface candidate).

### Status transitions
- `docs/superpowers/specs/2026-05-25-p9-fb-41-finalize-spec.md`: status approved-by-team → completed.
- `docs/superpowers/plans/2026-05-25-p9-fb-41-finalize-plan.md`: status approved-by-team → completed (spec_status 도).

### User-facing docs
- `README.md`: 명령 표의 `kebab ask` row 에 `--multi-hop` flag + NLI 옵션 안내 한 단락 (mDeBERTa-v3 XNLI 280 MB 자동 다운로드 / RAM peak ~7-8 GB / threshold tuning 0.5 prod / 0.0 disable).
- `docs/SMOKE.md`: `[rag] nli_threshold = 0.0` config 예시 + 활성화 절차 + first-run download + RAM 권장 inline 안내.

### Handoff + dashboard
- `HANDOFF.md`: 한 줄 요약 의 현재 version 0.17.2 → 0.18.0. v0.18.0 cut entry 추가 (fb-41 multi-hop + NLI + cleanup ship). Component 카운트 단락에 fb-41 PR-9 의 kebab-nli + ask_multi_hop 추가 명시. 머지 후 결정 절 맨 위에 v0.18.0 fb-41 entry 신규.
- `tasks/INDEX.md`: p9-fb-41  머지 (v0.18.0). v0.18.0 subsection 신규 — PR #176-181 의 6 sub-PR + cleanup 각 한 줄 요약.

## 비범위 / 별 작업
- HOTFIXES.md 의 fb-41 entry 는 이미 PR #180 (PR-9d closure) 에서 작성 완료 — 본 cut PR 에서 추가 anchor 불필요.
- SKILL.md 의 v0.18+ NLI 안내는 이미 PR-9c-2 에서 inline 추가 완료.

## 검증
- `cargo check --workspace -j 1` 통과 (모든 24 crate v0.18.0 확인).
- frozen design 의 RefusalReason enum 확장이 kebab-core 의 production code 와 정합 (PR-9c-1 시점부터 동일 variants 있음).

Wire 영향: 없음 (additive minor 는 PR-9c-1 에서 이미 ship, 본 commit 은 documentation cascade only).
Behavior 영향: 없음.

머지 후 `gitea-release v0.18.0` 으로 tag + release notes 작성.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 05:18:08 +00:00
4030f04f37 Merge pull request 'chore: workspace-wide cleanup — clippy::pedantic baseline + auto-fix' (#181) from chore/workspace-wide-cleanup-pre-v0-18 into main
Reviewed-on: #181
2026-05-26 04:48:50 +00:00
7c27633df2 chore(rag): post-PR9 refactor — H1/H2/H3/D/E + test coverage + post-refactor dogfood retest
OMC team `post-pr9-refactor` 의 architectural cleanup. architect priorities 분석 후 executor + test-engineer 가 file edits, system-architect 가 component-level review 로 *pre-cut nothing — all v0.18.1+ defer* 결론.

## Executor 작업 (H1/H2/H3/D/E)

- **H1** (kebab-nli/src/onnx.rs): `[models.nli]` config wire 활성화. `DEFAULT_MODEL_ID` const 제거 (kebab-config 의 NliCfg::defaults 가 single source). OnnxNliVerifier::new 가 config.models.nli.model 읽고 config.models.nli.provider 가 "onnx" 아니면 anyhow::bail. 3 stale "PR-9c-1 will wire this" 코멘트 제거. 2 unit test 추가 (`new_uses_config_model_id`, `new_rejects_unsupported_provider`).
- **H2** (kebab-rag/src/pipeline.rs): `truncate_for_nli(premise: &str, _hypothesis: &str)` → `truncate_for_nli(premise: &str)`. v0.18.1 placeholder doc 제거. 4 callsite (tests/multi_hop.rs) 갱신 + test rename `multi_hop_truncate_for_nli_preserves_hypothesis` → `multi_hop_truncate_for_nli_char_budget` (contract 정합).
- **H3** (kebab-rag/src/pipeline.rs:1041): `was_truncated` 가 tracing::debug! 으로 surface (observability 추가, signature 보존 — caller logging contract).
- **D** (kebab-mcp/tests/tools_call_ask_multi_hop.rs): request_timeout_secs 2 → 5 (slow CI 안정성), `mh_code` discriminator 제거. dispatch contract = `mh.is_error.unwrap_or(false)` (기존 assertion 으로 충분).
- **E** (tasks/HOTFIXES.md + pipeline.rs:1633-1638): fb-41 PR-9 closure entry 의 sibling 으로 "### PR-9 NLI refusal: terminal Synthesize hop omitted from hops trace" subsection 추가. pipeline 의 "cleanup deferred to a follow-up" → "// See tasks/HOTFIXES.md ... for follow-up" cross-link.

## Test-engineer 작업 (T1/T2/T3/T4, 9 new tests)

- **T1** (kebab-nli/src/onnx.rs::tests): sanitize_model_id 3 unit (replaces_slash / idempotent / leaves_other_chars).
- **T2** (kebab-rag/tests/multi_hop_nli_panic.rs 신규): 2 panic-path tests — facade invariant (`expect("verifier must be Some when nli_threshold > 0.0")`) 의 #[should_panic] + threshold=0 의 companion.
- **T3** (kebab-rag/tests/multi_hop_nli_stream.rs 신규): 2 StreamEvent::Final tests — refuse_nli_verification + refuse_nli_model_unavailable 의 stream_sink Final 분기 wire shape pinning.
- **T4** (kebab-app/tests/open_with_config_nli.rs 신규): 2 NLI failure path — model_dir 가 unwritable 일 때 App::open_with_config 의 Result<App> Err (with "OnnxNliVerifier" in chain) + threshold=0 일 때 graceful skip.

## System-architect 결론

3 lenses (absorption / duplication / under-engineered interface) 분석 결과 — *pre-cut nothing*. Top-3 items 모두 v0.18.1+ defer:
- Lens 1: kebab-normalize + kebab-parse-types 흡수 가능 (parse-md 만 사용, 5 parsers 우회) → v0.18.1+.
- Lens 3: Extractor + Chunker trait 의 dead polymorphism (모든 callsite 가 hardcoded) → v0.18.1+.
- Lens 1 bundled: kebab-source-fs 가 kebab-parse-code 의 9 tree-sitter grammars drag → low-risk dep-graph win, v0.18.1+ bundled.
- Defer-with-intent: LanguageModel async refactor (cloud-LLM 시), NliVerifier::score_batch + typed NliError (2nd impl 시), compute_stale → kebab-core::stale.

보고서: /build/cache/tmp/post-pr9-refactor-priorities.md, /build/cache/tmp/system-architecture-priorities.md (둘 다 repo 외 — analysis 보존).

## 검증

- cargo test -p kebab-nli -j 1 → 11/11 pass.
- cargo test -p kebab-rag -j 1 → 75/75 pass (5 NLI multi-hop + 4 신규 T2/T3 포함).
- cargo test -p kebab-app -j 1 → 23 pass + 2 ignored (T4 의 2 포함).
- cargo test -p kebab-mcp --test tools_call_ask_multi_hop -j 1 → 1 pass + 1 pre-existing flaky (HOTFIX #15, no_chunks short-circuit, executor D fix 와 무관 — line 86 의 base assertion 이 fixture 없어서 fail).
- cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.
- cargo test --workspace --no-fail-fast -j 1 → 1304 passed (+11 new) + 1 pre-existing flaky 동일.
- **Post-refactor dogfood retest byte-identical** (PR-9d / post-cleanup / post-refactor 3번 모두): S7 0.0035389824770390987, S1 0.058334656059741974, S10 0.0027875436935573816, S3 nli_model_unavailable.

docs/dogfood/v0.18.0/SUMMARY.md 에 "Post-architectural-refactor retest" section 추가.

Wire 영향: 없음.
Behavior 영향: 없음 (H1 의 config wiring 가 default 와 같은 model → byte-identical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 04:42:37 +00:00
3712d005cc chore(rag): post-cleanup dogfood retest — byte-identical 회귀 0
workspace-wide cleanup commit 직후 동일 dogfood S7/S1/S10/S3 재실행. NLI score byte-identical 확인:

- S7: nli_verification_failed, 0.0035389824770390987 (PR-9d 동일)
- S1: nli_verification_failed, 0.058334656059741974 (PR-9d 동일)
- S10: nli_verification_failed, 0.0027875436935573816 (PR-9d 동일)
- S3: nli_model_unavailable (PR-9d 동일, cleanup 무관 — v0.18.1 follow-up)

cleanup = mechanical refactor only. behavior 회귀 0. cut PR v0.18.0 진행 가능.

docs/dogfood/v0.18.0/SUMMARY.md 에 "Post-cleanup retest" section 추가.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 03:19:15 +00:00
7c85de065a chore: workspace-wide cleanup — clippy::pedantic baseline + auto-fix
cut PR v0.18.0 전 마지막 정리. 사용자 요청: "전체 코드베이스를 깔끔하고 알아보기 쉽게".

## Workspace lints

- `Cargo.toml` 의 `[workspace.lints.clippy]` 에 `pedantic = "warn"` (priority -1) + 의도적 allow-list 추가:
  - cast_possible_truncation / cast_possible_wrap / cast_sign_loss / cast_precision_loss — ONNX i64 / hash modular reduction 등 의도적 truncation.
  - doc_markdown / missing_errors_doc / missing_panics_doc — cosmetic doc style.
  - too_many_lines / module_name_repetitions / must_use_candidate / needless_pass_by_value / manual_let_else / items_after_statements / similar_names — informational only.
  - format_collect / match_wildcard_for_single_variants / trivially_copy_pass_by_ref / unnecessary_wraps — intentional patterns (exhaustive match, future Result variants 등).
  - default_trait_access — `Foo::default()` 가 idiomatic.
  - float_cmp — NLI / RRF score 의 explicit threshold 비교 의도.
  - struct_excessive_bools / case_sensitive_file_extension_comparisons / naive_bytecount / ignore_without_reason — domain-specific 의도.
  - format_push_string / return_self_not_must_use / match_same_arms — builder / wire-label / hot-path 패턴 보존.
  - needless_continue / used_underscore_binding / nonminimal_bool / unreadable_literal / many_single_char_names / doc_link_with_quotes / assigning_clones / collapsible_str_replace / trivial_regex / elidable_lifetime_names / range_plus_one / explicit_iter_loop / implicit_hasher / ref_option — remaining low-value style.
- 각 24 crate `Cargo.toml` 에 `[lints] workspace = true` 추가.

## Auto-fix

`cargo clippy --workspace --all-targets --fix` 적용 — 128 files changed, 552 insertions / 472 deletions. 주로:
- uninlined_format_args (~18): `format!("{}", x)` → `format!("{x}")`.
- redundant_closure_for_method_calls (~33): `.map(|x| x.foo())` → `.map(T::foo)`.
- 그 외 mechanical refactor.

## 검증

- `cargo clippy --workspace --all-targets -j 1 -- -D warnings` clean (pedantic + 모든 lint group).
- `cargo test --workspace --no-fail-fast -j 1` — **1293 tests pass + 1 pre-existing flaky fail** (`kebab-mcp::tools_call_ask_multi_hop::ask_tool_routes_multi_hop_true_to_decompose_first`, HOTFIX candidate, cleanup 무관). 회귀 0.

Wire 영향: 없음.
Behavior 영향: 없음 (mechanical refactor only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 03:01:58 +00:00
a0ccc7b021 Merge pull request 'feat(rag): fb-41 PR-9d — dogfood retest + HOTFIXES closure + corpus 보존' (#180) from feat/fb-41-pr-9d-dogfood-retest into main
Reviewed-on: #180
2026-05-26 02:01:56 +00:00
a8fd6994d2 chore(rag): PR-9d SUMMARY 의 latency 표 정정
PR-8 baseline 의 S1/S10 latency 추정값 (~150s, ~80s) 이 부정확. `results/s1-multihop.json` + `results/s10-multihop.json` 가 실제로 614s / 589s (`jq '.usage.latency_ms'` 측정) — *PR-8 시점 baseline 이 아닌 더 이전 timeline*. S7 만 `results/post-pr8/` 에 retest 보존되어 비교 의미 있음 (158s baseline → PR-9 241s with NLI first-run download).

SUMMARY.md 의 latency 표를 정정 — S1/S10 의 *동일 시점 baseline 부재* 명시 + S7 의 단일 비교만 의미 있음 caveat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 01:47:38 +00:00
505b3889fb feat(rag): fb-41 PR-9d — dogfood retest + HOTFIXES PR-9 closure + docs/dogfood/v0.18.0/ 보존
PR-9 의 진짜 작동 확인 — PR-1~PR-9c-2 머지 후 `/build/cache/dogfood-v018/` corpus 의 S7/S1/S3/S10 multi-hop retest.

핵심 결과: **S7 hallucination root cause 해결 확정**.
- PR-8 baseline: `grounded=true, refusal_reason=null`, **답변=Adam gradient 공식** (caffeine 질문에 무관 hallucination, silent).
- PR-9 retest: `refusal_reason=nli_verification_failed, nli_score=0.0035` (graceful refuse, NLI 가 entailment 0.35% 검출).

전체 비교 (4 case):
- S7  hallucination FIXED.
- S1  둘 다 reject, NLI 가 더 deterministic (0.058).
- S3 ⚠ consistent fail (`nli_model_unavailable`, 313s) — *v0.18.1 follow-up* (kebab-nli 의 특정 input 의존 fail, debug log emit 안 됨 → 진단 어려움).
- S10  둘 다 reject, NLI 가 더 deterministic (0.0028).

- docs/dogfood/v0.18.0/SUMMARY.md (sanitized 보고서) + s{1,3,7,10}-multihop-post-pr9.json (sample wire output, repo 보존).
- tasks/HOTFIXES.md 의 fb-41 PR-9 entry: "예정" → "완료 (2026-05-26)" + 비교 표 inline + S3 follow-up subsection (v0.18.1 candidate).

RAM: 5-6 GB → 7-8 GB (ONNX session ~600 MB), 16 GB 안전.
Disk: NLI model cache 1.1 GB (XDG default 또는 storage.model_dir).
Wire 영향: 없음 (PR-9c-1 의 schema 변경만 + 측정값 sample 보존).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 01:44:57 +00:00
772575d8f0 Merge pull request 'feat(rag): fb-41 PR-9c-2 — pipeline integration + mock test + SKILL.md (★ NLI 실 활성화)' (#179) from feat/fb-41-pr-9c-2-pipeline-integration into main
Reviewed-on: #179
2026-05-26 01:03:18 +00:00
00ffe9c792 feat(rag): fb-41 PR-9c-2 — pipeline integration + mock test + SKILL.md (★ NLI 실 활성화)
PR-9c-1 의 wire surface 위에 behavior 활성화 — `ask_multi_hop` 의 step 8.5 hook 가 `[rag] nli_threshold > 0` 일 때 NLI 검증 실 수행. **첫 user-visible behavior change** in PR-9.

- crates/kebab-rag/src/pipeline.rs:
  - ask_multi_hop step 8.5 NLI hook (empty answer 가드 + truncate_for_nli + verifier.score + verification field + refusal 분기).
  - refuse_nli_verification helper (verification: Some(...) + RefusalReason::NliVerificationFailed).
  - refuse_nli_model_unavailable helper (verification: None + RefusalReason::NliModelUnavailable).
  - truncate_for_nli helper (module-level pub fn, MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600 chars 의 chars-based budget, _hypothesis 미사용 placeholder — v0.18.1 token-budget 갱신 candidate).
  - PR-9c-1 의 #[allow(dead_code)] 두 곳 제거 (verifier field + with_verifier builder; doc 의 transitional sentence 도 정리). round-1 PR-9c-1 review N1 carry-forward closure.
- crates/kebab-app/src/app.rs:
  - App::open_with_config 의 NliVerifier construction — config.rag.nli_threshold > 0 → OnnxNliVerifier::new + Arc::new wrap + 후속 RagPipeline 초기화 시 with_verifier 호출. 실패 시 ? 전파 (시그니처 Result<Self> 그대로 — caller cascading 0).
  - kebab-app/Cargo.toml 에 kebab-nli path 의존 추가.
- crates/kebab-rag/tests/multi_hop.rs + tests/common/mod.rs:
  - MockNliVerifier (pass / fail / err 생성자 + score call_count instrumented).
  - multi_hop_nli_pass_keeps_grounded — entailment 0.9 → grounded=true, verification.nli_passed=true.
  - multi_hop_nli_fail_refuses — entailment 0.1 → refusal=NliVerificationFailed.
  - multi_hop_nli_disabled_skip_verify — threshold 0.0 → verify skip, verification=None.
  - multi_hop_nli_model_unavailable_refuses — verifier Err → refusal=NliModelUnavailable.
  - multi_hop_truncate_for_nli_preserves_hypothesis — long premise truncation + hypothesis 보전.
- integrations/claude-code/kebab/SKILL.md: mcp__kebab__ask 절에 NLI 안내 한 단락 (verification.nli_passed 의미 + threshold tuning + nli_verification_failed/nli_model_unavailable refusal handling).

검증: cargo test --workspace -j 1 — 5 신규 multi-hop pass + 회귀 0 (pre-existing kebab-mcp::tools_call_ask_multi_hop 동일 flaky). cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.
Wire 영향: PR-9c-1 의 schema 변경에 *behavior wiring* — answer.v1.verification field 가 multi-hop happy path + refuse path 양쪽에서 채움.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 00:55:02 +00:00