1125 Commits

Author SHA1 Message Date
2b4ba8e104 docs(spec): v0.20.2 dogfood findings 설계 + round-1 critic 반영
v0.20.1 전체 도그푸딩에서 발견된 8 todo (Ask 응답언어 rag-v3 / doc.lang
docs / bulk input / list title / fusion_score·score_kind / schema
index_version / Ollama hint) 를 단일 patch release 로 설계.

writer worker 초안 → opus critic round-1 리뷰 반영:
- B1: top-level score placeholder → 확정 (score_kind 가 의미 선언, search.rs:95-99)
- M1: 이미지 caption 언어 강제 out-of-scope 명시
- M2: config default 테스트(lib.rs:1316) 갱신 필요 명시
- M3: bulk input 전체 필드 (query/mode/k/trust_min/ingested_after/media/tag/lang)
- M4: rag-v3 의 eval_runs.config_snapshot_json cascade 영향

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 20:37:42 +00:00
b08941d6ab docs(handoff): mark brainstorming-needed items in v0.20.1 findings todo
각 todo 에 fix path classification 추가:
- 🧠 필수: design / user expectation 결정 필요 — brainstorming skill 우선
- 🔧 mechanical: spec drift 또는 명확한 fix — 별 brainstorming X
- 📝 mild discussion: spec drafter self-review 로 trade-off 결정 가능

Classification:
- 🧠 필수 (2): #1 Ask 영어→한국어 response policy, #4 doc.lang semantic
- 🔧 mechanical (4): #2 bulk schema, #5/#6 docs sync, #7 schema rename
- 📝 mild (3): #3 list title, #8 Ollama default

추가 brainstorming 후보 (직접 finding 외):
- BS-A: HTML corpus 지원 (1415 file skipped)
- BS-B: Tier 1/2/3 chunker UX visibility
- BS-C: kebab dogfood subcommand (자동화)
- BS-D: 영문 code chunk 의 tokenized_korean_text 효율
- BS-E: builtin_blacklist 명세 노출

권장 워크플로:
1. brainstorming 단계 먼저 (#1, #4 + BS-x 별 검토)
2. mechanical batch (#2, #5, #6, #7) — 한 PR
3. mild discussion batch (#3, #8)
4. dogfood retest → v0.20.2 patch release
2026-05-28 19:45:53 +00:00
6bf4e82e62 docs(handoff): v0.20.1 full dogfood findings todo for next session
머지 후 v0.20.1 의 full dogfood (사용자 실제 corpus 6293 file, 3.5
시간 ingest, §1~§11 시나리오) 발견된 findings 를 새 session 의 self-
contained todo handoff 로 정리.

P0 (bug / 의도와 다른 동작):
- #1 Ask 영어 query → 한국어 응답 (rag-v2 prompt template 강제)
- #2 bulk search input format 불명확 (wire schema 미명시)
- #3 list docs title 중복 (heading-based, doc_path 보조 필요)
- #4 doc.lang = und 53% (code file 의 lang detection 실패)

P1 (docs drift):
- #5 fusion_score 위치 (.retrieval.fusion_score)
- #6 score_kind="bm25" 의미 (lexical mode 의 fusion_score)
- #7 schema index_version vs lexical_index_version 혼동

P2 (setup):
- #8 Ollama endpoint default 가 localhost (사용자 환경 remote)

각 todo 별 severity, scenario, suspected location, action item 명시.
새 session 시작 명령 + branch 권장 + 도그푸딩 재실행 절차 + finding
cumulative table 포함.

Repo state: main HEAD=a0c7fa3, clean. v0.20.1 binary OK. /build/dogfood/
KB (3940 docs, 34896 chunks) preserved for regression test.
2026-05-28 19:40:10 +00:00
a0c7fa3d1a docs(claude): dogfood 보관소를 /build/dogfood/ 로 통합 + 분류 정책 명시
사용자 정정 따라 dogfood data layout 갱신:

1. 위치: /build/cache/dogfood/ → /build/dogfood/. /build/cache 는 의미상
   캐시 (regeneratable downloads/models) 이지 test data 아님.
   /build/dogfood 는 sudo 로 신설 + chown.

2. 분류 정책: kebab version / 생성 시점 / scenario name prefix 금지
   (v0.20.1-dogfood/, dogfood-v018/ 같은 디렉토리 신설 X). 모든 분류
   는 문서 의미 / 종류 / 형식 기준만. 자세한 layout 은
   /build/dogfood/README.md.

3. 단일 디렉토리 정책: source 문서 + KB state + logs 모두 /build/
   dogfood/ 안 하나로. 매 도그푸딩 run 마다 kb/ 만 reset, 별 디렉토리
   신설 X.

4. 금지 위치 명시: /tmp/kebab-*, /build/cache/dogfood*, /home/altair823/
   KnowledgeBase, XDG paths 신규 사용 금지.

Source dirs 정리 (이번 commit 외 별 작업으로 완료):
- /build/cache/dogfood{,-p10b,-v017,-v018,-v0.19.0} 모두 삭제 (move 후).
- /home/altair823/KnowledgeBase, kebab-dogfooding 도 /build/dogfood/ 로 이동.
- XDG paths 는 /build/dogfood/_archive/xdg-state/ 로 snapshot.

최종 corpus: 6293 files (markdown/code/html/manifest/resources), 554M.
2026-05-28 14:44:23 +00:00
ebc6bf45c4 Merge pull request 'feat(v0.20.1): 한국어 morphological tokenizer (V009) + N-gram supplement + eager backfill' (#191) from feat/korean-morphological-tokenizer into main
v0.20.1 — V009 한국어 morphological tokenizer + N-gram supplement (Bug #8 closure).

22 implementation commit + 2 docs polish (dogfood data consolidation + CLAUDE.md trigger policy).

Dogfood evidence: KnowledgeBase 1781 doc / 9050 chunk backfill 26.6초, '한국' query 0 → 10 hit.

PR-level review verdict: APPROVED WITH NOTES (4 minor docs polish, all addressed).
2026-05-28 14:17:13 +00:00
d8fdc815be docs(claude): dogfood trigger + 통합 data 보관소 정책
사용자 요청 — 사용자가 누적된 ad-hoc 도그푸딩 데이터를 /build/cache/
dogfood/ 한 곳에 collection 한 후, 도그푸딩의 필요 시점을 추론해
CLAUDE.md 에 정책 section 추가.

신규 section `## Dogfood trigger` (사이 Release 와 Naming):
- 도그푸딩이 필요한 시점 (6 trigger 분류: schema/migration, wire
  schema/CLI, search/RAG, performance, language/locale, file/asset).
- Release-level: bump commit 이전에 evidence 명시 필수.
- 도그푸딩 데이터 보관소: /build/cache/dogfood/ 의 디렉토리 구조 +
  README.md cross-link + /tmp/kebab-* 신규 사용 금지.
- 도그푸딩 결과 기록: HOTFIXES dated entry + release notes draft 의
  4-단락 풀어쓰기 + DOGFOOD.md scenario catalog cascade.

실 작업:
- /build/cache/tmp/v0.20.1-* 5 디렉토리, /tmp/dogfood-* 2 디렉토리,
  관련 log file 모두 /build/cache/dogfood/ 로 mv. config.toml 의
  hard-coded path 자동 sed-replace.
- /build/cache/dogfood/README.md 신규 — 디렉토리 구조 + 신규 시나리오
  시작 절차 + V007 시뮬레이션 패턴 + 정리 정책.

기대 효과: 도그푸딩 evidence 의 git-tracked HOTFIXES + draft release
notes 외에도 raw data 가 한 곳에서 자유롭게 재사용 가능. 새 release
의 도그푸딩이 이전 KB 위에서 incremental 확인 가능.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9 (도그푸딩 evidence cascade)
Plan: post-implementation infrastructure
2026-05-28 14:06:24 +00:00
9f2a56d091 docs(hotfixes): large-scale KnowledgeBase dogfood evidence (N-gram supplement)
사용자 실제 /home/altair823/KnowledgeBase/ (1781 markdown / 9050 chunk)
를 v0.20.1+N-gram supplement 포함 binary 로 backfill 재실행:

- Backfill duration: 26.6초 (9050 chunk, OnceLock 캐시 + 1000-row
  batch transaction). ~3 ms/chunk amortized.
- '한국' query: V007 의 0 hit → V009 + N-gram 의 10 hit (Bug #8
  functional closure 실측 검증).
- '한국어' query: 5 → 10 hit (morpheme + N-gram 동시 매칭).
- 영어 whole-token: 'token'/'pipeline'/'config' = 10 hit each
  (V009 회귀 측면 정상).

Snippet evidence: KB 의 testdata/coding-md-corpus/*/...md 의
"문서를 한국어로 다시 정리하기" 패턴이 ko-dic 분해 + N-gram window
로 '한국' query 매칭 demonstrate.

기타 한국어 (서울, 지하철, 대한민국 등) 0 hit 는 KB corpus 의
단어 자체 부재 — data limitation, V009 implementation limitation X.

Test data 위치:
- /home/altair823/KnowledgeBase/ (사용자 실제 KB, 1781 markdown)
- /build/cache/tmp/v0.20.1-dogfood/kb/ (ingested SQLite + LanceDB)
- /build/cache/tmp/v0.20.1-dogfood2/corpus/ (한국어 wiki fixture)
- /build/cache/tmp/v0.20.1-v007strict/corpus/no-space.md (whitespace-less)
- /build/cache/tmp/v0.20.1-ngram/corpus/extra.md (대한민국, 한국정부, 주민등록번호)

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9 + Appendix B
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (dogfood evidence final)
2026-05-28 14:02:02 +00:00
fe20be8195 feat(chunk): N-gram supplement (Option β) — sub-token emit for Korean compounds
#4 (사용자 요청): spec §6.2 의 Option β (sub-token 추가 emit) 를
v0.21.x P9 follow-up 에서 v0.20.1 implementation 으로 promote.
dogfood 의 ko-dic compound noun limitation (`대한민국`, `한국정부`,
`주민등록번호` 등 단일 token 정책) 해소.

Implementation (`crates/kebab-chunk/src/lib.rs::tokenize_korean_morphological`):
- 신규 helper `is_hangul()` — 한글 음절 (U+AC00..D7A3) + 자모
  (U+1100..11FF, U+3130..318F) 판정.
- lindera output 의 각 morpheme 에 대해, 한글만 + 길이 ≥ 3 인 경우
  sliding window 2-gram 추가 emit. `[한국정부, 한국, 국정, 정부]`
  형태로 token list expand.
- 영어 / 숫자 / 혼합 token 은 supplement X (false positive 회피).

Tests (`crates/kebab-chunk/tests/tokenize_korean.rs`):
- `tokenize_korean_morphological_emits_2gram_for_long_morpheme`: 5 probe
  fixture 중 supplement 발화 case 확인 (실측 `서울특별시` →
  `[서울, 특별시, 특별, 별시]`, `대한민국` → `[대한민국, 대한,
  한민, 민국]`).
- `tokenize_korean_morphological_no_2gram_for_english`: Rust optimization
  fixture 에서 영어 substring (`Rus`, `ust`, `imi`) emit 없음 보장.

Dogfood evidence (`tasks/HOTFIXES.md` 2026-05-28 entry 보강):
- '대한', '한민', '민국' query 모두 hit (대한민국 의 sliding window).
- '특별', '주민', '등록' 같은 sub-token query hit.
- 영어 'tokenizer' query 는 corpus 부재로 0 hit (supplement X).
- Trade-off: DB size +20-30% (Korean-heavy), false positive 작은 risk.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 (Option β promote)
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (post-implementation enhancement)
2026-05-28 13:48:05 +00:00
028d9ad4ea docs(release): v0.20.1 release notes draft + spec/plan dogfood cross-link
#1 (사용자 요청): release notes draft 작성 + spec/plan 의 dogfood
evidence cross-link 보강.

docs/release-notes/v0.20.1-draft.md (신규):
- 4 단락 본문 (한국어 2자 query 지원 + 영어 substring 회귀 + V007→V009
  자동 backfill + ingest 성능 영향).
- Migration cascade table (lexical_index_version, corpus_revision,
  wire schema shape preservation).
- API + dependency 변경 (lindera v3, lindera-ko-dic v3, retired
  short_query_hint helper, 새 facade APIs).
- Breaking changes 명시 (영어 substring 회귀, 첫 부팅 latency, DB/
  binary 크기 증가).
- Upgrade 절차 + Known limitation + 14 dogfood scenario reference.

spec Appendix B (segmentation evidence):
- "Empirical verification (2026-05-28 dogfood — post-merge update)"
  subsection 신규. prior-knowledge 가정 vs 실측 결과 table. Scenario
  1-4 모두 verified 표시. ko-dic 의 '서울특별시' → '[서울, 특별시]'
  분해 증거 명시.

plan Changelog:
- post-implementation entry: 22 commit on branch, S3 blockers, S7
  cascade, S11 sanity regression updates, opus PR review 4 finding
  fixes.
- dogfood evidence entry: 14 scenario verify pass, ko-dic 분해
  evidence, HOTFIXES + spec Appendix B cross-link.

Spec: …spec…md Appendix B
Plan: …plan…md (post-implementation + dogfood evidence Changelog)
Release notes: docs/release-notes/v0.20.1-draft.md
2026-05-28 13:34:33 +00:00
a3513c9110 docs(hotfixes): V009 dogfood verification evidence (2026-05-28)
V009 한국어 morphological tokenizer 의 dogfood 검증 결과를 HOTFIXES
2026-05-28 entry 에 보강. 14 scenario 의 hit count + ko-dic 의
compound noun 분해 evidence (서울특별시 → [서울, 특별시]) + Option α
acceptance 의 known limitation 명시.

Reference corpus: DOGFOOD.md §2.1bis 의 korea-overview.md +
korea-compound.md (10 KB 합계, 2 markdown). KB ingest + 14 query
검증 모두 expected.

사용자 KnowledgeBase 같은 영어/code 중심 KB 에서 한국어 lexical
0-hit 가 정상임을 reference fixture evidence 와 분리해 사용자
오인 방지.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11 + dogfood evidence)
2026-05-28 13:24:29 +00:00
f2a76cfe94 docs(dogfood): V009 morphological tokenizer scenarios + fixture evidence
v0.20.1 dogfood verification 의 fixture + scenario 를 DOGFOOD.md /
SMOKE.md 에 반영. 사용자 KnowledgeBase 같은 영어/code 중심 KB 에서
한국어 0-hit 가 정상 (token 부재) 임을 명시하고, ko-dic 의 morpheme
분해 동작을 검증할 reference fixture (korea-overview.md +
korea-compound.md) 를 inline 으로 제공.

DOGFOOD.md §2.1 갱신:
- description: trigram → unicode61 + 형태소 column.
- scenarios: 한국어 2-char (한국, 서울) + compound noun (서울특별시) +
  영어 whole-token 회귀 + 1-char filter 등 7 case 로 확장.
- §2.1bis 신규: V009 dogfood evidence reference corpus + 검증 명령 +
  예상 snippet (lindera 분해 증거) + known limitation (ko-dic
  compound 단일 token 정책, Option α acceptance).

SMOKE.md 'V009 morphological 검색' 갱신:
- trigram 시절 hint advisory + 3자 키워드 권장 시나리오 제거.
- v0.20.1 의 2-char Korean / compound noun / 1-char filter / 영어
  whole-token 회귀 scenario 로 교체.

Reference fixture (실측 verify pass):
- korea-overview.md: '한국' / '서울' / '지하철' 모두 hit.
- korea-compound.md: '한국어' / '한국문화' / '서울특별시' compound hit.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (dogfood evidence)
2026-05-28 13:22:43 +00:00
8c56ef3010 docs(superpowers): v0.20.x C 한국어 morphological tokenizer spec + plan artifacts
본 commit 은 v0.20.x C task (Bug #8 — 한국어 2자 query 0-hit) 의
4-stage workflow artifact 5 파일을 archive:

- spec.md (668 line, status=accepted): Option A/B/C 비교 + lindera
  Path A (영어 substring 회귀 인정) 결정 + 12 section + 4 Appendix
  (B segmentation evidence, C cost evidence, D license evidence).
- spec-critic-r1.md: 3 critical + 6 major finding (NEEDS_REWRITE).
- spec-critic-r2.md: r1c rewrite 후 traceability matrix (ACCEPT).
- plan.md (750 line, status=accepted): 11 step + dependencies +
  cost optimization routing + 9 closure micro-patches 적용.
- plan-closure-r1.md: traceability matrix + 9 MP 의 origin.

이 artifact 들은 implementation 머지 후 frozen reference. 후속
deviation 은 tasks/HOTFIXES.md 가 source of truth.

Workflow stage:
1. spec drafter (omc team writer, opus)
2. spec critic R1 (omc team critic, opus) — NEEDS_REWRITE
3. spec rewriter r1c (omc team writer, opus) — 7 item fix
4. spec closure R2 (in-process verifier, sonnet) — ACCEPT
5. plan drafter (omc team planner, opus)
6. plan closure (in-process verifier, sonnet) — ACCEPT + 9 MP
7. subagent-driven-development implementation (11 step + 5 follow-up
   + 1 docs polish = 17 commit)
8. PR-level final code review (in-process code-reviewer, opus) —
   Approved with notes (4 minor docs finding, merge as-is)

Branch: feat/korean-morphological-tokenizer
Version: 0.20.1
2026-05-28 12:53:31 +00:00
5d9ea588ed docs(v0.20.1): polish PR-review findings (README/HOTFIXES/schema/SKILL)
opus PR-level final review (Approved with notes) 의 4 minor finding
mechanical 정정:

1. README.md — `kebab search` row 의 영어 substring 매칭 표현이
   V007 시절 그대로였음. V009 의 whole-token 회귀 (substring → V002
   동작) 를 정직히 명시 + vector/hybrid mode 권장 안내.
2. tasks/HOTFIXES.md — 2026-05-28 entry 의 file path 정정. lexical.rs
   는 lindera 호출자가 아니라 build_match_string 의 MIN_QUERY_CHARS
   3→2 갱신만; lindera helper 의 실제 owner 는 kebab-chunk/src/lib.rs.
   ingest.rs 는 본 PR scope 외, eager backfill hook 위치는 kebab-app/
   src/app.rs::App::open_with_config.
3. docs/wire-schema/v1/search_response.schema.json — `hint` field
   description 이 V007 trigram 3-char minimum 시절 advisory 시그니처
   그대로. v0.20.1 에서 helper retired + always-omit 사실 명시
   (forward-compat 차원에서 field 만 schema 에 보존).
4. integrations/claude-code/kebab/SKILL.md — `hint` field 설명의
   self-contradiction ("present only with trigram in edge cases" vs
   "Korean 2-char now supported") 해소. retired + reuse 가능 명시.

PR-level reviewer recommendation: "Merge as-is — block 사유 아님 (모든
finding minor)". 본 commit 은 reviewer 의 옵션 1 (별 docs hotfix
commit) 채택.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (PR-level finding follow-up)
2026-05-28 12:53:00 +00:00
53ec9b4dc5 test(chunk): regenerate AST + long-section snapshots for V009 chunk field
S3 의 Chunk struct 갱신 (kebab-core 의 tokenized_korean_text:
Option<String> field 추가) 가 모든 chunk snapshot JSON 의 serde
serialize 결과를 변경시킴. 10 snapshot fixture (9 AST chunker +
markdown long-section) 의 baseline 을 V009 형태로 regenerate.

각 snapshot 의 변경 = chunk JSON 마다 `"tokenized_korean_text":
null` field 추가 (대부분의 fixture 가 영어 코드라 lindera 의 None
fallback). 동작 변경 없음 — serde representation 의 cascade만.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up via S11 sanity)
2026-05-28 12:27:37 +00:00
21b52bc285 style: cargo fmt --all (S3+S4+S5+S7 follow-up)
V009 morphological tokenizer 작업 (S3 chunk + S4 backfill + S5
short_query_hint 제거 + S7 신규 tests) 의 형식 정리. 동작 변경 없음.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11)
2026-05-28 12:06:01 +00:00
97fd895a10 chore(release): bump version 0.20.0 → 0.20.1
CLAUDE.md §Release / binary version bump 의 두 트리거 모두 hit:
- 사용자 도그푸딩 필요 (Bug #8 한국어 2자 query 해소 — '한국', '서울',
  '지하철' 검색 검증).
- frozen design contract 변경 (§5.5 chunks_fts 의 unicode61 + CASE
  expression triggers + tokenized_korean_text column).

V009 + lindera ko-dic 형태소 분석기 통합 외에도 v0.20.x 의 logging
round 2 enhancement (PR #190) 가 같은 v0.20.x 시리즈에 포함되어
v0.20.1 patch release 시점에 함께 cut.

Build verification: ./target/release/kebab --version → kebab 0.20.1.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §12.1
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S10)
2026-05-28 12:05:31 +00:00
d13eb87401 docs(v0.20.x): sync README + HANDOFF + ARCH + SKILL + HOTFIXES for V009
V009 한국어 morphological tokenizer 의 사용자 visible surface 변경 +
release notes scope 를 5 docs 에 cascade.

- README.md: kebab search 명령 row 에 한국어 2자 query 지원 명시.
- integrations/claude-code/kebab/SKILL.md: V007 3-char hint 제거 +
  V009 2자 한국어 query 지원 1줄.
- HANDOFF.md: C task status 완료 flip + v0.20.1 release notes scope
  에 본 변경 추가 + 머지 후 발견 summary 행.
- docs/ARCHITECTURE.md: embedding upgrade (e5-small → e5-large),
  lindera-ko-dic FTS5 한국어 지원, version notes 추가.
- tasks/HOTFIXES.md: 2026-05-28 entry — Bug #8 V009 해소, lindera-ko-dic
  실제 crate name (spec deviation), cargo-deny deferred, Path A
  영어 substring 회귀 명시.

Spec: tasks/p9/p9-9-v0.20.x-korean-morphological-tokenizer-spec.md §7.4
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-28 11:55:25 +00:00
26f3a7756c test(eval): regenerate runner_per_query_snapshot for V009 baseline
V009 FTS5 tokenizer (trigram → unicode61 + 형태소) 로 인한 BM25
distribution + hit ordering 변경의 의도된 cascade. eval runner 의
per-query snapshot (run-1.json) 을 V009 baseline 으로 regenerate.

Regenerate 절차: UPDATE_SNAPSHOTS=1 cargo test -p kebab-eval
--test runner runner_per_query_snapshot_matches_fixture -j 4.
회귀 0 — kebab-eval workspace 의 모든 test (runner / loader /
metrics_and_compare) pass.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S8)
2026-05-28 11:51:49 +00:00
881f949fcb test(search): regenerate lexical_snapshot_run_1 for V009 BM25 distribution
V009 FTS5 tokenizer 가 trigram → unicode61 + tokenized_korean_text
column 로 갱신되면서 BM25 IDF 계산이 변화 → fusion_score 의 부동
소수점 값이 미세하게 다름 (히트 순서·snippet·구조 동일). S1 머지
직후 update 누락되었던 snapshot 을 V009 baseline 으로 regenerate.

회귀 없음: `cargo test -p kebab-search --test lexical lexical_snapshot_run_1 -j 4`
exit 0.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1 follow-up via S7 reviewer)
2026-05-28 11:49:15 +00:00
c5de5f812b test(fts,app): V009 morphological tokenizer integration tests
신규 4 test 추가:

- crates/kebab-store-sqlite/tests/fts.rs:
  - fts_v009_korean_morphological_2char_query_hits: tokenized_korean_text
    column 이 채워진 chunk 의 '한국' 2-char query hit.
  - fts_v009_english_whole_token_only: V007 trigram substring 매칭
    회귀 (Path A) — 'token' query 가 'tokenizer' chunk 에서 0-hit.
- crates/kebab-app/tests/search_korean.rs:
  - korean_morphological_2char_query_lexical_mode: end-to-end
    한국어 wiki fixture ingest → '한국' / '서울' query hit.
  - korean_morphological_mixed_english_korean_query: 'Rust' English
    whole-token + '최적화' Korean morpheme hit.

crates/kebab-search/src/lexical.rs:
  - build_match_string() 의 MIN_TRIGRAM_CHARS(3) → MIN_QUERY_CHARS(2).
    V009 unicode61 은 최소 token 길이 제한 없어 2자 한국어 morpheme
    query 가 통과되어야 함. 1자 단독은 여전히 필터.
  - 관련 unit test 2개 V009 동작으로 갱신.

fixture text 는 lindera ko-dic 의 실제 segmentation 동작에 의존
(spec Appendix B prior-knowledge 예측). 실측 시 fixture 조정 가능.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9.1, §9.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S7)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:38:52 +00:00
f94e0c4a9b feat(app): bump lexical_index_version to V009 (fts5-v009-korean-morphological)
V009 의 FTS5 tokenizer 가 trigram → unicode61 + 한국어 형태소 분해
column 로 갱신됨. lexical_index_version 의 format 에
`fts5-v009-korean-morphological` suffix 추가하여 V007 baseline 과
구별. eval runner 의 config_snapshot 및 search cache 무효화에
자동 picks up.

기존 format: lex:{chunker_version}
신규 format: lex:{chunker_version}:fts5-v009-korean-morphological

Wire schema shape 변경 없음 (SearchHit.index_version 의 string
content 만 변화). lexical_index_version_is_returned_unchanged test
는 IndexVersion 의 임의 string 을 사용해 unchanged.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.1, §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S6)
2026-05-28 11:23:13 +00:00
923b959610 refactor(app): retire short_query_hint helper, keep wire field as None
V009 unicode61 + 형태소 tokenizer 환경에서 2-char 한국어 query 가
hit 가능해졌으므로 V007 시기의 "3자 이상 권장" hint 가 obsolete.
SearchResponse.hint field 는 wire schema 보존 위해 struct 에 유지 +
항상 None.

- kebab-app/src/app.rs: short_query_hint 함수 + doc-comment 삭제.
  2 호출 site 가 hint = None 으로 정리.
- kebab-app/src/lib.rs: re-export 에서 short_query_hint 제거.
- kebab-tui/{app.rs,search.rs,run.rs}: short_query_hint field + 4
  호출 cascade 제거.
- kebab-cli/tests/wire_search_response.rs:
  search_plain_emits_short_query_hint_to_stderr test 삭제.
  search_json_emits_hint_field_for_short_query →
  search_json_hint_absent_for_short_query_v009 으로 교체
  (hint 항상 None 검증).
- kebab-search/src/lexical.rs::build_match_string: V007 의 trigram
  multi-token OR-combine 분기는 V009 환경에서 redundant 하나 보존
  (future 확장성) — doc-comment 1 줄 추가.

Wire schema shape 변경 없음 (search_response.schema.json:33 의 hint
field 보존, struct 에 None 으로 항상 셋팅).

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §7.2, §7.3, §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S5)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:13:45 +00:00
b63af20b72 feat(app): first-boot eager backfill for tokenized_korean_text
V007 → V009 업그레이드 시 기존 chunks 의 tokenized_korean_text 가
NULL — 첫 App::open_with_config 호출 시 자동으로 lindera ko-dic
으로 분해 후 UPDATE. chunks_au trigger 가 chunks_fts 를 자동 재-index.
사용자 재-ingest 불필요.

- crates/kebab-store-sqlite/src/store.rs:
  backfill_tokenized_korean_text(progress_cb, tokenize) API. 1000 row 마다
  commit + progress 콜백. idempotent (IS NULL 필터로 partial
  completion 재실행 안전). tokenizer 를 파라미터로 받아 §8 dep 경계 유지.
- crates/kebab-app/src/app.rs::open_with_config: run_migrations 직후
  backfill 호출. 실패 시 warn log 만 (App open 은 성공 — vector/hybrid
  mode 계속 가능). 500 row 마다 info log progress.
- crates/kebab-store-sqlite/tests/fts.rs:
  backfill_tokenized_korean_text_populates_nullable_rows 단위 test
  (idempotency 포함).
- clippy pre-existing 오류 수정 (redundant_closure, map_unwrap_or,
  cast_lossless, uninlined_format_args — kebab-app/ingest_log.rs,
  pdf_ocr_apply.rs, app.rs, tests/ocr_inspect_smoke.rs).

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §8.1, §8.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:01:00 +00:00
e8f44a57e3 fix(test): update first_ingest_bumps_corpus_revision baseline for V009
V004 seeds corpus_revision=0, V009 migration bumps to 1 (spec §5.2 —
LRU cache invalidation). Test previously asserted fresh store = 0;
now reads post-migration baseline dynamically and verifies that the
ingest commit increments past it.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)
2026-05-28 10:47:09 +00:00
4b4a8cbb3a fix(test): retire V007 trigram-only fts tests, add V009 unicode61 sanity
V007 trigram tokenizer 의 substring 매칭을 검증하던 3 test 는 V009
unicode61 으로 의도된 회귀 (spec §3 Non-Goals Path A) 가 발생하므로
obsolete:

- fts_trigram_korean_3char_substring_hits: '발생한' → '발생한다' hit
  은 trigram 의 substring 매칭이라 V009 의 whole-token 매칭에서 fail.
- fts_trigram_korean_short_query_zero_hit_pinned: 2-char Korean
  query 의 0-hit 동작은 V009 의 형태소 column 으로 해소되므로 이 핀
  자체가 obsolete (S7 이 신규 2-char hit test 로 대체).
- fts_trigram_english_substring_hits: 'token' → 'tokenizer' hit 은
  V009 unicode61 의 whole-token only 에서 fail.

신규 추가:
- fts_v009_unicode61_space_separated_korean_token_hits: V009 unicode61
  의 whole-token 매칭 sanity (token '충돌은' hit, substring '발생한'
  0-hit). S7 이 추가할 morphological 검증 test 와 별개의 baseline.

S7 (plan §2 Step 7) 가 v009_korean_morphological_2char_query_hits +
v009_english_whole_token_only 를 추가하여 회귀 + 신규 동작 모두 핀할
예정.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §3, §9.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)
2026-05-28 10:36:32 +00:00
4dc1c10be1 fix(test): update corpus_revision baseline to post-V009 migration
V009 migration bumps corpus_revision by 1 at apply time (spec §5.2 —
invalidates pre-V009 LRU search cache). Existing tests assumed V004
seed (0) was the final baseline; updated to expect 1 after migration:

- fresh_store_starts_at_zero → fresh_store_starts_at_post_migration_baseline
- bump_increments_monotonically: expected 1,2,3 → 2,3,4 (post-baseline)
- revision_persists_across_reopen: 2 → 3 (V009 +1, +bump,+bump)

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)
2026-05-28 10:33:50 +00:00
bd86f61c9c fix(chunk): close S3 reviewer blockers — get_chunk read + AST chunker cascade
S3 spec compliance reviewer (sonnet) 가 2 blocker 발견:

1. crates/kebab-store-sqlite/src/documents.rs: get_chunk SELECT 가
   tokenized_korean_text column 을 미조회 → DB 의 값이 read 시 유실.
   SELECT column list + row → Chunk 변환 시 row.get 인덱스 추가.
   ChunkRow struct + chunk_row_from_sql + get_chunk Chunk 생성 cascade.

2. crates/kebab-chunk/src/code_*_ast_v1.rs (9 file): make_chunk 가
   tokenized_korean_text: None 하드코딩 → 한국어 주석을 가진 코드
   파일이 FTS hit 안 됨. tier2_shared 와 동일 패턴으로
   tokenize_korean_morphological(text) 호출 cascade.

이 commit 은 S3 의 rework — amend 아닌 별 commit (S3 boundary
유지). spec §6.2 invariant ("모든 chunker 가 chunk emit 직전에
tokenize 호출") 충족.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 rework)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:30:53 +00:00
b134ae9dd5 feat(chunk): integrate lindera korean morphological tokenizer
V009 의 tokenized_korean_text column 에 들어갈 morpheme sequence
를 lindera ko-dic 으로 분해. chunk builder pipeline 의 chunk 생성
직후 시점에서 호출 → chunk struct 의 field 에 pre-fill → store
의 put_chunks 가 단일 transaction 안에서 INSERT.

- crates/kebab-core/src/chunk.rs: Chunk struct 에
  tokenized_korean_text: Option<String> field 추가 (#[serde(default)]).
- crates/kebab-chunk/src/lib.rs: tokenize_korean_morphological()
  helper + OnceLock 캐싱 + fallback (None) 정책.
- crates/kebab-chunk/Cargo.toml: lindera features = ["embed-ko-dic"]
  추가 (DictionaryKind::KoDic 활성화에 필요).
- 모든 chunker (tier2_shared, md_heading_v1, pdf_page_v1, 9개
  code AST v1): Chunk 리터럴에 tokenized_korean_text pre-fill.
- crates/kebab-store-sqlite/src/documents.rs::put_chunks: INSERT
  SQL column list + placeholder + binding 갱신 (12번째 column).
- crates/kebab-chunk/tests/tokenize_korean.rs: 단위 테스트 2개.

lindera 3.0.7 API 정정: load_dictionary_from_kind →
load_embedded_dictionary, Token.text → Token.surface.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:22:15 +00:00
597d8b70ad feat(deps): add lindera + lindera-ko-dic for korean morphological tokenizer
Workspace dependency 만 추가 — 실제 사용은 S3 의 kebab-chunk
tokenize_korean_morphological() helper.

- Cargo.toml (workspace): lindera = "3", lindera-ko-dic = "3" 추가.
- crates/kebab-chunk/Cargo.toml: per-crate dep (lindera-ko-dic 에
  embed-ko-dic feature 로 KO-DIC 딕셔너리 embedded blob 활성화).
- crates/kebab-app/Cargo.toml: [features] 에 fts_korean_morphological
  (spec §6.3 Option A — marker role only, disable path 없음).

License: lindera = MIT, lindera-ko-dic = MIT
(cargo info 로 확인). cargo deny 도입은 P9 follow-up.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.1, §10.1
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:03:58 +00:00
b106120e93 feat(fts): add V009 korean morphological tokenizer migration
V007 trigram tokenizer 의 한국어 2자 query 0-hit 한계 (Bug #8) 해소를
위한 V009 migration 추가. unicode61 tokenizer 로 환원 + 한국어 형태소
분해 결과를 별 column `tokenized_korean_text` 에 pre-fill 하는 방식.

- migrations/V009__fts_korean_morphological.sql 신규: column ADD,
  chunks_fts DROP+재정의, 3 trigger CASE expression, backfill INSERT,
  corpus_revision bump.
- design §5.5 갱신: trigram → unicode61 + 형태소 column. CASE
  expression trigger 본문.
- crates/kebab-store-sqlite/tests/fts.rs: V007 verbatim test 를
  V009 source-of-truth 로 rename. v009_bumps_corpus_revision unit
  test 추가.
- store.rs: clippy bool_to_int_with_if + cast_lossless 기존 경고 수정
  (pdf_ocr_events 관련 코드, S1 작업 중 발견).

영어 substring 매칭은 V002 (whole-token only) 로 회귀 — spec §3
Non-Goals + 후속 release notes (v0.20.1) 에서 정직히 기술.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 09:48:46 +00:00
43366b1b15 docs(handoff): C 한국어 morphological tokenizer (Bug #8) 새 session handoff
v0.20.0 sub-item 1 + bugfix 1~4 + ingest log r1+r2 머지 후, 다음 우선순위
C (한국어 morphological tokenizer) 의 self-contained context.

새 session 의 첫 step + workflow patterns + 환경/memory references + cascade
risk + 가능한 fix paths (Option A jieba-rs / B bi-gram supplement / C
query-side expansion). spec/plan/executor cycle 동일 패턴.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:18:04 +00:00
70507e94ca docs(handoff): logging round 2 closed (PR #190 merged) + v0.20.1 release notes scope 갱신
PR #190 (2026-05-28 commit 7bbdc89a) merge 후:
- "Logging feature future enhancements" section →  closed (4 enhancement 모두 PR #190 에서 ship).
- G (v0.20.1 release) section 의 release notes scope 갱신 — V008 migration + pdf_ocr_events + CLI inspect + retention 추가.

다음 작업 priorities 그대로 C → B → A → G.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:15:07 +00:00
7bbdc89ae3 Merge pull request 'feat: ingest log round 2 — image_w/h + V008 SQLite mirror + CLI inspect + retention' (#190) from feat/ingest-log-round2-enhancements into main
Reviewed-on: #190
2026-05-28 08:12:25 +00:00
7c24734cc7 docs(superpowers): v0.20.x logging r2 spec + plan artifacts
logging round 2 (4 enhancement: image_w/h + V008 SQLite mirror + CLI inspect + retention) 의 spec/plan ACCEPT 후 round artifacts.

- spec: 751 line (ACCEPT, 7/7 critic round 1 finding + 7/7 closure r2 traceability)
- plan: 576 line (ACCEPT, 6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:04:32 +00:00
9a36a06f97 style: cargo fmt --all (v0.20.x logging r2 feature follow-up)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:34:01 +00:00
35c987df1c feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4)
LoggingCfg gains two fields with serde defaults: keep_recent_runs
(default 100, top-N file retention) and retention_days (default 30,
time-based retention for both ndjson files and the SQLite mirror).

IngestLogWriter::open now runs cleanup_old_logs before creating a new
ingest-*.ndjson — delete iff (idx >= keep_recent) OR (modified <=
cutoff). ingest_with_config_opts also calls
SqliteStore::prune_pdf_ocr_events(retention_days) at ingest start so
the SQLite mirror tracks the same retention window.

Backward compat (AC-9): both new fields use #[serde(default = ...)],
so a pre-v0.20.x config with only [logging] ingest_log_enabled +
ingest_log_dir parses unchanged. kebab init writes the new defaults
automatically via Config::default() -> toml::to_string_pretty (AC-12).

docs/SMOKE.md config example synced.

Closure r1 F5: explicit OR-on-stale comment inside cleanup_old_logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:17:47 +00:00
d9ec7b8dc3 feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor)
Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide
aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine,
top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or
corpus-wide recent failures, with --doc-id + --limit). Both ship via
new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`.

App gains four facade methods: inspect_ocr_stats /
inspect_ocr_failures plus their *_with_config companions — required by
CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI
dispatch arms thread cfg explicitly into the _with_config form.

Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two
entries; the meta JSON Schema (schema.schema.json) is untouched
because its wire.schemas is pattern-based, not enum-based.

ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces
only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile.

SKILL.md synced with the two new schemas (AC-13).

Closure r2 G2 (facade *_with_config pair) + G3 (runtime emit, not
meta schema file) + closure r1 F4 (p99) resolved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:13:08 +00:00
4e451c9f7c feat(app): dual-write PDF OCR events to SQLite + ndjson (Enhancement 2 wiring)
Pre-capture canonical.doc_id and Arc<SqliteStore> before the OCR
emit_progress closure so both the ndjson file and the SQLite mirror
carry the same doc_id for every event. File write is durable
(errors propagate); SQLite insert is non-critical (tracing::warn on
failure, ingest does not abort) per spec R-1.

LogEvent::Ocr gains a doc_id: Option<&str> field as an additive
Serde change — round 1 ndjson logs deserialize with doc_id=None.

Closure r1 F1: doc_id NULL in dual-write resolved via
let doc_id_for_log = canonical.doc_id.0.clone() pre-capture.
Closure r2 G1: Arc::clone(&app.sqlite) reused instead of opening a
second SqliteStore — eliminates double-open lock contention and
duplicate migration runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:06:03 +00:00
6482bf1321 feat(store): V008 pdf_ocr_events migration + record/prune API (Enhancement 2)
Add migrations/V008__pdf_ocr_events.sql with the events table + 3
indices (doc_id, run_id, ts). SqliteStore gains two pub fn:
record_pdf_ocr_event (insert one OCR sample) and prune_pdf_ocr_events
(delete rows older than retention_days; returns the affected row
count). Both follow the existing Mutex<Connection> lock pattern.

Wiring into ingest path lands in the next commit.

Closure r1 F2: explicit lock acquisition in both methods.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 05:56:54 +00:00
5977c8cdf1 feat(app): capture image_width/height in PDF OCR raster decode (Enhancement 1)
Add extract_image_dimensions(bytes) helper using image::ImageReader
and fill the 2 PdfOcrProgress::Finished emit points in pdf_ocr_apply.rs
where page_image_bytes is in scope (OCR error path + success path).
The no-DCTDecode skip path leaves None as page_image_bytes is absent.
Result: LogEvent::Ocr carries non-null image_width/image_height on
successful raster decode, enabling future size-conditioned timeout tuning.

Closure r1 F3: kebab-app/Cargo.toml image features += "jpeg" added as
direct [dependencies] entry (not relying on feature unification via
kebab-parse-image).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 05:54:55 +00:00
89d334a92b docs(handoff): v0.20.0 sub-item 1 머지 후 next priorities (C/B/A/G order)
PR #189 (2026-05-28 commit 09333d0) 머지 후 다음 작업 priorities 기록:

- C (다음 우선): 한국어 morphological tokenizer (Bug #8 V007 trigram 2-char 한계 follow-up)
- B: OCR dense page coverage (metro-korea.pdf page 8/13 timeout — max_pixels 동적 / column-level OCR / 점진적 timeout 축소)
- A: v0.20 의 deferred sub-items (sub-item 2 multi-region image / sub-item 3 PDF normalize integration / TODO #4 figure-table / TODO #5 Enricher trait)
- G: v0.20.1 patch release + release notes (180s timeout + [logging] section + wire schema additive + CLI fix)

Non-blocking known: Bug #12 falsified, ask phrasing-sensitive refusal.

Logging feature 후속 enhancements (low priority): image_width/height capture, SQLite mirror, CLI inspect query, log retention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:48:44 +00:00
09333d0b05 Merge pull request 'feat(pdf): scanned PDF OCR via qwen2.5vl:3b vision LLM (v0.20.0 sub-item 1)' (#189) from feat/pdf-scanned-ocr into main
Reviewed-on: #189
2026-05-28 04:37:41 +00:00
685007789a style: cargo fmt --all (round 4 ingest log feature follow-up)
Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.

추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)

workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:18:40 +00:00
445b096215 fix(test): clippy + fmt fixes for logging_roundtrip and ingest_log_smoke
* kebab-config/tests/logging_roundtrip.rs: r#"..."# → plain string
    (clippy::unnecessary_hashes).
  * kebab-app/tests/ingest_log_smoke.rs: |e| e.ok() → Result::ok,
    |s| s.as_u64() → Value::as_u64 (clippy::redundant_closure).
  * cargo fmt --all applied to pre-existing formatting drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:26:42 +00:00
415227bf76 test(app): ingest_log_smoke integration test (AC-9)
crates/kebab-app/tests/ingest_log_smoke.rs 신규:

  * ingest_log_smoke (AC-9): tempdir + 1 md + 1 scanned PDF →
    ingest → assert log file exists + 각 line valid JSON +
    각 kind ∈ {ocr,parse_error,skip,error,summary} + last
    line kind=summary + scanned>0.

  * ingest_log_disabled_emits_no_file (AC-6): enabled=false 일
    때 log_dir 안 ingest-*.ndjson 파일 0개 verify.

fixture: ../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf
재사용 (OCR disabled — Ollama 없이 smoke 실행).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:06:43 +00:00
f9dc0f749f feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc<Mutex> sync)
v0.20.x ingest log feature 의 ingest pipeline wiring. 5 emit hook:

  Hook 1: ingest_with_config_opts entry/exit (writer init + summary write + flush)
  Hook 2: apply_ocr_to_pdf_pages closure (PdfOcrProgress::Finished → LogEvent::Ocr)
  Hook 3: ingest_one_*_asset Err arm (LogEvent::Error)
  Hook 4: scan 직후 fs_skips.events enumerate (LogEvent::Skip)
  Hook 5: (Hook 3 통합) per-asset fatal error → LogEvent::Error

Hook 4 의 skip event carry 위해 kebab-source-fs 의 FsScanSkips 에
events: Vec<FsSkipEvent> field 추가 (kebab-source-fs 가 kebab-app
재호출 안 함 — cycle 회피).

Ownership: Option<Arc<Mutex<IngestLogWriter>>> binding 1 곳, 5 hook 이
clone+lock+write. ocr_ms_samples (Vec<u64> success-only) 는 Arc<Mutex>
로 share, summary stage 가 sort+p50/p90/max 계산. single-threaded
per-asset loop 라 deadlock/contention 위험 없음.

Writer 실패는 ingest 자체 fail 시키지 않음 (tracing::warn + 진행).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:05:07 +00:00
bef0c98867 feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields
v0.20.x ingest log feature 의 wire side. additive minor cascade:

  * PdfOcrProgress::Finished + IngestEvent::PdfOcrFinished 의 4 field:
      - image_byte_size: Option<u64>
      - image_width:     Option<u32>
      - image_height:    Option<u32>
      - failure_reason:  Option<String>
  * docs/wire-schema/v1/ingest_progress.schema.json — 4 추가 property
    (모두 optional, required 변경 없음 = additive minor)
  * integrations/claude-code/kebab/SKILL.md — wire schema description 동기

기존 ingest_progress.v1 consumer (CLI wire dump, integration test
fixture, kebab-cli wire_search/wire_ask) 는 4 추가 field 의
Option::None 으로 backward-compat. version bump 0 (additive minor =
binary-version cascade trigger 아님 per CLAUDE.md §Versioning cascade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:57:59 +00:00
f8a4c79727 feat(app): IngestLogWriter + LogEvent enum (per-ingest-run ndjson log)
v0.20.x ingest log surface 의 module side. crates/kebab-app/src/
ingest_log.rs 신규:
  * IngestLogWriter — open/write_event/write_summary/flush + Drop flush
  * LogEvent enum 4 variant (ocr / parse_error / skip / error)
  * IngestSummary struct (kind="summary" literal + 11 stat field)
  * generate_run_id (ISO 8601 prefix + uuid v7 마지막 8 hex)
  * expand_log_dir ({state_dir} placeholder 의 hand-roll expand)
  * now_ts (Rfc3339 UTC helper)
  * percentiles helper (sorted Vec p50/p90/max)

uuid v7 = workspace dep, rand 신규 의존 회피 (spec §6 R-5).

본 step 은 self-contained writer + 5 unit test. ingest pipeline 의
emit hook 5개 wiring 은 step 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:53:09 +00:00
f60304beb4 feat(config): add [logging] section (ingest_log_enabled + ingest_log_dir)
v0.20.x ingest log surface 의 config side. LoggingCfg struct 신설:
  * ingest_log_enabled (bool, default true)
  * ingest_log_dir (PathBuf, default "{state_dir}/logs")

#[serde(default)] tag 로 pre-v0.20 config 가 [logging] section 부재
시 LoggingCfg::default() 자동 init (AC-10 backward compat).

{state_dir} placeholder 의 실제 expand 는 step 2 (IngestLogWriter)
의 expand_log_dir helper 가 담당 (kebab-config 의 expand_path_with_base
는 {state_dir} 미지원, spec §6 R-3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:44:21 +00:00
6a9551e0fa fix(config): pdf.ocr.request_timeout_secs default 60 → 180 (Bug #11 follow-up)
Round 3 final dogfood (2026-05-28) 에서 60s default 가 dense Korean page
(metro-korea.pdf page 8/9/13) 의 OCR 을 강제 timeout — round 2 대비 1 page
더 indexed 손실. user perspective: cost vs coverage trade-off 가 60s 에선
coverage 쪽으로 너무 깎임.

Sweet spot 점진적 축소 정책 채택 — conservative starting point 180s 부터
dogfood evidence (OCR 평균 ms 분포) 기반 점진적 축소. 60s 같은 짧은 default
로 직접 jump 안 함.

- crates/kebab-config/src/lib.rs::default_pdf_ocr_request_timeout_secs() = 180
- unit test rename (_is_60s → _is_180s) + assertion 180
- crates/kebab-config/tests/pdf_ocr.rs assert_eq 180
- tasks/HOTFIXES.md 2026-05-28 follow-up entry 추가

User override path 보존 — config.toml [pdf.ocr] request_timeout_secs = N
로 user 가 직접 tune.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:40:23 +00:00