kebab

Author	SHA1	Message	Date
altair823	825543549d	docs(plan): 별칭 dense 별도 벡터 구현 plan ALIAS_SUFFIX(core) → embed_aliases flag → ingest sentinel 벡터+purge → VectorRetriever strip+dedup → 측정. TDD, 완성 코드. doc-side expansion PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 10:28:43 +00:00
altair823	bcb8b93751	docs(spec): 별칭 dense 별도 벡터 설계 spec PoC(concat) 측정: dense 별칭이 6/0/2/0.25 (설명형은 dense 본령 실증), 단 영어 설명형 2개는 concat 본문 희석으로 미회복. 처방: 별칭을 sentinel chunk_id 별도 벡터로 색인(본문 벡터 불변=회귀 안전, 별칭 순수 신호). flag ingest.expansion.embed_aliases default off. lexical 완화는 폐기. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 10:26:24 +00:00
altair823	69b53d1c97	docs(spec): doc-side expansion 검색 메커니즘을 shipped 구현에 맞춰 정정 Task 6 리뷰 MINOR-1: spec 본문이 단일 UNION ALL+GROUP BY 로 기술됐으나 shipped = 2-query(run_query+run_alias_query) + Rust merge_body_alias(body 우선). 서로 다른 FTS 테이블 bm25 절대값 비교가 무의미해 body-우선 merge 가 더 깨끗. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 03:20:13 +00:00
altair823	467a974901	docs(plan): doc-side expansion 구현 plan + spec 정제 (별도 FTS 테이블) spec: chunks_fts §5.5 verbatim 충돌 회피 → 별도 chunk_aliases_fts 테이블 + lexical 내부 body+alias 병합(RetrievalDetail/wire schema 무변경)으로 정제. plan: 7 task TDD (Chunk 필드 → V010 → config → ExpansionGenerator → ingest hook → lexical 병합 → 측정/문서). 완성 코드 + 빌드 규약. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 02:04:58 +00:00
altair823	098413922b	docs(spec): 색인시 doc-side expansion 설계 spec (Phase 2) brainstorm 확정: 청크당 별칭 생성(같은언어+한↔영 번역), additive+수동 재색인, 1차 단순 품질제어. 별도 FTS5 aliases 채널 → RRF 3채널 융합. flag off 기본, kebab eval variants 로 on/off 측정. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 01:54:46 +00:00
altair823	8bb7c276d0	docs: Phase 2 doc-side expansion 킥오프 + 구현 방법론 핸드오프 새 세션이 Phase 2(색인시 doc-side expansion)를 자립적으로 이어받을 컨텍스트 문서. 배경(rerank 반증→재정의→Phase1 진단 B우세→딥리서치→PoC), 설계 방향(KO↔EN 번역 별칭 + 별도 FTS5 필드 + RRF, flag off), 이미 만든 측정 도구(kebab eval variants + dogfood golden), 그리고 지금까지와 동일한 구현 방법론(brainstorm→spec→plan→OMC teammate sequential 구현+리뷰 +독립검증, 모델 라우팅, 빌드 redirect+exit, 측정=variant eval 프록시금지, gitea-pr 리뷰루프)을 담음. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 01:19:14 +00:00
altair823	b6ad947378	docs: README 명령 표 슬림 + ARCHITECTURE 상세 이전·동기화 README 의 괴물 셀(ingest 2891→544, search 2952→687, ask 1244→415, tui 2300→453자)을 "무엇 + 핵심 flag + 포인터"로 축소. 빠진 구조 detail 은 ARCHITECTURE 로 이전: - symbol path 형식에 Go/Java/Kotlin/C/C++ 추가 + code chunk provenance(citation.kind/code_lang/repo) - Markdown title 자동 채움 순서(md-frontmatter-v2) - RAG groundedness 검증(mDeBERTa-v3 XNLI, nli_threshold gate) 결정 행 신설 - TUI 행을 P9-1~4 완료 + F1 cheatsheet 로 최신화 (stale "진행 예정" 제거) flag 망라는 --help, TUI 키는 in-app F1 cheatsheet(권위 런타임 소스)로 위임 — stale 방지. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 01:10:39 +00:00
altair823	5ad1f98227	docs(handoff): doc-side expansion 딥리서치 + PoC 결과 (Phase 2 방향 확정) 딥리서치(104 agent): 어휘격차 pool-miss 최선책 = 색인시 doc-side expansion. PoC(dogfood KB): recall@50=0 이던 3쿼리가 별칭 추가로 rank1~2 부활(hybrid+vector, 골든 verbatim 아님=일반화). 핵심 미검증 고리 실 corpus 정량 확인. Phase 2 = 색인시 doc-side expansion(KO↔EN 번역 별칭) → 별도 FTS5 필드 → RRF, flag off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 00:53:24 +00:00
altair823	a58cae2ff3	docs(research): 어휘격차 pool-miss 해결 딥리서치 레퍼런스 deep-research 워크플로(104 agent, 5각도, 22소스, 25 claim 3-vote 검증, 22 confirmed/3 killed). 결론: 색인시 doc-side expansion(doc2query)이 pool-miss 최선책 — pool 자체를 키우고 per-query 지연 ~0(색인시 1회), 정확매칭 보존(별도 필드 append). 단 vanilla mt5는 같은언어라 한/영 갭은 색인시 KO↔EN 대체 query 생성 필요. query-side(HyDE=거부된 per-query LLM, Vector-PRF=recall 주장 기각)는 부적합. 검증은 기존 variant eval 로 가능. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 00:53:24 +00:00
altair823	7a1dff1684	docs(handoff): query-paraphrase robustness Phase 1 완료 + (A)/(B) 진단 8그룹×4변형(dogfood) 측정: groups=8 A_dominant=2 B_dominant=4 spread@10=0.750. 진단 — 문제는 한/영이 아니라 어휘 거리(영어 풀어쓴 문장도 miss, 일부 한국어는 OK). B(어휘격차, recall@50=0, rerank 불가) 우세 → 쿼리 확장/번역 처방 신호. A(순위출렁)는 cap_theorem/vector_database 2그룹뿐. "측정 먼저" 논제 정량 검증(rerank 단독은 부분해법). Phase 2 처방 결정 대기. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 00:53:24 +00:00
altair823	fe4c854673	docs(plan): query-paraphrase robustness Phase 1 구현 계획 5개 task: (1) GoldenQuery.group + 그룹 정합성 검증, (2) 변형 일관성 메트릭 모듈 + A/B(순위출렁/어휘격차) 분류, (3) kebab eval variants CLI, (4) dogfood golden 변형 그룹 큐레이션, (5) 측정 + 진단 리포트. TDD bite-sized, 완성 코드. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 00:53:24 +00:00
altair823	1de3f4ffca	docs(spec): query-paraphrase robustness 평가 프레임워크 설계 (측정 먼저) 목표 재정의: 한/영 overlap → 같은 의미의 다양한 표현(동의어·다른 어휘·풀어쓴 문장·한영)에서 일관된 답변 품질. 지난 reranker 실험이 overlap 프록시 최적화로 헛돈 교훈 반영 — 처방 전 진짜 지표(변형 일관성)를 직접 재는 평가부터. Phase 1(본 spec 구현): kebab-eval golden suite에 변형 그룹(intent group) + 변형 일관성 메트릭(recall_spread, answer_consistency) + recall@pool vs recall@k로 (A)순위출렁/(B)어휘격차 자동 판별. Phase 2(처방)는 측정 결과 게이트 뒤 조건부. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 00:53:24 +00:00
altair823	166b1404e4	docs(release-notes): correct refusal判정 mechanism + O-2 phrasing leader review of writer draft: refusal 판정은 citation marker(`[#번호]`) 유무 기반이며 `<REFUSE>` 특수 마커가 아님. O-2 문구 예시도 실제 rag-v3 규칙으로 정정. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 04:58:08 +00:00
altair823	4afcaf96d2	docs(release-notes): add v0.20.2-draft (rag-v3 응답언어 + 검색 품질 eval 인프라) v0.20.2 릴리즈 노트 초안 작성. 사용자 영향 4단락 구조로 각 finding 기술. - Finding #1/O-2: rag-v3 응답언어 자동 매칭 + refusal 언어중립화 - Finding #2: bulk search input schema 확정 (15필드) - Finding #3: list docs human-readable path 보강 - Finding #7: index_version 두 곳 구분 (vector vs FTS5) - eval --config facade + 검색 품질 baseline (hybrid hit@3=1.0 / MRR=0.833) - Finding #4/#5/#6/#8: docs/schema 정비 - version cascade 주의 (rag-v3 → eval compare) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 04:54:19 +00:00
altair823	40d7faee71	docs(dogfood): add §10.2 search quality baseline scenario (v0.20.2 golden suite) eval --config facade 패치로 dogfood KB 직접 평가 가능해짐에 따라 §10 Eval 에 §10.2 검색 품질 baseline 섹션 추가. - golden suite 실행 명령 (hybrid + lexical eval run → aggregate) - v0.20.2 metric baseline 표 (hybrid hit@3=1.0 / MRR=0.833) - 정성 체크리스트 (한국어 2자 hit@3, empty=0, MRR 임계치) - golden 큐레이션 절차 + dispatch.py 오류 교훈 - §10.1 로 기존 basic eval run 재구성 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 04:54:03 +00:00
altair823	2429189447	docs(spec): reflect search-quality critic round-1 (eval --config, lang-filter non-goal, curation) Incorporates all critic (opus) round-1 findings into the dogfood search-quality eval design spec: BLOCKER-1: §4.4 execution commands now use --config /build/dogfood/config.toml (Task A facade-rule patch makes this the canonical path). §5.1 re-titled from "(후속 패치)" to "Task A로 적용됨 — 권장 운영 경로"; XDG workarounds demoted to "패치 전 fallback". Intro paragraph updated accordingly. MAJOR-1: §3 Non-Goals gains an explicit bullet: lang/media/code_lang SearchFilters validation is out of scope for this harness (runner uses SearchFilters::default(), runner.rs:151). §4.1 "code 검색" row no longer claims code_lang filter coverage. MINOR-1: §4.3 step 3 now names kebab inspect doc <id> as the primary chunk-selection path (breaks chunk-level curation loop); search hits demoted to "보조 확인용". MINOR-2: §4.1 golden category table gains two new rows — 한국어 N-gram fallback query (복합어/신조어 coverage) and 영어 whole-token exact query (separates substring artefacts). MINOR-3: §4.1 YAML header note added: record corpus_revision in golden file so stale-bail root cause is immediately traceable. NIT: §9 References line numbers corrected (runner.rs:31, metrics.rs:116/144); runner.rs:151 SearchFilters::default() reference added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 03:43:00 +00:00
altair823	571996938c	docs(contract): bump default prompt_template_version to rag-v3 (Todo #1 ) line 899: V1만 legacy → V1/V2 둘 다 legacy, v0.20.2 부터 rag-v3 default 선언. line 1349 (★): config 예시 default rag-v2 → rag-v3. line 1533 (★): §9 cascade table 코드 상수 rag-v2 → rag-v3. line 287 이후: answer.v1 예시 블록에 historical snapshot 주석 추가 (n1 — model+ptv stale, 값 변경 안 함). task spec grep 판단: tasks/p9/p9-fb-15 의 rag-v2 언급 2줄은 rag-v2 도입 시점 historical 기술 → frozen 유지. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:45:13 +00:00
altair823	be79bdb83d	docs(#7 ): distinguish vector-store vs FTS5 index_version (Todo #7 ) schema.schema.json models.index_version: vector store (LanceDB) version 임을 명시. search_hit.schema.json index_version: lexical (FTS5) version 임을 명시. search_hit.schema.json retrieval: 내부 필드 목록 + hybrid 전용 fusion 설명 추가 (hunk 공유). README kebab schema 행: index_version 두 곳의 의미가 다름을 주의 표기 추가. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:45:04 +00:00
altair823	4e76f103c1	docs(#5,#6): clarify retrieval.* nesting + single-mode score relation (Todo #5/#6) README Score 해석 절에 score ↔ retrieval.* 구조 설명 추가: - fusion_score/lexical_score/vector_score/lexical_rank/vector_rank 는 retrieval 내부 (top-level 아님). - single-mode 에서 score==fusion_score==lexical/vector_score 가 같은 값인 것은 정상 (Finding X). search_hit.schema.json score 필드에 score_kind 관계 + single-mode 동일값 이유 설명 추가. search_hit.schema.json retrieval/index_version 설명은 Task 12 커밋에 포함 (같은 hunk). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:44:56 +00:00
altair823	4fd672193f	docs(#4 ): clarify lang vs code_lang semantic and und=code (Todo #4 ) lang_breakdown description에 code 문서는 자연어 감지 미수행(lang="und" 정상) 사실 추가. README에 lang vs code_lang 설명 절 신규 추가. task spec grep: tasks/p9/p9-fb-15 의 rag-v2 언급은 historical 기술 → frozen 유지. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:44:34 +00:00
altair823	dece5e89fc	feat(bulk): document bulk search input schema + error shape hint (Todo #2 ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:16:21 +00:00
altair823	f5ff823984	docs(dogfood): add RAG response-language scenario (Todo #1 verification)	2026-05-28 21:28:22 +00:00
altair823	85efeeca3e	docs(plan): v0.20.2 dogfood findings 구현 plan (15 task) planner(opus) 작성 → critic 리뷰 시도 → leader 좌표 검증. 8 todo → 15 task: 코드 4 (rag-v3 / list docs / bulk / init) + 각 finding 후 전체 도그푸딩 검증 task 4 + docs-only 3 + contract + HOTFIXES/release-notes + version bump. plan critic round-1 은 환경 도구 손상으로 좌표 blocker(B-1/B-2/M-1/M-2)를 오진 → leader 가 pipeline.rs/config/cli/bulk/Cargo.toml 을 직접 grep 검증해 plan 좌표 정확 확인, executor 용 "anchor grep 재확인" + binary 경로 주의 헤더 추가. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 21:09:39 +00:00
altair823	2b4ba8e104	docs(spec): v0.20.2 dogfood findings 설계 + round-1 critic 반영 v0.20.1 전체 도그푸딩에서 발견된 8 todo (Ask 응답언어 rag-v3 / doc.lang docs / bulk input / list title / fusion_score·score_kind / schema index_version / Ollama hint) 를 단일 patch release 로 설계. writer worker 초안 → opus critic round-1 리뷰 반영: - B1: top-level score placeholder → 확정 (score_kind 가 의미 선언, search.rs:95-99) - M1: 이미지 caption 언어 강제 out-of-scope 명시 - M2: config default 테스트(lib.rs:1316) 갱신 필요 명시 - M3: bulk input 전체 필드 (query/mode/k/trust_min/ingested_after/media/tag/lang) - M4: rag-v3 의 eval_runs.config_snapshot_json cascade 영향 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 20:37:42 +00:00
altair823	b08941d6ab	docs(handoff): mark brainstorming-needed items in v0.20.1 findings todo 각 todo 에 fix path classification 추가: - 🧠 필수: design / user expectation 결정 필요 — brainstorming skill 우선 - 🔧 mechanical: spec drift 또는 명확한 fix — 별 brainstorming X - 📝 mild discussion: spec drafter self-review 로 trade-off 결정 가능 Classification: - 🧠 필수 (2): #1 Ask 영어→한국어 response policy, #4 doc.lang semantic - 🔧 mechanical (4): #2 bulk schema, #5/#6 docs sync, #7 schema rename - 📝 mild (3): #3 list title, #8 Ollama default 추가 brainstorming 후보 (직접 finding 외): - BS-A: HTML corpus 지원 (1415 file skipped) - BS-B: Tier 1/2/3 chunker UX visibility - BS-C: kebab dogfood subcommand (자동화) - BS-D: 영문 code chunk 의 tokenized_korean_text 효율 - BS-E: builtin_blacklist 명세 노출 권장 워크플로: 1. brainstorming 단계 먼저 (#1, #4 + BS-x 별 검토) 2. mechanical batch (#2, #5, #6, #7) — 한 PR 3. mild discussion batch (#3, #8) 4. dogfood retest → v0.20.2 patch release	2026-05-28 19:45:53 +00:00
altair823	6bf4e82e62	docs(handoff): v0.20.1 full dogfood findings todo for next session 머지 후 v0.20.1 의 full dogfood (사용자 실제 corpus 6293 file, 3.5 시간 ingest, §1~§11 시나리오) 발견된 findings 를 새 session 의 self- contained todo handoff 로 정리. P0 (bug / 의도와 다른 동작): - #1 Ask 영어 query → 한국어 응답 (rag-v2 prompt template 강제) - #2 bulk search input format 불명확 (wire schema 미명시) - #3 list docs title 중복 (heading-based, doc_path 보조 필요) - #4 doc.lang = und 53% (code file 의 lang detection 실패) P1 (docs drift): - #5 fusion_score 위치 (.retrieval.fusion_score) - #6 score_kind="bm25" 의미 (lexical mode 의 fusion_score) - #7 schema index_version vs lexical_index_version 혼동 P2 (setup): - #8 Ollama endpoint default 가 localhost (사용자 환경 remote) 각 todo 별 severity, scenario, suspected location, action item 명시. 새 session 시작 명령 + branch 권장 + 도그푸딩 재실행 절차 + finding cumulative table 포함. Repo state: main HEAD=a0c7fa3, clean. v0.20.1 binary OK. /build/dogfood/ KB (3940 docs, 34896 chunks) preserved for regression test.	2026-05-28 19:40:10 +00:00
altair823	028d9ad4ea	docs(release): v0.20.1 release notes draft + spec/plan dogfood cross-link #1 (사용자 요청): release notes draft 작성 + spec/plan 의 dogfood evidence cross-link 보강. docs/release-notes/v0.20.1-draft.md (신규): - 4 단락 본문 (한국어 2자 query 지원 + 영어 substring 회귀 + V007→V009 자동 backfill + ingest 성능 영향). - Migration cascade table (lexical_index_version, corpus_revision, wire schema shape preservation). - API + dependency 변경 (lindera v3, lindera-ko-dic v3, retired short_query_hint helper, 새 facade APIs). - Breaking changes 명시 (영어 substring 회귀, 첫 부팅 latency, DB/ binary 크기 증가). - Upgrade 절차 + Known limitation + 14 dogfood scenario reference. spec Appendix B (segmentation evidence): - "Empirical verification (2026-05-28 dogfood — post-merge update)" subsection 신규. prior-knowledge 가정 vs 실측 결과 table. Scenario 1-4 모두 verified 표시. ko-dic 의 '서울특별시' → '[서울, 특별시]' 분해 증거 명시. plan Changelog: - post-implementation entry: 22 commit on branch, S3 blockers, S7 cascade, S11 sanity regression updates, opus PR review 4 finding fixes. - dogfood evidence entry: 14 scenario verify pass, ko-dic 분해 evidence, HOTFIXES + spec Appendix B cross-link. Spec: …spec…md Appendix B Plan: …plan…md (post-implementation + dogfood evidence Changelog) Release notes: docs/release-notes/v0.20.1-draft.md	2026-05-28 13:34:33 +00:00
altair823	f2a76cfe94	docs(dogfood): V009 morphological tokenizer scenarios + fixture evidence v0.20.1 dogfood verification 의 fixture + scenario 를 DOGFOOD.md / SMOKE.md 에 반영. 사용자 KnowledgeBase 같은 영어/code 중심 KB 에서 한국어 0-hit 가 정상 (token 부재) 임을 명시하고, ko-dic 의 morpheme 분해 동작을 검증할 reference fixture (korea-overview.md + korea-compound.md) 를 inline 으로 제공. DOGFOOD.md §2.1 갱신: - description: trigram → unicode61 + 형태소 column. - scenarios: 한국어 2-char (한국, 서울) + compound noun (서울특별시) + 영어 whole-token 회귀 + 1-char filter 등 7 case 로 확장. - §2.1bis 신규: V009 dogfood evidence reference corpus + 검증 명령 + 예상 snippet (lindera 분해 증거) + known limitation (ko-dic compound 단일 token 정책, Option α acceptance). SMOKE.md 'V009 morphological 검색' 갱신: - trigram 시절 hint advisory + 3자 키워드 권장 시나리오 제거. - v0.20.1 의 2-char Korean / compound noun / 1-char filter / 영어 whole-token 회귀 scenario 로 교체. Reference fixture (실측 verify pass): - korea-overview.md: '한국' / '서울' / '지하철' 모두 hit. - korea-compound.md: '한국어' / '한국문화' / '서울특별시' compound hit. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (dogfood evidence)	2026-05-28 13:22:43 +00:00
altair823	8c56ef3010	docs(superpowers): v0.20.x C 한국어 morphological tokenizer spec + plan artifacts 본 commit 은 v0.20.x C task (Bug #8 — 한국어 2자 query 0-hit) 의 4-stage workflow artifact 5 파일을 archive: - spec.md (668 line, status=accepted): Option A/B/C 비교 + lindera Path A (영어 substring 회귀 인정) 결정 + 12 section + 4 Appendix (B segmentation evidence, C cost evidence, D license evidence). - spec-critic-r1.md: 3 critical + 6 major finding (NEEDS_REWRITE). - spec-critic-r2.md: r1c rewrite 후 traceability matrix (ACCEPT). - plan.md (750 line, status=accepted): 11 step + dependencies + cost optimization routing + 9 closure micro-patches 적용. - plan-closure-r1.md: traceability matrix + 9 MP 의 origin. 이 artifact 들은 implementation 머지 후 frozen reference. 후속 deviation 은 tasks/HOTFIXES.md 가 source of truth. Workflow stage: 1. spec drafter (omc team writer, opus) 2. spec critic R1 (omc team critic, opus) — NEEDS_REWRITE 3. spec rewriter r1c (omc team writer, opus) — 7 item fix 4. spec closure R2 (in-process verifier, sonnet) — ACCEPT 5. plan drafter (omc team planner, opus) 6. plan closure (in-process verifier, sonnet) — ACCEPT + 9 MP 7. subagent-driven-development implementation (11 step + 5 follow-up + 1 docs polish = 17 commit) 8. PR-level final code review (in-process code-reviewer, opus) — Approved with notes (4 minor docs finding, merge as-is) Branch: feat/korean-morphological-tokenizer Version: 0.20.1	2026-05-28 12:53:31 +00:00
altair823	5d9ea588ed	docs(v0.20.1): polish PR-review findings (README/HOTFIXES/schema/SKILL) opus PR-level final review (Approved with notes) 의 4 minor finding mechanical 정정: 1. README.md — `kebab search` row 의 영어 substring 매칭 표현이 V007 시절 그대로였음. V009 의 whole-token 회귀 (substring → V002 동작) 를 정직히 명시 + vector/hybrid mode 권장 안내. 2. tasks/HOTFIXES.md — 2026-05-28 entry 의 file path 정정. lexical.rs 는 lindera 호출자가 아니라 build_match_string 의 MIN_QUERY_CHARS 3→2 갱신만; lindera helper 의 실제 owner 는 kebab-chunk/src/lib.rs. ingest.rs 는 본 PR scope 외, eager backfill hook 위치는 kebab-app/ src/app.rs::App::open_with_config. 3. docs/wire-schema/v1/search_response.schema.json — `hint` field description 이 V007 trigram 3-char minimum 시절 advisory 시그니처 그대로. v0.20.1 에서 helper retired + always-omit 사실 명시 (forward-compat 차원에서 field 만 schema 에 보존). 4. integrations/claude-code/kebab/SKILL.md — `hint` field 설명의 self-contradiction ("present only with trigram in edge cases" vs "Korean 2-char now supported") 해소. retired + reuse 가능 명시. PR-level reviewer recommendation: "Merge as-is — block 사유 아님 (모든 finding minor)". 본 commit 은 reviewer 의 옵션 1 (별 docs hotfix commit) 채택. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (PR-level finding follow-up)	2026-05-28 12:53:00 +00:00
altair823	d13eb87401	docs(v0.20.x): sync README + HANDOFF + ARCH + SKILL + HOTFIXES for V009 V009 한국어 morphological tokenizer 의 사용자 visible surface 변경 + release notes scope 를 5 docs 에 cascade. - README.md: kebab search 명령 row 에 한국어 2자 query 지원 명시. - integrations/claude-code/kebab/SKILL.md: V007 3-char hint 제거 + V009 2자 한국어 query 지원 1줄. - HANDOFF.md: C task status 완료 flip + v0.20.1 release notes scope 에 본 변경 추가 + 머지 후 발견 summary 행. - docs/ARCHITECTURE.md: embedding upgrade (e5-small → e5-large), lindera-ko-dic FTS5 한국어 지원, version notes 추가. - tasks/HOTFIXES.md: 2026-05-28 entry — Bug #8 V009 해소, lindera-ko-dic 실제 crate name (spec deviation), cargo-deny deferred, Path A 영어 substring 회귀 명시. Spec: tasks/p9/p9-9-v0.20.x-korean-morphological-tokenizer-spec.md §7.4 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-05-28 11:55:25 +00:00
altair823	b106120e93	feat(fts): add V009 korean morphological tokenizer migration V007 trigram tokenizer 의 한국어 2자 query 0-hit 한계 (Bug #8) 해소를 위한 V009 migration 추가. unicode61 tokenizer 로 환원 + 한국어 형태소 분해 결과를 별 column `tokenized_korean_text` 에 pre-fill 하는 방식. - migrations/V009__fts_korean_morphological.sql 신규: column ADD, chunks_fts DROP+재정의, 3 trigger CASE expression, backfill INSERT, corpus_revision bump. - design §5.5 갱신: trigram → unicode61 + 형태소 column. CASE expression trigger 본문. - crates/kebab-store-sqlite/tests/fts.rs: V007 verbatim test 를 V009 source-of-truth 로 rename. v009_bumps_corpus_revision unit test 추가. - store.rs: clippy bool_to_int_with_if + cast_lossless 기존 경고 수정 (pdf_ocr_events 관련 코드, S1 작업 중 발견). 영어 substring 매칭은 V002 (whole-token only) 로 회귀 — spec §3 Non-Goals + 후속 release notes (v0.20.1) 에서 정직히 기술. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 09:48:46 +00:00
altair823	43366b1b15	docs(handoff): C 한국어 morphological tokenizer (Bug #8 ) 새 session handoff v0.20.0 sub-item 1 + bugfix 1~4 + ingest log r1+r2 머지 후, 다음 우선순위 C (한국어 morphological tokenizer) 의 self-contained context. 새 session 의 첫 step + workflow patterns + 환경/memory references + cascade risk + 가능한 fix paths (Option A jieba-rs / B bi-gram supplement / C query-side expansion). spec/plan/executor cycle 동일 패턴. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:18:04 +00:00
altair823	7c24734cc7	docs(superpowers): v0.20.x logging r2 spec + plan artifacts logging round 2 (4 enhancement: image_w/h + V008 SQLite mirror + CLI inspect + retention) 의 spec/plan ACCEPT 후 round artifacts. - spec: 751 line (ACCEPT, 7/7 critic round 1 finding + 7/7 closure r2 traceability) - plan: 576 line (ACCEPT, 6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:04:32 +00:00
altair823	35c987df1c	feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4) LoggingCfg gains two fields with serde defaults: keep_recent_runs (default 100, top-N file retention) and retention_days (default 30, time-based retention for both ndjson files and the SQLite mirror). IngestLogWriter::open now runs cleanup_old_logs before creating a new ingest-*.ndjson — delete iff (idx >= keep_recent) OR (modified <= cutoff). ingest_with_config_opts also calls SqliteStore::prune_pdf_ocr_events(retention_days) at ingest start so the SQLite mirror tracks the same retention window. Backward compat (AC-9): both new fields use #[serde(default = ...)], so a pre-v0.20.x config with only [logging] ingest_log_enabled + ingest_log_dir parses unchanged. kebab init writes the new defaults automatically via Config::default() -> toml::to_string_pretty (AC-12). docs/SMOKE.md config example synced. Closure r1 F5: explicit OR-on-stale comment inside cleanup_old_logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:17:47 +00:00
altair823	d9ec7b8dc3	feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor) Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine, top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or corpus-wide recent failures, with --doc-id + --limit). Both ship via new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`. App gains four facade methods: inspect_ocr_stats / inspect_ocr_failures plus their _with_config companions — required by CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI dispatch arms thread cfg explicitly into the _with_config form. Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two entries; the meta JSON Schema (schema.schema.json) is untouched because its wire.schemas is pattern-based, not enum-based. ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile. SKILL.md synced with the two new schemas (AC-13). Closure r2 G2 (facade _with_config pair) + G3 (runtime emit, not meta schema file) + closure r1 F4 (p99) resolved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:13:08 +00:00
altair823	685007789a	style: cargo fmt --all (round 4 ingest log feature follow-up) Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이 test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all 적용. 모든 import alphabetical reorder + line wrapping 정합. 추가 untracked artifact 동시 commit: - docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT) - docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT) workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 04:18:40 +00:00
altair823	bef0c98867	feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields v0.20.x ingest log feature 의 wire side. additive minor cascade: * PdfOcrProgress::Finished + IngestEvent::PdfOcrFinished 의 4 field: - image_byte_size: Option<u64> - image_width: Option<u32> - image_height: Option<u32> - failure_reason: Option<String> * docs/wire-schema/v1/ingest_progress.schema.json — 4 추가 property (모두 optional, required 변경 없음 = additive minor) * integrations/claude-code/kebab/SKILL.md — wire schema description 동기 기존 ingest_progress.v1 consumer (CLI wire dump, integration test fixture, kebab-cli wire_search/wire_ask) 는 4 추가 field 의 Option::None 으로 backward-compat. version bump 0 (additive minor = binary-version cascade trigger 아님 per CLAUDE.md §Versioning cascade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 02:57:59 +00:00
altair823	46e99470eb	docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md 3-round dogfood-driven fix cycle 의 산출물: - bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line - bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line - bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line - docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus) 각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 01:21:34 +00:00
altair823	5bba95fd71	docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation Bug #11 (이전 commit `fix(config): pdf.ocr.request_timeout_secs default 600 → 60`) 의 frozen-spec deviation handoff. - tasks/HOTFIXES.md: 2026-05-27 dated subsection — Discovered / Symptom / Root cause / Fix / Amends 5-field 포맷 (기존 entries 와 일치). - docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md: PDF OCR config block line 1000 (default value) + OQ-1 line 1628 에 inline HTML 주석 2 줄 cross-link. prose 변경 0 — parent spec frozen contract 보존, HTML 주석은 markdown render 시 invisible. HOTFIXES entry 가 live source of truth (CLAUDE.md "Spec contract" 규칙). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:16:18 +00:00
altair823	d9c7aabce1	feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13 ) 이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 → multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest) 의 version cascade audit 누락 risk. 이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec<String> 추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는 optional (#[serde(default)] + JSON schema required 미포함). source: - kebab_store_sqlite::fetch_distinct_parser_versions() 가 documents.parser_version DISTINCT + ORDER BY 반환. - fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern. - collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결). wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분. integrations/claude-code/kebab/SKILL.md 동기 갱신. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:15:58 +00:00
altair823	b4d9e60816	chore(release): bump version 0.19.0 → 0.20.0 — v0.20.0 sub-item 1 scanned PDF OCR # v0.20.0 — scanned PDF OCR via Ollama vision LLM v0.20.0 의 핵심 변경 = embedded text 가 없는 scanned PDF (책 스캔, 영수증, 카메라 page) 의 OCR ingest. PoC 의 5 engine 비교 (Tesseract / EasyOCR / PaddleOCR / gemma4:e4b / qwen2.5vl:3b) 에서 qwen2.5vl:3b 의 alnum 94.79% (page1) / 81.56% (받침) 가 모든 다른 engine 을 능가 — 본 release 의 default vision OCR. ## 1. OCR opt-in 사용법 `[pdf.ocr]` config 의 `enabled = true` 또는 `KEBAB_PDF_OCR_ENABLED=true` env 로 활성화. default off — OCR 한 page 당 45-100s (qwen2.5vl:3b on CPU, remote Ollama) 의 cost 가 책 archive 외 비-OCR KB 에 부적합. ```toml [pdf.ocr] enabled = true model = "qwen2.5vl:3b" # 다른 default 는 README 참조 ``` qwen2.5vl:3b 의 Ollama pull: ```bash ollama pull qwen2.5vl:3b # 3GB Ollama image ``` ## 2. v0.19 indexed scanned PDF 의 force-reingest v0.19 binary 로 scanned PDF 를 ingest 한 KB 는 자동으로 OCR path 진입 안 함 — parser_version "pdf-text-v1" 보존 (CLAUDE.md §Versioning cascade 의 trigger 회피 결정, H-4). 따라서 v0.20 binary upgrade + config `pdf.ocr.enabled = true` 만 적용 시 try_skip_unchanged 의 Unchanged path 가 OCR 실행을 skip. 명시적 재처리: ```bash kebab ingest --root /path/to/kb --force ``` ## 3. DCTDecode-only v1 scope (FlateDecode / CCITTFax page 처리) v0.20.0 의 PDF page image extract = lopdf 의 image XObject 의 /Filter == DCTDecode 만 cover (JPEG passthrough). 다른 encoding (FlateDecode raw pixel, CCITTFaxDecode bilevel, JPXDecode JPEG2000) 은 warning event 발행 + 해당 page skip. scanned PDF 의 일부 page 가 FlateDecode 또는 CCITTFax 로 encoded 시: ```bash qpdf --object-streams=disable --recompress-flate input.pdf normalized.pdf ``` v1 의 의도 = single binary 원칙 (image crate 도입 0). v1.1+ 또는 별 sub-item 에서 multi-filter 지원 검토. ## 4. Family asymmetry (image OCR gemma4:e4b vs PDF OCR qwen2.5vl:3b) image OCR (P6) 의 default 는 gemma4:e4b 그대로 (변경 0). PDF OCR (v0.20) 만 qwen2.5vl:3b. 사용자가 [image.ocr] model = "qwen2.5vl:3b" 으로 통일 가능 단 default 는 family asymmetric 보존. ## Dogfood + test 결과 - workspace test: 178 result lines, 0 failure. - workspace clippy (-D warnings): exit 0. - alnum e2e (real Ollama, manual invoke): - F1 (한국어 page1): 94.79% (≥ 0.85 threshold). - F2 (받침-intensive): 81.56% (≥ 0.70 threshold). - integration smoke + vector PDF regression: pass. ## 변경된 surface - new config: [pdf.ocr] (11 field) + 11 env override KEBAB_PDF_OCR_*. - new wire: IngestEvent::PdfOcrStarted/Finished (additive minor). - new wire: IngestItem.pdf_ocr_pages/ms_total (additive minor). - new CLI line: "📷 OCR page N..." / "✓ OCR page N (chars chars, msms via ollama-vision)". - new module: kebab-parse-pdf::{page_image, text_quality} + kebab-app::pdf_ocr_apply. - dep: workspace lopdf = "0.32" 통합. - fixture: 5 PDF (F1/F2/F4/F6/F7) under crates/kebab-parse-pdf/tests/fixtures/. ## 변경되지 않은 surface (invariant) - Extractor::extract trait body byte-identical (PR #187). - PdfTextExtractor body 변경 0 — post-extract enrichment pattern 으로 분리. - parser_version "pdf-text-v1" 보존. - chunker_version "pdf-page-v1" 보존. - workspace.dependencies 의 production dep graph 변경 0 (-e normal baseline 보존). ## sub-item 의 11 commit history `9d7faab` Step 1: foundation + cargo tree baselines `aeeff36` Step 2: lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7) `fb3952d` Step 2 fix: F7 conversion engine record correction `c2cd3a7` Step 3: page_image + text_quality modules (10 test) `8d81bc1` Step 3 fix: clippy pedantic in page_image `9f003ef` Step 4: pdf_ocr_apply helper (10 test, F7 split + cancel) `fd918a6` Step 5: [pdf.ocr] config section + PdfOcrOpts doc `4672cba` Step 5 fix: clippy::bool_assert_comparison in pdf_ocr tests `b9ee09f` Step 6: wire PDF OCR enrichment + cancel propagation `4c5ccd5` Step 7: wire schema additive — IngestEvent + IngestItem + skipped `c9e0594` Step 8: CLI printer activation + ingest_progress test + spec literal `4819768` Step 9: integration smoke + vector regression + alnum e2e `1d4e301` Step 9 follow-up: Cargo.lock for dev-dep additions `90726ab` Step 10: docs sync (README + HANDOFF + ARCHITECTURE + SMOKE) ## § Acceptance §9 verifier evidence K5 의 15 row scriptable verifier 모두 green (또는 manual real-Ollama row 의 결과 보고): - Row #4 (vector PDF byte-identical): pass. - Row #5 (Extractor::extract trait byte-identical): 0 line diff. - Row #6 (wire schema additive): jq + diff exit 0. - Row #7-#8 (clippy / workspace test): exit 0. - Row #9-#10 (dep graph baseline -e normal): empty diff. - Row #11 (docs sync): grep evidence. - Row #12 (version bump): "0.20.0" + Cargo.lock cascade ≥ 22. - Row #14 (PR #187 invariant): extract_for(&asset.media_type) ≥ 1. - Row #15 (DCTDecode-only v1, F6/F7 skip): test green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 11:03:44 +00:00
altair823	90726ab283	docs(v0.20): sync README + HANDOFF + ARCHITECTURE + SMOKE for scanned PDF OCR (post-extract enrichment, qwen2.5vl:3b, DCTDecode-only v1) Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. J0 — release notes path decision: commit body (RELEASE_NOTES.md / docs/RELEASE_NOTES_.md 부재, v0.17.x/v0.18.0 patterns 의 commit body release notes 형식 따름). Step 11 K1 commit body 안 inline. J1 — README.md: - Configuration section 의 toml table list 에 `[pdf.ocr]` 추가. - 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field toml example + `KEBAB_PDF_OCR_` 11 env override + force-reingest UX ("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용, `kebab ingest --force` 필요"). J2 — HANDOFF.md: - phase status P7 row 확장: 3/3 component + post-extract OCR enrichment (v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM). - "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope (H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX). J3 — docs/ARCHITECTURE.md: - OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)` (qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply, DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%). - PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) + parser_version "pdf-text-v1" 보존 + provenance event 차별화. J3 — docs/SMOKE.md: - `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b). - 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`: v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke. J4 — release notes draft (Step 11 K1 commit body 의 source): - result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only + family asymmetry + dogfood/test 결과). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4) prior: `1d4e301` (Step 9 + Cargo.lock follow-up) contract: §9 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:34:24 +00:00
altair823	c9e05941c5	feat(cli): activate per-page PDF OCR progress printer + test(app): ingest_progress emit verify + spec(pdf-ocr): align §4.6.1 literal with option_A (ms/chars) Step 8 (Group H) of v0.20.0 sub-item 1 (scanned PDF OCR) plan + Step 7 reviewer concern fix (spec literal deviation). H1 — kebab-cli/src/progress.rs printer activation: - 구 no-op stub `IngestEvent::PdfOcr* { .. } => {}` (Step 6 placeholder) 를 사람-친화 stderr line printer 로 활성화. - spec §4.6.1 line 1085-1086 wording 그대로: - PdfOcrStarted → ` 📷 OCR page {page}...` - PdfOcrFinished (skipped=false) → ` ✓ OCR page {page} ({chars} chars, {ms}ms via {ocr_engine})` - PdfOcrFinished (skipped=true) → ` ⊘ OCR page {page} skipped (no DCTDecode or engine fail, {ms}ms)` (M-4 의 skipped field carry 활용) - `!quiet` gate 정합 (AssetStarted/Finished pattern mirror). H2 — crates/kebab-app/tests/ingest_progress.rs 의 새 test: - pdf_ocr_progress_emits_started_finished_events (real Ollama 의존, `#[ignore]`). - F1 fixture (scanned_page1.pdf) ingest 시 pdf_ocr_started + pdf_ocr_finished event 가 emit 됨을 verify. Started count == Finished count invariant. - Manual invoke: `KEBAB_PDF_OCR_ENABLED=true cargo test -p kebab-app --test ingest_progress --ignored`. - mock OcrEngine inject path 부재 (Step 6 의 eager build), Step 9 I5 의 ocr_e2e pattern (real Ollama + `#[ignore]`) 와 동일. Step 7 reviewer concern fix — spec §4.6.1 literal: - line 1076-1077 의 `ocr_ms` / `ocr_chars` literal 을 wire schema 의 실제 field name `ms` / `chars` (option_A, Rust serde 와 정합) 로 갱신. - line 1087 의 printer wording 도 `{ocr_chars}` / `{ocr_ms}` → `{chars}` / `{ms}`. - line 1556 의 rationale 참조 `pdf_ocr_finished.ocr_ms` → `.ms`. - `skipped` field 도 명시 (Step 6 reviewer M-4 결과). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.6.1) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 8 H1+H2) prior: `4c5ccd5` (Step 7 wire schema) — Step 7 reviewer concern 1 의 fix contract: §9 (additive minor wire bump — Step 7 commit 에서 완료) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 09:18:49 +00:00
altair823	4c5ccd5447	feat(wire): additive minor — IngestEvent kind 의 pdf_ocr_* + ingest_report.items[] 의 pdf_ocr_pages/ms_total + skipped field carry (Step 6 M-4/M-2) Step 7 (Group G) of v0.20.0 sub-item 1 (scanned PDF OCR) plan + Step 6 code reviewer Important M-4 (skipped field carry) + Minor M-2 (ordering invariant doc) fix. G3 — JSON Schema sync (additive minor — schema_version 보존): ingest_progress.schema.json: - kind enum 2 추가: pdf_ocr_started + pdf_ocr_finished. - 새 field: page (1-based PDF page), ocr_engine (engine_name), skipped (bool). - 기존 ms / chars field 의 description 갱신 (pdf_ocr_finished carry 추가). ingest_report.schema.json: - items.items.properties 신규 정의 (이전 stub ["array", "null"] 만). - pdf_ocr_pages + pdf_ocr_ms_total (nullable integer). - 모든 기존 IngestItem field 도 명시화 (kind, doc_path, byte_len, ...). Step 6 reviewer M-4 (Important) — skipped field carry: - IngestEvent::PdfOcrFinished 에 skipped: bool 추가. - ingest_one_pdf_asset 의 emit closure (lib.rs:~1864) 가 source PdfOcrProgress::Finished { skipped } 를 discard 않고 propagate. Step 6 reviewer M-2 (Minor) — ordering invariant doc: - crates/kebab-app/src/ingest_progress.rs 의 ordering text 갱신: ScanStarted < ScanCompleted < (AssetStarted [< (PdfOcrStarted < PdfOcrFinished)] < AssetFinished) < (Completed \| Aborted). .md doc (docs/wire-schema/v1/*.md) 부재 — plan §3 Step 7 G3 의 .md deliverable retro N/A (해당 file 0). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 7 G3) prior: `b9ee09f` (Step 6 wiring) + Step 6 reviewer M-4/M-2 권고 contract: §9 (additive minor wire bump — schema_version 보존) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 08:51:51 +00:00
altair823	fb3952d54f	docs(pdf-ocr): correct F7 conversion engine record in PoC doc (gs, not ImageMagick) `aeeff36` 의 PoC doc append (engine-comparison.md L134, L141) 가 F7 (`ccitt.pdf`) 의 conversion engine 을 "ImageMagick `convert -compress Group4`" 로 기록했으나, 실제 tests/fixtures/_synth/flate_ccittfax.sh:77-83 은 `gs -sDEVICE=pdfwrite -dMonoImageFilter=/CCITTFaxEncode -dEncodeMonoImages=true` flag 사용 (ImageMagick `convert` 호출 0회). fixture binary (`/Filter [ /CCITTFaxDecode ]`, 2060 bytes) 는 invariant 충족 OK (Step 2 spec compliance + code quality review verified). historical record 의 factual correction only. review trail: - impl result: .omc/reviews/2026-05-27-pdf-ocr-step-02-impl-result.md - spec review: .omc/reviews/2026-05-27-pdf-ocr-step-02-spec-review-result.md - code review: .omc/reviews/2026-05-27-pdf-ocr-step-02-code-review-result.md (I1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 05:36:56 +00:00
altair823	aeeff3635b	poc+test(pdf-ocr): lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7) for v0.20 sub-item 1 Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures, result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md). Key findings: - reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ]; useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0. - Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires manual PDF construction via zlib.compress. - ghostscript pdfwrite rejects TIFF input (/undefined in II*) — ImageMagick `convert -compress Group4` used for F7 CCITTFax. B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/: - F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어). - F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침). - F4 mojibake.pdf — DejaVu TTF + ToUnicode CMap stripped (count=0); Noto CJK TTC has PostScript outlines unsupported by reportlab. - F6 flate_raw.pdf — /Filter /FlateDecode, DCTDecode absent (skip path input). - F7 ccitt.pdf — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input). Synth scripts under tests/fixtures/_synth/: - scanned_pdf.py — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0). - mojibake.py — F4 reportlab DejaVu TTF + ToUnicode strip. - flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert. spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2) contract: §9 (additive minor wire bump — 후속 step) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 04:04:47 +00:00
altair823	9d7faab650	docs+chore(plan-bootstrap): apply spec L-1 cosmetic fix + capture cargo tree baselines for v0.20 sub-item 1 verifier gates Step 1 (Group A) of v0.20.0 sub-item 1 (scanned PDF OCR) implementation plan. A1 — spec §4.2 line 740 prose pseudo-code fix: `app.pdf_ocr_engine.as_ref()` → local `pdf_ocr_engine: Option<OllamaVisionOcr>` built in `ingest_with_config_opts` (정합 with §4.4 eager init, App field 도입 0). A2 — Cargo.toml dep invariant verified (image crate 미도입 — H-3 DCTDecode-only v1 invariant 보존; kebab-parse-pdf + kebab-parse-image 가 kebab-app 의 기존 dep). description 갱신은 Step 3 (module 추가 후) 으로 이연. A3 — cargo tree baseline 캡처 — K5 row #9/#10 의 ground-truth (.omc/state/pdf-ocr-{app-parse,parse-pdf}-deps.baseline.txt). 본 sub-item 의 다른 step 의 dep graph 변경 0 invariant 의 verifier 의 baseline. Note: .omc/ 는 .gitignore 대상 — baseline files 는 로컬 파일로 존재. spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (round 1c ACCEPT) contract: §9 (additive minor wire bump — 후속 step) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 03:40:49 +00:00
altair823	574e1b1ca1	docs: v0.20 image+pdf handoff + sub-item 3 spec/plan backfill v0.19.0 release 후 다음 session 인계용 handoff 문서 + 사후 backfill. - docs/superpowers/handoffs/2026-05-26-v0.20-image-pdf-normalize-handoff.md (540 lines, 9 section) - sub-item 1/2/3 머지 결과 + 도그푸딩 baseline (1781 doc / 9050 chunks) + user memory + OMC workflow + 빌드 환경 - 현재 구현 상태 (v0.19.0, image+pdf) — 정확한 file:line + struct/fn signature + flow - 8 TODO 상세 (problem + scope + affected files + risk + trigger 조건) - 우선순위 + sequencing 권장 + 새 session 첫 단계 제안 - docs/superpowers/specs/2026-05-26-extractor-dispatch-unification-spec.md (sub-item 3 spec) - docs/superpowers/plans/2026-05-26-extractor-dispatch-unification-plan.md (sub-item 3 plan) PR #187 머지 시 source code 만 들어가고 spec/plan 누락 — 동일 PR 의 reference link 가 main 에서 404. 본 commit 으로 backfill. Assisted-by: Claude Code	2026-05-26 23:34:17 +00:00
altair823	710945c4b0	refactor(parse-md): absorb kebab-normalize + kebab-parse-types — 24 → 22 crates + §3.7b 재작성 design §3.7b 의 thin layer (ParsedBlock 류) 가 4 parser 중 1개 (markdown) 만 lift 를 경유하는 현실 — fan-in/fan-out 모두 1 → layer 의미 잃음. kebab-normalize (1097 LOC) + kebab-parse-types (98 LOC) 둘을 kebab-parse-md 로 흡수. 설계: docs/superpowers/specs/2026-05-26-normalize-absorption-spec.md 플랜: docs/superpowers/plans/2026-05-26-normalize-absorption-plan.md HOTFIXES: tasks/HOTFIXES.md 의 2026-05-26 entry (design deviation) - 5 사용 type + 3 forward-declared struct → kebab-parse-md::types module 의 pub explicit re-export. - build_canonical_document + derive_title + warning_agent → kebab-parse-md::normalize module. - 4 hard-coded agent literal (lib.rs:122/128/134/153) + warning_agent body return + tracing target literal 모두 보존 — stage label 일관성. - kebab-app callsite (lib.rs:51 use + :1119 context string) + Cargo.toml 의 2 dep (regular + dead) 제거. - kebab-chunk + kebab-store-sqlite 의 [dev-dependencies] kebab-normalize → 제거 (kebab-parse-md 로 갈음). 통합 test source 의 use shift. - test file 이동 (kebab-normalize/tests/normalize_snapshot.rs → kebab-parse-md/tests/). - workspace Cargo.toml: Hunk (a) members 2 entry 삭제 + Hunk (b) version 0.18.0 → 0.19.0 (frozen contract 변경). - design §3.7b 4-단락 재작성 (원래 intent 보존 + 현재 상태 + 보존된 surface + future re-extraction trigger). - design §8 graph 갱신 (3 edge 제거 + 2 forbidden bullet 의미 갱신 + commentary). - ARCHITECTURE.md crate graph + directory tree mechanical 갱신. - tasks/INDEX.md L169 closure mention + "Future work / deferred" 섹션 신설 (image/pdf normalize integration entry). - tasks/HOTFIXES.md 신규 entry (4-block — design deviation Symptom). - HANDOFF.md cross-link 한 줄. - 3 dead struct (ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment) 는 보존 — v0.20+ image/pdf normalize integration 의 future surface (spec §11). Wire / surface impact: 0건. CLI / TUI / MCP / --json 출력 / config / XDG path / parser_version 모두 unchanged. wire-invisible provenance.events[].agent + tracing target literal "kb-normalize" 도 보존 — old DB row 와 new DB row 의 audit log 일관성. Verification: cargo test --workspace --no-fail-fast -j 1 → 1313 passed / 0 failed (172 result blocks). cargo clippy --workspace --all-targets -j 1 -- -D warnings → 0 warning (5m 46s). cargo metadata --no-deps --format-version 1 \| jq '.workspace_members \| length' = 22. cargo tree -p kebab-app --depth 2 \| grep -E "kebab_(parse_types\|normalize)" = 0 줄.	2026-05-26 15:00:59 +00:00

1 2 3 4

197 Commits