From 43366b1b1542742dc3ddbcca50d887d8f273fa4a Mon Sep 17 00:00:00 2001 From: altair823 Date: Thu, 28 May 2026 08:18:04 +0000 Subject: [PATCH] =?UTF-8?q?docs(handoff):=20C=20=ED=95=9C=EA=B5=AD?= =?UTF-8?q?=EC=96=B4=20morphological=20tokenizer=20(Bug=20#8)=20=EC=83=88?= =?UTF-8?q?=20session=20handoff?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit v0.20.0 sub-item 1 + bugfix 1~4 + ingest log r1+r2 머지 후, 다음 우선순위 C (한국어 morphological tokenizer) 의 self-contained context. 새 session 의 첫 step + workflow patterns + 환경/memory references + cascade risk + 가능한 fix paths (Option A jieba-rs / B bi-gram supplement / C query-side expansion). spec/plan/executor cycle 동일 패턴. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...-korean-morphological-tokenizer-handoff.md | 199 ++++++++++++++++++ 1 file changed, 199 insertions(+) create mode 100644 docs/superpowers/handoffs/2026-05-28-v0.20.x-c-korean-morphological-tokenizer-handoff.md diff --git a/docs/superpowers/handoffs/2026-05-28-v0.20.x-c-korean-morphological-tokenizer-handoff.md b/docs/superpowers/handoffs/2026-05-28-v0.20.x-c-korean-morphological-tokenizer-handoff.md new file mode 100644 index 0000000..950af9a --- /dev/null +++ b/docs/superpowers/handoffs/2026-05-28-v0.20.x-c-korean-morphological-tokenizer-handoff.md @@ -0,0 +1,199 @@ +--- +title: v0.20.x next session handoff — C 한국어 morphological tokenizer +created: 2026-05-28 +source_session: 2026-05-27 ~ 2026-05-28 (v0.20.0 sub-item 1 + bugfix 1~4 + ingest log r1+r2) +target_branch_candidate: feat/korean-morphological-tokenizer (가칭) +priority: C (HANDOFF.md "v0.20.0 sub-item 1 머지 후 priorities" 순서: C → B → A → G) +--- + +# 다음 session handoff — C 한국어 morphological tokenizer (Bug #8 follow-up) + +새 session 이 픽업해서 진행할 다음 작업의 self-contained context. **C 진행 후 B → A → G** 순서. + +--- + +## 0. Repo state (handoff 시점) + +- working directory: `/home/altair823/kebab` +- branch: `main` (HEAD `70507e9` — handoff doc 갱신 commit). +- 가장 최근 머지: PR #190 (`7bbdc89a` — logging round 2 4 enhancement merge). +- env: `export CARGO_TARGET_DIR=/build/out/cargo-target/target` +- fresh release binary: `/build/out/cargo-target/target/release/kebab` (main HEAD 빌드). +- workspace test 1370+ passed (logging r2 추가 후), clippy `-D warnings` clean. + +지난 session 의 누적 surface 변경 (HANDOFF.md G section 의 release notes scope 참조): +- v0.20.0 sub-item 1 (PDF scanned OCR via qwen2.5vl:3b) — PR #189. +- Bug #2/#3/#4 (size cap / chunker_version / mojibake fixture). +- Bug #6/#7 (Identity-H marker / CLI --media help). +- Bug #9/#10/#11/#13/#14 (capabilities flag / config_not_found / OCR timeout / schema.models / empty query). +- Bug #11 follow-up — OCR timeout 60s → 180s (sweet spot 점진적 축소). +- Ingest log feature round 1 (file ndjson + LoggingCfg + per-run rotation). +- Ingest log feature round 2 (image_w/h capture + V008 SQLite mirror + CLI inspect ocr-stats/ocr-failures + retention) — PR #190. + +falsified 또는 design-constraint 로 분류된 항목 (fix 안 함): +- Bug #8 → **본 C 항목의 대상**. +- Bug #12 (Code block wire `.code` field, not `.text`). +- ask 한국어 phrasing-sensitive refusal (RAG corner case, NLI gate). + +--- + +## 1. 본 작업 (C) 의 deliverable scope + +### 1.1 Bug #8 의 현 상태 + +`V007 trigram` FTS5 tokenizer (`migrations/V007__fts_trigram.sql`, 2026-05-23 v0.17.0 release) 가 SQLite FTS5 의 default `unicode61` 을 trigram 으로 교체. 효과: +- 한국어 ≥3 char substring 검색 가능 ('서울특별시' hit, '검색' hit). +- **한계: ≥3 char query 만 보장**. '한국' (2 char) 같은 query 는 trigram bucket 가 없어서 0 hit. +- 사용자 도그푸딩 (round 3 / round 4) 에서 `'한국'` / `'서울'` / `'지하철'` 0 hit 반복 — **사용자 search experience 의 가장 큰 surface**. + +### 1.2 Bug #8 design constraint vs follow-up + +`HOTFIXES 2026-05-22` 및 V007 migration 본문 의 design rationale: +> trigram indexes 3-character grams, enabling Korean partial matches. ... DB size grows (~2-10×), English lexical also moves to substring match (recall↑, precision↓), BM25 score distribution shifts. + +즉 trigram 도입 자체가 **substring matching** 의 우선시 — 그러나 2 char query 한계는 결국 한국어 사용자 의 search experience 의 큰 노이즈. + +### 1.3 본 sub-item 의 fix paths (spec drafter 가 결정) + +**Option A — morphological tokenizer 도입**: +- jieba-rs (Rust) 또는 koma / komoran 같은 한국어 형태소 분석기. +- SQLite FTS5 의 external content + custom tokenize. (rusqlite 의 `create_module` 또는 c-style binding). +- 또는 FTS5 의 `tokenize='unicode61'` + 별 tokenized table 의 dual-index (token list 사전 분해 후 column 에 저장). +- 한국어 morpheme 분해 → 2 char query 도 morpheme boundary 와 일치 시 매칭 가능. + +**Option B — trigram 보존 + bi-gram supplement**: +- 2 char query 의 경우 trigram 외 bi-gram fallback index. SQLite FTS5 의 `tokenize='trigram'` 외 별 `tokenize='unicode61'` 의 dual-table. +- query analyzer 가 2 char 일 때 bi-gram table 에 fallback. + +**Option C — query-side expansion**: +- 2 char query 시 user 에게 hint ("3 char 이상 입력 권장") + 또는 vector search fallback. +- low scope, design 변경 0. 그러나 actual fix 가 아닌 workaround. + +권장: spec drafter 가 jieba-rs (또는 비슷) 시제품 시도 후 결정. corpus_revision bump + V009 migration cascade 예상. + +### 1.4 cascade risk + +design §9 version cascade: +- FTS5 tokenizer 변경 = 새 V009 migration cascade. +- 기존 V007 trigram index 보존 (backward-compat 위해) vs DROP + 새 index 교체 — spec 결정. +- search wire schema 변경 0 (tokenizer 는 internal). 사용자 visible benefit = '한국' 같은 2 char query hit. + +--- + +## 2. 새 session 의 첫 step (권장) + +### 2.1 작업 시작 명령 + +새 session 시작 시: + +``` +@/home/altair823/kebab/docs/superpowers/handoffs/2026-05-28-v0.20.x-c-korean-morphological-tokenizer-handoff.md +``` + +위 file 을 읽고 진행. **PR #189 / #190 의 cycle 동일** 패턴 (spec → critic → drafter rewrite → closure → plan → plan closure → executor → dogfood → PR). + +### 2.2 branch + omc-teams + +새 branch 생성: +```bash +git checkout -b feat/korean-morphological-tokenizer +``` + +`omc team` spawn pattern (memory `feedback_omc_teams_usage` 따라): +- sequential single-team only (multi-team spawn fail). +- spawn 후 background polling shell (`until omc team status | grep -qE "phase=(completed|failed|done|stopped)"; do sleep 30; done`) — run_in_background=true. +- spawn syntax: `omc team 1:claude:writer --no-decompose "spec drafter ..."`. +- model routing: spawn 시 변경 X (memory: "OMC teams usage — model routing 미해결"). + +### 2.3 spec drafter brief 작성 시 핵심 + +- spec deliverable path: `docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md`. +- code probe: + - `migrations/V007__fts_trigram.sql` (current trigram migration). + - `tasks/HOTFIXES.md` 2026-05-22 + 2026-05-24 PR-A entries (trigram history). + - `crates/kebab-store-sqlite/src/fts.rs` (FTS5 rebuild helper). + - `crates/kebab-cli/src/main.rs::Search` (search args + `--match-string` 등 query path). +- jieba-rs crate 의 license + maintenance + tokenization quality 검토 (별 sub-section). + +--- + +## 3. workflow patterns (직전 4 round 의 패턴 따라) + +1. **spec drafter** (omc-teams `writer` role) → 500-800 line spec. +2. **closure critic round 1** (omc-teams `critic` role) — design feasibility + parent spec invariant + wire schema cascade + new substantive finding hunt. verdict = ACCEPT / NEEDS_REWRITE. +3. **drafter rewrite (r1c)** 만약 NEEDS_REWRITE. +4. **closure critic round 2** — mechanical traceability matrix + 0 substantive new finding 만 ACCEPT. +5. **plan drafter** (omc-teams `planner` role) → 500-700 line plan, step decomposition + AC verifier. +6. **plan closure pass** (omc-teams `verifier` role) — step → spec traceability + verifier actionability + 가능 시 closure finding 직접 plan-level resolve. +7. **executor** (omc-teams `executor` role) — sequential N step + N commit + workspace test verify. +8. **dogfood retest** (직접 또는 별 background bash) — actual user-visible verify. +9. **PR force-push 또는 새 PR 생성** (PR #189 = force-push pattern, PR #190 = 새 PR pattern — sub-item scope 결정). +10. **사용자 merge 결정**. + +--- + +## 4. 환경 + memory 참조 + +### 4.1 빌드 환경 + +```bash +export CARGO_TARGET_DIR=/build/out/cargo-target/target + +# 빌드: +cargo build --release -j 4 -p kebab-cli + +# workspace test: +cargo test --workspace --no-fail-fast -j 1 + +# clippy: +cargo clippy --workspace --all-targets -j 4 -- -D warnings + +# per-crate test: +cargo test -p kebab-store-sqlite -j 4 +cargo test -p kebab-cli -j 4 +``` + +### 4.2 dogfood corpus + +기존 KB: +- `/build/cache/tmp/v0.20-r5-dogfood/` 또는 `v0.20-r4-dogfood/` 의 corpus 재사용 가능 (실측 R5 + R6 후 cleanup 됨 — 다시 setup 필요). +- corpus 구성: markdown × 2 + code (Rust / Python / Dockerfile / k8s / shell) + PDF (mojibake / scanned / metro-korea). +- Ollama remote: `http://192.168.0.47:11434` (qwen2.5vl:3b OCR, bge-m3 embed, gemma4:e4b LLM). + +### 4.3 memory references + +이 session 의 memory (이미 메모리 안): +- `feedback_pr_workflow.md` — 모든 task = gitea-pr + 리뷰 루프 모드. +- `feedback_omc_teams_usage.md` — omc team CLI 사용 절차 (sequential single-team only). +- `feedback_worker_completion_polling.md` — worker spawn 직후 background polling shell run_in_background=true. +- `feedback_teammate_spawn_mode.md` — agent 는 omc-teams 통해 별 tmux pane 으로 spawn. +- `feedback_serial_build_only.md` — cargo 동시 bg 금지 (직렬), `-j 4` default. +- `feedback_cargo_clean_policy.md` — `/build` 4TB HDD 사용 시 routinely clean 금지. +- `feedback_user_review_gates.md` — brainstorming + writing-plans 의 사용자 컨펌 단계 skip. 핵심 trade-off 결정만 AskUserQuestion. +- `feedback_readme_sync_rule.md` — surface 변경 시 implementation PR 이 README + HANDOFF + ARCHITECTURE 동시 갱신. +- `feedback_no_caveman.md` — caveman 말투 금지, 자연스러운 한국어 산문. + +### 4.4 release strategy + +본 C 작업 후 — wire schema 변경 0 (search query path 의 internal tokenizer). 그러나 사용자 visible behavior 변경 (2 char query hit 가능). **HOTFIXES entry 추가 + parent spec cross-link** (CLAUDE.md "Spec contract" rule). C 완료 후 별 release 또는 G (v0.20.1) 의 patch release 시 함께 cut. + +--- + +## 5. 가능한 next-next 작업 preview + +C 완료 후: +- **B — OCR dense page coverage** (metro-korea page 8/13 timeout 해소). path: per-page max_pixels 동적 / column-level OCR / model upgrade. detail = HANDOFF.md. +- **A — v0.20 의 deferred sub-items** (sub-item 2 multi-region image / sub-item 3 PDF normalize / TODO #4 figure-table / TODO #5 Enricher trait). +- **G — v0.20.1 patch release** (round 1+2 surface 통합 release notes). + +--- + +## 6. 본 handoff doc 의 path + commit + +이 file: `docs/superpowers/handoffs/2026-05-28-v0.20.x-c-korean-morphological-tokenizer-handoff.md`. + +새 session 에서 첫 명령: +``` +@/home/altair823/kebab/docs/superpowers/handoffs/2026-05-28-v0.20.x-c-korean-morphological-tokenizer-handoff.md +``` +또는 그냥 위 file 의 내용 paste.