#4 (사용자 요청): spec §6.2 의 Option β (sub-token 추가 emit) 를
v0.21.x P9 follow-up 에서 v0.20.1 implementation 으로 promote.
dogfood 의 ko-dic compound noun limitation (`대한민국`, `한국정부`,
`주민등록번호` 등 단일 token 정책) 해소.
Implementation (`crates/kebab-chunk/src/lib.rs::tokenize_korean_morphological`):
- 신규 helper `is_hangul()` — 한글 음절 (U+AC00..D7A3) + 자모
(U+1100..11FF, U+3130..318F) 판정.
- lindera output 의 각 morpheme 에 대해, 한글만 + 길이 ≥ 3 인 경우
sliding window 2-gram 추가 emit. `[한국정부, 한국, 국정, 정부]`
형태로 token list expand.
- 영어 / 숫자 / 혼합 token 은 supplement X (false positive 회피).
Tests (`crates/kebab-chunk/tests/tokenize_korean.rs`):
- `tokenize_korean_morphological_emits_2gram_for_long_morpheme`: 5 probe
fixture 중 supplement 발화 case 확인 (실측 `서울특별시` →
`[서울, 특별시, 특별, 별시]`, `대한민국` → `[대한민국, 대한,
한민, 민국]`).
- `tokenize_korean_morphological_no_2gram_for_english`: Rust optimization
fixture 에서 영어 substring (`Rus`, `ust`, `imi`) emit 없음 보장.
Dogfood evidence (`tasks/HOTFIXES.md` 2026-05-28 entry 보강):
- '대한', '한민', '민국' query 모두 hit (대한민국 의 sliding window).
- '특별', '주민', '등록' 같은 sub-token query hit.
- 영어 'tokenizer' query 는 corpus 부재로 0 hit (supplement X).
- Trade-off: DB size +20-30% (Korean-heavy), false positive 작은 risk.
Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 (Option β promote)
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (post-implementation enhancement)
S3 의 Chunk struct 갱신 (kebab-core 의 tokenized_korean_text:
Option<String> field 추가) 가 모든 chunk snapshot JSON 의 serde
serialize 결과를 변경시킴. 10 snapshot fixture (9 AST chunker +
markdown long-section) 의 baseline 을 V009 형태로 regenerate.
각 snapshot 의 변경 = chunk JSON 마다 `"tokenized_korean_text":
null` field 추가 (대부분의 fixture 가 영어 코드라 lindera 의 None
fallback). 동작 변경 없음 — serde representation 의 cascade만.
Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up via S11 sanity)
V009 morphological tokenizer 작업 (S3 chunk + S4 backfill + S5
short_query_hint 제거 + S7 신규 tests) 의 형식 정리. 동작 변경 없음.
Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11)
S3 spec compliance reviewer (sonnet) 가 2 blocker 발견:
1. crates/kebab-store-sqlite/src/documents.rs: get_chunk SELECT 가
tokenized_korean_text column 을 미조회 → DB 의 값이 read 시 유실.
SELECT column list + row → Chunk 변환 시 row.get 인덱스 추가.
ChunkRow struct + chunk_row_from_sql + get_chunk Chunk 생성 cascade.
2. crates/kebab-chunk/src/code_*_ast_v1.rs (9 file): make_chunk 가
tokenized_korean_text: None 하드코딩 → 한국어 주석을 가진 코드
파일이 FTS hit 안 됨. tier2_shared 와 동일 패턴으로
tokenize_korean_morphological(text) 호출 cascade.
이 commit 은 S3 의 rework — amend 아닌 별 commit (S3 boundary
유지). spec §6.2 invariant ("모든 chunker 가 chunk emit 직전에
tokenize 호출") 충족.
Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 rework)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
V009 의 tokenized_korean_text column 에 들어갈 morpheme sequence
를 lindera ko-dic 으로 분해. chunk builder pipeline 의 chunk 생성
직후 시점에서 호출 → chunk struct 의 field 에 pre-fill → store
의 put_chunks 가 단일 transaction 안에서 INSERT.
- crates/kebab-core/src/chunk.rs: Chunk struct 에
tokenized_korean_text: Option<String> field 추가 (#[serde(default)]).
- crates/kebab-chunk/src/lib.rs: tokenize_korean_morphological()
helper + OnceLock 캐싱 + fallback (None) 정책.
- crates/kebab-chunk/Cargo.toml: lindera features = ["embed-ko-dic"]
추가 (DictionaryKind::KoDic 활성화에 필요).
- 모든 chunker (tier2_shared, md_heading_v1, pdf_page_v1, 9개
code AST v1): Chunk 리터럴에 tokenized_korean_text pre-fill.
- crates/kebab-store-sqlite/src/documents.rs::put_chunks: INSERT
SQL column list + placeholder + binding 갱신 (12번째 column).
- crates/kebab-chunk/tests/tokenize_korean.rs: 단위 테스트 2개.
lindera 3.0.7 API 정정: load_dictionary_from_kind →
load_embedded_dictionary, Token.text → Token.surface.
Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
UNIQUE constraint failed: chunks.chunk_id
Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.
Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).
Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer nit #3: the hand-built fixed_doc() only verified chunker 1:1
mapping. New tests invoke CppAstExtractor against tests/fixtures/sample.cpp
and snapshot the real extractor → chunker pipeline (14 blocks emitted
covering namespace::chunk::Class, ctor/dtor/operator/template/free-fn
convention, glue <top-level> blocks between units).
Adds kebab-parse-code as a dev-dep of kebab-chunk (same precedent as
kebab-parse-md). Both the existing hand-built test AND the new
extractor-driven tests are kept — the former for fast chunker-only
validation, the latter for end-to-end regression detection.
Added tests:
- code_cpp_ast_extractor_snapshot: asserts all 8 named symbol units are present
- code_cpp_ast_extractor_chunks_deterministic: chunker output is stable
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code-go-ast-v1's chunker pattern. Snapshot test against
tests/fixtures/sample.c (function + typedef struct + typedef enum +
preprocessor) verifies symbol list + lang=c stamping.
Chunks produced (4 total):
- <top-level> glue: includes, defines, static vars, typedefs (lines 1-18)
- parse_record function (lines 20-23)
- print_record function (lines 25-27)
- main function (lines 29-33)
All chunks stamped with lang=c and chunker_version=code-c-ast-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
Citation::Code { symbol: None, lang: "shell" }, chunker_version
"code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
(no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
chunker_version "code-text-paragraph-v1".
Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.
Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.
4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised
the line-window split branch of tier2_shared::push_chunks_with_oversize.
This adds a 256-line ConfigMap fixture (generated inline) and asserts:
- ≥2 chunks emitted (split happened),
- all chunks share symbol `ConfigMap/prod/big`,
- chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation),
- line ranges form a contiguous partition (prev.line_end + 1 == next.line_start).
Reviewer nit #1 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reads entire Dockerfile / Dockerfile.* / *.dockerfile content and emits a
single Chunk with symbol "<dockerfile>", code_lang "dockerfile", line range
1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
tier2_shared::push_chunks_with_oversize.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits multi-document YAML by ^---\s*$, requires apiVersion + kind string
fields per document, emits 1 chunk per recognized k8s resource. Symbol =
<kind>/<namespace>/<name> or <kind>/<name> (cluster-scoped). Invalid YAML
returns 0 chunks (handled by p10-3 paragraph fallback). Oversize >200 lines
splits into line-windows sharing the same symbol.
tier2_shared module hosts the oversize fallback + Chunk-construction helper
mirroring code_rust_ast_v1's Chunk shape. Task E (dockerfile) and Task F
(manifest) will reuse it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code_go_ast_snapshot pattern. In-memory CanonicalDocument (no
kebab-parse-code dep — boundary §6.3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-java-ast-v1 with language-agnostic body unchanged. Cross-
chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-rust-ast-v1 / code-go-ast-v1 with language-agnostic body
unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-rust-ast-v1 / code-{python,ts,js}-ast-v1 with language-agnostic
body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code_rust_ast_snapshot pattern. In-memory CanonicalDocument build so
no kebab-parse-code dep (boundary §6.3 respected).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-rust-ast-v1 / code-python-ast-v1 with language-agnostic body unchanged.
Cross-chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-rust-ast-v1 with language-agnostic body unchanged. Cross-chunker
policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four optional, serde-skipped-when-None fields added to `Metadata` for
code ingest context. All 11 downstream construction sites patched with
`repo: None, git_branch: None, git_commit: None, code_lang: None`.
Full workspace check (`--tests`) and per-crate test suite pass clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- chunk 진입부에 overlap clamp 추가 (`target_bytes / 2` 상한). 병적 정책
(`overlap_tokens >= target_tokens`) 에서 chunk 가 직전 chunk 의 텍스트를
완전히 재발행하던 위험 차단. md-heading-v1 의 `seed_budget = overlap_tokens
.min(target/2)` 가드 패턴과 일치. 회귀 테스트 `overlap_clamped_when
_overlap_exceeds_target` 추가 — `actual_start` 가 인접 chunk 사이에
엄격 증가하는지 검증.
- `char_start as u32` / `char_end as u32` silent truncation → `try_from
::expect` 로 corrupted input 시 명시 panic.
- 모듈 doc 의 `## Splitting policy` 에 약어 케이스 (`Mr.` / `i.e.` 등) +
overlap clamp 두 항목 명시.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`PdfPageV1Chunker` 가 `kebab-parse-pdf` 가 emit 한
`CanonicalDocument` (블록당 한 페이지, 모두 `SourceSpan::Page`) 를 받아
페이지 경계를 절대 넘지 않는 `Chunk` 들을 생성. `chunker_version =
"pdf-page-v1"`.
핵심 동작:
- 페이지 텍스트가 `target_tokens × BYTES_PER_TOKEN` (= 3) 안이면 한
덩어리. 초과 시 `\n\n` (paragraph) 또는 sentence-end 구두점 + whitespace
경계를 segment 로 보고 greedy 누적, 기본 한 chunk 당 최소 한 segment.
- 다음 chunk 의 prefix 에 `overlap_tokens × BYTES_PER_TOKEN` 만큼의 직전
꼬리를 prepend (char 단위, 이전 chunk 시작 너머로 backtrack 안 함).
- 빈/공백-only 페이지는 0 chunk (페이지의 `Provenance::Warning` 으로
`kebab-parse-pdf` 단계에 이미 표시됨).
- 비-PDF doc (Block::Paragraph 가 아니거나 SourceSpan 이 Page 아님) →
명시 에러.
Spec deviation (HOTFIXES 2026-05-02 P7-2):
- `chunk_id` 충돌 가드: 같은 페이지에서 여러 chunk 가 나오면 `block_ids`
가 모두 같아 §4.2 recipe 가 충돌. `id_for_chunk` 의 `policy_hash` 인풋을
per-chunk 로 `format!("{base}#c{char_start}")` 변형해 회피. recipe 자체는
불변. `Chunk.policy_hash` 필드는 base 유지.
- `BYTES_PER_TOKEN = 3` (md-heading-v1 실제 코드와 일치). spec 본문은
"/ 4" 라고 했지만 그 자체가 md-heading-v1 의 실코드와 어긋나 있어 일관성
쪽을 택함. cross-chunker `policy_hash` 동일성 unit test 로 잠금.
테스트 (10개 신규):
- chunker_version label, 3-page small, 1-page huge + overlap + chunk_id
유일성, empty page skip, whitespace-only skip, non-PDF error,
cross-page boundary 절대 안 만들어짐, determinism (1000회), snapshot
shape 안정, md-heading-v1 와 policy_hash 동일.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P6-1/P6-2/P6-3 의 라이브러리 (`ImageExtractor`, `OllamaVisionOcr`,
`apply_caption`) 가 그동안 CLI 에서 보이지 않던 미완 구간을 완성.
이제 `kebab ingest` 가 markdown 외에 이미지 자산을 end-to-end 로
색인하고, `kebab search` / `kebab ask` 가 OCR 텍스트 + caption 으로
이미지를 매칭/인용한다.
## kebab-app
- `[dependencies]` 에 `kebab-parse-image` 추가.
- `ingest_with_config` 진입 시 `image.ocr.enabled` / `image.caption.enabled`
플래그에 따라 `OllamaVisionOcr` / `OllamaLanguageModel` 을 **ingest
세션당 1회** 빌드. 자산 루프에서 trait object 로 공유.
reqwest::blocking::Client 의 내부 Arc 덕분에 알로케이션 비용은
자산 수와 무관.
- 두 어댑터 + ImageExtractor 를 한 묶음으로 `ImagePipeline` 구조체에
담아 `ingest_one_asset` 매개변수 폭증 차단 (clippy::too_many_arguments
대응).
- `ingest_one_asset` 의 markdown-only 가드를 `match media_type` 으로
교체 — Markdown 은 기존 경로, Image(_) 는 새 `ingest_one_image_asset`
로 분기, PDF/Audio/Other 는 종전대로 skipped.
- 신규 `ingest_one_image_asset`:
- bytes 읽기 → `ImageExtractor::extract` (실패 시 caller 가 errors+=1)
- `apply_ocr` (Lenient — 실패 시 ProvenanceKind::Warning 이벤트 +
`IngestItem.warnings` 에 \"ocr_failed: ...\", `block.ocr` 는 None
유지)
- `apply_caption` (동일 Lenient 정책)
- 기존 `MdHeadingV1Chunker` 호출 — 청커는 이미 `Block::ImageRef` 를
단일 청크로 emit
- 기존 persist + embed 시퀀스 그대로 (markdown 과 byte-identical)
- `lang_hint_from_doc` — `Lang(\"und\")` 또는 빈 문자열을 None 으로
매핑 (image-pipeline 어댑터의 build_prompt 가 \"und\" 를 silent drop
하지 않도록 caller 측에서 미리).
## kebab-chunk
- `render_block_text` 의 `Block::ImageRef` 분기를 P6-4 (β) plain
concat 정책으로 교체 — `[alt, ocr.joined, caption.text]` 를 `\\n\\n`
로 join, 빈 부분은 drop. alt 가 비면 `src` 의 basename 으로 fallback
(P6-1 contract 의 defensive guard).
- 신규 unit 테스트 `image_ref_p6_4_plain_concat_drops_empty_parts` —
alt-only / alt+ocr / alt+caption / alt+ocr+caption / 빈 alt → src
fallback 다섯 케이스 모두 검증.
- 기존 `image_ref_emits_own_chunk_zero_tokens` 그대로 통과 — 청커의
per-block dispatch 는 변경 없음, text 렌더링만 갱신.
## 통합 테스트 (kebab-app/tests/image_pipeline.rs)
wiremock 으로 Ollama 를 stub. 5건:
1. OCR-only happy path — 1 PNG + ocr.enabled → 1 doc + 1 chunk emit,
`block.ocr.joined` 가 mock 의 \"Hello World 2026\".
2. OCR + caption 동시 활성 — 두 필드 모두 채워지고 chunk text 에
alt + ocr + caption 세 부분 모두 포함.
3. Lenient 실패 검증 — OCR 503 시 자산은 indexed (kind=New),
`errors=0`, ProvenanceKind::Warning attributed to \"kb-app\",
`IngestItem.warnings` 에 \"ocr_failed:\" 노트.
4. 양쪽 비활성 — `image.ocr.enabled=false && image.caption.enabled=false`
여도 자산은 chunk 1개로 indexed (chunk text=filename), EXIF +
dimensions 그대로 채워짐.
5. 결정성 (re-ingest) — 동일 PNG 두 번 ingest 시 두 번째는
`Updated` + 동일 `doc_id`.
## SMOKE.md
`kebab search --mode lexical \"Hello World\"` 단계를 명령 시퀀스에
추가. `[image.ocr]` / `[image.caption]` config 절 예시 + ingest 시간
추정 (자산당 ~5-10초) 추가. \"책은 P7 PDF 라인으로\" 가이드를 검증
체크리스트 와 \"알려진 동작\" 양쪽에 박음.
## 실 Ollama 통합 검증
192.168.0.47 + gemma4:e4b 기준:
```
$ kebab --config /tmp/kebab-smoke/config.toml ingest
scanned 2 new 2 updated 0 skipped 0 errors 0 (18395 ms)
$ kebab inspect doc <image_doc_id>
parser_version: image-meta-v1
blocks: [{
alt: \"hello.png\",
ocr: \"Hello World 2026\",
caption: \"The image displays the text \\\"Hello World 2026\\\" in a large, black, sans-serif font.\"
}]
$ kebab --json ask \"Hello World 텍스트가 어디에 있나?\" --mode hybrid
grounded: true
citations: [{marker: \"[1]\", doc_path: \"hello.png\"}]
```
## 검증
- `cargo test --workspace --no-fail-fast -j 1` — 전부 pass
- `cargo clippy --workspace --all-targets -- -D warnings` — pass
- `cargo test -p kebab-chunk image_ref` — 2 pass (P1-5 회귀 + P6-4
신규 unit)
- `cargo test -p kebab-app --test image_pipeline` — 5 pass
## 의존성 경계
- `kebab-app` 이 `kebab-parse-image` 추가 — spec Allowed dep 그대로.
- 새 forbidden 침범 없음 (기존 `kebab-tui` / `kebab-desktop` /
`kebab-eval` 미참조 유지).
- 본 task 가 신설하는 image-specific 비즈니스 로직 0줄 — 모두
`kebab-parse-image` 에 위임.
`tasks/p6/p6-4-image-ingest-wiring.md` status: planned → completed.
contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
sections: §3.4 ImageRefBlock, §6.1 ingest pipeline, §7.2
Extractor/Chunker traits, §9.1 image extraction policy.