# p10-3 β€” Tier 3 paragraph + line-window fallback chunker **Status:** 🟑 μ§„ν–‰ 쀑 **Contract sections:** Β§3.3 (chunker_version `code-text-paragraph-v1`), Β§3.5 (code_lang routing β€” `shell` ν™œμ„±ν™” + "미지원 / Tier 3 fallback" λͺ…ν™•ν™”), Β§6.2 (`kebab-chunk/src/code_text_paragraph_v1.rs`), Β§6.3 (`tier2_shared::build_chunk` 의 `pub(crate)` λ…ΈμΆœ), Β§9.3 (Tier 3 μ •μ˜), Β§10.1 (deactivation log ν•œ 쀄). **Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) Β§1.3 (Phase 3) + Β§9.3. **Plan:** [2026-05-20-p10-3-tier3-paragraph-fallback.md](../../docs/superpowers/plans/2026-05-20-p10-3-tier3-paragraph-fallback.md). ## Goal p10-1A-2 / 1B / 1C / 1A-1 의 framework + p10-2 Tier 2 인프라 μœ„μ— Tier 3 paragraph fallback chunker ν™œμ„±ν™”. 단일 PR. λ¨Έμ§€ μ‹œμ λΆ€ν„°: - `.sh` / `.bash` / `.zsh` 파일이 paragraph λ‹¨μœ„λ‘œ 색인. - p10-2 의 λΉ„-k8s YAML / invalid YAML / Tier 1 AST extractor μ‹€νŒ¨ λ“± 0-chunk κ²°κ³Όκ°€ μžλ™μœΌλ‘œ Tier 3 둜 fallback λ˜μ–΄ 색인 β€” 이전에 skip 되던 파일이 search κ°€λŠ₯. ## λ™κ²°λœ 섀계 κ²°μ • (이 task 둜 ν™•μ •) ### chunker (`code-text-paragraph-v1`) - **Input**: `Document` with single `Block::Code { text, lang, ... }`. Tier 2 의 `synthesize_tier2_document` 와 λ™μΌν•œ λͺ¨μ–‘ β€” fallback wrapper κ°€ 같은 doc μž¬μ‚¬μš©. - **VERSION_LABEL**: `"code-text-paragraph-v1"`. - **Paragraph λΆ„ν• **: `text.lines()` 순회. 빈 쀄 (μ •ν™•νžˆ 빈 쀄 λ˜λŠ” only-whitespace) 을 paragraph boundary 둜. 빈 쀄 μžμ²΄λŠ” μ–΄λŠ paragraph 에도 ν¬ν•¨λ˜μ§€ μ•ŠμŒ (chunk 의 line range 에 미포함). 빈 paragraph (μ „λΆ€ whitespace) skip. - **Paragraph 크기 λ£°** (design Β§9.3 default κ·ΈλŒ€λ‘œ, hardcoded): - paragraph line count ≀ 80 β†’ 1 chunk emit. - paragraph line count > 80 β†’ line-window split with window size 80 / overlap 20 (stride 60). 즉 line 1-80, 61-140, 121-200, … λ§ˆμ§€λ§‰ window λŠ” EOF κΉŒμ§€ (≀ 80 lines). - `FALLBACK_LINES_PER_CHUNK = 80`, `FALLBACK_LINES_OVERLAP = 20` λ‘˜ λ‹€ hardcoded constants (1A-2 의 `AST_CHUNK_MAX_LINES = 200` νŒ¨ν„΄ κ·ΈλŒ€λ‘œ β€” μ‚¬μš©μž config λ…ΈμΆœ μ•ˆ 함, 미래 HOTFIXES μ‹œ λ…ΈμΆœ κ²€ν† ). - **Citation**: `SourceSpan::Code { line_start, line_end, symbol: None, lang: }`. `symbol = None` 톡일 (Tier 3 λŠ” 의미 λ‹¨μœ„ 식별 μ•ˆ 함). `lang` 은 μž…λ ₯ Document 의 `Block::Code.lang` κ·ΈλŒ€λ‘œ 보쑴 β€” shell β†’ `"shell"`, k8s skip β†’ `"yaml"`, Rust extractor μ‹€νŒ¨ β†’ `"rust"` λ“±. - **chunk_id 좩돌 λ°©μ§€**: 동일 paragraph 의 line-window split μ‹œ `id_for_chunk` 의 `split_key` 에 `window_start` 전달 (Tier 2 `#L{k}` νŒ¨ν„΄ 동일). - **Edge cases**: - 전체 파일이 빈 μ€„λ§Œ β†’ 0 chunk emit (fallback 의 fallback μ—†μŒ). `tracing::warn!`. - 단일 paragraph + ≀ 80 lines β†’ 1 chunk, line range 1..N. - 빈 쀄 μ—†λŠ” κ±°λŒ€ 파일 (ν•œ paragraph 전체) β†’ line-window split. ### Routing / fallback wrapper - **`code_lang_for_path`** λ³€κ²½ μ—†μŒ (shell 맀핑은 1A-1 μ‹œμ λΆ€ν„° 이미 쑴재). - **`ingest_one_code_asset` allowlist** (`crates/kebab-app/src/lib.rs:953`) 에 `"shell"` μΆ”κ°€. - **4-arm match (parser_version / chunker_version / extract / chunks)** 에 `"shell"` arm μΆ”κ°€: - parser_version = `"none-v1"` (Tier 2 sentinel μž¬μ‚¬μš©). - chunker_version = `CodeTextParagraphV1Chunker.chunker_version()`. - extract = `synthesize_tier2_document(asset, &bytes, "shell", &parser_version)?` (μž¬μ‚¬μš©). - chunks = `CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)?`. - **Fallback wrapper** (핡심 μ‹ κ·œ 둜직) β€” chunks match 직후 ν›„μ²˜λ¦¬: - Tier 1/2 lang 의 κ²°κ³Όκ°€ `Err(_)` λ˜λŠ” `Ok(empty_vec)` 이면 Tier 3 retry. - retry μ‹œ: - `chunker_version` λ₯Ό `code-text-paragraph-v1` 둜 swap (downstream stamping μ •ν™•μ„±). - `canonical.parser_version` 도 `"none-v1"` 둜 swap (Tier 1 의 `RUST_PARSER_VERSION` 등이 misleading ν•˜λ―€λ‘œ). - `CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)` μ‹€ν–‰. - μ‹€νŒ¨ μ‚¬μœ λŠ” `tracing::warn!("tier1/2 emitted 0 chunks or errored for {workspace_path} ({code_lang}); falling back to tier 3")`. - **Tier 3 μžμ²΄κ°€ 0 chunk λ˜λŠ” Err** 인 κ²½μš°λŠ” κ·ΈλŒ€λ‘œ fail/skip (fallback 의 fallback μ—†μŒ). ### `tier2_shared::build_chunk` λ…ΈμΆœ - ν˜„μž¬ module-private `fn build_chunk`. Tier 3 κ°€ 동일 Chunk 생성 (hash / token / policy_hash 일관) 을 μœ„ν•΄ 호좜 β€” `pub(crate) fn build_chunk(...)` 으둜 visibility 만 λ³€κ²½. signature 동일. ### Lang 보쑴 μ •μ±… - Tier 3 chunk 의 `Citation::Code.lang` = μž…λ ₯ Document 의 `Block::Code.lang` κ·ΈλŒ€λ‘œ. λͺ…μ‹œμ μœΌλ‘œ ν‘œ: | Source | input lang | Tier 3 output lang | |--------|-----------|----------| | shell direct | `"shell"` | `"shell"` | | k8s 0-chunk fallback | `"yaml"` | `"yaml"` | | Rust AST μ‹€νŒ¨ fallback | `"rust"` | `"rust"` | | manifest 0-chunk (이둠상, 거의 λ°œμƒ μ•ˆ 함) | `"toml"` λ“± | μœ μ§€ | - 검색 μ‹œ `--code-lang shell` / `--code-lang yaml` 등이 fallback chunk 도 λ§€μΉ­ β€” search filter λ™μž‘ μžμ—°. ### Non-scope - **미지원 ν™•μž₯자 wiring**: `.txt` / `.log` / `.scala` / `.rb` 등은 λ³Έ PR scope λ°–. `code_lang_for_path` 의 맀핑은 unchanged. Tier 3 chunker μžμ²΄λŠ” λ§Œλ“€μ–΄λ‘κ³ , λ―Έλž˜μ— `code_lang_for_path` 에 μƒˆ lang μΆ”κ°€ μ‹œ μžλ™ picked up (1A-2 νŒ¨ν„΄). - **config λ…ΈμΆœ**: `FALLBACK_LINES_PER_CHUNK` / `FALLBACK_LINES_OVERLAP` hardcoded. config.toml λ…ΈμΆœ μ—†μŒ. ### Frozen design κ°±μ‹  - `docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md` Β§10.1 ν™œμ„±ν™” 둜그 ν•œ 쀄. - `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` Β§10 activation log ν•œ 쀄. - Β§3.5 의 "미지원 / Tier 3 fallback β†’ null" ν‘œν˜„μ€ κ·ΈλŒ€λ‘œ μœ μ§€ (ν•΄λ‹Ή ν‘œν˜„μ΄ λ³Έ phase 의 μ •ν™•ν•œ 의미 β€” Tier 3 chunk 의 lang 은 μž…λ ₯ lang λ³΄μ‘΄μ΄λ―€λ‘œ "null" 은 미지원 ν™•μž₯자 wire μ‹œ 적용). ## Acceptance criteria - `cargo test --workspace --no-fail-fast -j 1` PASS (memory-conscious `-j 1`). - `cargo clippy --workspace --all-targets -- -D warnings` clean. - 4 μ‹ κ·œ unit test in `crates/kebab-chunk/tests/code_text_paragraph_v1.rs`: - `shell_multi_paragraph_splits_on_blank_lines` β€” 3-paragraph fixture β†’ 3 chunk, symbol=None, lang=shell, contiguous (exclusive of blank lines). - `single_long_paragraph_line_window_split` β€” 200+ line single paragraph β†’ window split, distinct chunk_ids, expected line ranges (1-80, 61-140, 121-200, …). - `empty_file_emits_zero_chunks` β€” 빈 ν…μŠ€νŠΈ β†’ `Ok(vec![])`. - `lang_field_preserved_from_input_doc` β€” lang=yaml μž…λ ₯ β†’ emit chunk lang=yaml. - 2 μ‹ κ·œ integration test in `crates/kebab-app/tests/code_ingest_smoke.rs`: - `tier3_shell_ingest_searchable` β€” `.sh` 파일 ingest β†’ `--code-lang shell` 검색 β†’ `Citation::Code { symbol: None, lang: "shell" }`, `chunker_version: "code-text-paragraph-v1"`. - `tier3_yaml_fallback_picks_up_non_k8s_yaml` β€” apiVersion+kind μ—†λŠ” yaml ingest β†’ fallback λ°œλ™ β†’ `Citation::Code { symbol: None, lang: "yaml" }`, chunker_version `code-text-paragraph-v1`. - κΈ°μ‘΄ 12 smoke test + 2 μ‹ κ·œ = 14 testing surface. (Tier 1 9 + Tier 2 3 + Tier 3 2.) - `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"shell"` 카운트 λ“±μž₯ (.sh 파일 ingest ν›„). λΉ„-k8s YAML 도 `"yaml"` μΉ΄μš΄νŠΈμ— λˆ„μ  (Tier 2 와 Tier 3 κ°€ 같은 lang). - README + HANDOFF + docs/ARCHITECTURE + docs/SMOKE + tasks/INDEX + tasks/p10/INDEX κ°±μ‹ . - frozen design Β§10.1 + Β§10 activation log ν•œ 쀄씩. - workspace `Cargo.toml` minor bump (0.14.0 β†’ 0.15.0), gitea-release v0.15.0. ## Allowed dependencies - `kebab-chunk` 의 μƒˆ λͺ¨λ“ˆ `code_text_paragraph_v1.rs` β€” kebab-core + anyhow + tracing. tier2_shared 의 `build_chunk` 호좜 (visibility `pub(crate)` 둜 λ…ΈμΆœ). tree-sitter / serde_yaml λΉ„μ‚¬μš©. - `kebab-app::ingest_one_code_asset` β€” 4-arm match + allowlist + fallback wrapper ν™•μž₯. μƒˆ crate dep μ—†μŒ. - `kebab-parse-code` β€” λ³€κ²½ μ—†μŒ (lang.rs 의 shell 맀핑은 1A-1 λΆ€ν„° 쑴재). - `kebab-source-fs` β€” λ³€κ²½ μ—†μŒ (media.rs 이미 `code_lang_for_path` μœ„μž„). ## Forbidden dependencies - `kebab-chunk` κ°€ store / embed / llm / rag / tree-sitter 직접 import κΈˆμ§€ (boundary Β§6.3 μœ μ§€). - UI crate (`kebab-cli` / `kebab-mcp` / `kebab-tui` / `kebab-desktop`) κ°€ `kebab-parse-code` / `kebab-chunk` 직접 import κΈˆμ§€ β€” `kebab-app` facade 만. ## Risks / notes - **Fallback infinite loop λ°©μ§€**: Tier 3 μžμ²΄κ°€ 0 chunk λ˜λŠ” Err 인 κ²½μš°λŠ” κ·ΈλŒ€λ‘œ fail/skip β€” fallback 의 fallback μ—†μŒ. λͺ…μ‹œ spec. - **chunker_version swap μ‹œ `try_skip_unchanged` 일관성**: fallback λ°œλ™ ν›„ stored chunker_version = `code-text-paragraph-v1`. λ‹€μŒ ingest 에 동일 파일 β†’ 동일 chunker_version 으둜 lookup λ§€μΉ­ (skip λ™μž‘ OK). Tier 1 chunker κ°€ λ―Έλž˜μ— μž‘λ™ν•˜κΈ° μ‹œμž‘ν•˜λ©΄ (예: tree-sitter grammar fix) cascade rule 둜 incremental cache miss β†’ μžλ™ reprocess κ°€ 정상 λ™μž‘. - **lang 보쑴 vs fallback 의미**: fallback chunk 의 lang 이 원본 lang μœ μ§€λΌ search filter `--code-lang yaml` κ°€ Tier 2 와 Tier 3 chunk λ‘˜ λ‹€ λ§€μΉ­. μ˜λ„λœ λ™μž‘ β€” μ‚¬μš©μžκ°€ "yaml 파일 검색" ν–ˆμ„ λ•Œ λͺ¨λ“  yaml κ²°κ³Ό ν‘œμ‹œ. - **line-window overlap 의미**: 80/20 (stride 60) 은 design Β§9.3 default. κ±°λŒ€ paragraph (예: minified JSON ν•œ 쀄) 의 κ²½μš°μ—λ„ 동일 μ•Œκ³ λ¦¬μ¦˜ β€” 단 ν•œ 쀄 = ν•œ line 이라 split λ°œμƒ μ•ˆ 함 (length 80 lines κΈ°μ€€). minified 의 경우 chunk ν•œ κ°œμ— 맀우 κΈ΄ ν…μŠ€νŠΈκ°€ λ“€μ–΄κ°€λŠ”λ° μ΄λŠ” paragraph λΆ„ν•  μ •μ±…μ˜ inherent limitation. 미래 HOTFIXES κ²€ν† . - **빈 쀄 처리**: `^\s*$` λ§€μΉ­ (whitespace-only) 쀄을 paragraph boundary 둜. νƒ­λ§Œ μžˆλŠ” 쀄 / CR-only 쀄 λ“± edge case fixture 둜 검증. - **shell line-comment 처리**: shell script 의 `# comment` 쀄은 일반 line. paragraph 뢄할에 영ν–₯ μ—†μŒ (빈 쀄 μ•„λ‹˜). chunk μ•ˆμ— κ·ΈλŒ€λ‘œ 보쑴. - **fallback wrapper 의 `canonical.parser_version` mutation**: Document 의 parser_version 을 Tier 3 fallback μ‹œ `"none-v1"` 둜 swap. CanonicalDocument κ°€ `mut` 둜 λ°›μ•„μ Έμ•Ό 함. 이미 `let mut canonical = match ...` 이라 mut κ°€λŠ₯. plan 단계 검증. - **λ¨Έμ§€ ν›„ deviation** 은 `tasks/HOTFIXES.md` dated 둜그 + λ³Έ spec `Risks / notes` cross-link.