feat(p10-3): Tier 3 paragraph + line-window fallback chunker — shell direct + Tier 1/2 0-chunk/Err 자동 picked up #155

Merged
altair823 merged 10 commits from feat/p10-3-tier3-paragraph into main 2026-05-21 12:27:21 +00:00
Owner

Summary

  • code-text-paragraph-v1 chunker 활성화 (design §9.3). 빈 줄 (whitespace-only) 기준 paragraph 분할 + paragraph >80 lines 면 80-line window / 20-line overlap (stride 60) 의 line-window split. symbol = None, lang 은 입력 보존.
  • shell direct routing: .sh / .bash / .zsh 파일이 ingest_one_code_asset 의 새 "shell" arm 으로 Tier 3 chunker 에 직결. parser_version = "none-v1" (Tier 2 sentinel 재사용).
  • Tier 1/2 → Tier 3 fallback wrapper: chunks 가 Ok(empty) 또는 Err 면 Tier 3 retry. extract 가 Err (Tier 1 extractor 실패) 인 경우도 Tier 3 synthesize fallback. p10-2 의 비-k8s YAML / invalid YAML / Rust AST 실패 케이스 자동 picked up. retry 시 chunker_version"code-text-paragraph-v1" + canonical.parser_version"none-v1" swap 으로 downstream stamping + try_skip_unchanged key 일관.
  • shell exempt from fallback chain — 이미 Tier 3 direct 라 retry-double-up 방지.
  • tier2_shared::build_chunk_no_symbol helper 추가 — build_chunk_from_span 내부 helper 분리로 build_chunk 와 코드 공유. Chunk id/hash/token/policy_hash semantics 가 Tier 1/2 와 동일.
  • bug fix during impl: short paragraph (≤80 lines) 의 split_key: None 호출 → 같은 doc 의 paragraph 들이 동일 chunk_id 공유 → SQLite UNIQUE 위반. 이제 항상 Some(para.line_start) 로 distinct id 보장.
  • frozen design: 2026-04-27 §10 활성화 entry 한 줄. (2026-05-15 §3.5 매핑 변경 없음 — shell 은 1A-1 시점부터 이미 매핑.)
  • docs: README (Tier 3 ingest 행 + --code-lang shell + Mermaid 다이어그램) / HANDOFF (p10-3 ) / docs/ARCHITECTURE (chunker tree + flowchart) / docs/SMOKE (P10-3 walkthrough) / tasks/INDEX + tasks/p10/INDEX ().
  • version bump: 0.14.0 → 0.15.0 (additive minor).

Test plan

  • cargo test --workspace --no-fail-fast -j 1 PASS (memory-conscious -j 1)
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • kebab-chunk 4 새 unit test: multi-paragraph split / 200-line line-window (1-80, 61-140, 121-200) / empty file 0 chunk / lang preservation
  • kebab-app code_ingest_smoke 2 새 integration test (tier3_shell_ingest_searchable + tier3_yaml_fallback_picks_up_non_k8s_yaml), 12 + 2 = 14 통합 테스트
  • post-merge dogfood: multi-root KB 에 .sh + docker-compose.yml + invalid yaml 추가 + --code-lang shell / --code-lang yaml 검색 결과 확인
  • post-merge gitea-release v0.15.0

Branch

feat/p10-3-tier3-paragraph (head: 46f408d). 9 commits A-I. Spec: tasks/p10/p10-3-tier3-paragraph-fallback.md. Plan: docs/superpowers/plans/2026-05-21-p10-3-tier3-paragraph-fallback.md.

🤖 Generated with Claude Code

## Summary - **`code-text-paragraph-v1` chunker 활성화** (design §9.3). 빈 줄 (whitespace-only) 기준 paragraph 분할 + paragraph >80 lines 면 80-line window / 20-line overlap (stride 60) 의 line-window split. symbol = None, lang 은 입력 보존. - **shell direct routing**: `.sh` / `.bash` / `.zsh` 파일이 `ingest_one_code_asset` 의 새 `"shell"` arm 으로 Tier 3 chunker 에 직결. parser_version = `"none-v1"` (Tier 2 sentinel 재사용). - **Tier 1/2 → Tier 3 fallback wrapper**: chunks 가 `Ok(empty)` 또는 `Err` 면 Tier 3 retry. extract 가 `Err` (Tier 1 extractor 실패) 인 경우도 Tier 3 synthesize fallback. p10-2 의 비-k8s YAML / invalid YAML / Rust AST 실패 케이스 자동 picked up. retry 시 `chunker_version` → `"code-text-paragraph-v1"` + `canonical.parser_version` → `"none-v1"` swap 으로 downstream stamping + `try_skip_unchanged` key 일관. - **`shell` exempt from fallback chain** — 이미 Tier 3 direct 라 retry-double-up 방지. - **`tier2_shared::build_chunk_no_symbol`** helper 추가 — `build_chunk_from_span` 내부 helper 분리로 `build_chunk` 와 코드 공유. Chunk id/hash/token/policy_hash semantics 가 Tier 1/2 와 동일. - **bug fix during impl**: short paragraph (≤80 lines) 의 `split_key: None` 호출 → 같은 doc 의 paragraph 들이 동일 chunk_id 공유 → SQLite UNIQUE 위반. 이제 항상 `Some(para.line_start)` 로 distinct id 보장. - **frozen design**: 2026-04-27 §10 활성화 entry 한 줄. (2026-05-15 §3.5 매핑 변경 없음 — shell 은 1A-1 시점부터 이미 매핑.) - **docs**: README (Tier 3 ingest 행 + `--code-lang shell` + Mermaid 다이어그램) / HANDOFF (p10-3 ✅) / docs/ARCHITECTURE (chunker tree + flowchart) / docs/SMOKE (P10-3 walkthrough) / tasks/INDEX + tasks/p10/INDEX (✅). - **version bump**: 0.14.0 → 0.15.0 (additive minor). ## Test plan - [x] `cargo test --workspace --no-fail-fast -j 1` PASS (memory-conscious `-j 1`) - [x] `cargo clippy --workspace --all-targets -- -D warnings` clean - [x] kebab-chunk 4 새 unit test: multi-paragraph split / 200-line line-window (1-80, 61-140, 121-200) / empty file 0 chunk / lang preservation - [x] kebab-app `code_ingest_smoke` 2 새 integration test (`tier3_shell_ingest_searchable` + `tier3_yaml_fallback_picks_up_non_k8s_yaml`), 12 + 2 = 14 통합 테스트 - [ ] post-merge dogfood: multi-root KB 에 .sh + docker-compose.yml + invalid yaml 추가 + `--code-lang shell` / `--code-lang yaml` 검색 결과 확인 - [ ] post-merge gitea-release v0.15.0 ## Branch `feat/p10-3-tier3-paragraph` (head: `46f408d`). 9 commits A-I. Spec: `tasks/p10/p10-3-tier3-paragraph-fallback.md`. Plan: `docs/superpowers/plans/2026-05-21-p10-3-tier3-paragraph-fallback.md`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
altair823 added 10 commits 2026-05-21 12:06:34 +00:00
Frozen contract for p10-3 single PR: code-text-paragraph-v1 chunker
(blank-line paragraph split + 80-line/20-overlap line-window for oversize),
shell direct routing, Tier 1/2 fallback wrapper (0-chunk or Err → Tier 3
retry with chunker_version + parser_version swap), tier2_shared::build_chunk
pub(crate) exposure, frozen design §10.1 + §10 deltas, version bump
0.14.0 → 0.15.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tasks: tier2_shared visibility upgrade / Tier 3 chunker + 4 unit tests /
shell direct routing / Tier 1/2 fallback wrapper / 2 smoke tests / frozen
design §10.1+§10 / docs sync (6 files) / workspace test gate / version
bump 0.14.0→0.15.0 + gitea PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.

4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.

Tier 1/2 fallback wrapper lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the chunks match resolves, an Ok(empty) result (Tier 2 invalid YAML
/ non-k8s YAML / similar) or Err (Tier 1 extractor / chunker failure) is
retried against CodeTextParagraphV1Chunker. On retry, chunker_version is
swapped to "code-text-paragraph-v1" and canonical.parser_version to
"none-v1" so downstream stamping + try_skip_unchanged remain consistent.

Extract failure is handled similarly — when a Tier 1 extractor errors
(e.g. tree-sitter parse failure), a synthesize_tier2_document-shaped
fallback doc is built from raw bytes and routed through Tier 3 chunker
directly (extract_fell_back guard).

shell direct path + Tier 2 extract synthesize_tier2_document failures
are exempted from the fallback chain (they ARE Tier 3 already, or the
error is real).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
  Citation::Code { symbol: None, lang: "shell" }, chunker_version
  "code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
  (no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
  retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
  chunker_version "code-text-paragraph-v1".

Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.

Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add p10-3 activation log entry for Tier 3 paragraph fallback chunker
(code-text-paragraph-v1) with shell direct routing and fallback wrapper
for invalid YAML / AST failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README adds Tier 3 to the ingest row (shell + fallback) and the Mermaid
  chunker enumeration; --code-lang shell admitted.
- HANDOFF flips p10-3 to  (v0.15.0) and updates the 한 줄 요약 + next
  candidates.
- ARCHITECTURE adds Tier 3 to the code-parser row, extends the flowchart
  pcode node, and lists code_text_paragraph_v1.rs in the chunker tree.
- SMOKE adds a P10-3 walkthrough (shell + non-k8s YAML fallback) and a
  verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-3 to .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor bump — additive new chunker_version "code-text-paragraph-v1" + new
routing lang "shell" + new Tier 1/2 → Tier 3 fallback wrapper behavior.
No DB migration, no wire schema major bump (Citation::Code.lang values
remain a free string field).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
altair823 merged commit 7a90df1485 into main 2026-05-21 12:27:21 +00:00
altair823 deleted branch feat/p10-3-tier3-paragraph 2026-05-21 12:27:22 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: altair823-org/kebab#155