Tasks: tier2_shared visibility upgrade / Tier 3 chunker + 4 unit tests / shell direct routing / Tier 1/2 fallback wrapper / 2 smoke tests / frozen design §10.1+§10 / docs sync (6 files) / workspace test gate / version bump 0.14.0→0.15.0 + gitea PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
51 KiB
p10-3 Tier 3 Paragraph Fallback Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Activate the code-text-paragraph-v1 chunker — paragraph + line-window fallback for shell scripts and for Tier 1/2 0-chunk / Err results (non-k8s YAML, invalid YAML, AST extractor failures).
Architecture: Single new chunker module crates/kebab-chunk/src/code_text_paragraph_v1.rs using blank-line paragraph segmentation and 80-line / 20-overlap line-window split for oversize paragraphs. tier2_shared::build_chunk is exposed as pub(crate) so Tier 3 shares the same Chunk-construction semantics as Tier 1/2. ingest_one_code_asset gains a "shell" arm in its 4-arm match plus a post-match fallback wrapper that retries Tier 1/2 results in Ok(empty) / Err(_) shape against Tier 3, swapping chunker_version + parser_version for downstream stamping.
Tech Stack: Rust 2024 workspace. No new external deps (string operations only). Reuses tier2_shared::build_chunk (Task D of p10-2 / commit 8996e73).
Memory note: Host has been OOM'd previously. Per-crate cargo only. ONE full-suite + clippy gate at Task I. NO cargo test --workspace outside that gate.
Pre-flight
Branch feat/p10-3-tier3-paragraph already exists (spec commit 9d4a60a).
- Disk hygiene:
df -h /점검. 80% 넘으면cargo clean.
Reference files (read on-demand per task):
tasks/p10/p10-3-tier3-paragraph-fallback.md— frozen contract.crates/kebab-chunk/src/tier2_shared.rs—build_chunksource; the visibility upgrade lives here.crates/kebab-chunk/src/k8s_manifest_resource_v1.rs— closest Tier 2 chunker template (usestier2_shared::push_chunks_with_oversize).crates/kebab-chunk/tests/k8s_manifest_resource_v1.rs—yaml_dochelper pattern +policy()helper; Tier 3 tests mirror this shape.crates/kebab-app/src/lib.rslines 950-970 (allowlist) + 1794-2040 (ingest_one_code_asset).crates/kebab-app/tests/code_ingest_smoke.rs— 12 existing tests; Tier 3 tests mirror theTestEnv::lexical_only()pattern.
Task A: expose tier2_shared::build_chunk as pub(crate)
Files:
-
Modify:
crates/kebab-chunk/src/tier2_shared.rs -
Step 1: Read
crates/kebab-chunk/src/tier2_shared.rsto confirmbuild_chunk's current visibility (likely module-privatefn). -
Step 2: Change
fn build_chunk(...)topub(crate) fn build_chunk(...). Signature unchanged:
pub(crate) fn build_chunk(
doc: &Document,
policy: &ChunkPolicy,
text: &str,
line_start: u32,
line_end: u32,
symbol: &str,
lang: &str,
chunker_version: &str,
) -> Result<Chunk> {
// body unchanged
}
- Step 3: Per-crate build sanity:
cargo build -p kebab-chunk 2>&1 | tail -3
Expected: clean.
- Step 4: Commit:
git add crates/kebab-chunk/src/tier2_shared.rs
git commit -m "$(cat <<'EOF'
refactor(p10-3): expose tier2_shared::build_chunk as pub(crate)
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
Task B: code-text-paragraph-v1 chunker (TDD)
Files:
- Create:
crates/kebab-chunk/src/code_text_paragraph_v1.rs - Create:
crates/kebab-chunk/tests/fixtures/sample_shell.sh - Create:
crates/kebab-chunk/tests/fixtures/sample_long_paragraph.txt - Create:
crates/kebab-chunk/tests/code_text_paragraph_v1.rs - Modify:
crates/kebab-chunk/src/lib.rs(pub mod + pub use)
B.1 — fixtures
- Step 1: Create
crates/kebab-chunk/tests/fixtures/sample_shell.sh(3-paragraph shell, each < 80 lines):
#!/usr/bin/env bash
set -euo pipefail
# First paragraph: env setup
export KEBAB_HOME="${KEBAB_HOME:-$HOME/.local/share/kebab}"
mkdir -p "$KEBAB_HOME"
cd "$KEBAB_HOME"
# Second paragraph: ingest
echo "ingesting workspace..."
kebab ingest --config /etc/kebab/config.toml
# Third paragraph: report
echo "done"
kebab schema --json | jq '.stats'
Note: blank lines BETWEEN the three logical sections are the paragraph boundaries. Each section starts with a # comment and runs ~3-4 lines. Total file ~13 lines.
- Step 2: Create
crates/kebab-chunk/tests/fixtures/sample_long_paragraph.txt(single 200-line paragraph, no blank lines — exercises line-window split):
# generate with a small loop — content is irrelevant, the line count is the test
python3 -c 'print("\n".join(f"line {i:03d}" for i in range(1, 201)))' \
> crates/kebab-chunk/tests/fixtures/sample_long_paragraph.txt
Verify line count = 200:
wc -l crates/kebab-chunk/tests/fixtures/sample_long_paragraph.txt
Expected: 200 crates/kebab-chunk/tests/fixtures/sample_long_paragraph.txt.
B.2 — failing tests
- Step 3: Create
crates/kebab-chunk/tests/code_text_paragraph_v1.rswith the same helper structure astests/k8s_manifest_resource_v1.rs(which has ayaml_doc(text) -> CanonicalDocumenthelper). Mirror it astext_doc(lang, text) -> CanonicalDocument. Then four tests:
//! Behavioural tests for `CodeTextParagraphV1Chunker`.
use std::path::PathBuf;
use kebab_chunk::{ChunkPolicy, Chunker, CodeTextParagraphV1Chunker};
use kebab_core::{
AssetId, Block, CanonicalDocument, CodeBlock, CommonBlock, Lang, Metadata, ParserVersion,
Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn text_doc(lang: &str, text: &str) -> CanonicalDocument {
let wp = WorkspacePath(format!("script.{lang}"));
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("none-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some(lang.into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some(lang.into()),
code: text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: format!("script.{lang}"),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some(lang.into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy::default()
}
#[test]
fn shell_multi_paragraph_splits_on_blank_lines() {
let text = std::fs::read_to_string(fixtures_dir().join("sample_shell.sh"))
.expect("read sample_shell.sh");
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker.chunk(&doc, &policy()).expect("chunk");
// 3 paragraphs separated by blank lines.
assert_eq!(chunks.len(), 3, "expected 3 paragraph chunks, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(symbol.as_deref(), None, "Tier 3 symbol must be None");
assert_eq!(lang.as_deref(), Some("shell"));
}
other => panic!("expected Code span, got {other:?}"),
}
}
// Line ranges must be ascending and not overlap (blank lines are NOT in any chunk).
let ranges: Vec<(u32, u32)> = chunks.iter().map(|c| match &c.source_spans[0] {
SourceSpan::Code { line_start, line_end, .. } => (*line_start, *line_end),
_ => unreachable!(),
}).collect();
for w in ranges.windows(2) {
assert!(w[0].1 < w[1].0, "paragraph ranges must be strictly ascending; got {:?}", ranges);
}
}
#[test]
fn single_long_paragraph_line_window_split() {
let text = std::fs::read_to_string(fixtures_dir().join("sample_long_paragraph.txt"))
.expect("read sample_long_paragraph.txt");
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker.chunk(&doc, &policy()).expect("chunk");
// 200 lines / window 80 / overlap 20 / stride 60
// chunk[0] = 1..80 (80 lines)
// chunk[1] = 61..140 (80 lines)
// chunk[2] = 121..200 (80 lines)
// → exactly 3 chunks.
assert_eq!(chunks.len(), 3, "expected 3 windows for 200-line paragraph, got {}", chunks.len());
let ranges: Vec<(u32, u32)> = chunks.iter().map(|c| match &c.source_spans[0] {
SourceSpan::Code { line_start, line_end, .. } => (*line_start, *line_end),
_ => unreachable!(),
}).collect();
assert_eq!(ranges, vec![(1, 80), (61, 140), (121, 200)]);
// chunk_ids must all differ (id_for_chunk's split_key suffix).
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(ids.len(), 3, "line-window chunks must have distinct chunk_ids");
}
#[test]
fn empty_file_emits_zero_chunks() {
let doc = text_doc("shell", "");
let chunks = CodeTextParagraphV1Chunker.chunk(&doc, &policy()).expect("chunk");
assert!(chunks.is_empty(), "empty text → 0 chunks");
}
#[test]
fn lang_field_preserved_from_input_doc() {
let yaml = "key1: value1\nkey2: value2\n";
let doc = text_doc("yaml", yaml);
let chunks = CodeTextParagraphV1Chunker.chunk(&doc, &policy()).expect("chunk");
assert_eq!(chunks.len(), 1);
match &chunks[0].source_spans[0] {
SourceSpan::Code { lang, symbol, .. } => {
assert_eq!(symbol.as_deref(), None);
assert_eq!(lang.as_deref(), Some("yaml"), "Tier 3 must preserve input lang");
}
other => panic!("expected Code span, got {other:?}"),
}
}
- Step 4: Run tests → FAIL (module/struct not yet defined):
cargo test -p kebab-chunk --test code_text_paragraph_v1 -- --nocapture 2>&1 | tail -10
Expected: compile error CodeTextParagraphV1Chunker not found.
B.3 — chunker implementation
- Step 5: Create
crates/kebab-chunk/src/code_text_paragraph_v1.rs:
//! p10-3: Tier 3 paragraph + line-window fallback chunker.
//!
//! Triggered for shell scripts (`.sh`/`.bash`/`.zsh`) directly, and as a
//! fallback when Tier 1/2 chunkers return `Ok(empty)` or `Err`. Splits by
//! blank lines into paragraphs; paragraphs > 80 lines are further split
//! into 80-line windows with 20-line overlap.
use crate::tier2_shared::build_chunk;
use crate::{Chunker, ChunkPolicy};
use anyhow::Result;
use kebab_core::{Block, Chunk, Document};
pub const VERSION_LABEL: &str = "code-text-paragraph-v1";
const FALLBACK_LINES_PER_CHUNK: usize = 80;
const FALLBACK_LINES_OVERLAP: usize = 20;
// stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP = 60.
pub struct CodeTextParagraphV1Chunker;
impl Chunker for CodeTextParagraphV1Chunker {
fn chunker_version(&self) -> &'static str { VERSION_LABEL }
fn chunk(&self, doc: &Document, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
let Some(Block::Code { text, lang, .. }) = doc.blocks.first() else {
return Ok(vec![]);
};
let lang_str = lang.as_deref().unwrap_or("");
let mut chunks = Vec::new();
for para in split_paragraphs(text) {
push_paragraph(&mut chunks, doc, policy, ¶, lang_str)?;
}
Ok(chunks)
}
}
/// Single paragraph + its 1-indexed line range.
struct Paragraph<'a> {
text: String, // joined lines (no trailing newline)
line_start: u32,
line_end: u32,
// unused but kept for future ergonomics:
_src: std::marker::PhantomData<&'a ()>,
}
fn split_paragraphs(text: &str) -> Vec<Paragraph<'_>> {
let mut paragraphs = Vec::new();
let mut current: Vec<&str> = Vec::new();
let mut current_start: Option<u32> = None; // 1-indexed line number where current paragraph began
for (idx, line) in text.lines().enumerate() {
let line_no = (idx + 1) as u32; // 1-indexed
let is_blank = line.trim().is_empty();
if is_blank {
// Boundary: flush current paragraph.
if let Some(start) = current_start.take() {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
_src: std::marker::PhantomData,
});
current.clear();
}
} else {
if current_start.is_none() {
current_start = Some(line_no);
}
current.push(line);
}
}
// Trailing paragraph at EOF (no boundary blank line).
if let Some(start) = current_start.take() {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
_src: std::marker::PhantomData,
});
}
paragraphs
}
fn push_paragraph(
out: &mut Vec<Chunk>,
doc: &Document,
policy: &ChunkPolicy,
para: &Paragraph<'_>,
lang: &str,
) -> Result<()> {
let n_lines = (para.line_end - para.line_start + 1) as usize;
if n_lines <= FALLBACK_LINES_PER_CHUNK {
out.push(build_chunk(
doc, policy,
¶.text,
para.line_start, para.line_end,
"", // empty symbol — build_chunk wraps as Some(""); see Step 7 note
lang,
VERSION_LABEL,
)?);
return Ok(());
}
// Line-window split. Stride = window - overlap = 60.
let stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP;
let lines: Vec<&str> = para.text.lines().collect();
let mut i = 0usize;
while i < lines.len() {
let end = (i + FALLBACK_LINES_PER_CHUNK).min(lines.len());
let window_text = lines[i..end].join("\n");
let window_start = para.line_start + i as u32;
let window_end = para.line_start + (end as u32) - 1;
out.push(build_chunk(
doc, policy,
&window_text,
window_start, window_end,
"",
lang,
VERSION_LABEL,
)?);
if end == lines.len() {
break;
}
i += stride;
}
Ok(())
}
- Step 6: Register the module in
crates/kebab-chunk/src/lib.rs(next to existing Tier 2 chunker exports):
pub mod code_text_paragraph_v1;
pub use code_text_paragraph_v1::CodeTextParagraphV1Chunker;
- Step 7:
symbol = ""vssymbol = Nonecorrection. The spec says Tier 3 chunks must haveCitation::Code.symbol = None. Buttier2_shared::build_chunktakessymbol: &strand likely wraps it asSome(s.to_string()). Two options:- (preferred) Add a sibling helper
build_chunk_no_symbol(doc, policy, text, line_start, line_end, lang, chunker_version) -> Result<Chunk>intier2_shared.rsthat constructsSourceSpan::Code { ..., symbol: None, lang: Some(lang.to_string()) }. The currentbuild_chunkkeeps wrappingSome(symbol). - (alternative) Change
build_chunk's symbol parameter toOption<&str>. More disruption (Tier 2 callers need an update).
- (preferred) Add a sibling helper
Take the preferred path. Edit crates/kebab-chunk/src/tier2_shared.rs: add (next to build_chunk):
/// Like `build_chunk` but emits `symbol: None`. Used by Tier 3.
#[allow(clippy::too_many_arguments)]
pub(crate) fn build_chunk_no_symbol(
doc: &Document,
policy: &ChunkPolicy,
text: &str,
line_start: u32,
line_end: u32,
lang: &str,
chunker_version: &str,
) -> Result<Chunk> {
// Mirror build_chunk's body but with symbol: None in the span.
// The simplest implementation calls build_chunk's underlying machinery —
// pull the body into a helper if needed, or inline minimally:
let span = SourceSpan::Code {
line_start,
line_end,
symbol: None,
lang: Some(lang.to_string()),
};
build_chunk_from_span(doc, policy, text, span, chunker_version)
}
If build_chunk has a _from_span substructure already (see tier2_shared.rs), call into it. If not, extract a build_chunk_from_span(doc, policy, text, span, chunker_version) -> Result<Chunk> private helper, then have both build_chunk (with symbol: Some(...)) and build_chunk_no_symbol call into it. The diff stays small.
Update code_text_paragraph_v1.rs::push_paragraph to call build_chunk_no_symbol instead of build_chunk(..., "", ...):
use crate::tier2_shared::build_chunk_no_symbol;
// ...
out.push(build_chunk_no_symbol(
doc, policy,
¶.text,
para.line_start, para.line_end,
lang,
VERSION_LABEL,
)?);
And the same swap in the line-window branch.
- Step 8: Run tests:
cargo test -p kebab-chunk --test code_text_paragraph_v1 -- --nocapture 2>&1 | tail -25
Expected: 4 PASS.
- Step 9: Run all kebab-chunk tests (no regression on Tier 1/2 from
tier2_sharededit):
cargo test -p kebab-chunk -- --nocapture 2>&1 | tail -20
Expected: all PASS.
- Step 10: Clippy + commit:
cargo clippy -p kebab-chunk --all-targets -- -D warnings
git add crates/kebab-chunk/src/code_text_paragraph_v1.rs \
crates/kebab-chunk/src/tier2_shared.rs \
crates/kebab-chunk/src/lib.rs \
crates/kebab-chunk/tests/fixtures/sample_shell.sh \
crates/kebab-chunk/tests/fixtures/sample_long_paragraph.txt \
crates/kebab-chunk/tests/code_text_paragraph_v1.rs
git commit -m "$(cat <<'EOF'
feat(p10-3): code-text-paragraph-v1 chunker — paragraph + line-window fallback
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2.
4 unit tests cover multi-paragraph shell, 200-line oversize line-window
split (chunks 1-80 / 61-140 / 121-200), empty file, and lang preservation
when input is yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
Task C: shell direct routing in ingest_one_code_asset
Files:
- Modify:
crates/kebab-app/src/lib.rs
C.1 — allowlist + 4-arm match
- Step 1: Open
crates/kebab-app/src/lib.rsline 953. Current allowlist:
if matches!(lang.as_str(),
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod")
Add "shell":
if matches!(lang.as_str(),
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "shell")
- Step 2: At the top-of-file
use kebab_chunk::{...}line, addCodeTextParagraphV1Chunker:
use kebab_chunk::{
/* existing items */,
CodeTextParagraphV1Chunker,
};
- Step 3: parser_version match (line ~1825):
let parser_version = match code_lang {
// ... existing 7 Tier 1 arms ...
"kotlin" => ParserVersion(kebab_parse_code::KOTLIN_PARSER_VERSION.to_string()),
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
=> ParserVersion("none-v1".to_string()),
// p10-3: shell also uses Tier 3 (no parse step).
"shell" => ParserVersion("none-v1".to_string()),
other => anyhow::bail!("unsupported code_lang: {other}"),
};
- Step 4: chunker_version match:
let chunker_version = match code_lang {
// ... existing arms ...
"toml" | "json" | "xml" | "groovy" | "go-mod"
=> ManifestFileV1Chunker.chunker_version(),
// p10-3:
"shell" => CodeTextParagraphV1Chunker.chunker_version(),
other => anyhow::bail!("unreachable chunker_version: {other}"),
};
- Step 5: extract match (canonical Document construction):
let mut canonical = match code_lang {
// ... existing Tier 1 + Tier 2 arms ...
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" => {
synthesize_tier2_document(asset, &bytes, code_lang, &parser_version)?
}
// p10-3: shell reuses the same synthesizer — single Block::Code with raw text.
"shell" => synthesize_tier2_document(asset, &bytes, "shell", &parser_version)?,
other => anyhow::bail!("unreachable (extract): {other}"),
};
- Step 6: chunks match:
let chunks = match code_lang {
// ... existing Tier 1 + Tier 2 arms ...
"toml" | "json" | "xml" | "groovy" | "go-mod"
=> ManifestFileV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::ManifestFileV1Chunker::chunk")?,
// p10-3:
"shell" => CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (code:shell)")?,
other => anyhow::bail!("unreachable (chunk): {other}"),
};
- Step 7: Build:
cargo build -p kebab-app 2>&1 | tail -5
Expected: clean.
- Step 8: Clippy + interim commit (allowlist + 4 arms only; fallback wrapper is the next task):
cargo clippy -p kebab-app --all-targets -- -D warnings
git add crates/kebab-app/src/lib.rs
git commit -m "$(cat <<'EOF'
feat(p10-3): activate shell direct routing through Tier 3 chunker
Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.
Tier 1/2 fallback wrapper lands in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
Task D: Tier 1/2 → Tier 3 fallback wrapper
Files:
- Modify:
crates/kebab-app/src/lib.rs
The post-chunk fallback wrapper. After the chunks match resolves, if the result is Ok(empty) (Tier 2 invalid YAML / non-k8s YAML) or Err(_) (Tier 1 extractor / chunker failure), retry with Tier 3.
- Step 1: Reshape the
chunksmatch to NOT use?, instead bind aResult<Vec<Chunk>>:
let chunks_result: anyhow::Result<Vec<Chunk>> = match code_lang {
"rust" => CodeRustAstV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)"),
"python" => CodePythonAstV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)"),
// ... existing arms similarly bind Result, no `?` ...
"shell" => CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (code:shell)"),
other => anyhow::bail!("unreachable (chunk): {other}"),
};
(Every existing arm: replace its .context(...)? with .context(...) — drop the trailing ?. The result of the whole match is now anyhow::Result<Vec<Chunk>>.)
- Step 2: Add the fallback wrapper directly after the match:
// p10-3: Tier 1/2 0-chunk OR error → Tier 3 fallback retry.
// "shell" is direct Tier 3 already; don't retry-double-up.
let chunks = match chunks_result {
Ok(v) if !v.is_empty() => v,
other if code_lang == "shell" => {
// shell direct call already IS Tier 3 — don't retry. Propagate.
other?
}
Ok(_empty) => {
tracing::warn!(
workspace_path = %asset.workspace_path,
code_lang = code_lang,
"tier1/2 emitted 0 chunks; falling back to tier 3 (code-text-paragraph-v1)"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
canonical.parser_version = ParserVersion("none-v1".to_string());
CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback)")?
}
Err(e) => {
tracing::warn!(
workspace_path = %asset.workspace_path,
code_lang = code_lang,
error = %e,
"tier1/2 errored; falling back to tier 3 (code-text-paragraph-v1)"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
canonical.parser_version = ParserVersion("none-v1".to_string());
CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback after error)")?
}
};
Notes:
-
chunker_versionmust already be alet mutbinding (Task C step 4). If it's currentlylet, change tolet mut. -
canonical.parser_versionmutation requirescanonicalto belet mut(it already is per Task G of p10-2 —let mut canonical = match ...). -
The
workspace_pathfield onassetis aWorkspacePath(String)newtype; the%formatter uses itsDisplayimpl. Verify bygit grep "fn fmt" crates/kebab-core/src/ids.rsif needed —WorkspacePathderivesDisplay. -
Step 3: When fallback fires, the extract step for a Tier 1 lang (
"rust"/"python"/ ...) didn't run (because the Tier 1extractcall errored before reachingchunks_result). Socanonicalmay already be set up correctly for the Tier 3 chunker — IF the extract step succeeded but chunking returned empty. But if extract itself errored (e.g. tree-sitter parse failure),canonicalwas never built and the chunks_result match arm never executed.Reshape the extract match similarly:
let canonical_result: anyhow::Result<kebab_core::CanonicalDocument> = match code_lang {
"rust" => RustAstExtractor::new().extract(&ctx, &bytes)
.context("kb-parse-code::RustAstExtractor::extract (code:rust)"),
// ... existing arms ... → drop trailing `?`
"shell" => Ok(synthesize_tier2_document(asset, &bytes, "shell", &parser_version)?),
// (synthesize_tier2_document returns anyhow::Result<CanonicalDocument>; the `?` here
// is fine because the Tier 2 synthesizer call is itself the inner Result. If it
// failed, we want to propagate the synthesizer error — synthesize_tier2_document
// can fail on non-utf8 bytes; falling back from a non-utf8 file makes no sense.)
other => anyhow::bail!("unreachable (extract): {other}"),
};
// p10-3: extract failure (e.g. tree-sitter parse error) → Tier 3 fallback with
// a synthesized Document.
let mut canonical = match canonical_result {
Ok(d) => d,
Err(_) if code_lang == "shell" => {
// shell's extract goes through synthesize_tier2_document — if THAT fails (non-utf8),
// there's nothing to fall back to. Propagate.
canonical_result?
}
Err(e) => {
tracing::warn!(
workspace_path = %asset.workspace_path,
code_lang = code_lang,
error = %e,
"tier1/2 extract errored; falling back to tier 3 synthesized doc"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
// Build the Tier 3 doc from raw bytes. parser_version was originally Tier 1's
// (e.g. RUST_PARSER_VERSION); swap to "none-v1" so try_skip_unchanged keys correctly.
let tier3_parser_version = ParserVersion("none-v1".to_string());
let mut tier3_doc = synthesize_tier2_document(asset, &bytes, code_lang, &tier3_parser_version)?;
tier3_doc.parser_version = tier3_parser_version;
tier3_doc
}
};
If after extract fallback the original chunks_result match was going to run a Tier 1 chunker against a Tier 3 doc — that would crash because Tier 1 chunkers expect AST output. So when extract fell back, the chunks match must also use Tier 3 directly. Solution: drop the chunks match into an if-else flow:
let extract_fell_back = matches!(canonical.parser_version.0.as_str(), "none-v1")
&& !matches!(code_lang, "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" | "shell");
let chunks = if extract_fell_back {
// Extract already fell back to Tier 3 doc shape; run Tier 3 chunker directly.
CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 after extract fallback)")?
} else {
// Normal path — Tier 1/2/3 chunker per code_lang.
let chunks_result: anyhow::Result<Vec<Chunk>> = match code_lang {
// ... arms above ...
};
// ... fallback wrapper from Step 2 ...
};
This is getting complex; consider a helper. Refactor signal: extract this logic into a single function tier3_fallback_chunks(asset, bytes, code_lang, chunk_policy, original_parser_version, &mut canonical, &mut chunker_version) -> Result<Vec<Chunk>> if the inline becomes hard to read.
For the plan, keep it inline but readable. The reviewer will catch readability issues. If a subagent reports DONE_WITH_CONCERNS citing complexity, refactor in a follow-up step.
- Step 4: Build:
cargo build -p kebab-app 2>&1 | tail -5
Expected: clean.
- Step 5: Run existing kebab-app unit tests (no regression):
cargo test -p kebab-app --lib -- --nocapture 2>&1 | tail -10
Expected: 52 PASS (matching the count after Task G of p10-2).
- Step 6: Clippy + commit:
cargo clippy -p kebab-app --all-targets -- -D warnings
git add crates/kebab-app/src/lib.rs
git commit -m "$(cat <<'EOF'
feat(p10-3): Tier 1/2 → Tier 3 fallback wrapper in ingest_one_code_asset
After the chunks match resolves, an Ok(empty) result (Tier 2 invalid YAML
/ non-k8s YAML / similar) or Err (Tier 1 extractor / chunker failure) is
retried against CodeTextParagraphV1Chunker. On retry, chunker_version is
swapped to "code-text-paragraph-v1" and canonical.parser_version to
"none-v1" so downstream stamping + try_skip_unchanged remain consistent.
Extract failure is handled similarly — when a Tier 1 extractor errors
(e.g. tree-sitter parse failure), a synthesize_tier2_document-shaped
fallback doc is built from raw bytes and routed through Tier 3 chunker
directly.
shell direct path is exempted from the fallback chain (it IS Tier 3
already).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
Task E: integration smoke tests (Tier 3)
Files:
- Modify:
crates/kebab-app/tests/code_ingest_smoke.rs
Two new tests. Mirror the pattern from p10-2's three Tier 2 tests (commit 166e1dd).
-
Step 1: Read the existing
tests/code_ingest_smoke.rsfirst — especially the three p10-2 Tier 2 tests near the end (tier2_k8s_yaml_ingest_searchable,tier2_dockerfile_ingest_searchable,tier2_cargo_toml_ingest_searchable). Replicate theTestEnv::lexical_only()+ ingest_with_config + search_with_config pattern. -
Step 2: Append two tests at the end of the file:
#[test]
fn tier3_shell_ingest_searchable() {
let env = TestEnv::lexical_only();
let workspace = env.workspace_root();
std::fs::write(
workspace.join("deploy.sh"),
"#!/usr/bin/env bash\nset -e\necho hello\n\nkebab ingest --json\n",
)
.unwrap();
let report = env.ingest().expect("ingest");
assert!(report.new_docs >= 1, "expected at least 1 new doc, got {}", report.new_docs);
let hits = env.search_code_lang("shell", "kebab").expect("search");
assert!(!hits.is_empty(), "expected at least 1 shell hit");
let citation = match &hits[0].citation {
Citation::Code { symbol, lang, .. } => (symbol.clone(), lang.clone()),
other => panic!("expected Citation::Code, got {other:?}"),
};
assert_eq!(citation.0, None, "Tier 3 symbol must be None");
assert_eq!(citation.1.as_deref(), Some("shell"));
// chunker_version should be code-text-paragraph-v1.
assert_eq!(
hits[0].chunker_version.as_deref(),
Some("code-text-paragraph-v1"),
"shell chunks must be stamped with the Tier 3 chunker_version"
);
}
#[test]
fn tier3_yaml_fallback_picks_up_non_k8s_yaml() {
let env = TestEnv::lexical_only();
let workspace = env.workspace_root();
// docker-compose-shaped YAML — has `version:` and `services:` but no apiVersion/kind.
// k8s chunker will return Ok(vec![]); the Tier 3 fallback should pick this up.
std::fs::write(
workspace.join("docker-compose.yml"),
"version: '3'\nservices:\n api:\n image: nginx:latest\n ports:\n - 8080:80\n",
)
.unwrap();
let report = env.ingest().expect("ingest");
assert!(report.new_docs >= 1, "expected the non-k8s yaml to be ingested via Tier 3, got {} new docs", report.new_docs);
let hits = env.search_code_lang("yaml", "nginx").expect("search");
assert!(!hits.is_empty(), "expected at least 1 yaml fallback hit");
let (symbol, lang) = match &hits[0].citation {
Citation::Code { symbol, lang, .. } => (symbol.clone(), lang.clone()),
other => panic!("expected Citation::Code, got {other:?}"),
};
assert_eq!(symbol, None, "Tier 3 fallback symbol must be None");
assert_eq!(lang.as_deref(), Some("yaml"), "lang preserved through fallback");
assert_eq!(
hits[0].chunker_version.as_deref(),
Some("code-text-paragraph-v1"),
"non-k8s yaml fallback must be stamped code-text-paragraph-v1"
);
}
(The helpers TestEnv::lexical_only(), workspace_root(), ingest(), search_code_lang(lang, query) — verify their actual names by reading the file. The first test in code_ingest_smoke.rs uses whatever the established API is; mirror it precisely.)
- Step 3: Run targeted tests:
cargo test -p kebab-app --test code_ingest_smoke tier3 -- --nocapture 2>&1 | tail -30
Expected: 2 PASS.
- Step 4: Run the entire smoke file:
cargo test -p kebab-app --test code_ingest_smoke -- --nocapture 2>&1 | tail -30
Expected: 14 PASS (12 existing + 2 new).
- Step 5: Clippy + commit:
cargo clippy -p kebab-app --tests -- -D warnings
git add crates/kebab-app/tests/code_ingest_smoke.rs
git commit -m "$(cat <<'EOF'
test(p10-3): integration smoke tests for Tier 3 (shell + yaml fallback)
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
Citation::Code { symbol: None, lang: "shell" }, chunker_version
"code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
(no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
chunker_version "code-text-paragraph-v1".
Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
Task F: frozen design §10.1 + §10 activation log
Files:
-
Modify:
docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md -
Modify:
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md -
Step 1: Read §10.1 of
docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.mdto find the existing activation log format (Task I of p10-2 added the p10-2 entry there). Add a sibling entry right after:
| p10-3 | Tier 3 활성화 — code-text-paragraph-v1 active. shell direct routing + Tier 1/2 fallback wrapper (0-chunk or Err → Tier 3 retry). 비-k8s YAML / invalid YAML 자동 picked up. | 2026-05-21 |
(Match the table style exactly. If the existing entries are bullets, use a bullet.)
- Step 2: Read §10 of
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md(around line 1552 — the p10-2 entry is there). Add right after:
**p10-3 활성화 (Tier 3 paragraph fallback) (2026-05-21)**: Tier 3 chunker `code-text-paragraph-v1` 활성화. shell script (`.sh`/`.bash`/`.zsh`) direct routing + Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback 으로 retry. 비-k8s YAML / invalid YAML / AST 실패 케이스 모두 picked up. lang 은 입력 보존 (shell → "shell", yaml → "yaml" 등), symbol 은 항상 None.
- Step 3: Commit:
git add docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md \
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
git commit -m "$(cat <<'EOF'
docs(p10-3): activate Tier 3 in frozen design §10.1 + §10
§10.1 (code-ingest design): add deactivation log entry for p10-3.
§10 (final-form design): mirror entry in the activation log.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
Task G: README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX + tasks/p10/INDEX
Files:
- Modify:
README.md - Modify:
HANDOFF.md - Modify:
docs/ARCHITECTURE.md - Modify:
docs/SMOKE.md - Modify:
tasks/INDEX.md - Modify:
tasks/p10/INDEX.md
G.1 — README
- Step 1: Open
README.md. Find thekebab ingestrow in the 명령 table (line ~73). Extend the supported-langs list with shell + the fallback note. Sample patch (adjust to actual current wording):
Find:
**소스코드** (`.rs` → `code-rust-ast-v1`, `.py` → ... , `.kt`/`.kts` → `code-kotlin-ast-v1` — 모두 tree-sitter AST chunker; **Tier 2 리소스 파일**: ...)
After the Tier 2 리소스 파일: ... clause, insert before the closing ):
; **Tier 3 paragraph fallback** (`.sh`/`.bash`/`.zsh` → `code-text-paragraph-v1`, blank-line paragraph split + 80-line/20-overlap line-window. Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback — 비-k8s YAML 같은 케이스 picked up. symbol = None, lang 은 원본 보존.)
Also extend the --code-lang enumeration:
--code-lang ... / --code-lang shell / ...
- Step 2: Find the Mermaid diagram's chunker node (line ~135). Replace:
chunker["chunker (md-heading-v1, pdf-page-v1, code-{rust,python,ts,js,go,java,kotlin}-ast-v1, k8s-manifest-resource-v1, dockerfile-file-v1, manifest-file-v1)"]
with (add code-text-paragraph-v1):
chunker["chunker (md-heading-v1, pdf-page-v1, code-{rust,python,ts,js,go,java,kotlin}-ast-v1, k8s-manifest-resource-v1, dockerfile-file-v1, manifest-file-v1, code-text-paragraph-v1)"]
G.2 — HANDOFF
- Step 3: Open
HANDOFF.md. Find the phase table row for P10. The Tier 2 row was**2 ✅ (Tier 2 resource-aware: ... — v0.14.0)**. Add a sibling for Tier 3:
, **3 ✅ (Tier 3 paragraph fallback: code-text-paragraph-v1 — v0.15.0)**
(Insert into the same P10 list cell — match the comma + bold styling used by neighbors.)
Update the 한 줄 요약 at the top: replace Tier 2 리소스 파일 (yaml/k8s / dockerfile / toml / json / xml / groovy / go-mod) 처리 with ... + Tier 3 paragraph fallback (shell / 비-k8s YAML / AST 실패) 처리. And in 다음 후보: drop p10-3, leave P10-1D (C/C++) 또는 P9-5 (desktop tauri) 또는 보류 중인 P8 (audio).
G.3 — ARCHITECTURE
- Step 4: Open
docs/ARCHITECTURE.md. Find the code parser table row (line ~25 per Task J of p10-2). After the Tier 2 sentence, add:
**Tier 3 (p10-3)**: shell scripts (`.sh`/`.bash`/`.zsh`) direct → `code-text-paragraph-v1` (blank-line paragraph segmentation + 80-line / 20-overlap line-window for oversize). Same chunker also serves as fallback when Tier 1/2 emit 0 chunks or Err — non-k8s YAML / invalid YAML / AST extractor failures all picked up. symbol = None; lang preserved from input doc.
-
Step 5: Find the
flowchart TBblock (line ~52). Thepcodenode currently says(P10-1A-2 + P10-1B + P10-1C-Go + P10-1C-JK + P10-2). Update to(P10-1A-2 + P10-1B + P10-1C-Go + P10-1C-JK + P10-2 + P10-3). -
Step 6: Find the
crates/kebab-chunk/src/tree (line ~165). Add an entry forcode_text_paragraph_v1.rs:
│ │ ├── code_*_ast_v1.rs # Tier 1 AST chunkers (rust/python/ts/js/go/java/kotlin)
│ │ ├── k8s_manifest_resource_v1.rs # Tier 2 (p10-2): YAML multi-doc, apiVersion+kind per resource
│ │ ├── dockerfile_file_v1.rs # Tier 2 (p10-2): whole-file Dockerfile
│ │ ├── manifest_file_v1.rs # Tier 2 (p10-2): whole-file Cargo.toml / go.mod / .json / .xml / .groovy
│ │ ├── code_text_paragraph_v1.rs # Tier 3 (p10-3): blank-line paragraph + 80/20 line-window fallback
│ │ └── tier2_shared.rs # Tier 2 (p10-2): shared oversize fallback + Chunk builder helpers
G.4 — SMOKE
- Step 7: Open
docs/SMOKE.md. After the "P10-2 Tier 2 리소스 파일 색인" section, add a "P10-3 Tier 3 paragraph fallback" section:
## P10-3 Tier 3 paragraph fallback
P10-2 와 동일한 격리 KB 설정. `.sh` 파일은 direct, 비-k8s YAML 은 fallback 으로 들어간다.
```bash
# 1) shell script (direct Tier 3)
cat > /tmp/kebab-smoke/workspace/deploy.sh <<'EOF'
#!/usr/bin/env bash
set -e
echo "ingesting..."
kebab ingest
echo "done"
kebab schema --json | jq '.stats'
EOF
# 2) 비-k8s YAML (Tier 2 가 0 chunk → Tier 3 fallback)
cat > /tmp/kebab-smoke/workspace/docker-compose.yml <<'EOF'
version: '3'
services:
api:
image: nginx:latest
ports:
- 8080:80
EOF
# 3) ingest
KB ingest
# 4) 언어별 검색 (citation.symbol = None 확인)
KB search --mode hybrid "ingest" --code-lang shell --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
# 기대: symbol = null, lang = "shell", chunker_version = "code-text-paragraph-v1"
KB search --mode hybrid "nginx" --code-lang yaml --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
# 기대: symbol = null, lang = "yaml", chunker_version = "code-text-paragraph-v1"
# 5) schema stats 에 shell 카운트 확인
KB --json schema | jq '.stats.code_lang_breakdown'
# 기대: {"shell": N, "yaml": M, ...} (M 은 k8s yaml + Tier 3 fallback yaml 합계)
Tier 3 citation.symbol 컨벤션: 항상 null. 의미 단위 식별 안 함. lang 은 원본 lang 보존 (shell → "shell", yaml → "yaml" 등).
Append a P10-3 entry to the 검증 체크리스트 at the bottom:
- (P10-3)
.sh/.bash/.zsh파일은 direct Tier 3 (code-text-paragraph-v1). 비-k8s YAML (apiVersion+kind 없는 yaml) 은 k8s chunker 가 0 chunk → Tier 3 fallback 으로 picked up.--code-lang shell/--code-lang yaml검색이citation.symbol = null,chunker_version = "code-text-paragraph-v1"결과를 반환하면 wiring 정상.kebab schema --json | jq .stats.code_lang_breakdown에"shell": N등장 확인.
### G.5 — INDEX files
- [ ] **Step 8**: `tasks/INDEX.md` — flip the p10-3 row to ✅:
Find:
- p10-3 Tier 3 paragraph + line-window fallback — ⏳
Replace with:
- p10-3 Tier 3 paragraph + line-window fallback — ✅ 머지 (v0.15.0,
code-text-paragraph-v1)
- [ ] **Step 9**: `tasks/p10/INDEX.md` — same row, change to ✅:
| 3 | Tier 3 paragraph + line-window fallback | ✅ 머지 (v0.15.0) |
### G.6 — commit
- [ ] **Step 10**: Single commit for all 6 docs:
```bash
git add README.md HANDOFF.md docs/ARCHITECTURE.md docs/SMOKE.md tasks/INDEX.md tasks/p10/INDEX.md
git commit -m "$(cat <<'EOF'
docs(p10-3): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
- README adds Tier 3 to the ingest row (shell + fallback) and the Mermaid
chunker enumeration; --code-lang shell admitted.
- HANDOFF flips p10-3 to ✅ (v0.15.0) and updates the 한 줄 요약 + next
candidates.
- ARCHITECTURE adds Tier 3 to the code-parser row, extends the flowchart
pcode node, and lists code_text_paragraph_v1.rs in the chunker tree.
- SMOKE adds a P10-3 walkthrough (shell + non-k8s YAML fallback) and a
verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-3 to ✅.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
Task H: workspace test gate + clippy
Files: (none — gates only)
- Step 1: Disk check:
df -h /
If usage > 80%, run cargo clean first.
- Step 2: Workspace test gate (memory-conscious
-j 1):
cargo test --workspace --no-fail-fast -j 1 2>&1 | tail -80
Expected: ALL PASS. Especially:
kebab-chunk: 4 new Tier 3 tests + existing.kebab-app: 14 tests incode_ingest_smoke(12 + 2 new).
If FAIL: common modes:
-
A Tier 1/2 test inadvertently relied on the chunks match's prior
?-propagation behavior — Task D's restructuring shouldn't change observable behavior but check. -
A test that expected
Errfrom Tier 1 (e.g. invalid input fixture) now getsOk(vec![chunk])(Tier 3 fallback). Such tests would be tests-of-failure-mode rather than tests-of-success — likely intentional regression coverage. Review case-by-case. -
Step 3: Workspace clippy:
cargo clippy --workspace --all-targets -- -D warnings 2>&1 | tail -30
Expected: clean.
Task I: workspace version bump + gitea PR
Files:
-
Modify:
Cargo.toml -
Modify:
Cargo.lock(auto) -
Step 1: Edit
Cargo.tomlworkspaceversion = "0.14.0"→"0.15.0". -
Step 2: Refresh Cargo.lock:
cargo build -p kebab-cli 2>&1 | tail -5
Expected: clean. Cargo.lock cascades all 22 kebab-* crates to 0.15.0.
- Step 3: Commit:
git add Cargo.toml Cargo.lock
git commit -m "$(cat <<'EOF'
chore: bump version 0.14.0 → 0.15.0 (p10-3 Tier 3 paragraph fallback)
Minor bump — additive new chunker_version "code-text-paragraph-v1" + new
routing lang "shell" + new fallback wrapper behavior. No DB migration, no
wire schema major bump (Citation::Code.lang values were already a free
string field).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
- Step 4: Push branch + open gitea PR via REST API (per CLAUDE.md). Title:
feat(p10-3): Tier 3 paragraph + line-window fallback chunker — shell + 비-k8s YAML / AST 실패 자동 picked up
Body summary:
code-text-paragraph-v1chunker activated (design §9.3).- shell scripts (
.sh/.bash/.zsh) ingest directly via Tier 3. - Tier 1/2 0-chunk or Err results retry with Tier 3 — non-k8s YAML, invalid YAML, AST extractor failures all picked up.
chunker_version+parser_versionswap on fallback. - 4 unit tests + 2 smoke tests = 6 new testing surfaces.
- frozen design §10.1 + §10 deltas.
- 0.14.0 → 0.15.0.
Test plan checkboxes:
-
cargo test --workspace --no-fail-fast -j 1PASS -
cargo clippy --workspace --all-targets -- -D warningsclean -
kebab-chunk 4 new Tier 3 unit tests PASS
-
kebab-app code_ingest_smoke 14 tests PASS (12 + 2 new)
-
post-merge dogfood: multi-root KB ingest with mixed .sh + non-k8s yaml — verify --code-lang shell results and schema breakdown
-
post-merge gitea-release v0.15.0
-
Step 5: Wait for code-reviewer APPROVE, then merge via the gitea REST API (
POST /repos/altair823-org/kebab/pulls/<N>/merge) and cutgitea-release v0.15.0.
Verification matrix (final, after Task I merge)
| 검증 | 명령 | 기대 |
|---|---|---|
| shell direct | kebab ingest /tmp/kebab-smoke/workspace/deploy.sh + kebab search --code-lang shell --json |
Citation::Code { symbol: null, lang: "shell" }, chunker_version: "code-text-paragraph-v1" |
| 비-k8s YAML fallback | kebab ingest /tmp/kebab-smoke/workspace/docker-compose.yml + kebab search --code-lang yaml --json |
Citation::Code { symbol: null, lang: "yaml" }, chunker_version: "code-text-paragraph-v1" |
| invalid YAML fallback | malformed yaml ingest → search | Tier 3 chunks emitted (non-empty) |
| AST extractor 실패 fallback | (hard to trigger artificially — relies on tree-sitter parse failure on otherwise-valid Rust; this is dogfood territory) | Citation::Code { symbol: null, lang: "rust" }, chunker_version: "code-text-paragraph-v1" |
code_lang_breakdown |
`kebab schema --json | jq .stats.code_lang_breakdown` |
Risks reminder (구현 중 주의)
-
Fallback wrapper 의 복잡도: Task D 가 가장 위험 — extract failure + chunks failure 두 path 가 얽힘. 한 helper 로 추출하는 게 깔끔할 수 있음. 가독성이 나빠지면 subagent 가 DONE_WITH_CONCERNS 로 보고 → 후속 cleanup commit.
-
tier2_shared::build_chunk_no_symbol추가: 기존build_chunk의 body 를 재사용하려면build_chunk_from_span내부 helper 분리 가능. 분리하면 build_chunk + build_chunk_no_symbol 둘 다 한 곳에서 Chunk 구성. -
shell fixture line count 정확성: Task B Step 1 의
sample_shell.sh가 정확히 3 paragraph 가 되도록 —#줄과 명령 줄 사이 빈 줄이 없어야 같은 paragraph, paragraph 사이엔 정확히 1 빈 줄. fixture 생성 후cat -A로 확인 권장. -
200-line fixture stride 계산: Step 2 의
sample_long_paragraph.txt가 정확히 200 lines. window 80 / stride 60 → 1-80, 61-140, 121-200. 마지막 window 의 시작이 121 인 이유 = 121 + 80 - 1 = 200 ≤ 200. 다음 stride (181) 의 시작은 181 인데 181 + 80 - 1 = 260 > 200 이라 그냥 EOF 까지 = 181-200 (20 lines). 즉 알고리즘에 따라 chunk 수가 3 또는 4 가 됨. Task B Step 5 의 impl 확인 —while i < lines.len()에서 마지막 stride 가 EOF 를 넘어서면 짧은 마지막 window emit. 확인 결과 expectation 이 정확히 3 chunk 인지 4 chunk 인지 (3 이 맞으면break조건이if end == lines.len() { break }로 처리) 검증.- 재계산: 200 lines, window 80, stride 60.
- i=0, end=min(80, 200)=80 → chunk 1-80, break? end != len(200), 계속, i += 60 → i=60.
- i=60, end=min(140, 200)=140 → chunk 61-140, end != 200, i=120.
- i=120, end=min(200, 200)=200 → chunk 121-200, end == 200 → break.
- 정확히 3 chunk. Test expectation 맞음.
- 재계산: 200 lines, window 80, stride 60.
-
canonical.parser_versionmutation 의 영향 범위: try_skip_unchanged 는 ingest_one_code_asset 시작부에서parser_version(캡처된 값) 으로 체크. fallback 후 stored 가none-v1로 변경 — 다음 ingest 시 동일 lang 이면 동일parser_version으로 키 → skip 동작. Tier 1 chunker 가 미래에 정상 동작하기 시작하면 Tier 1 path 가RUST_PARSER_VERSION으로 새 키 → cache miss → reprocess. cascade rule 정상. -
머지 후 deviation 은
tasks/HOTFIXES.mddated 로그 + 본 specRisks / notescross-link.