Compare commits

...

8 Commits

Author SHA1 Message Date
a0c0dca321 fix(dogfood): k8s multi-resource YAML chunk_id collision (#158) 2026-05-21 23:57:49 +00:00
667495ae6a docs(dogfood): HOTFIXES entry for k8s multi-resource chunk_id collision
PR #158 code-reviewer recommendation. Records the dogfood-discovered
k8s multi-resource chunk_id collision + the deliberate decision NOT to
bump chunker_version (dogfood-only stage, single-resource k8s chunk_id
shift is benign churn). Cross-link added to p10-2 spec Risks/notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:57:34 +00:00
08d72a12e0 chore: bump version 0.16.0 → 0.16.1 (k8s multi-resource chunk_id fix)
Patch bump — bug fix only (P10 dogfood-discovered k8s multi-resource
chunk_id collision). New binary needed to resume dogfooding. No wire
schema change, no DB migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:54:33 +00:00
1969c8e3b5 fix(dogfood): k8s multi-resource YAML chunk_id collision
P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
  UNIQUE constraint failed: chunks.chunk_id

Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.

Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).

Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 23:49:37 +00:00
c6207d196e chore(p10-1d-followup): reviewer nit cleanup — C extractor tests + HOTFIXES + cpp snapshot (#157) 2026-05-21 22:47:38 +00:00
840c6c40a6 test(p10-1d-followup): cpp snapshot exercises actual CppAstExtractor
Reviewer nit #3: the hand-built fixed_doc() only verified chunker 1:1
mapping. New tests invoke CppAstExtractor against tests/fixtures/sample.cpp
and snapshot the real extractor → chunker pipeline (14 blocks emitted
covering namespace::chunk::Class, ctor/dtor/operator/template/free-fn
convention, glue <top-level> blocks between units).

Adds kebab-parse-code as a dev-dep of kebab-chunk (same precedent as
kebab-parse-md). Both the existing hand-built test AND the new
extractor-driven tests are kept — the former for fast chunker-only
validation, the latter for end-to-end regression detection.

Added tests:
- code_cpp_ast_extractor_snapshot: asserts all 8 named symbol units are present
- code_cpp_ast_extractor_chunks_deterministic: chunker output is stable

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:43:57 +00:00
b81574afa9 docs(p10-1d-followup): HOTFIXES entry — typedef-wrapped struct/enum in C falls into glue
PR #156 reviewer nit #2. Documents the tension between spec body
("struct_specifier (named, top-level) → 1 unit") and the actual behavior
for the C idiom `typedef struct { ... } Foo;` — the inner struct_specifier
is anonymous, so the extractor falls into glue. Workaround: dogfood-driven
revisit if frequent pain point emerges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:40:04 +00:00
6beff35a2f test(p10-1d-followup): add in-file unit tests to C AST extractor
Mirrors the cpp.rs 15-test pattern. Covers function_definition (incl.
pointer-return, static/extern/inline), struct_specifier / enum_specifier /
union_specifier (named), anonymous struct/enum/union → glue, typedef-wrapped
struct → glue (per spec risks note), preprocessor directives → glue, empty
file → <module> post-pass, preprocessor-only → <module>, mixed fn + glue →
<top-level> present, determinism (20 runs). 17 tests total.

Reviewer nit #1 (PR #156 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:39:36 +00:00
14 changed files with 533 additions and 41 deletions

47
Cargo.lock generated
View File

@@ -4127,7 +4127,7 @@ dependencies = [
[[package]]
name = "kebab-app"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"base64 0.22.1",
@@ -4172,12 +4172,13 @@ dependencies = [
[[package]]
name = "kebab-chunk"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
"kebab-core",
"kebab-normalize",
"kebab-parse-code",
"kebab-parse-md",
"serde_json",
"serde_json_canonicalizer",
@@ -4188,7 +4189,7 @@ dependencies = [
[[package]]
name = "kebab-cli"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"clap",
@@ -4209,7 +4210,7 @@ dependencies = [
[[package]]
name = "kebab-config"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"dirs 5.0.1",
@@ -4224,7 +4225,7 @@ dependencies = [
[[package]]
name = "kebab-core"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4238,7 +4239,7 @@ dependencies = [
[[package]]
name = "kebab-embed"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4252,7 +4253,7 @@ dependencies = [
[[package]]
name = "kebab-embed-local"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"fastembed",
@@ -4265,7 +4266,7 @@ dependencies = [
[[package]]
name = "kebab-eval"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-app",
@@ -4284,7 +4285,7 @@ dependencies = [
[[package]]
name = "kebab-llm"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-core",
@@ -4293,7 +4294,7 @@ dependencies = [
[[package]]
name = "kebab-llm-local"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-config",
@@ -4310,7 +4311,7 @@ dependencies = [
[[package]]
name = "kebab-mcp"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-app",
@@ -4328,7 +4329,7 @@ dependencies = [
[[package]]
name = "kebab-normalize"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-core",
@@ -4343,7 +4344,7 @@ dependencies = [
[[package]]
name = "kebab-parse-code"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"gix",
@@ -4366,7 +4367,7 @@ dependencies = [
[[package]]
name = "kebab-parse-image"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"ab_glyph",
"anyhow",
@@ -4390,7 +4391,7 @@ dependencies = [
[[package]]
name = "kebab-parse-md"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-core",
@@ -4407,7 +4408,7 @@ dependencies = [
[[package]]
name = "kebab-parse-pdf"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4420,7 +4421,7 @@ dependencies = [
[[package]]
name = "kebab-parse-types"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"kebab-core",
"serde",
@@ -4428,7 +4429,7 @@ dependencies = [
[[package]]
name = "kebab-rag"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4449,7 +4450,7 @@ dependencies = [
[[package]]
name = "kebab-search"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"globset",
@@ -4468,7 +4469,7 @@ dependencies = [
[[package]]
name = "kebab-source-fs"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4487,7 +4488,7 @@ dependencies = [
[[package]]
name = "kebab-store-sqlite"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4508,7 +4509,7 @@ dependencies = [
[[package]]
name = "kebab-store-vector"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"arrow",
@@ -4532,7 +4533,7 @@ dependencies = [
[[package]]
name = "kebab-tui"
version = "0.16.0"
version = "0.16.1"
dependencies = [
"anyhow",
"crossterm",

View File

@@ -31,7 +31,7 @@ edition = "2024"
rust-version = "1.85"
license = "MIT OR Apache-2.0"
repository = "https://github.com/altair823/kebab"
version = "0.16.0"
version = "0.16.1"
[workspace.dependencies]
anyhow = "1"

View File

@@ -1286,6 +1286,64 @@ fn tier1_cpp_ingest_searchable() {
);
}
/// P10 dogfood regression: a k8s YAML with 2 documents (Deployment + Service
/// separated by `---`) must ingest without a UNIQUE constraint violation.
/// Before the fix, push_chunks_with_oversize emitted split_key=None for each
/// resource, giving every resource chunk the same id_hash → identical chunk_id
/// → SQLite UNIQUE constraint failure on the second resource.
#[test]
fn tier2_k8s_multi_resource_yaml_ingests_without_collision() {
let env = TestEnv::lexical_only();
let k8s_dir = env.workspace_root.join("k8s");
std::fs::create_dir_all(&k8s_dir).unwrap();
std::fs::write(
k8s_dir.join("k8s-multi.yaml"),
"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: api\n namespace: prod\nspec:\n replicas: 2\n---\napiVersion: v1\nkind: Service\nmetadata:\n name: api\n namespace: prod\nspec:\n selector:\n app: api\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
// The bug: this would land in report with an error + UNIQUE constraint message.
let item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("k8s-multi.yaml"))
.expect("k8s-multi.yaml in report");
assert!(
item.error.is_none(),
"multi-resource k8s yaml must ingest without error, got: {:?}",
item.error
);
assert!(
matches!(item.kind, IngestItemKind::New),
"expected New, got {:?}",
item.kind
);
// Both resources must be searchable (≥2 hits: Deployment/prod/api + Service/prod/api).
let query = kebab_core::SearchQuery {
text: "api".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["yaml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
assert!(
hits.len() >= 2,
"expected ≥2 hits (Deployment + Service), got {}",
hits.len()
);
}
/// p10-3 fix regression: a shell file (direct Tier 3, not a fallback)
/// must also report Unchanged on re-ingest. Shell goes straight to
/// CodeTextParagraphV1Chunker so `stored_is_tier3_fallback` is false

View File

@@ -16,12 +16,13 @@ tracing = { workspace = true }
serde_yaml = { workspace = true }
[dev-dependencies]
# kb-parse-md / kb-normalize are dev-only — used by the snapshot integration
# test to build a CanonicalDocument from a fixture Markdown file. Forbidden as
# regular deps per design §8 (chunker consumes CanonicalDocument from kb-core
# only); `cargo tree -p kb-chunk --depth 1` (default scope, excludes dev-deps)
# confirms this.
kebab-parse-md = { path = "../kebab-parse-md" }
kebab-normalize = { path = "../kebab-normalize" }
serde_json = { workspace = true }
time = { workspace = true }
# kb-parse-md / kb-normalize / kb-parse-code are dev-only — used by the
# snapshot integration tests to build a CanonicalDocument from fixture files.
# Forbidden as regular deps per design §8 (chunker consumes CanonicalDocument
# from kb-core only); `cargo tree -p kb-chunk --depth 1` (default scope,
# excludes dev-deps) confirms this.
kebab-parse-md = { path = "../kebab-parse-md" }
kebab-parse-code = { path = "../kebab-parse-code" }
kebab-normalize = { path = "../kebab-normalize" }
serde_json = { workspace = true }
time = { workspace = true }

View File

@@ -43,6 +43,7 @@ impl Chunker for DockerfileFileV1Chunker {
"<dockerfile>",
"dockerfile",
VERSION_LABEL,
None,
)?;
tracing::debug!(

View File

@@ -85,6 +85,7 @@ impl Chunker for K8sManifestResourceV1Chunker {
&symbol,
"yaml",
VERSION_LABEL,
Some(slice.line_start),
)?;
}

View File

@@ -44,6 +44,7 @@ impl Chunker for ManifestFileV1Chunker {
"<manifest>",
lang,
VERSION_LABEL,
None,
)?;
tracing::debug!(

View File

@@ -25,6 +25,13 @@ pub(crate) fn policy_hash(policy: &ChunkPolicy) -> String {
/// Emit one chunk for `(text, line_start..=line_end, symbol, lang)`, splitting
/// into line-windows of at most `AST_CHUNK_MAX_LINES` if the slice is oversize.
/// Mirrors the oversize path in `code_rust_ast_v1`'s `chunk` impl.
///
/// `base_split_key` is used as the `split_key` for the non-oversize single-chunk
/// case. Callers that emit multiple chunks from the same document (e.g.
/// `K8sManifestResourceV1Chunker` — one call per k8s resource) MUST pass
/// `Some(line_start)` so that each call produces a distinct `chunk_id`.
/// Single-chunk callers (dockerfile-file-v1, manifest-file-v1) pass `None` to
/// keep chunk_ids stable (no sibling can collide when there's only one chunk).
#[allow(clippy::too_many_arguments)]
pub(crate) fn push_chunks_with_oversize(
out: &mut Vec<Chunk>,
@@ -36,6 +43,7 @@ pub(crate) fn push_chunks_with_oversize(
symbol: &str,
lang: &str,
chunker_version: &str,
base_split_key: Option<u32>,
) -> Result<()> {
let n_lines = (line_end - line_start + 1).max(1);
let cv = ChunkerVersion(chunker_version.to_string());
@@ -51,7 +59,7 @@ pub(crate) fn push_chunks_with_oversize(
line_end,
symbol,
lang,
None,
base_split_key,
));
return Ok(());
}

View File

@@ -1,11 +1,13 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative C++ code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_c_ast_v1.rs`'s
//! internal `code_doc` test helper.
//! Two complementary tests:
//! 1. `code_cpp_ast_chunks_snapshot` — hand-built `fixed_doc()` validates the
//! chunker's 1:1 mapping (design §6.3 / §8 boundary: no parse-code dep needed).
//! 2. `code_cpp_ast_extractor_snapshot` — invokes `CppAstExtractor` against the
//! real `tests/fixtures/sample.cpp` fixture, validating the extractor → chunker
//! end-to-end pipeline. `kebab-parse-code` is a dev-dep (same pattern as
//! `kebab-parse-md` in Markdown snapshot tests).
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
@@ -17,6 +19,7 @@ use kebab_core::{
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use kebab_parse_code::CppAstExtractor;
use serde_json::Value;
use time::OffsetDateTime;
@@ -134,6 +137,47 @@ fn fixed_policy() -> ChunkPolicy {
}
}
// ---------------------------------------------------------------------------
// Helper: run the real CppAstExtractor against tests/fixtures/sample.cpp
// ---------------------------------------------------------------------------
fn extract_cpp_fixture() -> CanonicalDocument {
use kebab_core::{
AssetId, AssetStorage, Checksum, ExtractConfig, ExtractContext, Extractor, RawAsset,
SourceUri, WorkspacePath,
};
use std::path::PathBuf;
let bytes = std::fs::read(fixtures_dir().join("sample.cpp")).expect("read sample.cpp fixture");
let src = String::from_utf8(bytes).expect("fixture is valid UTF-8");
let wp = WorkspacePath("tests/fixtures/sample.cpp".to_string());
let asset = RawAsset {
asset_id: AssetId("e".repeat(64)),
source_uri: SourceUri::File(PathBuf::from("tests/fixtures/sample.cpp")),
workspace_path: wp,
media_type: kebab_core::MediaType::Code("cpp".to_string()),
byte_len: src.len() as u64,
checksum: Checksum("f".repeat(64)),
discovered_at: time::OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::Reference {
path: PathBuf::from("tests/fixtures/sample.cpp"),
sha: Checksum("f".repeat(64)),
},
};
let cfg = ExtractConfig::default();
let root = PathBuf::from("/tmp");
let ctx = ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
CppAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
}
// ---------------------------------------------------------------------------
// Test 1 (hand-built): chunker-only 1:1 mapping validation
// ---------------------------------------------------------------------------
#[test]
fn code_cpp_ast_chunks_snapshot() {
let doc = fixed_doc();
@@ -198,3 +242,84 @@ fn code_cpp_ast_chunks_are_deterministic() {
assert_eq!(again, baseline);
}
}
// ---------------------------------------------------------------------------
// Test 2 (real extractor): end-to-end extractor → chunker pipeline
// ---------------------------------------------------------------------------
/// Validates that the real `CppAstExtractor` processes `sample.cpp` and
/// emits the expected set of symbols through the full chunker pipeline.
///
/// `sample.cpp` contains:
/// - `#include` directives + nested namespace `kebab::chunk` → glue + struct unit
/// - `class MdHeadingV1Chunker` with methods (ctor, dtor, chunk_doc, operator())
/// - `template <typename T> T identity(T value)` (template fn)
/// - `void kebab::global_helper()` (free fn in namespace)
/// - `int main()` (global free fn)
#[test]
fn code_cpp_ast_extractor_snapshot() {
let doc = extract_cpp_fixture();
// Verify the extractor emits all expected named units.
let block_syms: Vec<Option<String>> = doc.blocks.iter().filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => Some(symbol.clone()),
_ => None,
},
_ => None,
}).collect();
// Must include namespace-qualified class and its methods
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker")),
"class unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker")),
"ctor unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker")),
"dtor unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::chunk_doc")),
"chunk_doc unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::operator()")),
"operator() unit missing: {block_syms:?}"
);
// Template function (inside kebab::chunk namespace in the fixture)
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::identity")),
"identity template fn unit missing: {block_syms:?}"
);
// Free function in outer namespace
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::global_helper")),
"global_helper unit missing: {block_syms:?}"
);
// Global main
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("main")),
"main unit missing: {block_syms:?}"
);
}
/// End-to-end chunker output from real extractor is deterministic.
#[test]
fn code_cpp_ast_extractor_chunks_deterministic() {
let doc1 = extract_cpp_fixture();
let doc2 = extract_cpp_fixture();
assert_eq!(doc1.blocks, doc2.blocks, "extractor output non-deterministic");
let policy = fixed_policy();
let chunks1 = CodeCppAstV1Chunker.chunk(&doc1, &policy).unwrap();
let chunks2 = CodeCppAstV1Chunker.chunk(&doc2, &policy).unwrap();
assert_eq!(
chunks1.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
chunks2.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
"chunker output non-deterministic"
);
}

View File

@@ -140,6 +140,17 @@ fn k8s_multi_doc_emits_one_chunk_per_resource() {
for chunk in &chunks {
assert_eq!(chunk.chunker_version.0, "k8s-manifest-resource-v1");
}
// Every chunk from a multi-resource file must have a distinct chunk_id.
// Without the fix, all non-oversize resources get split_key=None which
// collapses to the same id_hash (= base_policy_hash) → UNIQUE constraint
// violation on the second resource.
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"every k8s resource chunk must have a distinct chunk_id (multi-resource collision regression)"
);
}
/// A YAML document with an indentation error (tab in a space-indented context)

View File

@@ -333,5 +333,260 @@ fn flush_glue(glue: &mut Vec<(u32, u32)>, units: &mut Vec<(String, u32, u32, boo
glue.clear();
}
// Tests for CAstExtractor (snapshot + unit assertions) are added in Task D
// alongside the C fixture file. This module is intentionally empty until then.
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
pub(crate) mod tests_support {
use kebab_core::*;
use std::path::PathBuf;
use time::OffsetDateTime;
pub fn fixed_code_asset(workspace_path: &str, lang: &str) -> RawAsset {
RawAsset {
asset_id: AssetId("a".repeat(64)),
source_uri: SourceUri::File(PathBuf::from(workspace_path)),
workspace_path: WorkspacePath(workspace_path.to_string()),
media_type: MediaType::Code(lang.to_string()),
byte_len: 0,
checksum: Checksum("b".repeat(64)),
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::Reference {
path: PathBuf::from(workspace_path),
sha: Checksum("b".repeat(64)),
},
}
}
pub fn extract_c(src: &str, path: &str) -> kebab_core::CanonicalDocument {
use super::CAstExtractor;
use kebab_core::Extractor;
let asset = fixed_code_asset(path, "c");
let cfg = ExtractConfig::default();
let root = PathBuf::from("/tmp");
let ctx = ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
CAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
}
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn syms(doc: &kebab_core::CanonicalDocument) -> Vec<String> {
doc.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => symbol.clone(),
_ => None,
},
_ => None,
})
.collect()
}
#[test]
fn extractor_supports_only_media_code_c() {
let e = CAstExtractor::new();
assert!(e.supports(&MediaType::Code("c".into())));
assert!(!e.supports(&MediaType::Code("cpp".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn c_extractor_simple_function() {
let src = "int add(int a, int b) { return a + b; }\n";
let doc = tests_support::extract_c(src, "x/math.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "add"), "got {s:?}");
}
#[test]
fn c_extractor_pointer_return_function() {
let src = "int *find(int *arr, int n) { return arr; }\n";
let doc = tests_support::extract_c(src, "x/find.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "find"), "ptr-return fn missing: {s:?}");
}
#[test]
fn c_extractor_static_function() {
let src = "static void helper(void) {}\n";
let doc = tests_support::extract_c(src, "x/helper.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "helper"), "static fn missing: {s:?}");
}
#[test]
fn c_extractor_extern_function() {
let src = "extern int compute(int x);\n";
// extern prototype is a declaration → glue
let doc = tests_support::extract_c(src, "x/compute.c");
let s = syms(&doc);
// declaration (prototype) falls into glue → "<module>"
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for extern proto: {s:?}"
);
}
#[test]
fn c_extractor_inline_function() {
let src = "inline int square(int x) { return x * x; }\n";
let doc = tests_support::extract_c(src, "x/square.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "square"), "inline fn missing: {s:?}");
}
#[test]
fn c_extractor_named_struct() {
let src = "struct Point { int x; int y; };\n";
let doc = tests_support::extract_c(src, "x/point.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Point"), "struct missing: {s:?}");
}
#[test]
fn c_extractor_named_enum() {
let src = "enum Color { RED, GREEN, BLUE };\n";
let doc = tests_support::extract_c(src, "x/color.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Color"), "enum missing: {s:?}");
}
#[test]
fn c_extractor_named_union() {
let src = "union Data { int i; float f; };\n";
let doc = tests_support::extract_c(src, "x/data.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Data"), "union missing: {s:?}");
}
#[test]
fn c_extractor_anonymous_struct_falls_into_glue() {
// Anonymous struct (no name field) → glue → "<module>" (only glue, no real unit)
let src = "struct { int x; int y; } origin;\n";
let doc = tests_support::extract_c(src, "x/anon.c");
let s = syms(&doc);
// anonymous struct is a declaration containing anonymous struct_specifier → glue
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for anon struct: {s:?}"
);
// Must NOT emit a unit named after anything else
assert!(
!s.iter().any(|x| x == "origin"),
"unexpected 'origin' unit: {s:?}"
);
}
#[test]
fn c_extractor_typedef_struct_falls_into_glue() {
// typedef struct { ... } Foo; — inner struct_specifier is anonymous,
// outer node is type_definition → glue. See HOTFIXES.md 2026-05-21.
let src = "typedef struct { int x; int y; } Point;\n";
let doc = tests_support::extract_c(src, "x/typedef.c");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for typedef struct: {s:?}"
);
// The typedef alias should NOT surface as a Code symbol
assert!(
!s.iter().any(|x| x == "Point"),
"unexpected 'Point' unit for typedef struct: {s:?}"
);
}
#[test]
fn c_extractor_preprocessor_directives_are_glue() {
let src = "#include <stdio.h>\n#define MAX 100\n#ifdef DEBUG\n#endif\n";
let doc = tests_support::extract_c(src, "x/macros.c");
let s = syms(&doc);
// Only preprocessor → no real unit → "<module>"
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for preproc-only file: {s:?}"
);
assert_eq!(s.len(), 1, "expected exactly 1 block: {s:?}");
}
#[test]
fn c_extractor_multiple_functions_correct_count() {
let src = "int foo(void) { return 1; }\nint bar(void) { return 2; }\nint baz(void) { return 3; }\n";
let doc = tests_support::extract_c(src, "x/multi.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "foo"), "foo missing: {s:?}");
assert!(s.iter().any(|x| x == "bar"), "bar missing: {s:?}");
assert!(s.iter().any(|x| x == "baz"), "baz missing: {s:?}");
assert_eq!(s.len(), 3, "expected 3 units: {s:?}");
}
#[test]
fn c_extractor_empty_file_produces_module() {
let src = "";
let doc = tests_support::extract_c(src, "x/empty.c");
let s = syms(&doc);
assert_eq!(s, vec!["<module>"], "expected <module>: got {s:?}");
}
#[test]
fn c_extractor_preprocessor_only_produces_module() {
let src = "#include <stdlib.h>\n#define VERSION \"1.0\"\n";
let doc = tests_support::extract_c(src, "x/header.c");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for preproc-only file: {s:?}"
);
}
#[test]
fn c_extractor_mixed_functions_and_glue() {
let src = r#"#include <stdio.h>
int compute(int x) {
return x * 2;
}
extern int lookup(int key);
void print_result(int v) {
printf("%d\n", v);
}
"#;
let doc = tests_support::extract_c(src, "x/mixed.c");
let s = syms(&doc);
// Two real functions + one glue block
assert!(s.iter().any(|x| x == "compute"), "compute missing: {s:?}");
assert!(s.iter().any(|x| x == "print_result"), "print_result missing: {s:?}");
assert!(
s.iter().any(|x| x == "<top-level>"),
"<top-level> glue missing: {s:?}"
);
}
#[test]
fn c_extractor_deterministic_across_runs() {
let src = r#"
struct Node { int val; };
int sum(int a, int b) { return a + b; }
void noop(void) {}
"#;
let a = tests_support::extract_c(src, "x/det.c");
for _ in 0..20 {
assert_eq!(
tests_support::extract_c(src, "x/det.c").blocks,
a.blocks
);
}
}
}

View File

@@ -14,6 +14,34 @@ historical contract that was implemented; this file accumulates the
deltas so phase 5+ readers can find the live behavior without diffing
git history.
## 2026-05-21 — p10-2: k8s multi-resource YAML chunk_id collision
**Origin**: P10 종합 도그푸딩 (`/tmp/kebab-p10-dogfood/`, 16 파일). 한 파일에 2+ k8s document (Deployment + Service, `---` 구분) 인 YAML 이 ingest 실패.
**Symptom**: `DocumentStore::put_chunks (code): UNIQUE constraint failed: chunks.chunk_id`. document row 는 생성되나 chunk 0개 → 검색 불가. p10-2 의 통합 테스트 `tier2_k8s_yaml_ingest_searchable` 가 single-Deployment fixture 만 써서 미발견.
**원인**: `tier2_shared::push_chunks_with_oversize` 의 non-oversize 분기가 `split_key = None` 하드코딩. `K8sManifestResourceV1Chunker` 가 resource 마다 호출 — 같은 document 의 모든 resource 가 `doc_id` + `chunker_version` + `base_policy_hash` 공유 + `split_key = None` → 동일 `id_hash` → 동일 `chunk_id`. p10-3 의 `code_text_paragraph_v1` 가 같은 버그였고 `df3c5b8` 에서 fix 됐지만 그건 `build_chunk_no_symbol` 직접 호출 경로, `push_chunks_with_oversize` 경로는 미수정.
**Fix** (PR #158, v0.16.1): `push_chunks_with_oversize``base_split_key: Option<u32>` 추가. k8s chunker 가 `Some(resource.line_start)` 전달 → resource 별 distinct chunk_id. dockerfile / manifest 는 `None` (파일당 1 chunk, 충돌 없음, chunk_id 불변).
**Deviation note**: single-resource k8s YAML 의 chunk_id 도 `None → Some(1)` 으로 바뀜 (`id_hash``base_policy_hash``base_policy_hash#L1`). `chunker_version` (`k8s-manifest-resource-v1`) 은 의도적으로 bump 안 함 — p10-2 가 v0.14.0 (~1주 전) 머지된 dogfood 단계라 prod KB 없음. v0.14.0~v0.16.0 사이 single-resource k8s 를 색인한 KB 는 re-ingest 시 old chunk 가 orphan 될 수 있으나 (UNIQUE 충돌 아님 — 다른 id), `kebab reset` 또는 re-ingest sweep 으로 정리됨. dogfood-only 단계라 chunker_version bump (전체 re-process) 보다 가벼운 선택.
Cross-link: `tasks/p10/p10-2-tier2-resource-aware.md` Risks/notes section.
## 2026-05-21 — p10-1D: typedef-wrapped struct/enum in C falls into glue
**Origin**: PR #156 (p10-1d) code-reviewer review. Verified during dogfood.
**Symptom**: `typedef struct { ... } Foo;` in a `.c` file does NOT emit a struct-level unit. tree-sitter-c classifies the construct as a top-level `type_definition` with an *anonymous* inner `struct_specifier` (no `name` field), so the extractor's `struct_specifier` arm doesn't fire — the whole declaration falls into `<top-level>` glue. The named typedef alias `Foo` is therefore not searchable as a symbol.
**Status**: Consistent with spec p10-1d-c-cpp-ast-chunker.md's Risks/notes ("Anonymous union / struct … anonymous → glue"), but the spec's main body line 22 ("struct_specifier (named, top-level) → 1 unit") suggests this idiom WOULD emit. Tension noted, not yet fixed.
**Workaround**: search the struct by its field/function names, or use `--code-lang c` to broaden scope. Typedef-aliased struct names won't surface as `Citation::Code.symbol`.
**Next step**: dogfood real C code for a week+; if this turns out to be a frequent pain point (kernel-style code, libuv, etc.), revisit the extractor to detect `type_definition` → inner `struct_specifier` and emit a synthetic unit named after the typedef alias.
Cross-link: `tasks/p10/p10-1d-c-cpp-ast-chunker.md` Risks/notes section.
## 2026-05-20 — p10-1B: Rust 1A-2 symbol path is file-scope-only; 1B+ uses workspace path → module prefix
**무엇이 바뀌었나**: P10-1A-2 의 Rust `code-rust-ast-v1` chunker 가 생성하는 symbol 은 file-scope mod-path nesting 만 사용한다 (예: `Foo::double`). P10-1B 이후 Python / TypeScript / JavaScript 의 symbol 은 workspace 경로 → module path prefix 를 포함한다 (예: `kebab_eval.metrics.compute_mrr`, `src/Foo.Foo.search`).

View File

@@ -113,6 +113,7 @@ crates/kebab-parse-code/Cargo.toml [edit] — 위 2 dep 신규 entry.
- **Template specialization** (`template<> class Foo<int>`): tree-sitter-cpp 의 `template_declaration` 안의 `class_specifier` name 만 추출 — `Foo` 만 symbol 에 들어가고 `<int>` 미포함. design 의 generic 무시 룰 일관.
- **`extern "C"` block 안의 fn**: 일반 fn 처리. 외부 wrapping block 은 glue.
- **Anonymous union / struct** (`struct { int x; }` 변수 안에): 흔치 않음 + named 만 unit. anonymous 는 glue.
- **typedef-wrapped struct/enum idiom** (`typedef struct { ... } Foo;`) — anonymous inner struct → glue. Named typedef alias 미캡처. dogfood 후 HOTFIXES 검토. See [HOTFIXES.md 2026-05-21 entry](../HOTFIXES.md).
- **Macro-heavy code** (Linux kernel 등): `#define FOO(x) ...` 매크로가 function-like 라도 parser 가 fn 으로 인식 안 함. preprocessor glue 로 처리 — symbol 안 잡힘. 의도된 동작 (parser 의 macro expansion 안 함).
- **`__attribute__((...))`** annotations: tree-sitter-c 의 attribute 노드는 declarator 옆 sibling. 무시 가능. function name 추출에 영향 없음.
- **fixture 크기**: sample.c 는 ~30 line (top-level fn + struct + enum + preprocessor), sample.cpp 는 ~50 line (nested namespace + class + method + template + free fn). oversize fallback 의 별도 검증은 1A-2 의 long_section_snapshot 패턴이 이미 cover (필요 시 별도 fixture).

View File

@@ -118,3 +118,4 @@ _ → skip (p10-3 fallback 의 자리)
- **`pom.xml` aggregate parent POM** — 매우 큼 (수백~수천 줄). oversize fallback 으로 split. 거대 fixture 로 한 번 검증.
- **`media.rs` 정리** — 1A-1 부터 누적된 inline `match extension` duplication 을 `code_lang_for_path` 호출로 교체. 기존 단위 테스트 동작 보존 (테스트는 결과 값만 보므로 통과해야 함).
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` 에 one-line cross-link.
- **[HOTFIXES 2026-05-21]** multi-resource k8s YAML (2+ document) 이 `chunk_id` 충돌로 ingest 실패 — `push_chunks_with_oversize` 의 non-oversize 분기가 `split_key = None` 하드코딩. PR #158 (v0.16.1) 에서 `base_split_key` 파라미터로 fix. See `tasks/HOTFIXES.md` 2026-05-21 entry.