P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
UNIQUE constraint failed: chunks.chunk_id
Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.
Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).
Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
59 lines
1.7 KiB
Rust
59 lines
1.7 KiB
Rust
//! p10-2: dockerfile whole-file chunker (Tier 2).
|
|
//!
|
|
//! Reads entire Dockerfile content and emits a single Chunk with symbol
|
|
//! "<dockerfile>", code_lang "dockerfile", line range 1..EOF.
|
|
//! Oversize >200 lines splits into line-windows sharing the symbol via
|
|
//! tier2_shared::push_chunks_with_oversize.
|
|
|
|
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
|
|
use anyhow::Result;
|
|
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
|
|
|
|
pub const VERSION_LABEL: &str = "dockerfile-file-v1";
|
|
|
|
#[derive(Clone, Copy, Debug, Default)]
|
|
pub struct DockerfileFileV1Chunker;
|
|
|
|
impl Chunker for DockerfileFileV1Chunker {
|
|
fn chunker_version(&self) -> ChunkerVersion {
|
|
ChunkerVersion(VERSION_LABEL.to_string())
|
|
}
|
|
|
|
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
|
policy_hash(policy)
|
|
}
|
|
|
|
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
|
|
// Expect a single Block::Code carrying the full Dockerfile text.
|
|
let text = match doc.blocks.first() {
|
|
Some(Block::Code(cb)) => cb.code.as_str(),
|
|
_ => return Ok(vec![]),
|
|
};
|
|
|
|
let total_lines = text.lines().count().max(1) as u32;
|
|
let mut chunks = Vec::new();
|
|
|
|
push_chunks_with_oversize(
|
|
&mut chunks,
|
|
doc,
|
|
policy,
|
|
text,
|
|
1,
|
|
total_lines,
|
|
"<dockerfile>",
|
|
"dockerfile",
|
|
VERSION_LABEL,
|
|
None,
|
|
)?;
|
|
|
|
tracing::debug!(
|
|
target: "kebab-chunk",
|
|
doc_id = %doc.doc_id,
|
|
chunks = chunks.len(),
|
|
"dockerfile-file-v1 chunked",
|
|
);
|
|
|
|
Ok(chunks)
|
|
}
|
|
}
|