Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.
Tier 1/2 fallback wrapper lands in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.
4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised
the line-window split branch of tier2_shared::push_chunks_with_oversize.
This adds a 256-line ConfigMap fixture (generated inline) and asserts:
- ≥2 chunks emitted (split happened),
- all chunks share symbol `ConfigMap/prod/big`,
- chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation),
- line ranges form a contiguous partition (prev.line_end + 1 == next.line_start).
Reviewer nit #1 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier comment claimed the function "mirrors RustAstExtractor pattern" but
the two differ: RustAstExtractor joins ctx.workspace_root to handle relative
paths, while Tier 2 trusts FsSourceConnector's absolute-path invariant.
Rephrase to document the actual rationale + the Kb URI fallback.
Reviewer nit #3 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p10-1C-JK 이후 누락된 code-java-ast-v1 / code-kotlin-ast-v1 + p10-2 의
k8s-manifest-resource-v1 / dockerfile-file-v1 / manifest-file-v1 추가.
표기 단순화를 위해 code-* 는 brace 묶음.
Reviewer nit #2 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-visible surface sync per the docs-split rule:
- README adds Tier 2 langs (yaml / dockerfile / toml / json / xml / groovy / go-mod) to the ingest支援 list and --code-lang options.
- HANDOFF flips p10-2 phase row to ✅ (v0.14.0) and updates the next-task candidates.
- ARCHITECTURE extends crates/kebab-chunk/src/ tree with k8s_manifest_resource_v1.rs / dockerfile_file_v1.rs / manifest_file_v1.rs / tier2_shared.rs, plus a Tier 2 note on the code-parser row and flowchart node.
- SMOKE adds a Tier 2 smoke walkthrough (k8s yaml + Dockerfile + Cargo.toml ingest + --code-lang search) and a P10-2 entry in the verification checklist.
- tasks/INDEX + tasks/p10/INDEX flip p10-2 to ✅ (v0.14.0).
Workspace test gate (-j 1) + clippy --workspace pass cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds yaml / dockerfile / toml / json / xml / groovy / go-mod arms to the
existing 7-arm AST match. parser_version unified to "none-v1" for Tier 2.
synthesize_tier2_document builds a minimal Document (single Block::Code
with raw file text) since Tier 2 has no parse step. allowlist in
ingest_one_asset extended to admit Tier 2 langs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reads entire Dockerfile / Dockerfile.* / *.dockerfile content and emits a
single Chunk with symbol "<dockerfile>", code_lang "dockerfile", line range
1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
tier2_shared::push_chunks_with_oversize.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits multi-document YAML by ^---\s*$, requires apiVersion + kind string
fields per document, emits 1 chunk per recognized k8s resource. Symbol =
<kind>/<namespace>/<name> or <kind>/<name> (cluster-scoped). Invalid YAML
returns 0 chunks (handled by p10-3 paragraph fallback). Oversize >200 lines
splits into line-windows sharing the same symbol.
tier2_shared module hosts the oversize fallback + Chunk-construction helper
mirroring code_rust_ast_v1's Chunk shape. Task E (dockerfile) and Task F
(manifest) will reuse it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces 1A-1 era inline match block with a single call to
kebab_parse_code::code_lang_for_path, per design §3.5 single-source-of-truth
rule. Adds Tier 2 routing test (yaml / dockerfile / toml / json / xml /
groovy / go-mod) and preserves all non-code extension branches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Frozen contract for the p10-2 single PR: 3 chunker activation, k8s
identification via apiVersion+kind, Dockerfile/manifest basename matching,
code_lang_for_path source-of-truth consolidation, frozen design §3.5 +
§10.1 deltas, and version bump 0.13.0 → 0.14.0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code_go_ast_snapshot pattern. In-memory CanonicalDocument (no
kebab-parse-code dep — boundary §6.3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-java-ast-v1 with language-agnostic body unchanged. Cross-
chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Uses tree-sitter-kotlin-ng (bare tree-sitter-kotlin is stuck on tree-sitter
0.21-0.23, incompatible with our 0.26). Mirrors JavaAstExtractor (JVM family,
source-side package extraction + class-nesting) with Kotlin grammar quirks:
- Root is `source_file`, not `program`.
- `package_header` child is `qualified_identifier` (its slice text is the
dotted path); the bare `identifier` shape is also accepted as a fallback.
- `class_declaration` is the single node kind for `class` / `data class` /
`sealed class` / `interface` / `enum class` — distinguished only by its
`modifiers` child. Body is `class_body` for non-enum, `enum_class_body`
for enum class; neither carries a `body` field name, so the extractor
looks the body up by node kind rather than `child_by_field_name("body")`.
- `companion_object` is its own node kind (NOT object_declaration with a
modifier); its `name` field is optional, so the extractor fills in the
implicit Kotlin convention name `Companion`.
- `function_declaration` is allowed at top level (unlike Java), emitted as
`<pkg>.<fn_name>`; the same node kind nested in `class_body` becomes
`<pkg>.<...>.<Class>.<method>` via the same mod_path mechanism.
- `secondary_constructor` has no `name` field; symbol uses the enclosing
class name (Java duplication convention: `<pkg>.<...>.<Class>.<Class>`).
- Enum bodies (`enum_class_body`) are NOT recursed — `enum_entry` is not
emitted as a unit (matches Java 1차 scope).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-rust-ast-v1 / code-go-ast-v1 with language-agnostic body
unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bare tree-sitter-kotlin v0.3.8 requires tree-sitter >=0.21,<0.23 which
conflicts with the workspace's tree-sitter 0.26 (links = "tree-sitter"
is a singleton). tree-sitter-kotlin-ng v1.1.0 (from
tree-sitter-grammars/tree-sitter-kotlin) uses the tree-sitter-language
0.1 shim which is compatible with tree-sitter 0.26. Using
tree-sitter-kotlin-ng as the Kotlin grammar crate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces Go bail! arms with GoAstExtractor + CodeGoAstV1Chunker. Adds
go_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=go, symbol=chunk.ParseDoc, code_lang=go.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-rust-ast-v1 / code-{python,ts,js}-ast-v1 with language-agnostic
body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
assets.workspace_path is INTENTIONALLY 'last-registered path' for twin
files (identical content at different paths share one asset row PK'd by
blake3 content hash). PR #146 made try_skip_unchanged document-centric;
PR #149 made reset --orphans-only document-centric; this PR removes the
last caller of get_asset_by_workspace_path (fetch.rs:193 in fetch_span,
which used it to reject PDF/audio media — for twins this could read the
wrong asset's media_type and pick the wrong branch).
Replaced with the natural 2-step lookup: get_document_by_workspace_path
(PR #146) → doc.source_asset_id → get_asset (NEW trait method, asset_id
is PRIMARY KEY so flip-flop-immune by construction).
Then removed get_asset_by_workspace_path trait method + SqliteStore impl
— 0 callers after the refactor.
UPSERT doc-comment refreshed in store.rs to make the 'last-registered'
semantics explicit so future readers don't try to 'fix' the flip-flop.
Dogfood follow-up (PR #142 1B + multi-root corpus).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>