Compare commits

...

55 Commits

Author SHA1 Message Date
7a90df1485 feat(p10-3): Tier 3 paragraph + line-window fallback chunker — shell direct + Tier 1/2 0-chunk/Err 자동 picked up (#155) 2026-05-21 12:27:18 +00:00
46f408dc0f chore: bump version 0.14.0 → 0.15.0 (p10-3 Tier 3 paragraph fallback)
Minor bump — additive new chunker_version "code-text-paragraph-v1" + new
routing lang "shell" + new Tier 1/2 → Tier 3 fallback wrapper behavior.
No DB migration, no wire schema major bump (Citation::Code.lang values
remain a free string field).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:05:53 +00:00
49e60fb314 docs(p10-3): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
- README adds Tier 3 to the ingest row (shell + fallback) and the Mermaid
  chunker enumeration; --code-lang shell admitted.
- HANDOFF flips p10-3 to  (v0.15.0) and updates the 한 줄 요약 + next
  candidates.
- ARCHITECTURE adds Tier 3 to the code-parser row, extends the flowchart
  pcode node, and lists code_text_paragraph_v1.rs in the chunker tree.
- SMOKE adds a P10-3 walkthrough (shell + non-k8s YAML fallback) and a
  verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-3 to .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:43:38 +00:00
6bc7a83d3c docs(p10-3): activate Tier 3 in frozen design §10.1
Add p10-3 activation log entry for Tier 3 paragraph fallback chunker
(code-text-paragraph-v1) with shell direct routing and fallback wrapper
for invalid YAML / AST failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:39:49 +00:00
df3c5b8caf test(p10-3): integration smoke tests for Tier 3 (shell + yaml fallback)
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
  Citation::Code { symbol: None, lang: "shell" }, chunker_version
  "code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
  (no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
  retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
  chunker_version "code-text-paragraph-v1".

Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.

Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:37:44 +00:00
5051ea7534 feat(p10-3): Tier 1/2 → Tier 3 fallback wrapper in ingest_one_code_asset
After the chunks match resolves, an Ok(empty) result (Tier 2 invalid YAML
/ non-k8s YAML / similar) or Err (Tier 1 extractor / chunker failure) is
retried against CodeTextParagraphV1Chunker. On retry, chunker_version is
swapped to "code-text-paragraph-v1" and canonical.parser_version to
"none-v1" so downstream stamping + try_skip_unchanged remain consistent.

Extract failure is handled similarly — when a Tier 1 extractor errors
(e.g. tree-sitter parse failure), a synthesize_tier2_document-shaped
fallback doc is built from raw bytes and routed through Tier 3 chunker
directly (extract_fell_back guard).

shell direct path + Tier 2 extract synthesize_tier2_document failures
are exempted from the fallback chain (they ARE Tier 3 already, or the
error is real).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:32:49 +00:00
88d7fbc182 feat(p10-3): activate shell direct routing through Tier 3 chunker
Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.

Tier 1/2 fallback wrapper lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:28:41 +00:00
0b7d8af759 feat(p10-3): code-text-paragraph-v1 chunker — paragraph + line-window fallback
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.

4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:22:48 +00:00
9342b9543f refactor(p10-3): expose tier2_shared::build_chunk as pub(crate)
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:17:51 +00:00
a8aa03042f docs(p10-3): implementation plan (9 tasks A-I, subagent-driven)
Tasks: tier2_shared visibility upgrade / Tier 3 chunker + 4 unit tests /
shell direct routing / Tier 1/2 fallback wrapper / 2 smoke tests / frozen
design §10.1+§10 / docs sync (6 files) / workspace test gate / version
bump 0.14.0→0.15.0 + gitea PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:16:55 +00:00
9d4a60aac5 docs(p10-3): task spec for Tier 3 paragraph + line-window fallback chunker
Frozen contract for p10-3 single PR: code-text-paragraph-v1 chunker
(blank-line paragraph split + 80-line/20-overlap line-window for oversize),
shell direct routing, Tier 1/2 fallback wrapper (0-chunk or Err → Tier 3
retry with chunker_version + parser_version swap), tier2_shared::build_chunk
pub(crate) exposure, frozen design §10.1 + §10 deltas, version bump
0.14.0 → 0.15.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:55:16 +00:00
8ce7a911ee chore(p10-2-followup): reviewer nit cleanup — Mermaid + 주석 + oversize test (#154) 2026-05-20 14:44:39 +00:00
75c1c7b911 test(p10-2-followup): cover tier2_shared oversize fallback with >200-line k8s ConfigMap
Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised
the line-window split branch of tier2_shared::push_chunks_with_oversize.
This adds a 256-line ConfigMap fixture (generated inline) and asserts:
- ≥2 chunks emitted (split happened),
- all chunks share symbol `ConfigMap/prod/big`,
- chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation),
- line ranges form a contiguous partition (prev.line_end + 1 == next.line_start).

Reviewer nit #1 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:41:16 +00:00
b5c12ecb6f docs(p10-2-followup): clarify synthesize_tier2_document path resolution comment
Earlier comment claimed the function "mirrors RustAstExtractor pattern" but
the two differ: RustAstExtractor joins ctx.workspace_root to handle relative
paths, while Tier 2 trusts FsSourceConnector's absolute-path invariant.
Rephrase to document the actual rationale + the Kb URI fallback.

Reviewer nit #3 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:39:02 +00:00
a1192ce3b2 docs(p10-2-followup): README Mermaid chunker_version list — Java/Kotlin + Tier 2
p10-1C-JK 이후 누락된 code-java-ast-v1 / code-kotlin-ast-v1 + p10-2 의
k8s-manifest-resource-v1 / dockerfile-file-v1 / manifest-file-v1 추가.
표기 단순화를 위해 code-* 는 brace 묶음.

Reviewer nit #2 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:35:20 +00:00
17ee400fd5 feat(p10-2): Tier 2 resource-aware chunkers (k8s + Dockerfile + manifest) — 코드 색인 외 리소스 파일 활성화 (#153) 2026-05-20 14:22:55 +00:00
217dddb4ba chore: bump version 0.13.0 → 0.14.0 (p10-2 Tier 2 resource-aware)
Minor bump — additive code_lang values (xml / groovy / go-mod) + 3 new
chunker_version labels (k8s-manifest-resource-v1 / dockerfile-file-v1 /
manifest-file-v1) + frozen design §3.5 / §10.1 deltas. No DB migration,
no wire schema major bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:14:38 +00:00
308666dbd5 docs(p10-2): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync + tasks/p10/INDEX
User-visible surface sync per the docs-split rule:
- README adds Tier 2 langs (yaml / dockerfile / toml / json / xml / groovy / go-mod) to the ingest支援 list and --code-lang options.
- HANDOFF flips p10-2 phase row to  (v0.14.0) and updates the next-task candidates.
- ARCHITECTURE extends crates/kebab-chunk/src/ tree with k8s_manifest_resource_v1.rs / dockerfile_file_v1.rs / manifest_file_v1.rs / tier2_shared.rs, plus a Tier 2 note on the code-parser row and flowchart node.
- SMOKE adds a Tier 2 smoke walkthrough (k8s yaml + Dockerfile + Cargo.toml ingest + --code-lang search) and a P10-2 entry in the verification checklist.
- tasks/INDEX + tasks/p10/INDEX flip p10-2 to  (v0.14.0).

Workspace test gate (-j 1) + clippy --workspace pass cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:10:13 +00:00
522ae7b8bc docs(p10-2): activate Tier 2 in code-ingest design §10.1 + §3.5 mappings
§3.5: add code_lang_for_path mappings xml / groovy / go-mod.
§10.1: add deactivation log entry for p10-2 (3 Tier 2 chunkers active).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:24:16 +00:00
166e1ddfaf test(p10-2): integration smoke tests for Tier 2 (k8s yaml + Dockerfile + Cargo.toml)
Three new tests in code_ingest_smoke.rs verifying isolated-TempDir ingest +
--code-lang filter + Citation::Code.lang / .symbol shape for each Tier 2
chunker. Brings the suite to 12 tests (Rust 3 + Python 1 + TS 1 + JS 1 +
Go 1 + Java 1 + Kotlin 1 + yaml 1 + dockerfile 1 + manifest 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:23:01 +00:00
226ce8b744 feat(p10-2): activate Tier 2 chunkers in ingest_one_code_asset dispatch
Adds yaml / dockerfile / toml / json / xml / groovy / go-mod arms to the
existing 7-arm AST match. parser_version unified to "none-v1" for Tier 2.
synthesize_tier2_document builds a minimal Document (single Block::Code
with raw file text) since Tier 2 has no parse step. allowlist in
ingest_one_asset extended to admit Tier 2 langs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:19:54 +00:00
22d4161728 feat(p10-2): manifest-file-v1 chunker (whole-file 1 chunk, symbol <manifest>)
Emits 1 Chunk per manifest file (Cargo.toml / pyproject.toml / package.json /
tsconfig.json / pom.xml / build.gradle / go.mod). Symbol unified to
"<manifest>"; manifest type distinguished by code_lang (toml / json / xml /
groovy / go-mod) read from Block::Code.lang. Oversize >200 lines splits via
tier2_shared::push_chunks_with_oversize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:11:46 +00:00
51004ac593 feat(p10-2): dockerfile-file-v1 chunker (whole-file 1 chunk, symbol <dockerfile>)
Reads entire Dockerfile / Dockerfile.* / *.dockerfile content and emits a
single Chunk with symbol "<dockerfile>", code_lang "dockerfile", line range
1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
tier2_shared::push_chunks_with_oversize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:09:13 +00:00
8996e73282 feat(p10-2): k8s-manifest-resource-v1 chunker + tier2_shared helper
Splits multi-document YAML by ^---\s*$, requires apiVersion + kind string
fields per document, emits 1 chunk per recognized k8s resource. Symbol =
<kind>/<namespace>/<name> or <kind>/<name> (cluster-scoped). Invalid YAML
returns 0 chunks (handled by p10-3 paragraph fallback). Oversize >200 lines
splits into line-windows sharing the same symbol.

tier2_shared module hosts the oversize fallback + Chunk-construction helper
mirroring code_rust_ast_v1's Chunk shape. Task E (dockerfile) and Task F
(manifest) will reuse it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:06:47 +00:00
22dba09857 refactor(p10-2): media.rs delegates code lang to code_lang_for_path
Replaces 1A-1 era inline match block with a single call to
kebab_parse_code::code_lang_for_path, per design §3.5 single-source-of-truth
rule. Adds Tier 2 routing test (yaml / dockerfile / toml / json / xml /
groovy / go-mod) and preserves all non-code extension branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:01:14 +00:00
aaa90b1754 feat(p10-2): extend code_lang_for_path with Tier 2 basenames + extensions
Adds basename-first matching for Dockerfile / Cargo.toml / pyproject.toml /
package.json / tsconfig.json / go.mod / pom.xml / build.gradle plus
Dockerfile.* prefix variant. Extension fallback adds .yaml/.yml/.dockerfile/
.toml/.json/.xml/.gradle → yaml/dockerfile/toml/json/xml/groovy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:59:11 +00:00
077f92f41e build(p10-2): add serde_yaml dep to kebab-chunk for k8s-manifest-resource-v1
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:57:06 +00:00
5ce7f60932 docs(p10-2): implementation plan (11 tasks A-K, subagent-driven)
Branch feat/p10-2-tier2-resource. Tasks: serde_yaml dep / lang.rs basenames /
media.rs source-of-truth consolidation / 3 chunkers (k8s + dockerfile +
manifest) + tier2_shared helper / ingest dispatch / smoke tests / frozen
design §3.5+§10.1 / docs sync / version bump 0.13.0→0.14.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:55:36 +00:00
47857b2622 docs(p10-2): task spec for Tier 2 resource-aware chunkers (k8s + Dockerfile + manifest)
Frozen contract for the p10-2 single PR: 3 chunker activation, k8s
identification via apiVersion+kind, Dockerfile/manifest basename matching,
code_lang_for_path source-of-truth consolidation, frozen design §3.5 +
§10.1 deltas, and version bump 0.13.0 → 0.14.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:43:34 +00:00
1e4cff879b Merge pull request 'feat(p10-1C-JK): Java + Kotlin AST chunkers — JVM family 코드 색인 활성화' (#152) from feat/p10-1c-jk into main 2026-05-20 11:57:39 +00:00
2d7a566624 docs(p10-1c-jk): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + design §10.1; chore: bump version 0.12.0 → 0.13.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 11:38:40 +00:00
813bdd1a16 test(p10-1c-jk): code-java-ast-v1 + code-kotlin-ast-v1 chunker snapshots
Mirrors code_go_ast_snapshot pattern. In-memory CanonicalDocument (no
kebab-parse-code dep — boundary §6.3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:57:37 +00:00
ff1bedbef5 feat(p10-1c-jk): activate Kotlin in ingest_one_code_asset dispatch
Replaces Kotlin bail! arms with KotlinAstExtractor + CodeKotlinAstV1Chunker.
Adds kotlin_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=kotlin, symbol=com.foo.Foo.bar, code_lang=kotlin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:54:55 +00:00
30e03c7a12 feat(p10-1c-jk): code-kotlin-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-java-ast-v1 with language-agnostic body unchanged. Cross-
chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:52:24 +00:00
2ce6ae47c5 feat(p10-1c-jk): tree-sitter-kotlin-ng AST extractor (KotlinAstExtractor)
Uses tree-sitter-kotlin-ng (bare tree-sitter-kotlin is stuck on tree-sitter
0.21-0.23, incompatible with our 0.26). Mirrors JavaAstExtractor (JVM family,
source-side package extraction + class-nesting) with Kotlin grammar quirks:

- Root is `source_file`, not `program`.
- `package_header` child is `qualified_identifier` (its slice text is the
  dotted path); the bare `identifier` shape is also accepted as a fallback.
- `class_declaration` is the single node kind for `class` / `data class` /
  `sealed class` / `interface` / `enum class` — distinguished only by its
  `modifiers` child. Body is `class_body` for non-enum, `enum_class_body`
  for enum class; neither carries a `body` field name, so the extractor
  looks the body up by node kind rather than `child_by_field_name("body")`.
- `companion_object` is its own node kind (NOT object_declaration with a
  modifier); its `name` field is optional, so the extractor fills in the
  implicit Kotlin convention name `Companion`.
- `function_declaration` is allowed at top level (unlike Java), emitted as
  `<pkg>.<fn_name>`; the same node kind nested in `class_body` becomes
  `<pkg>.<...>.<Class>.<method>` via the same mod_path mechanism.
- `secondary_constructor` has no `name` field; symbol uses the enclosing
  class name (Java duplication convention: `<pkg>.<...>.<Class>.<Class>`).
- Enum bodies (`enum_class_body`) are NOT recursed — `enum_entry` is not
  emitted as a unit (matches Java 1차 scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:49:57 +00:00
ebc4ef2eea feat(p10-1c-jk): activate Java in ingest_one_code_asset dispatch
Replaces Java bail! arms with JavaAstExtractor + CodeJavaAstV1Chunker. Adds
java_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=java, symbol=com.foo.Foo.bar, code_lang=java.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:44:05 +00:00
7bda1509b7 feat(p10-1c-jk): code-java-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-rust-ast-v1 / code-go-ast-v1 with language-agnostic body
unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:41:27 +00:00
61d48d67a3 feat(p10-1c-jk): tree-sitter-java AST extractor (JavaAstExtractor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:39:02 +00:00
f4c840b994 refactor(p10-1c-jk): add java + kotlin to dispatch allowlist (bail until Tasks F/I)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:33:27 +00:00
15244b7494 feat(p10-1c-jk): route .java/.kt/.kts to MediaType::Code
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:31:29 +00:00
a7f7ab9f93 build(p10-1c-jk): add tree-sitter-java + tree-sitter-kotlin-ng workspace deps
Bare tree-sitter-kotlin v0.3.8 requires tree-sitter >=0.21,<0.23 which
conflicts with the workspace's tree-sitter 0.26 (links = "tree-sitter"
is a singleton). tree-sitter-kotlin-ng v1.1.0 (from
tree-sitter-grammars/tree-sitter-kotlin) uses the tree-sitter-language
0.1 shim which is compatible with tree-sitter 0.26. Using
tree-sitter-kotlin-ng as the Kotlin grammar crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:30:03 +00:00
1b19e33a4f docs(p10-1c-jk): task spec + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:27:13 +00:00
9c9e391b15 Merge pull request 'feat(p10-1C-Go): tree-sitter-go AST extractor + chunker — Go 코드 색인 활성화' (#151) from feat/p10-1c-go into main 2026-05-20 10:16:09 +00:00
f95cd55484 docs(p10-1c-go): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + design §10.1; chore: bump version 0.11.1 → 0.12.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:02:21 +00:00
ab288135e9 test(p10-1c-go): code-go-ast-v1 chunker snapshot + full-suite gate
Mirrors code_python_ast_snapshot / code_ts_ast_snapshot patterns. In-memory
CanonicalDocument (no kebab-parse-code dep — boundary §6.3 respected).

verify:
- cargo test -p kebab-chunk --test code_go_ast_snapshot → 2/2
- cargo test --workspace --no-fail-fast -j 1 → 0 failures (all green)
- cargo clippy --workspace --all-targets -- -D warnings → clean
- SMOKE: chunk.ParseDoc symbol + code_lang_breakdown {"go": 1} 확인

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:54:17 +00:00
c19aa006d0 feat(p10-1c-go): activate Go in ingest_one_code_asset dispatch
Replaces Go bail! arms with GoAstExtractor + CodeGoAstV1Chunker. Adds
go_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=go, symbol=chunk.ParseDoc, code_lang=go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:13:47 +00:00
f1a4f67e12 feat(p10-1c-go): code-go-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-rust-ast-v1 / code-{python,ts,js}-ast-v1 with language-agnostic
body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:11:14 +00:00
6463c52827 feat(p10-1c-go): tree-sitter-go AST extractor (GoAstExtractor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:08:46 +00:00
2559d0d95a refactor(p10-1c-go): add go to ingest dispatch allowlist (bail until Task F)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:03:28 +00:00
4524830306 feat(p10-1c-go): route .go to MediaType::Code(go)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:01:29 +00:00
8cdd3903c7 build(p10-1c-go): add tree-sitter-go workspace dep
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:00:04 +00:00
8b89961ada docs(p10-1c-go): task spec + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:58:45 +00:00
eec90996aa chore: bump version 0.11.0 → 0.11.1
dogfood semantic cleanup (PR #150) lands: document-centric fetch_span +
assets.workspace_path 'last-registered' semantic explicitly documented.

patch bump 사유: 외부 wire / CLI / config surface 변경 없음. 새 internal
trait method (get_asset) + caller refactor + doc-comment 갱신. twin file
의 fetch_span 잘못 분기 가능성 fix (rare).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:09:46 +00:00
ce1c778b4a Merge pull request 'fix(dogfood): document-centric fetch_span + assets.workspace_path semantic doc' (#150) from fix/dogfood-asset-flip-flop-cleanup into main 2026-05-20 08:08:55 +00:00
453ec15df4 fix(dogfood): document-centric fetch_span + remove get_asset_by_workspace_path
assets.workspace_path is INTENTIONALLY 'last-registered path' for twin
files (identical content at different paths share one asset row PK'd by
blake3 content hash). PR #146 made try_skip_unchanged document-centric;
PR #149 made reset --orphans-only document-centric; this PR removes the
last caller of get_asset_by_workspace_path (fetch.rs:193 in fetch_span,
which used it to reject PDF/audio media — for twins this could read the
wrong asset's media_type and pick the wrong branch).

Replaced with the natural 2-step lookup: get_document_by_workspace_path
(PR #146) → doc.source_asset_id → get_asset (NEW trait method, asset_id
is PRIMARY KEY so flip-flop-immune by construction).

Then removed get_asset_by_workspace_path trait method + SqliteStore impl
— 0 callers after the refactor.

UPSERT doc-comment refreshed in store.rs to make the 'last-registered'
semantics explicit so future readers don't try to 'fix' the flip-flop.

Dogfood follow-up (PR #142 1B + multi-root corpus).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:03:38 +00:00
63 changed files with 11330 additions and 83 deletions

80
Cargo.lock generated
View File

@@ -4127,7 +4127,7 @@ dependencies = [
[[package]]
name = "kebab-app"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"base64 0.22.1",
@@ -4172,7 +4172,7 @@ dependencies = [
[[package]]
name = "kebab-chunk"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"blake3",
@@ -4181,13 +4181,14 @@ dependencies = [
"kebab-parse-md",
"serde_json",
"serde_json_canonicalizer",
"serde_yaml",
"time",
"tracing",
]
[[package]]
name = "kebab-cli"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"clap",
@@ -4208,7 +4209,7 @@ dependencies = [
[[package]]
name = "kebab-config"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"dirs 5.0.1",
@@ -4223,7 +4224,7 @@ dependencies = [
[[package]]
name = "kebab-core"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"blake3",
@@ -4237,7 +4238,7 @@ dependencies = [
[[package]]
name = "kebab-embed"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"blake3",
@@ -4251,7 +4252,7 @@ dependencies = [
[[package]]
name = "kebab-embed-local"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"fastembed",
@@ -4264,7 +4265,7 @@ dependencies = [
[[package]]
name = "kebab-eval"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"kebab-app",
@@ -4283,7 +4284,7 @@ dependencies = [
[[package]]
name = "kebab-llm"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"kebab-core",
@@ -4292,7 +4293,7 @@ dependencies = [
[[package]]
name = "kebab-llm-local"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"kebab-config",
@@ -4309,7 +4310,7 @@ dependencies = [
[[package]]
name = "kebab-mcp"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"kebab-app",
@@ -4327,7 +4328,7 @@ dependencies = [
[[package]]
name = "kebab-normalize"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"kebab-core",
@@ -4342,7 +4343,7 @@ dependencies = [
[[package]]
name = "kebab-parse-code"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"gix",
@@ -4352,7 +4353,10 @@ dependencies = [
"time",
"tracing",
"tree-sitter",
"tree-sitter-go",
"tree-sitter-java",
"tree-sitter-javascript",
"tree-sitter-kotlin-ng",
"tree-sitter-python",
"tree-sitter-rust",
"tree-sitter-typescript",
@@ -4360,7 +4364,7 @@ dependencies = [
[[package]]
name = "kebab-parse-image"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"ab_glyph",
"anyhow",
@@ -4384,7 +4388,7 @@ dependencies = [
[[package]]
name = "kebab-parse-md"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"kebab-core",
@@ -4401,7 +4405,7 @@ dependencies = [
[[package]]
name = "kebab-parse-pdf"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"blake3",
@@ -4414,7 +4418,7 @@ dependencies = [
[[package]]
name = "kebab-parse-types"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"kebab-core",
"serde",
@@ -4422,7 +4426,7 @@ dependencies = [
[[package]]
name = "kebab-rag"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"blake3",
@@ -4443,7 +4447,7 @@ dependencies = [
[[package]]
name = "kebab-search"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"globset",
@@ -4462,7 +4466,7 @@ dependencies = [
[[package]]
name = "kebab-source-fs"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"blake3",
@@ -4481,7 +4485,7 @@ dependencies = [
[[package]]
name = "kebab-store-sqlite"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"blake3",
@@ -4502,7 +4506,7 @@ dependencies = [
[[package]]
name = "kebab-store-vector"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"arrow",
@@ -4526,7 +4530,7 @@ dependencies = [
[[package]]
name = "kebab-tui"
version = "0.11.0"
version = "0.15.0"
dependencies = [
"anyhow",
"crossterm",
@@ -8527,6 +8531,26 @@ dependencies = [
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-go"
version = "0.25.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c8560a4d2f835cc0d4d2c2e03cbd0dde2f6114b43bc491164238d333e28b16ea"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-java"
version = "0.23.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0aa6cbcdc8c679b214e616fd3300da67da0e492e066df01bcf5a5921a71e90d6"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-javascript"
version = "0.25.0"
@@ -8537,6 +8561,16 @@ dependencies = [
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-kotlin-ng"
version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e800ebbda938acfbf224f4d2c34947a31994b1295ee6e819b65226c7b51b4450"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-language"
version = "0.1.7"

View File

@@ -31,7 +31,7 @@ edition = "2024"
rust-version = "1.85"
license = "MIT OR Apache-2.0"
repository = "https://github.com/altair823/kebab"
version = "0.11.0"
version = "0.15.0"
[workspace.dependencies]
anyhow = "1"
@@ -94,6 +94,11 @@ tree-sitter-rust = "0.24"
tree-sitter-python = "0.25.0"
tree-sitter-typescript = "0.23.2"
tree-sitter-javascript = "0.25.0"
# Go grammar for code ingest (kebab-parse-code, p10-1C-Go).
tree-sitter-go = "0.25.0"
# JVM family grammars for code ingest (kebab-parse-code, p10-1C-JK).
tree-sitter-java = "0.23.5"
tree-sitter-kotlin-ng = "1.1.0" # bare tree-sitter-kotlin requires ts <0.23; -ng uses tree-sitter-language 0.1 (ts 0.26 compat)
# Disk-footprint trim for dev / test builds. Codegen, opt-level, and
# behavior are unchanged — only DWARF debug info is reduced (line

View File

@@ -4,7 +4,7 @@
## 한 줄 요약
P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료. `kebab ingest` 가 markdown / image / PDF 모두 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공 — 사용자가 `?` 로 ask, `/` 로 search, Library Enter / Search `i` 로 inspect, Search `g` 로 editor jump. 다음 후보 = P9-5 (desktop tauri) 또는 보류 중인 P8 (audio) 의 시스템 dep brainstorm.
P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료. `kebab ingest` 가 markdown / image / PDF / 소스코드 (Rust / Python / TS / JS / Go / Java / Kotlin) / Tier 2 리소스 파일 (yaml/k8s / dockerfile / toml / json / xml / groovy / go-mod) + Tier 3 paragraph fallback (shell / 비-k8s YAML / AST 실패 케이스) 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page / code citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공. P10-3 (Tier 3 paragraph fallback) 완료 — 다음 후보 = P10-1D (C/C++) 또는 P9-5 (desktop tauri) 또는 보류 중인 P8 (audio).
## Phase 로드맵
@@ -20,7 +20,7 @@ P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료.
| **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) |
| **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) |
| **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, tree-sitter-rust, `code-rust-ast-v1` — v0.7.0), **1B 🟡 PR 오픈** (Python `code-python-ast-v1` + TypeScript `code-ts-ast-v1` + JavaScript `code-js-ast-v1`3 언어 dogfooding 가능, v0.8.0 대기) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, `code-rust-ast-v1` — v0.7.0), 1B ✅ (Python/TS/JS AST chunkers — v0.8.0 이후), **1C-Go ✅ (Go AST chunker, `code-go-ast-v1` — v0.12.0)**, **1C-JavaKotlin ✅ (Java + Kotlin AST chunkers, `code-java-ast-v1` / `code-kotlin-ast-v1` — v0.13.0)**, **2 ✅ (Tier 2 resource-aware: yaml/k8s + dockerfile + manifest, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1` — v0.14.0)**, **3 ✅ (Tier 3 paragraph fallback: code-text-paragraph-v1 — v0.15.0)** |
P0~P5 직렬. P6~P9 P5 이후 병렬 가능.

View File

@@ -42,7 +42,7 @@ cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin ke
# 첫 실행 — XDG 경로에 데이터 디렉토리 + config.toml 생성
kebab init
# config 손보고 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js)
# config 손보고 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js / go)
${EDITOR:-vi} ~/.config/kebab/config.toml
# 색인 (Markdown / 이미지 / PDF 모두 한 번에)
@@ -70,7 +70,7 @@ kebab doctor
| 명령 | 동작 |
|------|------|
| `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs``code-rust-ast-v1`, `.py``code-python-ast-v1`, `.ts`/`.tsx``code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx``code-js-ast-v1` — 모두 tree-sitter AST chunker). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"``citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`). |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs``code-rust-ast-v1`, `.py``code-python-ast-v1`, `.ts`/`.tsx``code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx``code-js-ast-v1`, `.go``code-go-ast-v1`, `.java``code-java-ast-v1`, `.kt`/`.kts``code-kotlin-ast-v1` — 모두 tree-sitter AST chunker; **Tier 2 리소스 파일**: `.yaml`/`.yml``k8s-manifest-resource-v1` (apiVersion+kind 파싱), `Dockerfile`/`Dockerfile.*`/`*.dockerfile``dockerfile-file-v1` (전체 파일), `Cargo.toml`/`pyproject.toml`/`.toml`/`package.json`/`tsconfig.json`/`.json`/`pom.xml`/`.xml`/`build.gradle`/`.gradle`/`go.mod``manifest-file-v1` (전체 파일) — yaml (k8s) / dockerfile / toml / json / xml / groovy / go-mod 지원); **Tier 3 paragraph fallback** (`.sh`/`.bash`/`.zsh``code-text-paragraph-v1`, blank-line paragraph split + 80-line/20-overlap line-window. Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback — 비-k8s YAML 같은 케이스 picked up. symbol = None, lang 은 원본 보존.). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"``citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--code-lang go` / `--code-lang java` / `--code-lang kotlin` / `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` / `--code-lang json` / `--code-lang xml` / `--code-lang groovy` / `--code-lang go-mod` / `--code-lang shell` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`), Go symbol 은 `package.Func` / `package.(*Receiver).Method` 형식, Java / Kotlin symbol 은 `com.foo.Foo.bar` 형식 (패키지 + 클래스 + 메서드/필드). |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media``,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min``primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md``markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. |
| `kebab list docs` | 색인된 문서 목록 |
| `kebab inspect doc <id>` / `kebab inspect chunk <id>` | raw record 보기 |
@@ -132,7 +132,7 @@ flowchart TB
subgraph Pipeline["도메인 + 파이프라인"]
parse["parse-md / parse-pdf / parse-image / parse-code"]
chunker["chunker (md-heading-v1, pdf-page-v1, code-rust-ast-v1, code-python-ast-v1, code-ts-ast-v1, code-js-ast-v1)"]
chunker["chunker (md-heading-v1, pdf-page-v1, code-{rust,python,ts,js,go,java,kotlin}-ast-v1, k8s-manifest-resource-v1, dockerfile-file-v1, manifest-file-v1, code-text-paragraph-v1)"]
embedder["embedder (fastembed multilingual-e5-large)"]
retriever["retriever (lexical / vector / hybrid RRF)"]
rag["RAG pipeline"]

View File

@@ -189,10 +189,12 @@ fn fetch_span(
// (markdown / note / paper / reference / inbox) is the *user-facing*
// category, not the rendering format — the actual byte-level format
// lives on the source `RawAsset.media_type`. Look it up via
// workspace_path (unique key per asset).
if let Some(asset) = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_asset_by_workspace_path(
// doc.source_asset_id (PRIMARY KEY) so twin files (identical content
// at different paths) always read *this* document's own asset row,
// not whichever twin last wrote `assets.workspace_path`.
if let Some(asset) = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_asset(
&app.sqlite,
&doc.workspace_path,
&doc.source_asset_id,
)? {
if matches!(
asset.media_type,

View File

@@ -39,7 +39,7 @@ use std::sync::Arc;
use anyhow::{Context, anyhow};
use serde::{Deserialize, Serialize};
use kebab_chunk::{CodeJsAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTsAstV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
use kebab_chunk::{CodeGoAstV1Chunker, CodeJavaAstV1Chunker, CodeJsAstV1Chunker, CodeKotlinAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTextParagraphV1Chunker, CodeTsAstV1Chunker, DockerfileFileV1Chunker, K8sManifestResourceV1Chunker, ManifestFileV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
use kebab_core::{
Answer, Block, CanonicalDocument, Chunk, ChunkId, ChunkPolicy, ChunkerVersion, Chunker,
DocFilter, DocSummary, DocumentId, DocumentStore, Embedder, EmbeddingInput,
@@ -50,7 +50,7 @@ use kebab_core::{
use kebab_llm_local::OllamaLanguageModel;
use kebab_normalize::build_canonical_document;
use kebab_parse_image::{ImageExtractor, OllamaVisionOcr, apply_caption, apply_ocr};
use kebab_parse_code::{JavascriptAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
use kebab_parse_code::{GoAstExtractor, JavaAstExtractor, JavascriptAstExtractor, KotlinAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
use kebab_parse_pdf::PdfTextExtractor;
use kebab_parse_md::{BodyHints, parse_blocks, parse_frontmatter};
use kebab_source_fs::FsSourceConnector;
@@ -948,9 +948,12 @@ fn ingest_one_asset(
force_reingest,
);
}
// p10-1A-2 / 1B: code ingest dispatch.
// p10-1A-2 / 1B: code ingest dispatch. p10-2: Tier 2 langs added. p10-3: shell added.
MediaType::Code(lang)
if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript") =>
if matches!(lang.as_str(),
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "shell") =>
{
return ingest_one_code_asset(
app,
@@ -1827,15 +1830,33 @@ fn ingest_one_code_asset(
"python" => ParserVersion(kebab_parse_code::PYTHON_PARSER_VERSION.to_string()),
"typescript" => ParserVersion(kebab_parse_code::TS_PARSER_VERSION.to_string()),
"javascript" => ParserVersion(kebab_parse_code::JS_PARSER_VERSION.to_string()),
"go" => ParserVersion(kebab_parse_code::GO_PARSER_VERSION.to_string()),
"java" => ParserVersion(kebab_parse_code::JAVA_PARSER_VERSION.to_string()),
"kotlin" => ParserVersion(kebab_parse_code::KOTLIN_PARSER_VERSION.to_string()),
// p10-2: Tier 2 has no parse step — sentinel "none-v1".
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
=> ParserVersion("none-v1".to_string()),
// p10-3: shell direct routes to Tier 3 (no parse step).
"shell" => ParserVersion("none-v1".to_string()),
other => anyhow::bail!("unsupported code_lang: {other}"),
};
// p10-1b Task D/G/J/L: chunker_version per-lang.
let chunker_version = match code_lang {
let mut chunker_version = match code_lang {
"rust" => CodeRustAstV1Chunker.chunker_version(),
"python" => CodePythonAstV1Chunker.chunker_version(),
"typescript" => CodeTsAstV1Chunker.chunker_version(),
"javascript" => CodeJsAstV1Chunker.chunker_version(),
"go" => CodeGoAstV1Chunker.chunker_version(),
"java" => CodeJavaAstV1Chunker.chunker_version(),
"kotlin" => CodeKotlinAstV1Chunker.chunker_version(),
// p10-2 Tier 2:
"yaml" => K8sManifestResourceV1Chunker.chunker_version(),
"dockerfile" => DockerfileFileV1Chunker.chunker_version(),
"toml" | "json" | "xml" | "groovy" | "go-mod"
=> ManifestFileV1Chunker.chunker_version(),
// p10-3:
"shell" => CodeTextParagraphV1Chunker.chunker_version(),
other => anyhow::bail!("unreachable chunker_version: {other}"),
};
@@ -1861,37 +1882,145 @@ fn ingest_one_code_asset(
};
// p10-1b Task D/G/J/L: extractor per-lang.
let mut canonical = match code_lang {
// p10-3: capture Result so Tier 1 extractor errors can fall back to Tier 3.
let canonical_result: anyhow::Result<kebab_core::CanonicalDocument> = match code_lang {
"rust" => RustAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::RustAstExtractor::extract (code:rust)")?,
.context("kb-parse-code::RustAstExtractor::extract (code:rust)"),
"python" => PythonAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::PythonAstExtractor::extract (code:python)")?,
.context("kb-parse-code::PythonAstExtractor::extract (code:python)"),
"typescript" => TypescriptAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)")?,
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)"),
"javascript" => JavascriptAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)")?,
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)"),
"go" => GoAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::GoAstExtractor::extract (code:go)"),
"java" => JavaAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::JavaAstExtractor::extract (code:java)"),
"kotlin" => KotlinAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::KotlinAstExtractor::extract (code:kotlin)"),
// p10-2 Tier 2: no extractor — synthesize Document directly from raw bytes.
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" => {
synthesize_tier2_document(asset, &bytes, code_lang, &parser_version)
}
// p10-3: shell reuses the same synthesizer.
"shell" => synthesize_tier2_document(asset, &bytes, "shell", &parser_version),
other => anyhow::bail!("unreachable (extract): {other}"),
};
// p10-3: Tier 1 extractor failure → fall back to Tier 3 synthesized doc.
// Tier 2 (yaml/dockerfile/…) and shell errors are real (e.g. non-UTF-8) — propagate.
let mut canonical = match canonical_result {
Ok(d) => d,
Err(e) if code_lang == "shell"
|| matches!(code_lang, "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod") =>
{
return Err(e).context("synthesize_tier2_document failed for tier 2/3 lang");
}
Err(e) => {
// Tier 1 extractor errored — fall back to Tier 3 synthesized doc.
tracing::warn!(
workspace_path = %asset.workspace_path.0,
code_lang = code_lang,
error = %e,
"tier1 extract errored; falling back to tier 3 synthesized doc"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
let tier3_parser_version = ParserVersion("none-v1".to_string());
synthesize_tier2_document(asset, &bytes, code_lang, &tier3_parser_version)
.context("synthesize_tier2_document for tier 3 fallback after extract error")?
}
};
// p10-1b Task D/G/J/L: chunker per-lang.
let chunks = match code_lang {
"rust" => CodeRustAstV1Chunker
// p10-3: track whether the extract stage already fell back to Tier 3.
// Tier 2 langs already have "none-v1" parser_version normally, so exclude them
// from the extract_fell_back guard with the !matches! exclusion.
let extract_fell_back = canonical.parser_version.0 == "none-v1"
&& !matches!(code_lang, "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" | "shell");
let chunks_result: anyhow::Result<Vec<Chunk>> = if extract_fell_back {
// Tier 1 lang whose extractor errored — go straight to Tier 3 chunker.
CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)")?,
"python" => CodePythonAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)")?,
"typescript" => CodeTsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)")?,
"javascript" => CodeJsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)")?,
other => anyhow::bail!("unreachable (chunk): {other}"),
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 after extract fallback)")
} else {
match code_lang {
"rust" => CodeRustAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)"),
"python" => CodePythonAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)"),
"typescript" => CodeTsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)"),
"javascript" => CodeJsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)"),
"go" => CodeGoAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeGoAstV1Chunker::chunk (code:go)"),
"java" => CodeJavaAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJavaAstV1Chunker::chunk (code:java)"),
"kotlin" => CodeKotlinAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeKotlinAstV1Chunker::chunk (code:kotlin)"),
// p10-2 Tier 2:
"yaml" => K8sManifestResourceV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::K8sManifestResourceV1Chunker::chunk"),
"dockerfile" => DockerfileFileV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::DockerfileFileV1Chunker::chunk"),
"toml" | "json" | "xml" | "groovy" | "go-mod" => ManifestFileV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::ManifestFileV1Chunker::chunk"),
// p10-3:
"shell" => CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (code:shell)"),
other => anyhow::bail!("unreachable (chunk): {other}"),
}
};
// p10-3: Tier 1/2 0-chunk OR error → Tier 3 fallback retry.
// "shell" direct path is already Tier 3 — don't retry-double-up.
let chunks: Vec<Chunk> = match chunks_result {
Ok(v) if !v.is_empty() => v,
other if code_lang == "shell" => other?, // shell propagates directly
Ok(_empty) => {
tracing::warn!(
workspace_path = %asset.workspace_path.0,
code_lang = code_lang,
"tier1/2 emitted 0 chunks; falling back to tier 3 (code-text-paragraph-v1)"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
canonical.parser_version = ParserVersion("none-v1".to_string());
CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback)")?
}
Err(e) => {
tracing::warn!(
workspace_path = %asset.workspace_path.0,
code_lang = code_lang,
error = %e,
"tier1/2 chunker errored; falling back to tier 3 (code-text-paragraph-v1)"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
canonical.parser_version = ParserVersion("none-v1".to_string());
CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback after error)")?
}
};
// Stamp chunker + embedding versions so incremental skip detection has
@@ -1986,6 +2115,139 @@ fn ingest_one_code_asset(
})
}
/// p10-2: Build a minimal [`CanonicalDocument`] for Tier 2 code assets
/// (yaml / dockerfile / toml / json / xml / groovy / go-mod) that have
/// no AST extractor. Produces a single `Block::Code` whose source span
/// covers the entire file, mirroring the shape the Tier 1 extractors
/// produce for glue / top-level regions.
fn synthesize_tier2_document(
asset: &RawAsset,
bytes: &[u8],
code_lang: &str,
parser_version: &ParserVersion,
) -> anyhow::Result<kebab_core::CanonicalDocument> {
use anyhow::Context as _;
use kebab_core::{
BlockId, CodeBlock, CommonBlock, Lang, Metadata, Provenance, ProvenanceEvent,
ProvenanceKind, SourceSpan, SourceType, TrustLevel, id_for_block, id_for_doc,
};
let text = std::str::from_utf8(bytes)
.with_context(|| format!("tier2 doc not utf-8: {}", asset.workspace_path.0))?
.to_string();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, parser_version);
let n_lines = text.lines().count().max(1) as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: n_lines,
symbol: Some("<file>".to_string()),
lang: Some(code_lang.to_string()),
};
let block_id: BlockId = id_for_block(
&doc_id,
"code",
&[],
0,
&span,
);
let block = kebab_core::Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: vec![],
source_span: span,
},
lang: Some(code_lang.to_string()),
code: text,
});
let now = time::OffsetDateTime::now_utc();
let events = vec![
ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
},
ProvenanceEvent {
at: now,
agent: "kb-app".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; tier2_synthesized; lang={}",
parser_version.0, code_lang
)),
},
];
// Resolve absolute path for repo detection. FsSourceConnector always
// emits absolute paths in SourceUri::File (verified in connector.rs); Kb
// URIs were rejected earlier in ingest_one_code_asset (returns Skipped),
// so the fallback below is purely defensive. This does NOT mirror
// RustAstExtractor — that extractor joins ctx.workspace_root for relative
// paths, but Tier 2 trusts the connector invariant.
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => p.clone(),
kebab_core::SourceUri::Kb(_) => std::path::PathBuf::new(),
};
let (repo, git_branch, git_commit) = match kebab_parse_code::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let title = {
let fname = asset.workspace_path.0
.rsplit('/')
.next()
.unwrap_or(&asset.workspace_path.0);
// strip extension
match fname.rfind('.') {
Some(i) => fname[..i].to_string(),
None => fname.to_string(),
}
};
let metadata = Metadata {
aliases: vec![],
tags: vec![],
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: serde_json::Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some(code_lang.to_string()),
};
tracing::debug!(
target: "kebab-app",
"synthesized tier2 doc_id={} workspace_path={} lang={}",
doc_id.0,
asset.workspace_path.0,
code_lang,
);
Ok(kebab_core::CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks: vec![block],
metadata,
provenance: Provenance { events },
parser_version: parser_version.clone(),
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
/// Pull the BCP-47 language hint from the canonical document. P6-1
/// stamps `Lang("und")` by default; image-pipeline OCR / caption
/// adapters special-case "und" so the hint is intentionally dropped

View File

@@ -390,6 +390,641 @@ fn javascript_file_ingests_and_searches_as_code_citation() {
);
}
/// p10-1c-go Task F: a `.go` file in a sub-directory is ingested and the
/// resulting `Citation::Code` hit must carry `lang="go"`,
/// `symbol="chunk.ParseDoc"`, and `line_start >= 1`.
/// The sub-directory (`chunk/`) ensures the Go package-prefix wiring
/// produces a non-empty module prefix so the fully-qualified symbol assertion
/// exercises that path end-to-end.
#[test]
fn go_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
let pkg_dir = env.workspace_root.join("chunk");
std::fs::create_dir_all(&pkg_dir).unwrap();
std::fs::write(
pkg_dir.join("ast.go"),
"package chunk\n\nfunc ParseDoc(input string) string {\n return input\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0);
assert!(report.new >= 1);
let go_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("ast.go"))
.expect("ast.go item present");
assert_eq!(
go_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-go-v1"),
"parser_version must be code-go-v1"
);
assert_eq!(
go_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-go-ast-v1"),
"chunker_version must be code-go-ast-v1"
);
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("ParseDoc"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, kebab_core::Citation::Code { .. }))
.expect("Citation::Code hit");
match &h.citation {
kebab_core::Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("go"), "citation.lang must be 'go'");
assert_eq!(
symbol.as_deref(),
Some("chunk.ParseDoc"),
"citation.symbol must be 'chunk.ParseDoc'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("go"),
"SearchHit.code_lang must be 'go'"
);
}
/// p10-1c-jk Task F: a `.java` file in a package directory is ingested and the
/// resulting `Citation::Code` hit must carry `lang="java"`,
/// `symbol="com.foo.Foo.bar"`, and `line_start >= 1`.
/// The sub-directory (`com/foo/`) ensures the Java package-prefix wiring
/// produces a non-empty module prefix so the fully-qualified symbol assertion
/// exercises that path end-to-end.
#[test]
fn java_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
let pkg_dir = env.workspace_root.join("com").join("foo");
std::fs::create_dir_all(&pkg_dir).unwrap();
std::fs::write(
pkg_dir.join("Foo.java"),
"package com.foo;\n\npublic class Foo {\n public String bar() { return \"x\"; }\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0);
assert!(report.new >= 1);
let java_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("Foo.java"))
.expect("Foo.java item present");
assert_eq!(
java_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-java-v1"),
"parser_version must be code-java-v1"
);
assert_eq!(
java_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-java-ast-v1"),
"chunker_version must be code-java-ast-v1"
);
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("bar"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, kebab_core::Citation::Code { .. }))
.expect("Citation::Code hit");
match &h.citation {
kebab_core::Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("java"), "citation.lang must be 'java'");
assert_eq!(
symbol.as_deref(),
Some("com.foo.Foo.bar"),
"citation.symbol must be 'com.foo.Foo.bar'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("java"),
"SearchHit.code_lang must be 'java'"
);
}
/// p10-1c-jk Task I: a `.kt` file in a package directory is ingested and the
/// resulting `Citation::Code` hit must carry `lang="kotlin"`,
/// `symbol="com.foo.Foo.bar"`, and `line_start >= 1`.
/// The sub-directory (`com/foo/`) ensures the Kotlin package-prefix wiring
/// produces a non-empty module prefix so the fully-qualified symbol assertion
/// exercises that path end-to-end.
#[test]
fn kotlin_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
let pkg_dir = env.workspace_root.join("com").join("foo");
std::fs::create_dir_all(&pkg_dir).unwrap();
std::fs::write(
pkg_dir.join("Foo.kt"),
"package com.foo\n\nclass Foo {\n fun bar(): String = \"x\"\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0);
assert!(report.new >= 1);
let kt_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("Foo.kt"))
.expect("Foo.kt item present");
assert_eq!(
kt_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-kotlin-v1"),
"parser_version must be code-kotlin-v1"
);
assert_eq!(
kt_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-kotlin-ast-v1"),
"chunker_version must be code-kotlin-ast-v1"
);
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("bar"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, kebab_core::Citation::Code { .. }))
.expect("Citation::Code hit");
match &h.citation {
kebab_core::Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("kotlin"), "citation.lang must be 'kotlin'");
assert_eq!(
symbol.as_deref(),
Some("com.foo.Foo.bar"),
"citation.symbol must be 'com.foo.Foo.bar'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("kotlin"),
"SearchHit.code_lang must be 'kotlin'"
);
}
/// p10-2 Task H: a `k8s/deploy.yaml` file with a Deployment resource is
/// ingested and the resulting `Citation::Code` hit must carry
/// `lang="yaml"`, `symbol="Deployment/prod/api"`, and `line_start >= 1`.
/// Exercises the k8s-manifest-resource-v1 chunker end-to-end.
#[test]
fn tier2_k8s_yaml_ingest_searchable() {
let env = TestEnv::lexical_only();
let k8s_dir = env.workspace_root.join("k8s");
std::fs::create_dir_all(&k8s_dir).unwrap();
std::fs::write(
k8s_dir.join("deploy.yaml"),
"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: api\n namespace: prod\nspec:\n replicas: 1\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "yaml file ingested: {report:?}");
let yaml_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("deploy.yaml"))
.expect("deploy.yaml item present");
assert_eq!(
yaml_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1"
);
assert_eq!(
yaml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("k8s-manifest-resource-v1"),
"chunker_version must be k8s-manifest-resource-v1"
);
let query = kebab_core::SearchQuery {
text: "api".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["yaml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'api'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("yaml"), "citation.lang must be 'yaml'");
assert_eq!(
symbol.as_deref(),
Some("Deployment/prod/api"),
"citation.symbol must be 'Deployment/prod/api'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("yaml"),
"SearchHit.code_lang must be 'yaml'"
);
}
/// p10-2 Task H: a `Dockerfile` is ingested and the resulting
/// `Citation::Code` hit must carry `lang="dockerfile"`,
/// `symbol="<dockerfile>"`, and `line_start >= 1`.
/// Exercises the dockerfile-file-v1 chunker end-to-end.
#[test]
fn tier2_dockerfile_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("Dockerfile"),
"FROM rust:1.94\nRUN cargo install foo\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "Dockerfile ingested: {report:?}");
let df_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("Dockerfile"))
.expect("Dockerfile item present");
assert_eq!(
df_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1"
);
assert_eq!(
df_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("dockerfile-file-v1"),
"chunker_version must be dockerfile-file-v1"
);
let query = kebab_core::SearchQuery {
text: "cargo".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["dockerfile".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'cargo'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("dockerfile"),
"citation.lang must be 'dockerfile'"
);
assert_eq!(
symbol.as_deref(),
Some("<dockerfile>"),
"citation.symbol must be '<dockerfile>'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("dockerfile"),
"SearchHit.code_lang must be 'dockerfile'"
);
}
/// p10-2 Task H: a `Cargo.toml` manifest is ingested and the resulting
/// `Citation::Code` hit must carry `lang="toml"`, `symbol="<manifest>"`,
/// and `line_start >= 1`.
/// Exercises the manifest-file-v1 chunker end-to-end.
#[test]
fn tier2_cargo_toml_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("Cargo.toml"),
"[package]\nname = \"demo\"\nversion = \"0.1.0\"\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "Cargo.toml ingested: {report:?}");
let toml_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("Cargo.toml"))
.expect("Cargo.toml item present");
assert_eq!(
toml_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1"
);
assert_eq!(
toml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("manifest-file-v1"),
"chunker_version must be manifest-file-v1"
);
let query = kebab_core::SearchQuery {
text: "demo".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["toml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'demo'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("toml"),
"citation.lang must be 'toml'"
);
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"citation.symbol must be '<manifest>'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("toml"),
"SearchHit.code_lang must be 'toml'"
);
}
/// p10-3 Task E: a `.sh` file is ingested via the shell direct-Tier-3 path
/// and the resulting `Citation::Code` hit must carry `lang="shell"`,
/// `symbol=None`, `line_start >= 1`, and
/// `chunker_version = "code-text-paragraph-v1"`.
#[test]
fn tier3_shell_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("deploy.sh"),
"#!/usr/bin/env bash\nset -e\necho hello\n\nkebab ingest --json\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "shell file ingested: {report:?}");
let sh_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
.expect("deploy.sh item present");
assert_eq!(
sh_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1 for shell (Tier 3 direct)"
);
assert_eq!(
sh_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-text-paragraph-v1"),
"chunker_version must be code-text-paragraph-v1 for shell"
);
let query = kebab_core::SearchQuery {
text: "kebab".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["shell".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'kebab'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("shell"),
"citation.lang must be 'shell'"
);
assert_eq!(*symbol, None, "Tier 3 symbol must be None");
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("shell"),
"SearchHit.code_lang must be 'shell'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-text-paragraph-v1",
"shell chunks must be stamped with the Tier 3 chunker_version"
);
}
/// p10-3 Task E: a docker-compose-shaped YAML file (no `apiVersion`/`kind`)
/// is ingested; the k8s chunker returns `Ok(vec![])` and the Tier 3 fallback
/// wrapper retries with `CodeTextParagraphV1Chunker`. The resulting
/// `Citation::Code` hit must carry `lang="yaml"`, `symbol=None`,
/// `line_start >= 1`, and `chunker_version = "code-text-paragraph-v1"`.
#[test]
fn tier3_yaml_fallback_picks_up_non_k8s_yaml() {
let env = TestEnv::lexical_only();
// docker-compose-shaped YAML — version + services but no apiVersion/kind.
// The k8s chunker returns Ok(vec![]); Tier 3 fallback should pick this up.
std::fs::write(
env.workspace_root.join("docker-compose.yml"),
"version: '3'\nservices:\n api:\n image: nginx:latest\n ports:\n - 8080:80\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(
report.new >= 1,
"expected non-k8s yaml ingested via Tier 3, got {} new docs",
report.new
);
let yaml_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
.expect("docker-compose.yml item present");
assert_eq!(
yaml_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1 after Tier 3 fallback"
);
assert_eq!(
yaml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-text-paragraph-v1"),
"chunker_version must be code-text-paragraph-v1 after Tier 3 fallback"
);
let query = kebab_core::SearchQuery {
text: "nginx".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["yaml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'nginx'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("yaml"),
"citation.lang must be 'yaml'"
);
assert_eq!(*symbol, None, "Tier 3 fallback symbol must be None");
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("yaml"),
"SearchHit.code_lang must be 'yaml'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-text-paragraph-v1",
"non-k8s yaml fallback must be stamped code-text-paragraph-v1"
);
}
/// Re-ingesting the same `.rs` file without changes must report
/// `Unchanged` (incremental-skip path exercised).
#[test]

View File

@@ -0,0 +1,176 @@
//! Regression test for the twin-file fetch_span media-type lookup bug.
//!
//! Twin files (identical content at different workspace paths) share one
//! `assets` row whose PRIMARY KEY is the blake3 content hash. The old
//! `fetch_span` implementation called
//! `get_asset_by_workspace_path(&doc.workspace_path)` to check whether the
//! media type was PDF/audio (and therefore reject span fetch). For a twin
//! file that lookup could silently return the *other* twin's asset row if
//! `assets.workspace_path` had been overwritten on the most recent ingest of
//! the sibling — making the media-type branch decision incorrect.
//!
//! Fix: `fetch_span` now uses the 2-step lookup
//! `get_document_by_workspace_path` → `doc.source_asset_id` → `get_asset`
//! so the result is always anchored to the requesting document, not
//! whichever twin last updated `assets.workspace_path`.
//!
//! This test builds a twin-file scenario (two .md files at different paths
//! with identical content), ingests both, then calls `fetch_span` on each
//! twin's `doc_id` and asserts it succeeds. Before the fix, if the asset
//! row's workspace_path happened to point at the wrong twin the span could
//! return an incorrect `span_not_supported` for a non-PDF/audio file, or
//! conversely allow span on a PDF twin by accident. After the fix, the
//! lookup is always doc-specific.
mod common;
use common::TestEnv;
use kebab_app::ingest_with_config;
use kebab_core::{DocumentStore, FetchKind, FetchOpts, FetchQuery, IngestItemKind};
#[test]
fn twin_files_fetch_span_uses_correct_asset() {
let env = TestEnv::lexical_only();
// Write two markdown files with identical content at different paths.
let dir_a = env.workspace_root.join("src_a");
let dir_b = env.workspace_root.join("src_b");
std::fs::create_dir_all(&dir_a).unwrap();
std::fs::create_dir_all(&dir_b).unwrap();
// The content must produce at least 1 line so span fetch is non-trivial.
let content = "# Twin\n\nLine one.\n\nLine two.\n\nLine three.\n";
std::fs::write(dir_a.join("note.md"), content).unwrap();
std::fs::write(dir_b.join("note.md"), content).unwrap();
// Ingest all files (fixture workspace + our two new twins).
let report = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors; report={report:?}");
// Both twin paths must appear as New in the report.
let items = report.items.as_ref().expect("items must be present");
let twin_items: Vec<_> = items
.iter()
.filter(|i| {
i.doc_path.0.ends_with("src_a/note.md")
|| i.doc_path.0.ends_with("src_b/note.md")
})
.collect();
assert_eq!(
twin_items.len(),
2,
"exactly 2 twin items expected; items={items:?}"
);
for item in &twin_items {
assert_eq!(
item.kind,
IngestItemKind::New,
"each twin must be New; item={item:?}"
);
}
// Resolve doc_ids for both workspace paths.
// The ingest layer normalises workspace_path to the path relative to
// workspace_root (e.g. "src_a/note.md"), so we look up by that form.
let store = kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
store.run_migrations().unwrap();
// Find the twin items by matching on suffix so the test is robust to
// however the workspace root is represented.
let items = report.items.as_ref().expect("items must be present");
let path_a_str = items
.iter()
.find(|i| i.doc_path.0.ends_with("src_a/note.md"))
.map(|i| i.doc_path.0.clone())
.expect("src_a/note.md must appear in ingest report");
let path_b_str = items
.iter()
.find(|i| i.doc_path.0.ends_with("src_b/note.md"))
.map(|i| i.doc_path.0.clone())
.expect("src_b/note.md must appear in ingest report");
let path_a = kebab_core::WorkspacePath(path_a_str);
let path_b = kebab_core::WorkspacePath(path_b_str);
let doc_a = store
.get_document_by_workspace_path(&path_a)
.expect("get_document_by_workspace_path path_a")
.expect("doc_a must exist after ingest");
let doc_b = store
.get_document_by_workspace_path(&path_b)
.expect("get_document_by_workspace_path path_b")
.expect("doc_b must exist after ingest");
// Both twins share one asset_id (same content hash).
assert_eq!(
doc_a.source_asset_id, doc_b.source_asset_id,
"twin files must share one asset_id"
);
// Open App and issue span fetch on each twin's doc_id.
let app = env.app();
let result_a = app
.fetch(
FetchQuery::Span {
doc_id: doc_a.doc_id.clone(),
line_start: 1,
line_end: 2,
},
FetchOpts::default(),
)
.expect("fetch_span on twin A must succeed for a markdown file");
assert_eq!(result_a.kind, FetchKind::Span);
assert!(
result_a.text.as_deref().is_some_and(|t| !t.is_empty()),
"span text for twin A must not be empty"
);
let result_b = app
.fetch(
FetchQuery::Span {
doc_id: doc_b.doc_id.clone(),
line_start: 1,
line_end: 2,
},
FetchOpts::default(),
)
.expect("fetch_span on twin B must succeed for a markdown file");
assert_eq!(result_b.kind, FetchKind::Span);
assert!(
result_b.text.as_deref().is_some_and(|t| !t.is_empty()),
"span text for twin B must not be empty"
);
// Ingest again to force the asset.workspace_path flip-flop, then
// re-check. Pre-fix this was the scenario that triggered the bug:
// after the second ingest the asset row's workspace_path could point
// at either twin, making one twin's span fetch behave incorrectly.
let report2 = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest must succeed");
assert_eq!(report2.errors, 0, "no ingest errors on second run; report={report2:?}");
// Re-open app after second ingest and verify span still works on both.
let app2 = env.app();
app2.fetch(
FetchQuery::Span {
doc_id: doc_a.doc_id.clone(),
line_start: 1,
line_end: 3,
},
FetchOpts::default(),
)
.expect("fetch_span on twin A after flip-flop must still succeed");
app2.fetch(
FetchQuery::Span {
doc_id: doc_b.doc_id.clone(),
line_start: 1,
line_end: 3,
},
FetchOpts::default(),
)
.expect("fetch_span on twin B after flip-flop must still succeed");
}

View File

@@ -13,6 +13,7 @@ serde_json_canonicalizer = "0.3"
blake3 = { workspace = true }
anyhow = { workspace = true }
tracing = { workspace = true }
serde_yaml = { workspace = true }
[dev-dependencies]
# kb-parse-md / kb-normalize are dev-only — used by the snapshot integration

View File

@@ -0,0 +1,322 @@
//! `code-go-ast-v1` — maps a tree-sitter-derived Go AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-go-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeGoAstV1Chunker;
impl Chunker for CodeGoAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeGoAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeGoAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-go-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.go".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-go-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("go".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("go".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("go".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_go_ast_v1() {
assert_eq!(CodeGoAstV1Chunker.chunker_version(),
ChunkerVersion("code-go-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "func parse() {\n\t// x\n}"),
("Foo.double", 5, 7, "func double() int {\n\t//\n\treturn 0\n}"),
]);
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-go-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tx{i} := {i}")).collect::<Vec<_>>().join("\n");
let code = format!("func big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "func parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeGoAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "func parse() {}\n")]);
let base: Vec<String> = CodeGoAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeGoAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeGoAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-java-ast-v1` — maps a tree-sitter-derived Java AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-java-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeJavaAstV1Chunker;
impl Chunker for CodeJavaAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeJavaAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeJavaAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-java-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/Main.java".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-java-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("java".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("java".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("java".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_java_ast_v1() {
assert_eq!(CodeJavaAstV1Chunker.chunker_version(),
ChunkerVersion("code-java-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "void parse() {\n\t// x\n}"),
("Foo.double", 5, 7, "int double() {\n\t//\n\treturn 0;\n}"),
]);
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-java-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tint x{i} = {i};")).collect::<Vec<_>>().join("\n");
let code = format!("void big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "void parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeJavaAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "void parse() {}\n")]);
let base: Vec<String> = CodeJavaAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeJavaAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeJavaAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-kotlin-ast-v1` — maps a tree-sitter-derived Kotlin AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-kotlin-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeKotlinAstV1Chunker;
impl Chunker for CodeKotlinAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeKotlinAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeKotlinAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-kotlin-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/Main.kt".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-kotlin-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("kotlin".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("kotlin".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("kotlin".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_kotlin_ast_v1() {
assert_eq!(CodeKotlinAstV1Chunker.chunker_version(),
ChunkerVersion("code-kotlin-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "fun parse() {\n\t// x\n}"),
("Foo.double", 5, 7, "fun double(): Int {\n\t//\n\treturn 0\n}"),
]);
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-kotlin-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tval x{i} = {i}")).collect::<Vec<_>>().join("\n");
let code = format!("fun big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "fun parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeKotlinAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "fun parse() {}\n")]);
let base: Vec<String> = CodeKotlinAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeKotlinAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeKotlinAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,170 @@
//! p10-3: Tier 3 paragraph + line-window fallback chunker.
//!
//! Splits code/text files on blank-line paragraph boundaries. Paragraphs
//! with more than 80 lines are further split into 80-line windows with a
//! 20-line overlap (stride 60) — the same oversize pattern used by Tier 1/2
//! chunkers but without AST structure, hence no symbol.
//!
//! Per spec §9.3: all emitted chunks carry `symbol: None`.
use crate::tier2_shared::{build_chunk_no_symbol, policy_hash};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
pub const VERSION_LABEL: &str = "code-text-paragraph-v1";
/// Lines-per-window for the oversize fallback (Tier 3).
const FALLBACK_LINES_PER_CHUNK: usize = 80;
/// Overlap between consecutive windows.
const FALLBACK_LINES_OVERLAP: usize = 20;
// stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP = 60.
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeTextParagraphV1Chunker;
impl Chunker for CodeTextParagraphV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full source text.
let (text, lang_str) = match doc.blocks.first() {
Some(Block::Code(cb)) => (cb.code.as_str(), cb.lang.as_deref().unwrap_or("")),
_ => return Ok(vec![]),
};
let mut chunks = Vec::new();
for para in split_paragraphs(text) {
push_paragraph(&mut chunks, doc, policy, &para, lang_str)?;
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"code-text-paragraph-v1 chunked",
);
Ok(chunks)
}
}
/// A contiguous run of non-blank lines from the source text.
struct Paragraph {
/// Lines joined with `\n` (no trailing newline).
text: String,
/// 1-indexed line number of the first line in the source file.
line_start: u32,
/// 1-indexed line number of the last line in the source file.
line_end: u32,
}
/// Split `text` into `Paragraph`s separated by blank (all-whitespace) lines.
///
/// Blank lines are treated as boundaries and are NOT included in any
/// paragraph's line range. Paragraphs that would consist entirely of blank
/// lines are skipped.
fn split_paragraphs(text: &str) -> Vec<Paragraph> {
let mut paragraphs = Vec::new();
let mut current: Vec<&str> = Vec::new();
let mut current_start: Option<u32> = None;
for (idx, line) in text.lines().enumerate() {
let line_no = (idx + 1) as u32;
let is_blank = line.trim().is_empty();
if is_blank {
if let Some(start) = current_start.take() {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
});
current.clear();
}
} else {
if current_start.is_none() {
current_start = Some(line_no);
}
current.push(line);
}
}
// Flush any trailing paragraph not terminated by a blank line.
if let Some(start) = current_start {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
});
}
paragraphs
}
/// Emit one or more chunks for a single paragraph.
///
/// Paragraphs with ≤ `FALLBACK_LINES_PER_CHUNK` lines become a single chunk.
/// Larger paragraphs are split into overlapping windows of
/// `FALLBACK_LINES_PER_CHUNK` lines with stride `FALLBACK_LINES_PER_CHUNK -
/// FALLBACK_LINES_OVERLAP`. The last window may be shorter. Window starts
/// are passed as `split_key` so `id_for_chunk` can produce distinct ids
/// across windows.
fn push_paragraph(
out: &mut Vec<Chunk>,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
para: &Paragraph,
lang: &str,
) -> Result<()> {
let n_lines = (para.line_end - para.line_start + 1) as usize;
if n_lines <= FALLBACK_LINES_PER_CHUNK {
// Use line_start as split_key so each paragraph gets a distinct
// chunk_id even when block_ids is empty (no symbol, no AST structure).
// Without this, all short paragraphs from the same doc share the same
// base_policy_hash and therefore the same id_for_chunk result.
out.push(build_chunk_no_symbol(
doc,
policy,
&para.text,
para.line_start,
para.line_end,
lang,
VERSION_LABEL,
Some(para.line_start),
));
return Ok(());
}
// Oversize: line-window split with overlap.
let stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP;
let lines: Vec<&str> = para.text.lines().collect();
let mut i = 0usize;
loop {
let end = (i + FALLBACK_LINES_PER_CHUNK).min(lines.len());
let window_text = lines[i..end].join("\n");
let window_start = para.line_start + i as u32;
let window_end = para.line_start + (end as u32) - 1;
// Use window_start as split_key so chunk_ids are unique across windows.
out.push(build_chunk_no_symbol(
doc,
policy,
&window_text,
window_start,
window_end,
lang,
VERSION_LABEL,
Some(window_start),
));
if end == lines.len() {
break;
}
i += stride;
}
Ok(())
}

View File

@@ -0,0 +1,57 @@
//! p10-2: dockerfile whole-file chunker (Tier 2).
//!
//! Reads entire Dockerfile content and emits a single Chunk with symbol
//! "<dockerfile>", code_lang "dockerfile", line range 1..EOF.
//! Oversize >200 lines splits into line-windows sharing the symbol via
//! tier2_shared::push_chunks_with_oversize.
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
pub const VERSION_LABEL: &str = "dockerfile-file-v1";
#[derive(Clone, Copy, Debug, Default)]
pub struct DockerfileFileV1Chunker;
impl Chunker for DockerfileFileV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full Dockerfile text.
let text = match doc.blocks.first() {
Some(Block::Code(cb)) => cb.code.as_str(),
_ => return Ok(vec![]),
};
let total_lines = text.lines().count().max(1) as u32;
let mut chunks = Vec::new();
push_chunks_with_oversize(
&mut chunks,
doc,
policy,
text,
1,
total_lines,
"<dockerfile>",
"dockerfile",
VERSION_LABEL,
)?;
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"dockerfile-file-v1 chunked",
);
Ok(chunks)
}
}

View File

@@ -0,0 +1,169 @@
//! p10-2: k8s manifest resource-aware chunker.
//!
//! Splits a multi-document YAML file on `^---\s*$` boundaries, recognises
//! documents that have both `apiVersion` and `kind` string fields as k8s
//! resources, and emits one `Chunk` per resource (with oversize >200-line
//! fallback). Non-k8s documents are skipped; invalid YAML yields 0 chunks
//! for the entire file.
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
pub const VERSION_LABEL: &str = "k8s-manifest-resource-v1";
#[derive(Clone, Copy, Debug, Default)]
pub struct K8sManifestResourceV1Chunker;
impl Chunker for K8sManifestResourceV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full YAML text.
let text = match doc.blocks.first() {
Some(Block::Code(cb)) => cb.code.as_str(),
_ => return Ok(vec![]),
};
let slices = split_yaml_documents(text);
let mut chunks: Vec<Chunk> = Vec::new();
for slice in slices {
// Invalid YAML in any document → return 0 chunks for the file.
let value: serde_yaml::Value = match serde_yaml::from_str(slice.text) {
Ok(v) => v,
Err(_) => return Ok(vec![]),
};
let Some(mapping) = value.as_mapping() else {
continue;
};
let api = mapping
.get("apiVersion")
.and_then(|v| v.as_str())
.unwrap_or("");
let kind = mapping
.get("kind")
.and_then(|v| v.as_str())
.unwrap_or("");
// Skip non-k8s documents.
if api.is_empty() || kind.is_empty() {
continue;
}
let metadata = mapping
.get("metadata")
.and_then(|v| v.as_mapping());
let name = metadata
.and_then(|m| m.get("name"))
.and_then(|v| v.as_str())
.unwrap_or("<unnamed>");
let namespace = metadata
.and_then(|m| m.get("namespace"))
.and_then(|v| v.as_str());
let symbol = match namespace {
Some(ns) if !ns.is_empty() => format!("{kind}/{ns}/{name}"),
_ => format!("{kind}/{name}"),
};
push_chunks_with_oversize(
&mut chunks,
doc,
policy,
slice.text,
slice.line_start,
slice.line_end,
&symbol,
"yaml",
VERSION_LABEL,
)?;
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"k8s-manifest-resource-v1 chunked",
);
Ok(chunks)
}
}
struct YamlSlice<'a> {
text: &'a str,
line_start: u32,
line_end: u32,
}
/// Split raw YAML text into per-document slices on `---` separator lines.
/// Line numbers are 1-indexed.
fn split_yaml_documents(text: &str) -> Vec<YamlSlice<'_>> {
let lines: Vec<&str> = text.lines().collect();
// Collect indices of separator lines (0-based), then append a sentinel at
// the end so the last slice is always terminated.
let mut separators: Vec<usize> = lines
.iter()
.enumerate()
.filter_map(|(i, l)| {
let trimmed = l.trim_end();
if trimmed == "---"
|| trimmed.starts_with("--- ")
|| trimmed.starts_with("---\t")
{
Some(i)
} else {
None
}
})
.collect();
separators.push(lines.len());
let mut slices: Vec<YamlSlice<'_>> = Vec::new();
let mut doc_start_line: usize = 0; // 0-based index of current doc start
for sep_line in separators {
if sep_line > doc_start_line {
let start_byte = byte_offset_of_line(text, doc_start_line);
let end_byte = byte_offset_of_line(text, sep_line);
let slice_text = &text[start_byte..end_byte];
if !slice_text.trim().is_empty() {
slices.push(YamlSlice {
text: slice_text,
line_start: (doc_start_line + 1) as u32,
line_end: sep_line as u32,
});
}
}
doc_start_line = sep_line + 1;
}
slices
}
/// Return the byte offset of the start of `line_idx` (0-based line index).
fn byte_offset_of_line(text: &str, line_idx: usize) -> usize {
if line_idx == 0 {
return 0;
}
let mut count = 0usize;
for (i, c) in text.char_indices() {
if c == '\n' {
count += 1;
if count == line_idx {
return i + 1;
}
}
}
text.len()
}

View File

@@ -15,16 +15,31 @@
//! embedder, the retriever, the LLM, the RAG layer, or the UI layers.
//! It consumes `CanonicalDocument` purely through `kb-core` types.
mod code_go_ast_v1;
mod code_java_ast_v1;
mod code_js_ast_v1;
mod code_kotlin_ast_v1;
mod code_python_ast_v1;
mod code_rust_ast_v1;
mod code_ts_ast_v1;
mod md_heading_v1;
mod pdf_page_v1;
mod tier2_shared;
pub mod k8s_manifest_resource_v1;
pub mod dockerfile_file_v1;
pub mod manifest_file_v1;
pub mod code_text_paragraph_v1;
pub use code_go_ast_v1::CodeGoAstV1Chunker;
pub use code_java_ast_v1::CodeJavaAstV1Chunker;
pub use code_js_ast_v1::CodeJsAstV1Chunker;
pub use code_kotlin_ast_v1::CodeKotlinAstV1Chunker;
pub use code_python_ast_v1::CodePythonAstV1Chunker;
pub use code_rust_ast_v1::CodeRustAstV1Chunker;
pub use code_ts_ast_v1::CodeTsAstV1Chunker;
pub use md_heading_v1::MdHeadingV1Chunker;
pub use pdf_page_v1::PdfPageV1Chunker;
pub use k8s_manifest_resource_v1::K8sManifestResourceV1Chunker;
pub use dockerfile_file_v1::DockerfileFileV1Chunker;
pub use manifest_file_v1::ManifestFileV1Chunker;
pub use code_text_paragraph_v1::CodeTextParagraphV1Chunker;

View File

@@ -0,0 +1,58 @@
//! p10-2: manifest whole-file chunker (Tier 2).
//!
//! Reads entire manifest file (Cargo.toml / package.json / pom.xml / go.mod /
//! build.gradle / pyproject.toml / tsconfig.json) and emits a single Chunk
//! with symbol "<manifest>", code_lang read from Block::Code.lang, line range
//! 1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
//! tier2_shared::push_chunks_with_oversize.
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
pub const VERSION_LABEL: &str = "manifest-file-v1";
#[derive(Clone, Copy, Debug, Default)]
pub struct ManifestFileV1Chunker;
impl Chunker for ManifestFileV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full manifest text.
let (text, lang) = match doc.blocks.first() {
Some(Block::Code(cb)) => (cb.code.as_str(), cb.lang.as_deref().unwrap_or("")),
_ => return Ok(vec![]),
};
let total_lines = text.lines().count().max(1) as u32;
let mut chunks = Vec::new();
push_chunks_with_oversize(
&mut chunks,
doc,
policy,
text,
1,
total_lines,
"<manifest>",
lang,
VERSION_LABEL,
)?;
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"manifest-file-v1 chunked",
);
Ok(chunks)
}
}

View File

@@ -0,0 +1,184 @@
//! p10-2: Tier 2 chunker shared helpers (oversize fallback + Chunk build).
//!
//! Mirrors `code_rust_ast_v1`'s Chunk-construction pattern exactly so that
//! id / hashes / token-count / ChunkPolicy semantics stay identical across
//! Tier 1 (AST) and Tier 2 (resource-aware) chunkers.
use anyhow::Result;
use kebab_core::{
BlockId, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, DocumentId, SourceSpan,
id_for_chunk,
};
pub(crate) const AST_CHUNK_MAX_LINES: u32 = 200;
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
/// Compute the policy hash the same way `code_rust_ast_v1` does.
pub(crate) fn policy_hash(policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
/// Emit one chunk for `(text, line_start..=line_end, symbol, lang)`, splitting
/// into line-windows of at most `AST_CHUNK_MAX_LINES` if the slice is oversize.
/// Mirrors the oversize path in `code_rust_ast_v1`'s `chunk` impl.
#[allow(clippy::too_many_arguments)]
pub(crate) fn push_chunks_with_oversize(
out: &mut Vec<Chunk>,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
text: &str,
line_start: u32,
line_end: u32,
symbol: &str,
lang: &str,
chunker_version: &str,
) -> Result<()> {
let n_lines = (line_end - line_start + 1).max(1);
let cv = ChunkerVersion(chunker_version.to_string());
let base_policy_hash = policy_hash(policy);
if n_lines <= AST_CHUNK_MAX_LINES {
out.push(build_chunk(
doc,
&cv,
&base_policy_hash,
text,
line_start,
line_end,
symbol,
lang,
None,
));
return Ok(());
}
let lines: Vec<&str> = text.lines().collect();
let total = lines.len();
let mut window_start = line_start;
let mut i = 0usize;
while i < total {
let take = (AST_CHUNK_MAX_LINES as usize).min(total - i);
let window_text = lines[i..i + take].join("\n");
let window_end = window_start + take as u32 - 1;
out.push(build_chunk(
doc,
&cv,
&base_policy_hash,
&window_text,
window_start,
window_end,
symbol,
lang,
Some(window_start),
));
i += take;
window_start = window_end + 1;
}
Ok(())
}
/// Build a single `Chunk`, mirroring `make_chunk` in `code_rust_ast_v1.rs`
/// exactly (same id recipe, same token estimate, same field set).
///
/// `split_key` is `Some(line_start_of_window)` for oversize splits, `None`
/// for normal single-chunk emission. Mirrors the `Some(part_ls)` / `None`
/// split_key pattern in 1A-2.
#[allow(clippy::too_many_arguments)]
pub(crate) fn build_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
base_policy_hash: &str,
text: &str,
line_start: u32,
line_end: u32,
symbol: &str,
lang: &str,
split_key: Option<u32>,
) -> Chunk {
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol.to_string()),
lang: Some(lang.to_string()),
};
build_chunk_from_span(doc, chunker_version, base_policy_hash, text, span, split_key)
}
/// Like `build_chunk` but emits `symbol: None`. Used by Tier 3 (per spec §9.3).
///
/// Accepts `policy: &ChunkPolicy` and `chunker_version: &str` (string slice)
/// so callers don't need to pre-compute the hash and version wrapper.
/// `split_key` is `Some(window_start)` for oversize line-window splits.
#[allow(clippy::too_many_arguments)]
pub(crate) fn build_chunk_no_symbol(
doc: &CanonicalDocument,
policy: &ChunkPolicy,
text: &str,
line_start: u32,
line_end: u32,
lang: &str,
chunker_version: &str,
split_key: Option<u32>,
) -> Chunk {
let cv = ChunkerVersion(chunker_version.to_string());
let base_policy_hash = policy_hash(policy);
let span = SourceSpan::Code {
line_start,
line_end,
symbol: None,
lang: Some(lang.to_string()),
};
build_chunk_from_span(doc, &cv, &base_policy_hash, text, span, split_key)
}
/// Core chunk-building logic shared by `build_chunk` and `build_chunk_no_symbol`.
///
/// Takes a pre-built `SourceSpan` so the only difference between the two
/// public helpers is whether `symbol` is `Some` or `None`. All id/hash/
/// token mechanics are identical.
fn build_chunk_from_span(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
base_policy_hash: &str,
text: &str,
span: SourceSpan,
split_key: Option<u32>,
) -> Chunk {
// id_hash mirrors code_rust_ast_v1's make_chunk logic:
// split_key Some(k) => "{base_policy_hash}#L{k}"
// split_key None => base_policy_hash
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
// block_ids: Tier 2/3 chunkers have no per-block structure (the whole file
// is one Block::Code), so we pass an empty slice — same as using the doc-
// level slice without explicit block granularity.
let block_ids: Vec<BlockId> = vec![];
let chunk_id = id_for_chunk(
&DocumentId(doc.doc_id.0.clone()),
chunker_version,
&block_ids,
&id_hash,
);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids,
text: text.to_string(),
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Go code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeGoAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("kebab_eval/metrics.go".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-go-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line function body to force split_oversize.
let big_body: String = {
let header = "func BigCompute(data []int) int {\n";
let body: String = (0..210u32)
.map(|i| format!("\tv{i} := 0\n\tif {i} < len(data) {{\n\t\tv{i} = data[{i}]\n\t}}\n"))
.collect();
let footer = "\treturn len(data)\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. free fn `ComputeMRR` (lines 712, ≤200)
// 2. struct `MetricsCollector` (lines 1420, ≤200)
// 3. struct `BaseEvaluator` (lines 2230, ≤200)
// 4. method `Run` (lines 3238, ≤200)
// 5. method `Report` (lines 4046, ≤200)
// 6. BigCompute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import (\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)".to_string(),
),
(
"ComputeMRR",
7,
12,
"func ComputeMRR(scores []float64) float64 {\n\tif len(scores) == 0 {\n\t\treturn 0.0\n\t}\n\t_ = fmt.Sprintf(\"%v\", scores)\n\treturn 1.0 / float64(len(scores))\n}".to_string(),
),
(
"MetricsCollector",
14,
20,
"type MetricsCollector struct {\n\tScores []float64\n\tLabels []string\n\tCounts map[string]int\n\tTotals map[string]float64\n\tTags []string\n}".to_string(),
),
(
"BaseEvaluator",
22,
30,
"type BaseEvaluator struct {\n\tName string\n}\n\nfunc (e *BaseEvaluator) Evaluate(data []string) error {\n\t_ = os.Stderr\n\t_ = strings.Join(data, \",\")\n\treturn nil\n}".to_string(),
),
(
"MetricsCollector.Run",
32,
38,
"func (m *MetricsCollector) Run(inputs []float64) {\n\tfor _, inp := range inputs {\n\t\tm.Scores = append(\n\t\t\tm.Scores,\n\t\t\tinp,\n\t\t)\n\t}\n}".to_string(),
),
(
"MetricsCollector.Report",
40,
46,
"func (m *MetricsCollector) Report() map[string]interface{} {\n\treturn map[string]interface{}{\n\t\t\"mean\": 0.0,\n\t\t\"count\": len(m.Scores),\n\t\t\"tags\": m.Tags,\n\t}\n}".to_string(),
),
("BigCompute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("go".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("go".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "metrics.go".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("go".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-go-ast-v1".into()),
}
}
#[test]
fn code_go_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.go.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-go-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_go_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeGoAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeGoAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Java code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeJavaAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("src/main/java/com/example/Metrics.java".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-java-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line method body to force split_oversize.
let big_body: String = {
let header = "public class BigCompute {\n public int compute(int[] data) {\n";
let body: String = (0..210u32)
.map(|i| format!(" int v{i} = {i} < data.length ? data[{i}] : 0;\n"))
.collect();
let footer = " return data.length;\n }\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. free method `computeMRR` (lines 712, ≤200)
// 2. class `MetricsCollector` (lines 1420, ≤200)
// 3. class `BaseEvaluator` (lines 2230, ≤200)
// 4. method `MetricsCollector.run` (lines 3238, ≤200)
// 5. method `MetricsCollector.report` (lines 4046, ≤200)
// 6. BigCompute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import java.util.List;\nimport java.util.Map;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.stream.Collectors;".to_string(),
),
(
"computeMRR",
7,
12,
"public static double computeMRR(List<Double> scores) {\n if (scores.isEmpty()) {\n return 0.0;\n }\n return 1.0 / scores.size();\n}".to_string(),
),
(
"MetricsCollector",
14,
20,
"public class MetricsCollector {\n private List<Double> scores;\n private List<String> labels;\n private Map<String, Integer> counts;\n private Map<String, Double> totals;\n private List<String> tags;\n}".to_string(),
),
(
"BaseEvaluator",
22,
30,
"public class BaseEvaluator {\n private String name;\n\n public BaseEvaluator(String name) {\n this.name = name;\n }\n\n public void evaluate(List<String> data) throws Exception {\n String joined = String.join(\",\", data);\n }\n}".to_string(),
),
(
"MetricsCollector.run",
32,
38,
"public void run(List<Double> inputs) {\n for (Double inp : inputs) {\n scores.add(\n inp\n );\n }\n}".to_string(),
),
(
"MetricsCollector.report",
40,
46,
"public Map<String, Object> report() {\n Map<String, Object> result = new HashMap<>();\n result.put(\"mean\", 0.0);\n result.put(\"count\", scores.size());\n result.put(\"tags\", tags);\n return result;\n}".to_string(),
),
("BigCompute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("java".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("java".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Metrics.java".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("java".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-java-ast-v1".into()),
}
}
#[test]
fn code_java_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.java.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-java-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_java_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeJavaAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeJavaAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Kotlin code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeKotlinAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("src/main/kotlin/com/example/Metrics.kt".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-kotlin-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line function body to force split_oversize.
let big_body: String = {
let header = "class BigCompute {\n fun compute(data: IntArray): Int {\n";
let body: String = (0..210u32)
.map(|i| format!(" val v{i} = if ({i} < data.size) data[{i}] else 0\n"))
.collect();
let footer = " return data.size\n }\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. top-level fn `computeMRR` (lines 712, ≤200)
// 2. data class `MetricsCollector` (lines 1420, ≤200)
// 3. class `BaseEvaluator` (lines 2230, ≤200)
// 4. method `MetricsCollector.run` (lines 3238, ≤200)
// 5. method `MetricsCollector.report` (lines 4046, ≤200)
// 6. BigCompute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import kotlin.collections.List\nimport kotlin.collections.Map\nimport kotlin.collections.MutableList\nimport kotlin.collections.MutableMap\nimport kotlin.collections.mutableListOf".to_string(),
),
(
"computeMRR",
7,
12,
"fun computeMRR(scores: List<Double>): Double {\n if (scores.isEmpty()) {\n return 0.0\n }\n return 1.0 / scores.size\n}".to_string(),
),
(
"MetricsCollector",
14,
20,
"data class MetricsCollector(\n val scores: MutableList<Double> = mutableListOf(),\n val labels: MutableList<String> = mutableListOf(),\n val counts: MutableMap<String, Int> = mutableMapOf(),\n val totals: MutableMap<String, Double> = mutableMapOf(),\n val tags: MutableList<String> = mutableListOf(),\n)".to_string(),
),
(
"BaseEvaluator",
22,
30,
"open class BaseEvaluator(val name: String) {\n\n fun evaluate(data: List<String>) {\n val joined = data.joinToString(\",\")\n println(joined)\n }\n\n open fun describe(): String = name\n}".to_string(),
),
(
"MetricsCollector.run",
32,
38,
"fun MetricsCollector.run(inputs: List<Double>) {\n for (inp in inputs) {\n scores.add(\n inp\n )\n }\n}".to_string(),
),
(
"MetricsCollector.report",
40,
46,
"fun MetricsCollector.report(): Map<String, Any> {\n return mapOf(\n \"mean\" to 0.0,\n \"count\" to scores.size,\n \"tags\" to tags,\n )\n}".to_string(),
),
("BigCompute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("kotlin".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("kotlin".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Metrics.kt".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("kotlin".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-kotlin-ast-v1".into()),
}
}
#[test]
fn code_kotlin_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.kt.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-kotlin-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_kotlin_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,270 @@
//! Behavioural tests for `CodeTextParagraphV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing raw text into a single `Block::Code`, mirroring the pattern used
//! in `k8s_manifest_resource_v1.rs`.
use std::path::PathBuf;
use kebab_chunk::CodeTextParagraphV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing `text`
/// and the supplied `lang` label.
fn text_doc(lang: &str, text: &str) -> CanonicalDocument {
let wp = WorkspacePath("scripts/sample.sh".into());
let aid = AssetId("d".repeat(64));
let pv = ParserVersion("code-text-paragraph-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some(lang.into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some(lang.into()),
code: text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "sample.sh".into(),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some(lang.into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-text-paragraph-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// `sample_shell.sh` has 4 paragraphs separated by 3 blank lines:
/// - paragraph 1: lines 1-2 (shebang + set -euo pipefail)
/// - paragraph 2: lines 4-7 (env setup block)
/// - paragraph 3: lines 9-11 (ingest block)
/// - paragraph 4: lines 13-15 (report block)
///
/// We assert:
/// - exactly 4 chunks (one per paragraph)
/// - all symbols are None (Tier 3 spec §9.3)
/// - all langs are "shell"
/// - line ranges are strictly ascending and do NOT include the blank lines
/// (lines 3, 8, 12 must not appear in any range)
#[test]
fn shell_multi_paragraph_splits_on_blank_lines() {
let fixture_path = fixtures_dir().join("sample_shell.sh");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
4,
"expected 4 chunks (one per paragraph), got {}: {chunks:#?}",
chunks.len()
);
// All symbols must be None (Tier 3 requirement).
for (i, chunk) in chunks.iter().enumerate() {
match &chunk.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(
symbol.is_none(),
"chunk[{i}] symbol must be None for Tier 3 chunker, got {symbol:?}"
);
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// All langs must be "shell".
for (i, chunk) in chunks.iter().enumerate() {
match &chunk.source_spans[0] {
SourceSpan::Code { lang, .. } => {
assert_eq!(
lang.as_deref(),
Some("shell"),
"chunk[{i}] lang must be 'shell', got {lang:?}"
);
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// Line ranges must be strictly ascending with no overlap,
// and blank lines (3, 8, 12) must not be included in any range.
let expected_ranges: &[(u32, u32)] = &[(1, 2), (4, 7), (9, 11), (13, 15)];
let actual_ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code {
line_start,
line_end,
..
} => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
assert_eq!(
actual_ranges, expected_ranges,
"line ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
);
}
/// `sample_long_paragraph.txt` has exactly 200 non-blank lines and no blank
/// lines, so the entire file is one paragraph. 200 > 80 (FALLBACK_LINES_PER_CHUNK),
/// so the oversize window split fires with stride 60:
/// - window 1: lines 1-80
/// - window 2: lines 61-140
/// - window 3: lines 121-200
///
/// All chunk_ids must be distinct (the #L{window_start} split_key suffix).
#[test]
fn single_long_paragraph_line_window_split() {
let fixture_path = fixtures_dir().join("sample_long_paragraph.txt");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
assert_eq!(
text.lines().count(),
200,
"fixture must have exactly 200 lines"
);
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
3,
"expected 3 window chunks for 200-line paragraph, got {}: {chunks:#?}",
chunks.len()
);
let expected_ranges: &[(u32, u32)] = &[(1, 80), (61, 140), (121, 200)];
let actual_ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code {
line_start,
line_end,
..
} => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
assert_eq!(
actual_ranges, expected_ranges,
"window ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
);
// All chunk_ids must be distinct (#L{window_start} suffix differentiates them).
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"oversize window chunks must have distinct chunk_ids"
);
}
/// An empty source file (no non-blank lines) must yield zero chunks.
#[test]
fn empty_file_emits_zero_chunks() {
let doc = text_doc("shell", "");
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
0,
"empty file must yield 0 chunks, got {}: {chunks:#?}",
chunks.len()
);
}
/// The `lang` field on each emitted chunk must match the `lang` passed to
/// `text_doc`, regardless of content. `symbol` must be `None` (Tier 3 spec).
#[test]
fn lang_field_preserved_from_input_doc() {
let doc = text_doc("yaml", "key1: value1\nkey2: value2\n");
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert!(!chunks.is_empty(), "expected at least one chunk");
match &chunks[0].source_spans[0] {
SourceSpan::Code { lang, symbol, .. } => {
assert_eq!(
lang.as_deref(),
Some("yaml"),
"lang must be 'yaml', got {lang:?}"
);
assert!(
symbol.is_none(),
"symbol must be None for Tier 3 chunker, got {symbol:?}"
);
}
other => panic!("expected Code span, got {other:?}"),
}
}

View File

@@ -0,0 +1,134 @@
//! Behavioural tests for `DockerfileFileV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing the raw Dockerfile text into a single `Block::Code`, mirroring the
//! pattern used in `k8s_manifest_resource_v1.rs`.
use std::path::PathBuf;
use kebab_chunk::DockerfileFileV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing `dockerfile_text`.
fn dockerfile_doc(dockerfile_text: &str) -> CanonicalDocument {
let wp = WorkspacePath("build/Dockerfile".into());
let aid = AssetId("d".repeat(64));
let pv = ParserVersion("code-dockerfile-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = dockerfile_text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some("dockerfile".into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("dockerfile".into()),
code: dockerfile_text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Dockerfile".into(),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("dockerfile".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("dockerfile-file-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// A simple 5-line Dockerfile fixture must emit exactly 1 chunk with the
/// correct symbol, lang, and line range.
#[test]
fn dockerfile_emits_single_chunk() {
let fixture_path = fixtures_dir().join("sample.dockerfile");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = dockerfile_doc(&text);
let chunks = DockerfileFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
// Inspect the Chunk's source_spans for symbol / lang / line range.
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(*line_end, 5, "line_end must be 5 (5-line fixture)");
assert_eq!(
symbol.as_deref(),
Some("<dockerfile>"),
"symbol must be '<dockerfile>'"
);
assert_eq!(lang.as_deref(), Some("dockerfile"), "lang must be 'dockerfile'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
// Verify chunker_version label.
assert_eq!(chunks[0].chunker_version.0, "dockerfile-file-v1");
}

View File

@@ -0,0 +1,233 @@
[
{
"block_ids": [
"c182bf37e32c7fc1b868bd617f8eaf66"
],
"chunk_id": "43de518d946dc18ec040ae20d74e0cff",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 5,
"line_start": 1,
"symbol": "imports"
}
],
"text": "import (\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)",
"token_estimate": 12
},
{
"block_ids": [
"c9992cdcfdf3c2a7700a4abc4782a8a4"
],
"chunk_id": "af4c382a83f1e8cdea495d8b33c11abc",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 12,
"line_start": 7,
"symbol": "ComputeMRR"
}
],
"text": "func ComputeMRR(scores []float64) float64 {\n\tif len(scores) == 0 {\n\t\treturn 0.0\n\t}\n\t_ = fmt.Sprintf(\"%v\", scores)\n\treturn 1.0 / float64(len(scores))\n}",
"token_estimate": 50
},
{
"block_ids": [
"5f18dc3e79fe946ba05d32c3bfc00684"
],
"chunk_id": "4be6d8f180bc19b8651877e5264852ac",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 20,
"line_start": 14,
"symbol": "MetricsCollector"
}
],
"text": "type MetricsCollector struct {\n\tScores []float64\n\tLabels []string\n\tCounts map[string]int\n\tTotals map[string]float64\n\tTags []string\n}",
"token_estimate": 45
},
{
"block_ids": [
"3009cc022ca832c323393e4f9bcdb388"
],
"chunk_id": "3ae182f4c6d304ee7f0aaf447142f948",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 30,
"line_start": 22,
"symbol": "BaseEvaluator"
}
],
"text": "type BaseEvaluator struct {\n\tName string\n}\n\nfunc (e *BaseEvaluator) Evaluate(data []string) error {\n\t_ = os.Stderr\n\t_ = strings.Join(data, \",\")\n\treturn nil\n}",
"token_estimate": 53
},
{
"block_ids": [
"e0e83d1d7f9327a1902ae9a8f67c1f1c"
],
"chunk_id": "b962f14980e756bb8ba514e2282756cd",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 38,
"line_start": 32,
"symbol": "MetricsCollector.Run"
}
],
"text": "func (m *MetricsCollector) Run(inputs []float64) {\n\tfor _, inp := range inputs {\n\t\tm.Scores = append(\n\t\t\tm.Scores,\n\t\t\tinp,\n\t\t)\n\t}\n}",
"token_estimate": 44
},
{
"block_ids": [
"0e6a572bc3fe2bd6d173fe614bd1b763"
],
"chunk_id": "441c695e990e7f49188068433e313e87",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 46,
"line_start": 40,
"symbol": "MetricsCollector.Report"
}
],
"text": "func (m *MetricsCollector) Report() map[string]interface{} {\n\treturn map[string]interface{}{\n\t\t\"mean\": 0.0,\n\t\t\"count\": len(m.Scores),\n\t\t\"tags\": m.Tags,\n\t}\n}",
"token_estimate": 53
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "7a942d871c588ec69426290561f05179",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 247,
"line_start": 48,
"symbol": "BigCompute [part 1/5]"
}
],
"text": "func BigCompute(data []int) int {\n\tv0 := 0\n\tif 0 < len(data) {\n\t\tv0 = data[0]\n\t}\n\tv1 := 0\n\tif 1 < len(data) {\n\t\tv1 = data[1]\n\t}\n\tv2 := 0\n\tif 2 < len(data) {\n\t\tv2 = data[2]\n\t}\n\tv3 := 0\n\tif 3 < len(data) {\n\t\tv3 = data[3]\n\t}\n\tv4 := 0\n\tif 4 < len(data) {\n\t\tv4 = data[4]\n\t}\n\tv5 := 0\n\tif 5 < len(data) {\n\t\tv5 = data[5]\n\t}\n\tv6 := 0\n\tif 6 < len(data) {\n\t\tv6 = data[6]\n\t}\n\tv7 := 0\n\tif 7 < len(data) {\n\t\tv7 = data[7]\n\t}\n\tv8 := 0\n\tif 8 < len(data) {\n\t\tv8 = data[8]\n\t}\n\tv9 := 0\n\tif 9 < len(data) {\n\t\tv9 = data[9]\n\t}\n\tv10 := 0\n\tif 10 < len(data) {\n\t\tv10 = data[10]\n\t}\n\tv11 := 0\n\tif 11 < len(data) {\n\t\tv11 = data[11]\n\t}\n\tv12 := 0\n\tif 12 < len(data) {\n\t\tv12 = data[12]\n\t}\n\tv13 := 0\n\tif 13 < len(data) {\n\t\tv13 = data[13]\n\t}\n\tv14 := 0\n\tif 14 < len(data) {\n\t\tv14 = data[14]\n\t}\n\tv15 := 0\n\tif 15 < len(data) {\n\t\tv15 = data[15]\n\t}\n\tv16 := 0\n\tif 16 < len(data) {\n\t\tv16 = data[16]\n\t}\n\tv17 := 0\n\tif 17 < len(data) {\n\t\tv17 = data[17]\n\t}\n\tv18 := 0\n\tif 18 < len(data) {\n\t\tv18 = data[18]\n\t}\n\tv19 := 0\n\tif 19 < len(data) {\n\t\tv19 = data[19]\n\t}\n\tv20 := 0\n\tif 20 < len(data) {\n\t\tv20 = data[20]\n\t}\n\tv21 := 0\n\tif 21 < len(data) {\n\t\tv21 = data[21]\n\t}\n\tv22 := 0\n\tif 22 < len(data) {\n\t\tv22 = data[22]\n\t}\n\tv23 := 0\n\tif 23 < len(data) {\n\t\tv23 = data[23]\n\t}\n\tv24 := 0\n\tif 24 < len(data) {\n\t\tv24 = data[24]\n\t}\n\tv25 := 0\n\tif 25 < len(data) {\n\t\tv25 = data[25]\n\t}\n\tv26 := 0\n\tif 26 < len(data) {\n\t\tv26 = data[26]\n\t}\n\tv27 := 0\n\tif 27 < len(data) {\n\t\tv27 = data[27]\n\t}\n\tv28 := 0\n\tif 28 < len(data) {\n\t\tv28 = data[28]\n\t}\n\tv29 := 0\n\tif 29 < len(data) {\n\t\tv29 = data[29]\n\t}\n\tv30 := 0\n\tif 30 < len(data) {\n\t\tv30 = data[30]\n\t}\n\tv31 := 0\n\tif 31 < len(data) {\n\t\tv31 = data[31]\n\t}\n\tv32 := 0\n\tif 32 < len(data) {\n\t\tv32 = data[32]\n\t}\n\tv33 := 0\n\tif 33 < len(data) {\n\t\tv33 = data[33]\n\t}\n\tv34 := 0\n\tif 34 < len(data) {\n\t\tv34 = data[34]\n\t}\n\tv35 := 0\n\tif 35 < len(data) {\n\t\tv35 = data[35]\n\t}\n\tv36 := 0\n\tif 36 < len(data) {\n\t\tv36 = data[36]\n\t}\n\tv37 := 0\n\tif 37 < len(data) {\n\t\tv37 = data[37]\n\t}\n\tv38 := 0\n\tif 38 < len(data) {\n\t\tv38 = data[38]\n\t}\n\tv39 := 0\n\tif 39 < len(data) {\n\t\tv39 = data[39]\n\t}\n\tv40 := 0\n\tif 40 < len(data) {\n\t\tv40 = data[40]\n\t}\n\tv41 := 0\n\tif 41 < len(data) {\n\t\tv41 = data[41]\n\t}\n\tv42 := 0\n\tif 42 < len(data) {\n\t\tv42 = data[42]\n\t}\n\tv43 := 0\n\tif 43 < len(data) {\n\t\tv43 = data[43]\n\t}\n\tv44 := 0\n\tif 44 < len(data) {\n\t\tv44 = data[44]\n\t}\n\tv45 := 0\n\tif 45 < len(data) {\n\t\tv45 = data[45]\n\t}\n\tv46 := 0\n\tif 46 < len(data) {\n\t\tv46 = data[46]\n\t}\n\tv47 := 0\n\tif 47 < len(data) {\n\t\tv47 = data[47]\n\t}\n\tv48 := 0\n\tif 48 < len(data) {\n\t\tv48 = data[48]\n\t}\n\tv49 := 0\n\tif 49 < len(data) {\n\t\tv49 = data[49]",
"token_estimate": 847
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "3f44ba43c9415652e2705bb667776e76",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 447,
"line_start": 248,
"symbol": "BigCompute [part 2/5]"
}
],
"text": "\t}\n\tv50 := 0\n\tif 50 < len(data) {\n\t\tv50 = data[50]\n\t}\n\tv51 := 0\n\tif 51 < len(data) {\n\t\tv51 = data[51]\n\t}\n\tv52 := 0\n\tif 52 < len(data) {\n\t\tv52 = data[52]\n\t}\n\tv53 := 0\n\tif 53 < len(data) {\n\t\tv53 = data[53]\n\t}\n\tv54 := 0\n\tif 54 < len(data) {\n\t\tv54 = data[54]\n\t}\n\tv55 := 0\n\tif 55 < len(data) {\n\t\tv55 = data[55]\n\t}\n\tv56 := 0\n\tif 56 < len(data) {\n\t\tv56 = data[56]\n\t}\n\tv57 := 0\n\tif 57 < len(data) {\n\t\tv57 = data[57]\n\t}\n\tv58 := 0\n\tif 58 < len(data) {\n\t\tv58 = data[58]\n\t}\n\tv59 := 0\n\tif 59 < len(data) {\n\t\tv59 = data[59]\n\t}\n\tv60 := 0\n\tif 60 < len(data) {\n\t\tv60 = data[60]\n\t}\n\tv61 := 0\n\tif 61 < len(data) {\n\t\tv61 = data[61]\n\t}\n\tv62 := 0\n\tif 62 < len(data) {\n\t\tv62 = data[62]\n\t}\n\tv63 := 0\n\tif 63 < len(data) {\n\t\tv63 = data[63]\n\t}\n\tv64 := 0\n\tif 64 < len(data) {\n\t\tv64 = data[64]\n\t}\n\tv65 := 0\n\tif 65 < len(data) {\n\t\tv65 = data[65]\n\t}\n\tv66 := 0\n\tif 66 < len(data) {\n\t\tv66 = data[66]\n\t}\n\tv67 := 0\n\tif 67 < len(data) {\n\t\tv67 = data[67]\n\t}\n\tv68 := 0\n\tif 68 < len(data) {\n\t\tv68 = data[68]\n\t}\n\tv69 := 0\n\tif 69 < len(data) {\n\t\tv69 = data[69]\n\t}\n\tv70 := 0\n\tif 70 < len(data) {\n\t\tv70 = data[70]\n\t}\n\tv71 := 0\n\tif 71 < len(data) {\n\t\tv71 = data[71]\n\t}\n\tv72 := 0\n\tif 72 < len(data) {\n\t\tv72 = data[72]\n\t}\n\tv73 := 0\n\tif 73 < len(data) {\n\t\tv73 = data[73]\n\t}\n\tv74 := 0\n\tif 74 < len(data) {\n\t\tv74 = data[74]\n\t}\n\tv75 := 0\n\tif 75 < len(data) {\n\t\tv75 = data[75]\n\t}\n\tv76 := 0\n\tif 76 < len(data) {\n\t\tv76 = data[76]\n\t}\n\tv77 := 0\n\tif 77 < len(data) {\n\t\tv77 = data[77]\n\t}\n\tv78 := 0\n\tif 78 < len(data) {\n\t\tv78 = data[78]\n\t}\n\tv79 := 0\n\tif 79 < len(data) {\n\t\tv79 = data[79]\n\t}\n\tv80 := 0\n\tif 80 < len(data) {\n\t\tv80 = data[80]\n\t}\n\tv81 := 0\n\tif 81 < len(data) {\n\t\tv81 = data[81]\n\t}\n\tv82 := 0\n\tif 82 < len(data) {\n\t\tv82 = data[82]\n\t}\n\tv83 := 0\n\tif 83 < len(data) {\n\t\tv83 = data[83]\n\t}\n\tv84 := 0\n\tif 84 < len(data) {\n\t\tv84 = data[84]\n\t}\n\tv85 := 0\n\tif 85 < len(data) {\n\t\tv85 = data[85]\n\t}\n\tv86 := 0\n\tif 86 < len(data) {\n\t\tv86 = data[86]\n\t}\n\tv87 := 0\n\tif 87 < len(data) {\n\t\tv87 = data[87]\n\t}\n\tv88 := 0\n\tif 88 < len(data) {\n\t\tv88 = data[88]\n\t}\n\tv89 := 0\n\tif 89 < len(data) {\n\t\tv89 = data[89]\n\t}\n\tv90 := 0\n\tif 90 < len(data) {\n\t\tv90 = data[90]\n\t}\n\tv91 := 0\n\tif 91 < len(data) {\n\t\tv91 = data[91]\n\t}\n\tv92 := 0\n\tif 92 < len(data) {\n\t\tv92 = data[92]\n\t}\n\tv93 := 0\n\tif 93 < len(data) {\n\t\tv93 = data[93]\n\t}\n\tv94 := 0\n\tif 94 < len(data) {\n\t\tv94 = data[94]\n\t}\n\tv95 := 0\n\tif 95 < len(data) {\n\t\tv95 = data[95]\n\t}\n\tv96 := 0\n\tif 96 < len(data) {\n\t\tv96 = data[96]\n\t}\n\tv97 := 0\n\tif 97 < len(data) {\n\t\tv97 = data[97]\n\t}\n\tv98 := 0\n\tif 98 < len(data) {\n\t\tv98 = data[98]\n\t}\n\tv99 := 0\n\tif 99 < len(data) {\n\t\tv99 = data[99]",
"token_estimate": 850
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "e4763e10f059d97f40c2932761b56c3e",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 647,
"line_start": 448,
"symbol": "BigCompute [part 3/5]"
}
],
"text": "\t}\n\tv100 := 0\n\tif 100 < len(data) {\n\t\tv100 = data[100]\n\t}\n\tv101 := 0\n\tif 101 < len(data) {\n\t\tv101 = data[101]\n\t}\n\tv102 := 0\n\tif 102 < len(data) {\n\t\tv102 = data[102]\n\t}\n\tv103 := 0\n\tif 103 < len(data) {\n\t\tv103 = data[103]\n\t}\n\tv104 := 0\n\tif 104 < len(data) {\n\t\tv104 = data[104]\n\t}\n\tv105 := 0\n\tif 105 < len(data) {\n\t\tv105 = data[105]\n\t}\n\tv106 := 0\n\tif 106 < len(data) {\n\t\tv106 = data[106]\n\t}\n\tv107 := 0\n\tif 107 < len(data) {\n\t\tv107 = data[107]\n\t}\n\tv108 := 0\n\tif 108 < len(data) {\n\t\tv108 = data[108]\n\t}\n\tv109 := 0\n\tif 109 < len(data) {\n\t\tv109 = data[109]\n\t}\n\tv110 := 0\n\tif 110 < len(data) {\n\t\tv110 = data[110]\n\t}\n\tv111 := 0\n\tif 111 < len(data) {\n\t\tv111 = data[111]\n\t}\n\tv112 := 0\n\tif 112 < len(data) {\n\t\tv112 = data[112]\n\t}\n\tv113 := 0\n\tif 113 < len(data) {\n\t\tv113 = data[113]\n\t}\n\tv114 := 0\n\tif 114 < len(data) {\n\t\tv114 = data[114]\n\t}\n\tv115 := 0\n\tif 115 < len(data) {\n\t\tv115 = data[115]\n\t}\n\tv116 := 0\n\tif 116 < len(data) {\n\t\tv116 = data[116]\n\t}\n\tv117 := 0\n\tif 117 < len(data) {\n\t\tv117 = data[117]\n\t}\n\tv118 := 0\n\tif 118 < len(data) {\n\t\tv118 = data[118]\n\t}\n\tv119 := 0\n\tif 119 < len(data) {\n\t\tv119 = data[119]\n\t}\n\tv120 := 0\n\tif 120 < len(data) {\n\t\tv120 = data[120]\n\t}\n\tv121 := 0\n\tif 121 < len(data) {\n\t\tv121 = data[121]\n\t}\n\tv122 := 0\n\tif 122 < len(data) {\n\t\tv122 = data[122]\n\t}\n\tv123 := 0\n\tif 123 < len(data) {\n\t\tv123 = data[123]\n\t}\n\tv124 := 0\n\tif 124 < len(data) {\n\t\tv124 = data[124]\n\t}\n\tv125 := 0\n\tif 125 < len(data) {\n\t\tv125 = data[125]\n\t}\n\tv126 := 0\n\tif 126 < len(data) {\n\t\tv126 = data[126]\n\t}\n\tv127 := 0\n\tif 127 < len(data) {\n\t\tv127 = data[127]\n\t}\n\tv128 := 0\n\tif 128 < len(data) {\n\t\tv128 = data[128]\n\t}\n\tv129 := 0\n\tif 129 < len(data) {\n\t\tv129 = data[129]\n\t}\n\tv130 := 0\n\tif 130 < len(data) {\n\t\tv130 = data[130]\n\t}\n\tv131 := 0\n\tif 131 < len(data) {\n\t\tv131 = data[131]\n\t}\n\tv132 := 0\n\tif 132 < len(data) {\n\t\tv132 = data[132]\n\t}\n\tv133 := 0\n\tif 133 < len(data) {\n\t\tv133 = data[133]\n\t}\n\tv134 := 0\n\tif 134 < len(data) {\n\t\tv134 = data[134]\n\t}\n\tv135 := 0\n\tif 135 < len(data) {\n\t\tv135 = data[135]\n\t}\n\tv136 := 0\n\tif 136 < len(data) {\n\t\tv136 = data[136]\n\t}\n\tv137 := 0\n\tif 137 < len(data) {\n\t\tv137 = data[137]\n\t}\n\tv138 := 0\n\tif 138 < len(data) {\n\t\tv138 = data[138]\n\t}\n\tv139 := 0\n\tif 139 < len(data) {\n\t\tv139 = data[139]\n\t}\n\tv140 := 0\n\tif 140 < len(data) {\n\t\tv140 = data[140]\n\t}\n\tv141 := 0\n\tif 141 < len(data) {\n\t\tv141 = data[141]\n\t}\n\tv142 := 0\n\tif 142 < len(data) {\n\t\tv142 = data[142]\n\t}\n\tv143 := 0\n\tif 143 < len(data) {\n\t\tv143 = data[143]\n\t}\n\tv144 := 0\n\tif 144 < len(data) {\n\t\tv144 = data[144]\n\t}\n\tv145 := 0\n\tif 145 < len(data) {\n\t\tv145 = data[145]\n\t}\n\tv146 := 0\n\tif 146 < len(data) {\n\t\tv146 = data[146]\n\t}\n\tv147 := 0\n\tif 147 < len(data) {\n\t\tv147 = data[147]\n\t}\n\tv148 := 0\n\tif 148 < len(data) {\n\t\tv148 = data[148]\n\t}\n\tv149 := 0\n\tif 149 < len(data) {\n\t\tv149 = data[149]",
"token_estimate": 917
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "24176c911d0bacf9a29fa7f8251f5036",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 847,
"line_start": 648,
"symbol": "BigCompute [part 4/5]"
}
],
"text": "\t}\n\tv150 := 0\n\tif 150 < len(data) {\n\t\tv150 = data[150]\n\t}\n\tv151 := 0\n\tif 151 < len(data) {\n\t\tv151 = data[151]\n\t}\n\tv152 := 0\n\tif 152 < len(data) {\n\t\tv152 = data[152]\n\t}\n\tv153 := 0\n\tif 153 < len(data) {\n\t\tv153 = data[153]\n\t}\n\tv154 := 0\n\tif 154 < len(data) {\n\t\tv154 = data[154]\n\t}\n\tv155 := 0\n\tif 155 < len(data) {\n\t\tv155 = data[155]\n\t}\n\tv156 := 0\n\tif 156 < len(data) {\n\t\tv156 = data[156]\n\t}\n\tv157 := 0\n\tif 157 < len(data) {\n\t\tv157 = data[157]\n\t}\n\tv158 := 0\n\tif 158 < len(data) {\n\t\tv158 = data[158]\n\t}\n\tv159 := 0\n\tif 159 < len(data) {\n\t\tv159 = data[159]\n\t}\n\tv160 := 0\n\tif 160 < len(data) {\n\t\tv160 = data[160]\n\t}\n\tv161 := 0\n\tif 161 < len(data) {\n\t\tv161 = data[161]\n\t}\n\tv162 := 0\n\tif 162 < len(data) {\n\t\tv162 = data[162]\n\t}\n\tv163 := 0\n\tif 163 < len(data) {\n\t\tv163 = data[163]\n\t}\n\tv164 := 0\n\tif 164 < len(data) {\n\t\tv164 = data[164]\n\t}\n\tv165 := 0\n\tif 165 < len(data) {\n\t\tv165 = data[165]\n\t}\n\tv166 := 0\n\tif 166 < len(data) {\n\t\tv166 = data[166]\n\t}\n\tv167 := 0\n\tif 167 < len(data) {\n\t\tv167 = data[167]\n\t}\n\tv168 := 0\n\tif 168 < len(data) {\n\t\tv168 = data[168]\n\t}\n\tv169 := 0\n\tif 169 < len(data) {\n\t\tv169 = data[169]\n\t}\n\tv170 := 0\n\tif 170 < len(data) {\n\t\tv170 = data[170]\n\t}\n\tv171 := 0\n\tif 171 < len(data) {\n\t\tv171 = data[171]\n\t}\n\tv172 := 0\n\tif 172 < len(data) {\n\t\tv172 = data[172]\n\t}\n\tv173 := 0\n\tif 173 < len(data) {\n\t\tv173 = data[173]\n\t}\n\tv174 := 0\n\tif 174 < len(data) {\n\t\tv174 = data[174]\n\t}\n\tv175 := 0\n\tif 175 < len(data) {\n\t\tv175 = data[175]\n\t}\n\tv176 := 0\n\tif 176 < len(data) {\n\t\tv176 = data[176]\n\t}\n\tv177 := 0\n\tif 177 < len(data) {\n\t\tv177 = data[177]\n\t}\n\tv178 := 0\n\tif 178 < len(data) {\n\t\tv178 = data[178]\n\t}\n\tv179 := 0\n\tif 179 < len(data) {\n\t\tv179 = data[179]\n\t}\n\tv180 := 0\n\tif 180 < len(data) {\n\t\tv180 = data[180]\n\t}\n\tv181 := 0\n\tif 181 < len(data) {\n\t\tv181 = data[181]\n\t}\n\tv182 := 0\n\tif 182 < len(data) {\n\t\tv182 = data[182]\n\t}\n\tv183 := 0\n\tif 183 < len(data) {\n\t\tv183 = data[183]\n\t}\n\tv184 := 0\n\tif 184 < len(data) {\n\t\tv184 = data[184]\n\t}\n\tv185 := 0\n\tif 185 < len(data) {\n\t\tv185 = data[185]\n\t}\n\tv186 := 0\n\tif 186 < len(data) {\n\t\tv186 = data[186]\n\t}\n\tv187 := 0\n\tif 187 < len(data) {\n\t\tv187 = data[187]\n\t}\n\tv188 := 0\n\tif 188 < len(data) {\n\t\tv188 = data[188]\n\t}\n\tv189 := 0\n\tif 189 < len(data) {\n\t\tv189 = data[189]\n\t}\n\tv190 := 0\n\tif 190 < len(data) {\n\t\tv190 = data[190]\n\t}\n\tv191 := 0\n\tif 191 < len(data) {\n\t\tv191 = data[191]\n\t}\n\tv192 := 0\n\tif 192 < len(data) {\n\t\tv192 = data[192]\n\t}\n\tv193 := 0\n\tif 193 < len(data) {\n\t\tv193 = data[193]\n\t}\n\tv194 := 0\n\tif 194 < len(data) {\n\t\tv194 = data[194]\n\t}\n\tv195 := 0\n\tif 195 < len(data) {\n\t\tv195 = data[195]\n\t}\n\tv196 := 0\n\tif 196 < len(data) {\n\t\tv196 = data[196]\n\t}\n\tv197 := 0\n\tif 197 < len(data) {\n\t\tv197 = data[197]\n\t}\n\tv198 := 0\n\tif 198 < len(data) {\n\t\tv198 = data[198]\n\t}\n\tv199 := 0\n\tif 199 < len(data) {\n\t\tv199 = data[199]",
"token_estimate": 917
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "438127626378632c03780d10603de32c",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 890,
"line_start": 848,
"symbol": "BigCompute [part 5/5]"
}
],
"text": "\t}\n\tv200 := 0\n\tif 200 < len(data) {\n\t\tv200 = data[200]\n\t}\n\tv201 := 0\n\tif 201 < len(data) {\n\t\tv201 = data[201]\n\t}\n\tv202 := 0\n\tif 202 < len(data) {\n\t\tv202 = data[202]\n\t}\n\tv203 := 0\n\tif 203 < len(data) {\n\t\tv203 = data[203]\n\t}\n\tv204 := 0\n\tif 204 < len(data) {\n\t\tv204 = data[204]\n\t}\n\tv205 := 0\n\tif 205 < len(data) {\n\t\tv205 = data[205]\n\t}\n\tv206 := 0\n\tif 206 < len(data) {\n\t\tv206 = data[206]\n\t}\n\tv207 := 0\n\tif 207 < len(data) {\n\t\tv207 = data[207]\n\t}\n\tv208 := 0\n\tif 208 < len(data) {\n\t\tv208 = data[208]\n\t}\n\tv209 := 0\n\tif 209 < len(data) {\n\t\tv209 = data[209]\n\t}\n\treturn len(data)\n}",
"token_estimate": 191
}
]

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,5 @@
FROM rust:1.94-slim AS builder
WORKDIR /app
COPY . .
RUN cargo build --release
CMD ["/app/target/release/kebab"]

View File

@@ -0,0 +1,7 @@
[package]
name = "demo"
version = "0.1.0"
edition = "2021"
[dependencies]
serde = "1"

View File

@@ -0,0 +1,5 @@
module example.com/demo
go 1.22
require github.com/spf13/cobra v1.8.0

View File

@@ -0,0 +1,34 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: prod
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: example/api:1.2.3
---
apiVersion: v1
kind: Service
metadata:
name: api-server
namespace: prod
spec:
selector:
app: api-server
ports:
- port: 80
targetPort: 8080
---
# Non-k8s document — apiVersion missing
kind: ClusterIP
foo: bar

View File

@@ -0,0 +1,200 @@
line 001
line 002
line 003
line 004
line 005
line 006
line 007
line 008
line 009
line 010
line 011
line 012
line 013
line 014
line 015
line 016
line 017
line 018
line 019
line 020
line 021
line 022
line 023
line 024
line 025
line 026
line 027
line 028
line 029
line 030
line 031
line 032
line 033
line 034
line 035
line 036
line 037
line 038
line 039
line 040
line 041
line 042
line 043
line 044
line 045
line 046
line 047
line 048
line 049
line 050
line 051
line 052
line 053
line 054
line 055
line 056
line 057
line 058
line 059
line 060
line 061
line 062
line 063
line 064
line 065
line 066
line 067
line 068
line 069
line 070
line 071
line 072
line 073
line 074
line 075
line 076
line 077
line 078
line 079
line 080
line 081
line 082
line 083
line 084
line 085
line 086
line 087
line 088
line 089
line 090
line 091
line 092
line 093
line 094
line 095
line 096
line 097
line 098
line 099
line 100
line 101
line 102
line 103
line 104
line 105
line 106
line 107
line 108
line 109
line 110
line 111
line 112
line 113
line 114
line 115
line 116
line 117
line 118
line 119
line 120
line 121
line 122
line 123
line 124
line 125
line 126
line 127
line 128
line 129
line 130
line 131
line 132
line 133
line 134
line 135
line 136
line 137
line 138
line 139
line 140
line 141
line 142
line 143
line 144
line 145
line 146
line 147
line 148
line 149
line 150
line 151
line 152
line 153
line 154
line 155
line 156
line 157
line 158
line 159
line 160
line 161
line 162
line 163
line 164
line 165
line 166
line 167
line 168
line 169
line 170
line 171
line 172
line 173
line 174
line 175
line 176
line 177
line 178
line 179
line 180
line 181
line 182
line 183
line 184
line 185
line 186
line 187
line 188
line 189
line 190
line 191
line 192
line 193
line 194
line 195
line 196
line 197
line 198
line 199
line 200

View File

@@ -0,0 +1,7 @@
{
"name": "demo",
"version": "0.1.0",
"dependencies": {
"react": "^18.0.0"
}
}

View File

@@ -0,0 +1,7 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
<groupId>com.demo</groupId>
<artifactId>demo</artifactId>
<version>0.1.0</version>
</project>

View File

@@ -0,0 +1,15 @@
#!/usr/bin/env bash
set -euo pipefail
# First paragraph: env setup
export KEBAB_HOME="${KEBAB_HOME:-$HOME/.local/share/kebab}"
mkdir -p "$KEBAB_HOME"
cd "$KEBAB_HOME"
# Second paragraph: ingest
echo "ingesting workspace..."
kebab ingest --config /etc/kebab/config.toml
# Third paragraph: report
echo "done"
kebab schema --json | jq '.stats'

View File

@@ -0,0 +1,277 @@
//! Behavioural tests for `K8sManifestResourceV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing the raw YAML text into a single `Block::Code`, mirroring the
//! pattern used in `code_rust_ast_snapshot.rs`.
use std::path::PathBuf;
use kebab_chunk::K8sManifestResourceV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing `yaml_text`.
fn yaml_doc(yaml_text: &str) -> CanonicalDocument {
let wp = WorkspacePath("manifests/deploy.yaml".into());
let aid = AssetId("c".repeat(64));
let pv = ParserVersion("code-yaml-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = yaml_text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some("yaml".into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("yaml".into()),
code: yaml_text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "deploy.yaml".into(),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("yaml".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("k8s-manifest-resource-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// Three YAML documents: 2 valid k8s resources + 1 non-k8s (no apiVersion).
/// The chunker must emit exactly 2 chunks with the correct symbols and lang.
#[test]
fn k8s_multi_doc_emits_one_chunk_per_resource() {
let fixture_path = fixtures_dir().join("sample_k8s.yaml");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = yaml_doc(&text);
let chunks = K8sManifestResourceV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
2,
"expected 2 k8s chunks, got {}: {chunks:#?}",
chunks.len()
);
let symbols: Vec<&str> = chunks
.iter()
.map(|c| {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
symbol.as_deref().expect("symbol must be Some for k8s chunks")
}
other => panic!("expected Code span, got {other:?}"),
}
})
.collect();
assert_eq!(
symbols,
vec!["Deployment/prod/api-server", "Service/prod/api-server"],
"symbols mismatch: {symbols:?}"
);
// Verify lang = "yaml" on every chunk.
for chunk in &chunks {
match &chunk.source_spans[0] {
SourceSpan::Code { lang, .. } => {
assert_eq!(lang.as_deref(), Some("yaml"), "lang must be 'yaml'");
}
other => panic!("expected Code span, got {other:?}"),
}
}
// Verify chunker_version label.
for chunk in &chunks {
assert_eq!(chunk.chunker_version.0, "k8s-manifest-resource-v1");
}
}
/// A YAML document with an indentation error (tab in a space-indented context)
/// must cause the chunker to return 0 chunks for the entire file.
#[test]
fn k8s_invalid_yaml_emits_zero_chunks() {
// serde_yaml 0.9 is lenient about duplicate keys (last wins), so use a
// genuine YAML structural error (unclosed flow sequence) to force a parse
// failure.
let actually_bad = "apiVersion: v1\nkind: Service\nfoo: [\nbar\n";
let doc = yaml_doc(actually_bad);
let chunks = K8sManifestResourceV1Chunker
.chunk(&doc, &policy())
.expect("chunk should not error — return Ok(vec![]) for invalid yaml");
assert_eq!(
chunks.len(),
0,
"invalid YAML must yield 0 chunks, got {}: {chunks:#?}",
chunks.len()
);
}
/// A cluster-scoped resource (no `metadata.namespace`) must produce a symbol
/// of the form `<Kind>/<name>` (two components, no namespace segment).
#[test]
fn k8s_cluster_scoped_resource_symbol() {
let yaml = "\
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-admin
rules:
- apiGroups: [\"*\"]
resources: [\"*\"]
verbs: [\"*\"]
";
let doc = yaml_doc(yaml);
let chunks = K8sManifestResourceV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk for cluster-scoped resource, got {}: {chunks:#?}",
chunks.len()
);
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(
symbol.as_deref(),
Some("ClusterRole/cluster-admin"),
"cluster-scoped symbol must be <Kind>/<name>"
);
assert_eq!(lang.as_deref(), Some("yaml"));
}
other => panic!("expected Code span, got {other:?}"),
}
}
/// 200+ line resource exercises `tier2_shared::push_chunks_with_oversize`'s
/// line-window split branch. All chunks must share the same symbol
/// (`<Kind>/<ns>/<name>`); their line ranges must form a contiguous
/// partition; chunk_ids must all differ (the `#L{k}` suffix on `id_for_chunk`
/// ensures uniqueness across windows). Spec p10-2 risks section explicitly
/// flags "거대 ConfigMap" — this test covers that path.
#[test]
fn k8s_oversize_splits_into_line_windows_sharing_symbol() {
// ConfigMap with 250 data keys → ~256 total lines, > AST_CHUNK_MAX_LINES (200).
let mut yaml = String::from(
"apiVersion: v1\nkind: ConfigMap\nmetadata:\n name: big\n namespace: prod\ndata:\n",
);
for i in 0..250 {
yaml.push_str(&format!(" key{i}: value{i}\n"));
}
let doc = yaml_doc(&yaml);
let chunks = K8sManifestResourceV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for oversize resource, got {}",
chunks.len()
);
// Every chunk must share the same symbol + lang.
let expected_symbol = "ConfigMap/prod/big";
for (i, c) in chunks.iter().enumerate() {
match &c.source_spans[0] {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(
symbol.as_deref(),
Some(expected_symbol),
"chunk[{i}] symbol must equal `{expected_symbol}`"
);
assert_eq!(lang.as_deref(), Some("yaml"));
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// chunk_ids must all be distinct (oversize fallback's #L{k} suffix).
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"oversize chunks must have distinct chunk_ids (the #L{{k}} suffix should disambiguate)"
);
// Line ranges must form a contiguous partition: chunk[i].line_end + 1 == chunk[i+1].line_start.
let ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code { line_start, line_end, .. } => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
for w in ranges.windows(2) {
let (_, prev_end) = w[0];
let (next_start, _) = w[1];
assert_eq!(
prev_end + 1,
next_start,
"line ranges must be contiguous: {} → {} (got gap or overlap)",
prev_end,
next_start
);
}
}

View File

@@ -0,0 +1,267 @@
//! Behavioural tests for `ManifestFileV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing the raw manifest text into a single `Block::Code`, mirroring the
//! pattern used in `dockerfile_file_v1.rs`.
use std::path::PathBuf;
use kebab_chunk::ManifestFileV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing manifest text.
fn manifest_doc(lang: &str, manifest_text: &str) -> CanonicalDocument {
let wp = WorkspacePath(format!("build/{}", manifest_filename(lang)));
let aid = AssetId("m".repeat(64));
let pv = ParserVersion("code-manifest-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = manifest_text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some(lang.into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some(lang.into()),
code: manifest_text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: format!("Manifest ({})", lang),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some(lang.into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn manifest_filename(lang: &str) -> &'static str {
match lang {
"toml" => "Cargo.toml",
"json" => "package.json",
"xml" => "pom.xml",
"go-mod" => "go.mod",
_ => "manifest",
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("manifest-file-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// A Cargo.toml fixture must emit exactly 1 chunk with the correct symbol,
/// lang, and line range.
#[test]
fn cargo_toml_single_chunk_with_toml_lang() {
let fixture_path = fixtures_dir().join("sample_cargo.toml");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("toml", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end: _,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"symbol must be '<manifest>'"
);
assert_eq!(lang.as_deref(), Some("toml"), "lang must be 'toml'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
}
/// A package.json fixture must emit exactly 1 chunk with the correct symbol,
/// lang, and line range.
#[test]
fn package_json_single_chunk_with_json_lang() {
let fixture_path = fixtures_dir().join("sample_package.json");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("json", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end: _,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"symbol must be '<manifest>'"
);
assert_eq!(lang.as_deref(), Some("json"), "lang must be 'json'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
}
/// A pom.xml fixture must emit exactly 1 chunk with the correct symbol,
/// lang, and line range.
#[test]
fn pom_xml_single_chunk_with_xml_lang() {
let fixture_path = fixtures_dir().join("sample_pom.xml");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("xml", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end: _,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"symbol must be '<manifest>'"
);
assert_eq!(lang.as_deref(), Some("xml"), "lang must be 'xml'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
}
/// A go.mod fixture must emit exactly 1 chunk with the correct symbol,
/// lang, and line range.
#[test]
fn go_mod_single_chunk_with_go_mod_lang() {
let fixture_path = fixtures_dir().join("sample_go.mod");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("go-mod", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end: _,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"symbol must be '<manifest>'"
);
assert_eq!(lang.as_deref(), Some("go-mod"), "lang must be 'go-mod'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
}

View File

@@ -8,7 +8,7 @@ use serde_json::Value;
use crate::asset::{RawAsset, WorkspacePath};
use crate::chunk::Chunk;
use crate::document::{Block, CanonicalDocument};
use crate::ids::{ChunkId, DocumentId};
use crate::ids::{AssetId, ChunkId, DocumentId};
use crate::jobs::{JobFilter, JobId, JobKind, JobRow, JobStatus};
use crate::media::MediaType;
use crate::search::{DocFilter, DocSummary, SearchFilters, SearchHit, SearchQuery};
@@ -161,10 +161,23 @@ pub trait DocumentStore {
fn get_document(&self, id: &DocumentId) -> anyhow::Result<Option<CanonicalDocument>>;
fn get_chunk(&self, id: &ChunkId) -> anyhow::Result<Option<Chunk>>;
fn list_documents(&self, filter: &DocFilter) -> anyhow::Result<Vec<DocSummary>>;
/// Look up an asset row by its `asset_id` (PRIMARY KEY = blake3
/// content hash). Twin-file safe: asset_id is PK so there is
/// exactly one row per unique content hash, regardless of how many
/// `documents` rows share it. Use this instead of
/// `get_asset_by_workspace_path` when you already have a
/// `CanonicalDocument` (which carries `source_asset_id`).
fn get_asset(&self, id: &AssetId) -> anyhow::Result<Option<RawAsset>>;
/// p9-fb-23: look up an asset row by its workspace path. Used by
/// the incremental-ingest skip path to compare the freshly
/// computed blake3 checksum against what's already in SQLite. The
/// schema enforces a unique workspace_path per asset.
///
/// NOTE: for twin files (identical content at different paths),
/// `assets.workspace_path` is "last-registered path" — it
/// flip-flops on every ingest. Prefer `get_asset` (by asset_id)
/// when you have a `CanonicalDocument.source_asset_id`.
fn get_asset_by_workspace_path(
&self,
path: &WorkspacePath,

View File

@@ -19,6 +19,9 @@ tree-sitter-rust = { workspace = true }
tree-sitter-python = { workspace = true }
tree-sitter-typescript = { workspace = true }
tree-sitter-javascript = { workspace = true }
tree-sitter-go = { workspace = true }
tree-sitter-java = { workspace = true }
tree-sitter-kotlin-ng = { workspace = true }
[dev-dependencies]
tempfile = { workspace = true }

View File

@@ -0,0 +1,451 @@
//! `kebab-parse-code::go` — tree-sitter Go AST extractor (P10-1C-Go Task D).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("go")`].
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
//! top-level AST semantic unit (free fn, method, each type spec) carrying
//! [`SourceSpan::Code`] with the unit's self-reference symbol path
//! (design §3.4 Go row). Glue declarations (`import` / `const` / `var`)
//! collapse into one grouped `<top-level>` (or `<module>`) unit.
//!
//! Unlike the Python/TS/JS extractors which path-derive their module
//! prefix from the workspace file path, Go's package identity comes from
//! the source itself (the leading `package` clause) — `extract_package`
//! reads it from the AST. If the `package_clause` is missing (invalid Go
//! in practice) the prefix falls back to `"<unknown>"`.
//!
//! Doc comments immediately preceding an item are folded into that
//! item's line range via `unit_start` (1B pattern). Go has no separate
//! attribute/decorator AST nodes.
//!
//! Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
pub const PARSER_VERSION: &str = "code-go-v1";
/// Go AST extractor. Per-unit blocks via tree-sitter-go 0.25
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct GoAstExtractor;
impl GoAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for GoAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for GoAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "go")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for GoAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec())
.map_err(|e| anyhow::anyhow!("kebab-parse-code: Go source is not valid UTF-8: {e}"))?;
let blocks = build_blocks(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
// Resolve the file's absolute path for repo detection. If the
// source URI carries a relative path, anchor it at the workspace
// root so the `.git/` walk-up starts from the right place.
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("go".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted Go doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
/// p10-1C-Go: extract `package` declaration text from a tree-sitter-go
/// `source_file`. Returns `None` if no `package_clause` (invalid Go in
/// practice but defense-in-depth). Per design §3.4 Go row.
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_clause" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "package_identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
fn build_blocks(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_go::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-go language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Go source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
let mod_prefix = extract_package(root, source).unwrap_or_else(|| "<unknown>".to_string());
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue groups are pushed with a sentinel symbol + is_real=false so a
// post-pass can decide `<module>` vs `<top-level>` (1B post-pass
// mirror).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
// (is_import 0/1, s, e). `is_import` flags `import_declaration` —
// used by the glue flush to pick `<module>` vs `<top-level>`
// provisional label.
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
fn node_name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
n.child_by_field_name("name")
.map(|c| &src[c.start_byte()..c.end_byte()])
}
/// Walk preceding `comment` siblings to extend the unit's line range
/// upward, folding leading doc / line comments into the unit. Go has
/// no decorator/attribute nodes — doc comments are simply preceding
/// `comment` siblings (the 1B pattern).
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
if p.kind() == "comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
/// Extract the receiver type text for a `method_declaration`. The
/// returned slice INCLUDES the leading `*` for pointer receivers
/// (`(*Foo).Bar`) per design §3.4 Go row example. Returns `None` if
/// the receiver is malformed (defense in depth).
fn receiver_type_text<'a>(method_node: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
let recv = method_node.child_by_field_name("receiver")?;
let mut cw = recv.walk();
for p in recv.named_children(&mut cw) {
if p.kind() == "parameter_declaration" {
if let Some(ty) = p.child_by_field_name("type") {
return Some(&src[ty.start_byte()..ty.end_byte()]);
}
}
}
None
}
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"function_declaration" => {
if let Some(name) = node_name_text(&child, source) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(&mut glue, &mut units, &mod_prefix);
let sym = join_symbol(&mod_prefix, &[], name);
units.push((sym, s, e, true));
}
}
"method_declaration" => {
if let Some(name_node) = child.child_by_field_name("name") {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(&mut glue, &mut units, &mod_prefix);
let owner = receiver_type_text(&child, source).unwrap_or("<unknown>");
let method_name = &source[name_node.start_byte()..name_node.end_byte()];
let sym = format!("{mod_prefix}.({owner}).{method_name}");
units.push((sym, s, e, true));
}
}
"type_declaration" => {
// One unit per inner `type_spec`. Each type_spec gets
// the type_declaration's whole upward-folded `s` range
// start so doc comments are attached to the first spec;
// subsequent specs use their own start. Match 1B
// pattern: keep the outer `s` only when there's a single
// spec; otherwise use the spec's own start.
let mut tcur = child.walk();
let specs: Vec<tree_sitter::Node> = child
.named_children(&mut tcur)
.filter(|c| c.kind() == "type_spec")
.collect();
let single = specs.len() == 1;
for spec in specs {
let name_node = match spec.child_by_field_name("name") {
Some(n) => n,
None => continue,
};
let spec_s = if single {
s
} else {
spec.start_position().row as u32 + 1
};
let spec_e = spec.end_position().row as u32 + 1;
glue.retain(|(_, gs, _)| *gs < spec_s);
flush_glue(&mut glue, &mut units, &mod_prefix);
let name = &source[name_node.start_byte()..name_node.end_byte()];
let sym = join_symbol(&mod_prefix, &[], name);
units.push((sym, spec_s, spec_e, true));
}
}
"import_declaration" => {
glue.push((1, s, e));
}
"const_declaration" | "var_declaration" => {
glue.push((0, s, e));
}
_ => {}
}
}
flush_glue(&mut glue, &mut units, &mod_prefix);
// `<module>` is correct only when the file produced no real unit.
// Otherwise the import/const/var-only group becomes `<top-level>`
// (same post-pass as 1B). Match on the suffix so the demotion stays
// mod-prefix-agnostic.
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if has_real_unit {
for (sym, _, _, is_real) in units.iter_mut() {
if !*is_real && sym.ends_with("<module>") {
let pre = &sym[..sym.len() - "<module>".len()];
*sym = format!("{pre}<top-level>");
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("go".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("go".to_string()),
code,
}));
}
Ok(blocks)
}
fn flush_glue(
glue: &mut Vec<(usize, u32, u32)>,
units: &mut Vec<(String, u32, u32, bool)>,
mod_prefix: &str,
) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
// Provisional label: `<module>` only if the group is exclusively
// imports (1A's `only_mod_decls` analog). The post-pass demotes any
// `<module>` to `<top-level>` if the file produced any real unit.
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
let label = if only_imports { "<module>" } else { "<top-level>" };
units.push((join_symbol(mod_prefix, &[], label), s, e, false));
glue.clear();
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(concat!(
env!("CARGO_MANIFEST_DIR"),
"/tests/fixtures/sample.go"
))
.unwrap();
// Reuse the cross-language test-support helper promoted in 1B.
let asset = crate::rust::tests_support::fixed_code_asset("crates/x/src/sample.go", "go");
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
GoAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_go() {
let e = GoAstExtractor::new();
assert!(e.supports(&MediaType::Code("go".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn go_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("go"));
symbol.clone()
}
_ => None,
},
_ => None,
})
.collect();
syms.sort();
assert!(syms.iter().any(|s| s == "chunk.Free"), "got {syms:?}");
assert!(syms.iter().any(|s| s == "chunk.init"), "got {syms:?}");
assert!(
syms.iter().any(|s| s == "chunk.MdHeadingV1Chunker"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "chunk.(*MdHeadingV1Chunker).ChunkDoc"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "chunk.(MdHeadingV1Chunker).Name2"),
"got {syms:?}"
);
assert!(syms.iter().any(|s| s == "chunk.Stringer"), "got {syms:?}");
// import + const grouped into one glue unit (no isolated `<module>`).
assert!(
syms.iter().any(|s| s == "chunk.<top-level>"),
"got {syms:?}"
);
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 {
assert_eq!(extract_fixture().blocks, a.blocks);
}
}
}

View File

@@ -0,0 +1,543 @@
//! `kebab-parse-code::java` — tree-sitter Java AST extractor (P10-1C-JK Task D).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("java")`].
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
//! top-level AST semantic unit (class / interface / enum / record /
//! annotation-type at any nesting level, plus methods + constructors
//! inside class / interface / record bodies), each carrying
//! [`SourceSpan::Code`] with the unit's dotted self-reference symbol
//! path (design §3.4 Java row). Glue declarations (`import`) collapse
//! into one grouped `<top-level>` (or `<module>`) unit.
//!
//! Like the Go extractor, Java's package identity comes from the
//! source itself (the `package_declaration` clause), not from the
//! workspace file path — `extract_package` reads it from the AST. If
//! the clause is missing the prefix falls back to `"<unknown>"`.
//!
//! Class/interface/record bodies are recursed (1B Python pattern):
//! the type name is pushed onto `mod_path` so methods and nested
//! types become `<pkg>.<Outer>.<Inner>.<method>`. Constructors use
//! the Java convention `<pkg>.<...>.<Class>.<ClassName>` (name
//! duplicated, per design §3.4). Enum bodies are not recursed for
//! the 1차 cut — enum constants are not emitted as units.
//!
//! Javadoc (`/** ... */` → `block_comment`) and line comments
//! immediately preceding an item are folded into that item's line
//! range via `unit_start` (1B pattern). Annotations are children of
//! the declaration node itself (inside `modifiers`), so they are
//! already part of the declaration's span — no separate unwrap arm.
//!
//! Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
pub const PARSER_VERSION: &str = "code-java-v1";
/// Java AST extractor. Per-unit blocks via tree-sitter-java 0.23
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct JavaAstExtractor;
impl JavaAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for JavaAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for JavaAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "java")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for JavaAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec())
.map_err(|e| anyhow::anyhow!("kebab-parse-code: Java source is not valid UTF-8: {e}"))?;
let blocks = build_blocks(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
// Resolve the file's absolute path for repo detection. If the
// source URI carries a relative path, anchor it at the workspace
// root so the `.git/` walk-up starts from the right place.
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("java".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted Java doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
/// p10-1C-JK: extract `package` declaration text from a tree-sitter-java
/// `program`. Returns `None` if no `package_declaration` (default-package
/// Java file). The package_declaration's named children are either a
/// single `identifier` (single-segment package, rare) or a
/// `scoped_identifier` (dotted, common). Per design §3.4 Java row.
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_declaration" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "scoped_identifier" || sub.kind() == "identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
/// Walk preceding `line_comment` / `block_comment` siblings to extend
/// the unit's line range upward, folding leading Javadoc / line
/// comments into the unit. Annotations live INSIDE `modifiers` on the
/// declaration node itself, so their lines are already inside
/// `n.start_position()` — no separate unwrap arm is needed for them.
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
let k = p.kind();
if k == "line_comment" || k == "block_comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
fn node_name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
n.child_by_field_name("name")
.map(|c| &src[c.start_byte()..c.end_byte()])
}
fn build_blocks(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_java::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-java language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Java source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
let mod_prefix = extract_package(root, source).unwrap_or_else(|| "<unknown>".to_string());
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue groups are pushed with a sentinel symbol + is_real=false so a
// post-pass can decide `<module>` vs `<top-level>` (1B/1C-Go pattern).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
// (is_import 0/1, s, e). `is_import` flags `import_declaration` —
// used by the glue flush to pick `<module>` vs `<top-level>`
// provisional label.
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
walk_top(root, source, &mod_prefix, &mut units, &mut glue);
// `<module>` is correct only when the file produced no real unit.
// Otherwise the import-only group becomes `<top-level>` (same
// post-pass as 1B / 1C-Go).
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if has_real_unit {
for (sym, _, _, is_real) in units.iter_mut() {
if !*is_real && sym.ends_with("<module>") {
let pre = &sym[..sym.len() - "<module>".len()];
*sym = format!("{pre}<top-level>");
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("java".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("java".to_string()),
code,
}));
}
Ok(blocks)
}
/// Walk the file's top-level children — `program` named children:
/// `package_declaration` (handled by `extract_package`), `import_declaration`
/// (glue), and the five type declarations (`class` / `interface` /
/// `enum` / `record` / `annotation_type`). Type-declaration bodies
/// are recursed via [`walk_body`] with the type name pushed onto
/// `mod_path` (1B Python pattern). Enum bodies are NOT recursed
/// (1차 cut — see module-level doc).
fn walk_top(
node: tree_sitter::Node,
src: &str,
mod_prefix: &str,
units: &mut Vec<(String, u32, u32, bool)>,
glue: &mut Vec<(usize, u32, u32)>,
) {
let mod_path: &[String] = &[];
let mut cur = node.walk();
for child in node.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"class_declaration"
| "interface_declaration"
| "record_declaration" => {
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if let Some(body) = child.child_by_field_name("body") {
let np: Vec<String> = vec![name.to_string()];
walk_body(body, src, mod_prefix, &np, units);
}
}
}
"enum_declaration" => {
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
// Enum body NOT recursed for 1차 — enum constants are
// not emitted as units, and method declarations inside
// enum bodies (rare) live under `enum_body_declarations`
// not `class_body`. Skip per design §3.4 1차 scope.
}
}
"annotation_type_declaration" => {
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
}
}
"import_declaration" => {
glue.push((1, s, e));
}
// package_declaration is handled by `extract_package`; no
// glue entry — it's structural metadata, not a unit.
_ => {}
}
}
flush_glue(glue, units, mod_prefix, mod_path);
}
/// Walk a `class_body` / `interface_body` (or record's `class_body`).
/// Emits one unit per method / constructor, and recurses into nested
/// type declarations. Field declarations are NOT emitted (would
/// explode unit count). `compact_constructor_declaration` (records)
/// is handled the same as `constructor_declaration`.
///
/// No `glue` parameter: Java does not have imports inside type
/// bodies — they only appear at file top level, handled by
/// [`walk_top`].
fn walk_body(
body: tree_sitter::Node,
src: &str,
mod_prefix: &str,
mod_path: &[String],
units: &mut Vec<(String, u32, u32, bool)>,
) {
let mut cur = body.walk();
for child in body.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"method_declaration"
| "constructor_declaration"
| "compact_constructor_declaration" => {
// Constructor: name field equals the class name. Per
// design §3.4 Java convention, symbol is
// `<pkg>.<mod_path>.<ClassName>` with the constructor
// name (== class name) as the trailing segment. This
// means the symbol duplicates the class name (e.g.
// `com.x.Foo.Foo`), which is the documented convention.
if let Some(name) = node_name_text(&child, src) {
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
}
}
"class_declaration"
| "interface_declaration"
| "record_declaration"
| "enum_declaration"
| "annotation_type_declaration" => {
// Nested type — emit unit, then recurse into its body
// (skipped for enum + annotation_type per 1차 scope).
let name = match node_name_text(&child, src) {
Some(n) => n,
None => continue,
};
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if child.kind() != "enum_declaration"
&& child.kind() != "annotation_type_declaration"
{
if let Some(inner_body) = child.child_by_field_name("body") {
let mut np = mod_path.to_vec();
np.push(name.to_string());
walk_body(inner_body, src, mod_prefix, &np, units);
}
}
}
// field_declaration, static_initializer, block: NOT emitted.
_ => {}
}
}
}
fn flush_glue(
glue: &mut Vec<(usize, u32, u32)>,
units: &mut Vec<(String, u32, u32, bool)>,
mod_prefix: &str,
mod_path: &[String],
) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
// Provisional label: `<module>` only if the group is exclusively
// imports (1A's `only_mod_decls` analog). The post-pass demotes any
// `<module>` to `<top-level>` if the file produced any real unit.
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
let label = if only_imports { "<module>" } else { "<top-level>" };
units.push((join_symbol(mod_prefix, mod_path, label), s, e, false));
glue.clear();
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(concat!(
env!("CARGO_MANIFEST_DIR"),
"/tests/fixtures/sample.java"
))
.unwrap();
let asset =
crate::rust::tests_support::fixed_code_asset("crates/x/src/sample.java", "java");
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
JavaAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_java() {
let e = JavaAstExtractor::new();
assert!(e.supports(&MediaType::Code("java".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn java_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("java"));
symbol.clone()
}
_ => None,
},
_ => None,
})
.collect();
syms.sort();
// package extracted from source = com.kebab.chunk
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker"),
"got {syms:?}"
);
// constructor — Java convention is class-name-as-method-name
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.MdHeadingV1Chunker"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.getName"),
"got {syms:?}"
);
// static nested class
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.withName"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.build"),
"got {syms:?}"
);
// package-private interface + enum
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Stringer"),
"got {syms:?}"
);
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Mode"),
"got {syms:?}"
);
// import grouped as <top-level>
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.<top-level>"),
"got {syms:?}"
);
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 {
assert_eq!(extract_fixture().blocks, a.blocks);
}
}
}

View File

@@ -0,0 +1,627 @@
//! `kebab-parse-code::kotlin` — tree-sitter Kotlin AST extractor (P10-1C-JK Task G).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("kotlin")`].
//! Mirrors the Java extractor (JVM family, source-side `package` extraction +
//! class-nesting) with Kotlin-specific adjustments:
//!
//! * Root is `source_file` (not `program`).
//! * `package_header` carries a single `qualified_identifier` child whose
//! slice text IS the dotted package path — never a bare `identifier`
//! sub-form for the package (the grammar always wraps a single segment
//! in `qualified_identifier` too).
//! * `class_declaration` covers `class`, `data class`, `sealed class`,
//! `enum class`, AND `interface` — Kotlin uses ONE node kind with a
//! `modifiers` child rather than separate `interface_declaration` /
//! `enum_declaration` nodes (verified via tree-sitter-kotlin-ng
//! `node-types.json`).
//! * The body child of `class_declaration` is either `class_body` (normal
//! classes / interfaces) OR `enum_class_body` (enum class). Neither
//! carries a `body` field name, so it is matched by kind, not by
//! `child_by_field_name("body")`.
//! * `companion_object` is a SEPARATE node kind (not `object_declaration`
//! with a modifier). Its `name` field is OPTIONAL — when omitted (the
//! common case `companion object { ... }`) the symbol uses the
//! implicit Kotlin convention name `Companion`.
//! * `object_declaration` (named singleton) carries a `name` field and a
//! `class_body` child.
//! * `function_declaration` may appear at top level (Kotlin top-level
//! function) AND inside `class_body` — same node kind, the
//! `mod_path` state distinguishes the two emit forms.
//!
//! Enum bodies (`enum_class_body`) are NOT recursed for the 1차 cut —
//! `enum_entry` declarations are not emitted as units, matching the
//! Java extractor's enum policy (design §3.4 1차 scope).
//!
//! Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
pub const PARSER_VERSION: &str = "code-kotlin-v1";
/// Kotlin AST extractor. Per-unit blocks via tree-sitter-kotlin-ng 1.1
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct KotlinAstExtractor;
impl KotlinAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for KotlinAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for KotlinAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "kotlin")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for KotlinAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
anyhow::anyhow!("kebab-parse-code: Kotlin source is not valid UTF-8: {e}")
})?;
let blocks = build_blocks(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
// Resolve the file's absolute path for repo detection. If the
// source URI carries a relative path, anchor it at the workspace
// root so the `.git/` walk-up starts from the right place.
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("kotlin".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted Kotlin doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
/// p10-1C-JK: extract `package` declaration text from a tree-sitter-kotlin
/// `source_file`. Returns `None` if no `package_header` (default-package
/// Kotlin file). The package_header's single named child is a
/// `qualified_identifier`; its slice text is the dotted path. Per design
/// §3.4 Kotlin row.
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_header" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
let k = sub.kind();
if k == "qualified_identifier" || k == "identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
/// Walk preceding `line_comment` / `block_comment` siblings to extend
/// the unit's line range upward, folding leading KDoc / line comments
/// into the unit. Modifiers / annotations live INSIDE the declaration
/// node itself, so their lines are already inside `n.start_position()`.
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
let k = p.kind();
if k == "line_comment" || k == "block_comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
fn node_name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
n.child_by_field_name("name")
.map(|c| &src[c.start_byte()..c.end_byte()])
}
/// Find the first child of a node with one of the given kinds. Used to
/// locate `class_body` / `enum_class_body` on `class_declaration` since
/// the kotlin grammar attaches them without a `body` field name.
fn first_child_of_kinds<'a>(
n: &tree_sitter::Node<'a>,
kinds: &[&str],
) -> Option<tree_sitter::Node<'a>> {
let mut cur = n.walk();
n.named_children(&mut cur)
.find(|child| kinds.contains(&child.kind()))
}
/// `true` iff a `class_declaration` carries the `enum` class modifier.
/// Detected by walking `modifiers` → `class_modifier` and checking the
/// child text. The grammar exposes "enum" / "sealed" / "data" /
/// "annotation" / "inner" as named `class_modifier` children of
/// `modifiers`. We only need to know about "enum" to decide whether to
/// look for `class_body` or `enum_class_body` and whether to skip body
/// recursion.
fn class_decl_is_enum(n: &tree_sitter::Node, src: &str) -> bool {
let mut cur = n.walk();
for child in n.named_children(&mut cur) {
if child.kind() == "modifiers" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "class_modifier" {
let text = &src[sub.start_byte()..sub.end_byte()];
if text == "enum" {
return true;
}
}
}
}
}
false
}
fn build_blocks(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_kotlin_ng::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-kotlin-ng language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Kotlin source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
let mod_prefix = extract_package(root, source).unwrap_or_else(|| "<unknown>".to_string());
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue groups are pushed with a sentinel symbol + is_real=false so a
// post-pass can decide `<module>` vs `<top-level>` (JVM family pattern).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
// (is_import 0/1, s, e). `is_import` flags `import` — used by the
// glue flush to pick `<module>` vs `<top-level>` provisional label.
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
walk_top(root, source, &mod_prefix, &mut units, &mut glue);
// `<module>` is correct only when the file produced no real unit.
// Otherwise the import-only group becomes `<top-level>` (same
// post-pass as 1B / 1C-Go / Java).
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if has_real_unit {
for (sym, _, _, is_real) in units.iter_mut() {
if !*is_real && sym.ends_with("<module>") {
let pre = &sym[..sym.len() - "<module>".len()];
*sym = format!("{pre}<top-level>");
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("kotlin".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("kotlin".to_string()),
code,
}));
}
Ok(blocks)
}
/// Walk the file's top-level children — `source_file` named children:
/// `package_header` (handled by `extract_package`), `import` (glue),
/// `class_declaration` (class / interface / enum class), `object_declaration`,
/// `function_declaration` (top-level), `property_declaration` (top-level),
/// `type_alias` (currently treated as glue). Class / object bodies are
/// recursed via [`walk_body`] with the type name pushed onto `mod_path`
/// (JVM family pattern). Enum bodies are NOT recursed (1차 cut).
fn walk_top(
node: tree_sitter::Node,
src: &str,
mod_prefix: &str,
units: &mut Vec<(String, u32, u32, bool)>,
glue: &mut Vec<(usize, u32, u32)>,
) {
let mod_path: &[String] = &[];
let mut cur = node.walk();
for child in node.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"class_declaration" => {
// Covers class / data class / sealed class / interface /
// enum class — single grammar node, the modifiers child
// distinguishes them. The body is `class_body` for
// non-enum and `enum_class_body` for enum class; both
// attach without a `body` field name.
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
let is_enum = class_decl_is_enum(&child, src);
if !is_enum {
if let Some(body) = first_child_of_kinds(&child, &["class_body"]) {
let np: Vec<String> = vec![name.to_string()];
walk_body(body, src, mod_prefix, &np, units);
}
}
// enum_class_body NOT recursed — enum constants are
// not emitted as units (1차 scope, matches Java).
}
}
"object_declaration" => {
// Singleton object — name field is required by the grammar.
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if let Some(body) = first_child_of_kinds(&child, &["class_body"]) {
let np: Vec<String> = vec![name.to_string()];
walk_body(body, src, mod_prefix, &np, units);
}
}
}
"function_declaration" => {
// Top-level Kotlin function (unlike Java).
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
}
}
"import" => {
glue.push((1, s, e));
}
// `property_declaration` (top-level val/var) and `type_alias`
// are not emitted as standalone units in the 1차 cut — they
// glue into the import group instead. `package_header` is
// handled by `extract_package` (structural metadata, not a
// unit).
_ => {}
}
}
flush_glue(glue, units, mod_prefix, mod_path);
}
/// Walk a `class_body` (or object's `class_body`). Emits one unit per
/// method / secondary constructor and recurses into nested type
/// declarations + companion objects. Property declarations are NOT
/// emitted (would explode unit count, parallel to Java field policy).
///
/// `companion_object` carries an optional `name` field — when omitted
/// (the common case `companion object { ... }`) the implicit Kotlin
/// convention name `Companion` is used.
///
/// No `glue` parameter: Kotlin imports are file-level only.
fn walk_body(
body: tree_sitter::Node,
src: &str,
mod_prefix: &str,
mod_path: &[String],
units: &mut Vec<(String, u32, u32, bool)>,
) {
let mut cur = body.walk();
for child in body.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"function_declaration" => {
if let Some(name) = node_name_text(&child, src) {
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
}
}
"secondary_constructor" => {
// Kotlin secondary constructor — no `name` field on the
// grammar node. Per design §3.4 (Java JVM convention) the
// symbol uses the enclosing class name as the trailing
// segment (matches the Java `<pkg>.<...>.<Class>.<Class>`
// duplication for constructors).
if let Some(class_name) = mod_path.last() {
let sym = join_symbol(mod_prefix, mod_path, class_name);
units.push((sym, s, e, true));
}
}
"companion_object" => {
// Companion's name field is OPTIONAL — fall back to the
// Kotlin implicit name `Companion`.
let name: &str = node_name_text(&child, src).unwrap_or("Companion");
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if let Some(inner_body) = first_child_of_kinds(&child, &["class_body"]) {
let mut np = mod_path.to_vec();
np.push(name.to_string());
walk_body(inner_body, src, mod_prefix, &np, units);
}
}
"class_declaration" => {
let name = match node_name_text(&child, src) {
Some(n) => n,
None => continue,
};
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
let is_enum = class_decl_is_enum(&child, src);
if !is_enum {
if let Some(inner_body) = first_child_of_kinds(&child, &["class_body"]) {
let mut np = mod_path.to_vec();
np.push(name.to_string());
walk_body(inner_body, src, mod_prefix, &np, units);
}
}
}
"object_declaration" => {
let name = match node_name_text(&child, src) {
Some(n) => n,
None => continue,
};
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if let Some(inner_body) = first_child_of_kinds(&child, &["class_body"]) {
let mut np = mod_path.to_vec();
np.push(name.to_string());
walk_body(inner_body, src, mod_prefix, &np, units);
}
}
// property_declaration, anonymous_initializer: NOT emitted.
_ => {}
}
}
}
fn flush_glue(
glue: &mut Vec<(usize, u32, u32)>,
units: &mut Vec<(String, u32, u32, bool)>,
mod_prefix: &str,
mod_path: &[String],
) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
// Provisional label: `<module>` only if the group is exclusively
// imports. The post-pass demotes any `<module>` to `<top-level>` if
// the file produced any real unit.
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
let label = if only_imports { "<module>" } else { "<top-level>" };
units.push((join_symbol(mod_prefix, mod_path, label), s, e, false));
glue.clear();
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(concat!(
env!("CARGO_MANIFEST_DIR"),
"/tests/fixtures/sample.kt"
))
.unwrap();
let asset =
crate::rust::tests_support::fixed_code_asset("crates/x/src/sample.kt", "kotlin");
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
KotlinAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_kotlin() {
let e = KotlinAstExtractor::new();
assert!(e.supports(&MediaType::Code("kotlin".into())));
assert!(!e.supports(&MediaType::Code("java".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn kotlin_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("kotlin"));
symbol.clone()
}
_ => None,
},
_ => None,
})
.collect();
syms.sort();
// package extracted from source = com.kebab.chunk
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.getName"),
"got {syms:?}"
);
// Implicit companion object name = Companion (grammar leaves the
// name field unset; the extractor fills it in).
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Companion"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Companion.withName"),
"got {syms:?}"
);
// interface — also via class_declaration in the grammar
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Stringer"),
"got {syms:?}"
);
// enum class — also via class_declaration; body NOT recursed
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Mode"),
"got {syms:?}"
);
// Kotlin top-level fn — unlike Java
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.freeFunction"),
"got {syms:?}"
);
// Singleton object + its method
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Singleton"),
"got {syms:?}"
);
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Singleton.ping"),
"got {syms:?}"
);
// import grouped as <top-level>
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.<top-level>"),
"got {syms:?}"
);
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 {
assert_eq!(extract_fixture().blocks, a.blocks);
}
}
}

View File

@@ -10,18 +10,39 @@ use std::path::Path;
/// `None` if the extension / filename is not recognized.
///
/// Matching priority:
/// 1. exact filename match (e.g. `Dockerfile`, `Makefile`)
/// 2. lowercase extension match
/// 1. Tier 1 basename exact match (e.g. `Dockerfile`, `Makefile`)
/// 2. Tier 2 basename match (e.g. `Cargo.toml`, `package.json`, `build.gradle`)
/// 3. Tier 2 `Dockerfile.*` prefix variant
/// 4. Tier 1 + Tier 2 extension fallback (lowercase)
pub fn code_lang_for_path(path: &Path) -> Option<&'static str> {
if let Some(name) = path.file_name().and_then(|n| n.to_str()) {
// Tier 1 basename exact match
match name {
"Dockerfile" => return Some("dockerfile"),
"Makefile" | "GNUmakefile" => return Some("make"),
_ => {}
}
// Tier 2 basename match (configuration / manifest files)
match name {
"Cargo.toml" | "pyproject.toml" => return Some("toml"),
"package.json" | "tsconfig.json" => return Some("json"),
"go.mod" => return Some("go-mod"),
"pom.xml" => return Some("xml"),
"build.gradle" => return Some("groovy"),
_ => {}
}
// Tier 2: `Dockerfile.*` prefix variant (e.g. `Dockerfile.dev`, `Dockerfile.prod`)
if name.starts_with("Dockerfile.") && name.len() > "Dockerfile.".len() {
return Some("dockerfile");
}
}
// Extension fallback (Tier 1 + Tier 2)
let ext = path.extension()?.to_str()?.to_ascii_lowercase();
match ext.as_str() {
// Tier 1 extensions
"rs" => Some("rust"),
"py" | "pyi" => Some("python"),
"ts" | "tsx" | "mts" | "cts" => Some("typescript"),
@@ -31,12 +52,15 @@ pub fn code_lang_for_path(path: &Path) -> Option<&'static str> {
"kt" | "kts" => Some("kotlin"),
"c" | "h" => Some("c"),
"cpp" | "cc" | "cxx" | "hpp" | "hh" | "hxx" => Some("cpp"),
"sh" | "bash" | "zsh" => Some("shell"),
"mk" => Some("make"),
// Tier 2 extensions
"yaml" | "yml" => Some("yaml"),
"toml" => Some("toml"),
"json" => Some("json"),
"sh" | "bash" | "zsh" => Some("shell"),
"mk" => Some("make"),
"xml" => Some("xml"),
"dockerfile" => Some("dockerfile"),
"gradle" => Some("groovy"),
_ => None,
}
}
@@ -118,4 +142,28 @@ mod tests {
assert_eq!(module_path_for_tsjs("a/b/c.ts"), "a/b/c");
assert_eq!(module_path_for_tsjs("packages/x/src/Foo.ts"), "packages/x/src/Foo");
}
#[test]
fn tier2_basename_takes_precedence_over_extension() {
assert_eq!(code_lang_for_path(Path::new("Dockerfile")), Some("dockerfile"));
assert_eq!(code_lang_for_path(Path::new("foo/Dockerfile.dev")), Some("dockerfile"));
assert_eq!(code_lang_for_path(Path::new("myapp.dockerfile")), Some("dockerfile"));
assert_eq!(code_lang_for_path(Path::new("repo/Cargo.toml")), Some("toml"));
assert_eq!(code_lang_for_path(Path::new("pyproject.toml")), Some("toml"));
assert_eq!(code_lang_for_path(Path::new("repo/package.json")), Some("json"));
assert_eq!(code_lang_for_path(Path::new("tsconfig.json")), Some("json"));
assert_eq!(code_lang_for_path(Path::new("go.mod")), Some("go-mod"));
assert_eq!(code_lang_for_path(Path::new("pom.xml")), Some("xml"));
assert_eq!(code_lang_for_path(Path::new("build.gradle")), Some("groovy"));
}
#[test]
fn tier2_extension_fallback() {
assert_eq!(code_lang_for_path(Path::new("k8s/deploy.yaml")), Some("yaml"));
assert_eq!(code_lang_for_path(Path::new("k8s/deploy.yml")), Some("yaml"));
assert_eq!(code_lang_for_path(Path::new("foo/bar.toml")), Some("toml"));
assert_eq!(code_lang_for_path(Path::new("foo/bar.json")), Some("json"));
assert_eq!(code_lang_for_path(Path::new("foo/bar.xml")), Some("xml"));
assert_eq!(code_lang_for_path(Path::new("foo/bar.gradle")), Some("groovy"));
}
}

View File

@@ -13,7 +13,10 @@
//! `kebab-parse-*` crates per design §8: must NOT depend on store / embed
//! / llm / rag.
pub mod go;
pub mod java;
pub mod javascript;
pub mod kotlin;
pub mod lang;
pub mod python;
pub mod repo;
@@ -22,7 +25,10 @@ pub(crate) mod scaffold;
pub mod skip;
pub mod typescript;
pub use go::{PARSER_VERSION as GO_PARSER_VERSION, GoAstExtractor};
pub use java::{PARSER_VERSION as JAVA_PARSER_VERSION, JavaAstExtractor};
pub use javascript::{PARSER_VERSION as JS_PARSER_VERSION, JavascriptAstExtractor};
pub use kotlin::{PARSER_VERSION as KOTLIN_PARSER_VERSION, KotlinAstExtractor};
pub use lang::{code_lang_for_path, module_path_for_python, module_path_for_tsjs};
pub use python::{PARSER_VERSION as PYTHON_PARSER_VERSION, PythonAstExtractor};
pub use repo::{RepoMeta, detect_repo};

View File

@@ -0,0 +1,34 @@
// sample.go
package chunk
import (
"fmt"
"strings"
)
const Version = "v1"
type MdHeadingV1Chunker struct {
Name string
}
// ChunkDoc returns a stub list of strings.
func (m *MdHeadingV1Chunker) ChunkDoc(input string) []string {
return []string{m.Name}
}
func (m MdHeadingV1Chunker) Name2() string {
return m.Name
}
type Stringer interface {
String() string
}
func Free(x int) int {
return x + 1
}
func init() {
fmt.Println(strings.ToUpper("init"))
}

View File

@@ -0,0 +1,36 @@
// sample.java
package com.kebab.chunk;
import java.util.List;
import java.util.stream.Collectors;
/**
* Heading-aware Markdown chunker.
*/
public class MdHeadingV1Chunker {
private final String name;
public MdHeadingV1Chunker(String name) {
this.name = name;
}
public List<String> chunkDoc(String input) {
return List.of(name, input);
}
public String getName() {
return name;
}
public static class Builder {
private String name;
public Builder withName(String n) { this.name = n; return this; }
public MdHeadingV1Chunker build() { return new MdHeadingV1Chunker(name); }
}
}
interface Stringer {
String asString();
}
enum Mode { DEFAULT, FAST }

View File

@@ -0,0 +1,29 @@
// sample.kt
package com.kebab.chunk
import java.util.List
/**
* Heading-aware Markdown chunker.
*/
class MdHeadingV1Chunker(val name: String) {
fun chunkDoc(input: String): List<String> = listOf(name, input)
fun getName(): String = name
companion object {
fun withName(n: String): MdHeadingV1Chunker = MdHeadingV1Chunker(n)
}
}
interface Stringer {
fun asString(): String
}
enum class Mode { DEFAULT, FAST }
fun freeFunction(x: Int): Int = x + 1
object Singleton {
fun ping(): String = "pong"
}

View File

@@ -12,6 +12,12 @@ use kebab_core::{AudioType, ImageType, MediaType};
/// `MediaType::Image(_)` / `MediaType::Audio(_)`. Anything else (including
/// missing extension) → `MediaType::Other(ext)`.
pub(crate) fn media_type_for(path: &Path) -> MediaType {
// p10-2: code_lang_for_path is the single source of truth for code lang
// (design §3.5). Delegate before falling back to extension branches.
if let Some(lang) = kebab_parse_code::code_lang_for_path(path) {
return MediaType::Code(lang.to_string());
}
let ext = path
.extension()
.and_then(|s| s.to_str())
@@ -36,16 +42,6 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {
"flac" => MediaType::Audio(AudioType::Flac),
"ogg" => MediaType::Audio(AudioType::Ogg),
// p10-1A-2: Rust is the only code lang activated in 1A. Other
// recognized code langs stay Other until their phase (1B+).
"rs" => MediaType::Code("rust".to_string()),
// p10-1B: Python / TS / JS AST chunkers active.
"py" | "pyi" => MediaType::Code("python".into()),
// .mts / .cts are TypeScript ESM / CommonJS variants — same grammar.
"ts" | "tsx" | "mts" | "cts" => MediaType::Code("typescript".into()),
"js" | "mjs" | "cjs" | "jsx" => MediaType::Code("javascript".into()),
// Empty string (no extension) and any other extension: bucket as
// Other and let downstream extractors decide if they support it.
_ => MediaType::Other(ext),
@@ -89,7 +85,8 @@ mod tests {
media_type_for(Path::new("crates/kebab-core/src/lib.rs")),
MediaType::Code("rust".to_string())
);
assert_eq!(media_type_for(Path::new("Cargo.toml")), MediaType::Other("toml".to_string()));
// Cargo.toml is a Tier 2 code manifest (p10-2), handled by code_lang_for_path
assert_eq!(media_type_for(Path::new("Cargo.toml")), MediaType::Code("toml".to_string()));
}
#[test]
@@ -119,6 +116,18 @@ mod tests {
assert_eq!(media_type_for(Path::new("docs/page.mdx")), MediaType::Markdown);
}
#[test]
fn go_files_map_to_media_code_go() {
assert_eq!(media_type_for(Path::new("a/b.go")), MediaType::Code("go".into()));
}
#[test]
fn java_kotlin_files_map_to_media_code() {
assert_eq!(media_type_for(Path::new("a/b.java")), MediaType::Code("java".into()));
assert_eq!(media_type_for(Path::new("a/b.kt")), MediaType::Code("kotlin".into()));
assert_eq!(media_type_for(Path::new("a/b.kts")), MediaType::Code("kotlin".into()));
}
#[test]
fn unknown_and_missing_extension() {
assert_eq!(
@@ -130,4 +139,14 @@ mod tests {
MediaType::Other(String::new())
);
}
#[test]
fn tier2_files_map_to_media_code() {
assert_eq!(media_type_for(Path::new("a/deploy.yaml")), MediaType::Code("yaml".into()));
assert_eq!(media_type_for(Path::new("a/Dockerfile")), MediaType::Code("dockerfile".into()));
assert_eq!(media_type_for(Path::new("a/Cargo.toml")), MediaType::Code("toml".into()));
assert_eq!(media_type_for(Path::new("a/pom.xml")), MediaType::Code("xml".into()));
assert_eq!(media_type_for(Path::new("a/build.gradle")), MediaType::Code("groovy".into()));
assert_eq!(media_type_for(Path::new("a/go.mod")), MediaType::Code("go-mod".into()));
}
}

View File

@@ -264,6 +264,28 @@ impl kebab_core::DocumentStore for SqliteStore {
}))
}
fn get_asset(
&self,
id: &kebab_core::AssetId,
) -> Result<Option<kebab_core::RawAsset>> {
let conn = self.lock_conn();
let result = conn.query_row(
r#"SELECT
asset_id, source_uri, workspace_path, media_type,
byte_len, checksum, storage_kind, storage_path,
discovered_at
FROM assets
WHERE asset_id = ?"#,
rusqlite::params![id.0.as_str()],
asset_from_row,
);
match result {
Ok(asset) => Ok(Some(asset)),
Err(rusqlite::Error::QueryReturnedNoRows) => Ok(None),
Err(e) => Err(e.into()),
}
}
fn get_asset_by_workspace_path(
&self,
path: &kebab_core::WorkspacePath,
@@ -632,7 +654,8 @@ fn rows_optional<T>(err: rusqlite::Error) -> rusqlite::Result<Option<T>> {
/// Reconstruct a [`kebab_core::RawAsset`] from one `assets` row.
/// Row mapper for `RawAsset`. Column names are self-documenting; the
/// SELECT in [`DocumentStore::get_asset_by_workspace_path`] must include
/// SELECTs in [`DocumentStore::get_asset`] and
/// [`DocumentStore::get_asset_by_workspace_path`] must both include
/// all nine columns by their schema names.
fn asset_from_row(row: &rusqlite::Row<'_>) -> rusqlite::Result<kebab_core::RawAsset> {
use std::path::PathBuf;

View File

@@ -652,6 +652,20 @@ pub fn purge_deleted_workspace_path(
/// path (which has bytes + computed `storage_kind/path`) and the
/// `DocumentStore::put_asset` path (which only has the `RawAsset` and
/// reads `storage_kind/path` from `asset.stored`).
///
/// **`assets.workspace_path` is "last-registered path" semantics for
/// twin files** (two source files with identical content share one
/// `assets` row keyed on `asset_id = blake3(content)`). Each ingest
/// of either twin overwrites `workspace_path` with whichever path was
/// seen most recently — this is intentional and correct after PR #146
/// made `try_skip_unchanged` document-centric (uses
/// `get_document_by_workspace_path`, not `get_asset_by_workspace_path`)
/// and PR #149 made `reset --orphans-only` document-centric too.
/// Do NOT "fix" the flip-flop by adding a UNIQUE constraint on
/// `workspace_path` in the `assets` table — twin de-dup is load-bearing.
/// When you need media_type for a known document, use the 2-step lookup
/// `get_document_by_workspace_path` → `doc.source_asset_id` →
/// `get_asset(asset_id)` so the result is twin-safe.
pub(crate) fn upsert_asset_row(
conn: &Connection,
asset: &kebab_core::RawAsset,

View File

@@ -22,7 +22,7 @@ Cargo workspace, 함수 호출 기반 모듈러 모놀리스. UI binary (`kebab-
| OCR | Ollama vision LM (default `gemma4:e4b`) — `OcrEngine` trait 으로 Tesseract / Apple Vision 등 future swap (HOTFIXES P6-2) |
| Image caption | Ollama vision LM, runtime gate `image.caption.enabled` (default OFF) |
| PDF parser | `lopdf` per-page 텍스트, `chunker_version = "pdf-page-v1"` 가 PDF 자산에 하드코딩 (HOTFIXES P7-3) |
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript`**parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). |
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` / `tree-sitter-go` / `tree-sitter-java` / `tree-sitter-kotlin-ng`**parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`, Go = `code-go-ast-v1`, Java = `code-java-ast-v1`, Kotlin = `code-kotlin-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). Kotlin grammar 은 `tree-sitter-kotlin-ng` 사용 — bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 에 고착되어 있어 사용 불가. **Tier 2 (p10-2)**: YAML/k8s → `serde_yaml` + `k8s-manifest-resource-v1` (apiVersion+kind per resource), Dockerfile → `dockerfile-file-v1` (whole-file), Cargo.toml/go.mod/.json/.xml/.groovy → `manifest-file-v1` (whole-file). Tier 2 chunkers live in `kebab-chunk`; no tree-sitter grammar needed (structure from file type, not AST). **Tier 3 (p10-3)**: shell scripts (`.sh`/`.bash`/`.zsh`) direct → `code-text-paragraph-v1` (blank-line paragraph segmentation + 80-line / 20-overlap line-window for oversize). Same chunker also serves as fallback when Tier 1/2 emit 0 chunks or Err — non-k8s YAML / invalid YAML / AST extractor failures all picked up. symbol = None; lang preserved from input doc. |
| 1B symbol path | workspace path → module path: Python = dotted prefix (`kebab_eval.metrics.compute_mrr`), TypeScript/JavaScript = slash-style prefix (`src/Foo.Foo.search`). Rust 1A-2 는 file-scope nesting 만 (workspace prefix 없음, 비일관 수용 — HOTFIXES 2026-05-20). |
| TUI | Ratatui + crossterm — P9-1 Library 패널, P9-2/3/4 진행 예정 |
| Desktop | Tauri 2 + `pdfjs-dist` (native PDF render backend 금지) — P9-5 |
@@ -52,7 +52,7 @@ flowchart TB
ppdf["kebab-parse-pdf"]
pimg["kebab-parse-image"]
paud["kebab-parse-audio<br/>(P8 보류)"]
pcode["kebab-parse-code<br/>(P10-1A-2 + P10-1B)"]
pcode["kebab-parse-code<br/>(P10-1A-2 + P10-1B + P10-1C-Go + P10-1C-JK + P10-2 + P10-3)"]
ptypes["kebab-parse-types"]
norm["kebab-normalize"]
chunk["kebab-chunk"]
@@ -127,7 +127,7 @@ flowchart TB
UI → store/llm/parse 직접 의존 금지. 모든 user-facing 진입은 `kebab-app` facade 만 통한다 (frozen 설계 §8). `kebab-cli``--config <path>` flag 를 honor 하려면 `kebab_app::*_with_config(cfg, …)` companion 을 통해 Config 을 명시적으로 thread 하는 패턴 — 자세한 이유는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 의 `--config` 항목.
`kebab-parse-code` 의 외부 tree-sitter grammar crate 의존: P10-1A-2 에서 `tree-sitter-rust` 추가, P10-1B 에서 `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` 추가. 모두 `kebab-parse-code` 에만 격리 (facade 룰 — UI crate / chunker 가 직접 import 금지).
`kebab-parse-code` 의 외부 tree-sitter grammar crate 의존: P10-1A-2 에서 `tree-sitter-rust` 추가, P10-1B 에서 `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` 추가, P10-1C-Go 에서 `tree-sitter-go` 추가, P10-1C-JK 에서 `tree-sitter-java` / `tree-sitter-kotlin-ng` 추가. 모두 `kebab-parse-code` 에만 격리 (facade 룰 — UI crate / chunker 가 직접 import 금지). Kotlin 은 `tree-sitter-kotlin-ng` 사용 (bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 에 고착 — 사용 불가).
## 디렉토리 구조
@@ -165,7 +165,14 @@ kebab/
│ ├── kebab-source-fs/ # 워크스페이스 walk + checksum (P1-1)
│ ├── kebab-parse-md/ # Markdown frontmatter + blocks (P1-2/3)
│ ├── kebab-normalize/ # ParsedBlock → CanonicalDocument (P1-4)
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 + code-rust-ast-v1 + code-python-ast-v1 + code-ts-ast-v1 + code-js-ast-v1 chunker (P1-5, P7-2, P10-1A-2, P10-1B)
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 + code-*-ast-v1 (Tier 1) + k8s-manifest-resource-v1 + dockerfile-file-v1 + manifest-file-v1 + tier2_shared (P10-2) + code-text-paragraph-v1 (P10-3) chunker (P1-5, P7-2, P10-1A-2, P10-1B, P10-1C-Go, P10-1C-JK, P10-2, P10-3)
│ │ └── src/
│ │ ├── code_*_ast_v1.rs # Tier 1 AST chunkers (rust/python/ts/js/go/java/kotlin)
│ │ ├── k8s_manifest_resource_v1.rs # Tier 2 (p10-2): YAML multi-doc, apiVersion+kind per resource
│ │ ├── dockerfile_file_v1.rs # Tier 2 (p10-2): whole-file Dockerfile
│ │ ├── manifest_file_v1.rs # Tier 2 (p10-2): whole-file Cargo.toml / go.mod / .json / .xml / .groovy
│ │ ├── code_text_paragraph_v1.rs # Tier 3 (p10-3): blank-line paragraph + 80/20 line-window fallback
│ │ └── tier2_shared.rs # Tier 2 (p10-2): shared oversize fallback + Chunk builder helpers
│ ├── kebab-store-sqlite/ # SQLite + FTS5 (V001/V002/V003) (P1-6, P2-1, P3-3)
│ ├── kebab-search/ # Lexical + Vector + Hybrid retriever (P2-2, P3-4)
│ ├── kebab-embed/ kebab-embed-local/ # Embedder trait + fastembed adapter (P3-1, P3-2)
@@ -175,7 +182,7 @@ kebab/
│ ├── kebab-eval/ # golden query runner + metrics (P5-1, P5-2)
│ ├── kebab-parse-image/ # ImageExtractor + Ollama OCR + caption (P6)
│ ├── kebab-parse-pdf/ # lopdf per-page text extractor (P7-1)
│ ├── kebab-parse-code/ # tree-sitter AST extractors: Rust (P10-1A-2), Python + TypeScript + JavaScript (P10-1B); chunker lives in kebab-chunk
│ ├── kebab-parse-code/ # tree-sitter AST extractors: Rust (P10-1A-2), Python + TypeScript + JavaScript (P10-1B), Go (P10-1C-Go), Java + Kotlin (P10-1C-JK — java.rs + kotlin.rs); chunker lives in kebab-chunk
│ ├── kebab-app/ # facade (P0 시그니처 + P3-5/P6-4/P7-3 본체)
│ ├── kebab-tui/ # Ratatui shell + Library 패널 (P9-1)
│ ├── kebab-mcp/ # stdio MCP server — tools: schema, doctor, search, ask (P9-FB-30)

View File

@@ -401,6 +401,153 @@ KB --json schema | jq '.stats.code_lang_breakdown'
- `const foo = () => {...}` 같은 expression-level 함수는 `<top-level>` glue 로 잡힘 (declaration-level 단위만 1B 1차 범위). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-20).
- `.gitignore` honor — `node_modules/` / `__pycache__/` / `.venv/` 등 built-in 안전망 자동 skip.
## P10-1C-Go Go 코드 색인
P10-1B 와 동일한 격리 KB 설정. `.go` 파일을 워크스페이스에 두고 ingest 하면 `code-go-ast-v1` chunker 가 package 단위 AST 로 처리한다.
```bash
cat > /tmp/kebab-smoke/workspace/sample_code/hello.go <<'EOF'
package main
import "fmt"
func Hello(name string) string {
return fmt.Sprintf("Hello, %s!", name)
}
EOF
KB ingest
KB search --mode hybrid "Hello" --code-lang go --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "main.Hello", lang = "go"
```
## P10-2 Tier 2 리소스 파일 색인
P10-1C-Go 와 동일한 격리 KB 설정. `.yaml` / `Dockerfile` / `.toml` 등 Tier 2 리소스 파일을 워크스페이스에 두고 ingest 하면 각 확장자에 맞는 chunker 로 처리된다.
```bash
# 1) Kubernetes manifest (YAML multi-doc)
cat > /tmp/kebab-smoke/workspace/deploy.yaml <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:latest
---
apiVersion: v1
kind: Service
metadata:
name: my-app-svc
namespace: default
spec:
selector:
app: my-app
ports:
- port: 80
EOF
# 2) Dockerfile (전체 파일 단일 chunk)
cat > /tmp/kebab-smoke/workspace/Dockerfile <<'EOF'
FROM rust:1.85 AS builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/kebab /usr/local/bin/kebab
ENTRYPOINT ["kebab"]
EOF
# 3) Cargo.toml (manifest — 전체 파일 단일 chunk)
cp Cargo.toml /tmp/kebab-smoke/workspace/Cargo.toml
# 4) ingest
KB ingest
# 5) 언어별 검색 (citation.symbol 확인)
KB search --mode hybrid "Deployment" --code-lang yaml --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "Deployment/default/my-app" (kind/namespace/name), lang = "yaml"
KB search --mode hybrid "rust:1.85" --code-lang dockerfile --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "<dockerfile>", lang = "dockerfile"
KB search --mode hybrid "kebab-cli" --code-lang toml --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "<manifest>", lang = "toml"
# 6) schema stats 에 Tier 2 언어 카운트 확인
KB --json schema | jq '.stats.code_lang_breakdown'
# 기대: {"yaml": N, "dockerfile": N, "toml": N, ...}
```
**Tier 2 citation.symbol 컨벤션**:
- **YAML k8s 리소스**: `<kind>/<namespace>/<name>` (예: `Deployment/default/my-app`). `namespace` 없으면 `<kind>/<name>`. multi-doc YAML 은 `---` 구분자 기준으로 resource 별 chunk.
- **Dockerfile**: `<dockerfile>` (고정 심볼, 전체 파일이 단일 chunk).
- **TOML / JSON / XML / Groovy / go.mod**: `<manifest>` (고정 심볼, 전체 파일이 단일 chunk). 단, 파일이 `tier2_shared` 의 oversize threshold 초과 시 줄 단위 fallback chunk.
## P10-3 Tier 3 paragraph fallback
P10-2 와 동일한 격리 KB 설정. `.sh` 파일은 direct, 비-k8s YAML 은 fallback 으로 들어간다.
```bash
# 1) shell script (direct Tier 3)
cat > /tmp/kebab-smoke/workspace/deploy.sh <<'EOF'
#!/usr/bin/env bash
set -e
echo "ingesting..."
kebab ingest
echo "done"
kebab schema --json | jq '.stats'
EOF
# 2) 비-k8s YAML (Tier 2 가 0 chunk → Tier 3 fallback)
cat > /tmp/kebab-smoke/workspace/docker-compose.yml <<'EOF'
version: '3'
services:
api:
image: nginx:latest
ports:
- 8080:80
EOF
# 3) ingest
KB ingest
# 4) 언어별 검색 (citation.symbol = None 확인)
KB search --mode hybrid "ingest" --code-lang shell --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
# 기대: symbol = null, lang = "shell", chunker_version = "code-text-paragraph-v1"
KB search --mode hybrid "nginx" --code-lang yaml --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
# 기대: symbol = null, lang = "yaml", chunker_version = "code-text-paragraph-v1"
# 5) schema stats 에 shell 카운트 확인
KB --json schema | jq '.stats.code_lang_breakdown'
# 기대: {"shell": N, "yaml": M, ...}
```
**Tier 3 citation.symbol 컨벤션**: 항상 `null`. 의미 단위 식별 안 함. `lang` 은 원본 lang 보존 (shell → `"shell"`, yaml → `"yaml"` 등).
## 검증 체크리스트
- `kebab doctor` 가 `--config` path 를 honor 하고 그 안의 `storage.data_dir` 를 출력 (XDG default 가 아님).
@@ -433,6 +580,10 @@ rm -rf /tmp/kebab-smoke # 통째로 정리
- (P7-3) 한 PDF 가 N 페이지면 `kebab ingest` 가 N 개 (또는 그 이상의, 페이지 길면 multi-chunk) 의 chunk 를 한 transaction 안에서 commit. 500 페이지 책 → 500+ chunk 한 번에 → embedding throughput 가 bottleneck. 임베딩 활성 워크스페이스에서 큰 PDF 를 처음 ingest 하면 분-단위 시간 + WAL 크기 증가 가능 — P+ 스케일 hardening task 까지 정상 동작이지만 비용은 측정 가능.
- (P10-1A-2) `.rs` 파일을 워크스페이스에 두면 `kebab ingest` 결과에 `new` 카운터에 포함. `kebab search --mode hybrid "<함수명>" --code-lang rust --json` 가 `citation.kind = "code"`, `citation.lang = "rust"` (SearchHit top-level `code_lang` 도 동일), `citation.symbol` (함수/타입 이름), `citation.line_start` / `citation.line_end` 를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"rust": N` 이 나오면 chunk 가 색인됨.
- (P10-1B) `.py` / `.ts` / `.tsx` / `.js` / `.mjs` / `.cjs` / `.jsx` 파일을 워크스페이스에 두면 `kebab ingest` 결과에 `new` 카운터에 포함. `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` 검색이 `citation.symbol` 에 module path prefix 를 포함한 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 해당 언어 카운트 등장 확인.
- (P10-1C-Go) `.go` 파일을 워크스페이스에 두면 `kebab ingest` 가 `code-go-ast-v1` 로 처리. `--code-lang go` 검색이 `citation.symbol` 에 `<package>.<Func>` / `<package>.(*Receiver).<Method>` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"go": N` 등장 확인.
- (P10-1C-JK) `.java` 파일은 `code-java-ast-v1`, `.kt`/`.kts` 파일은 `code-kotlin-ast-v1` 로 처리. `--code-lang java` / `--code-lang kotlin` 검색이 `citation.symbol` 에 `com.foo.Foo.bar` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"java": N` / `"kotlin": N` 등장 확인.
- (P10-2) `.yaml`/`.yml` 파일은 apiVersion+kind 파싱으로 k8s resource 별 chunk 생성 (`k8s-manifest-resource-v1`). `Dockerfile`/`Dockerfile.*` 는 전체 파일 단일 chunk (`dockerfile-file-v1`). `.toml`/`.json`/`.xml`/`.groovy`/`go.mod` 는 전체 파일 단일 chunk (`manifest-file-v1`). `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` 검색이 `citation.symbol` 에 각각 `Deployment/default/my-app` / `<dockerfile>` / `<manifest>` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"yaml": N` / `"dockerfile": N` / `"toml": N` 등장 확인.
- (P10-3) `.sh`/`.bash`/`.zsh` 파일은 direct Tier 3 (`code-text-paragraph-v1`). 비-k8s YAML (apiVersion+kind 없는 yaml) 은 k8s chunker 가 0 chunk → Tier 3 fallback 으로 picked up. `--code-lang shell` / `--code-lang yaml` 검색이 `citation.symbol = null`, `chunker_version = "code-text-paragraph-v1"` 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"shell": N` 등장 확인.
- (P7-3 + follow-up) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_vector_orphans_for_workspace_path` 가 옛 chunk_id 를 LanceDB 에서 먼저 삭제, 이어서 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embedding_records 를 SQLite 에서 sweep. 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). 두 store 모두 정합 — 옛 본문 검색 시 옛 chunks 가 더 이상 surface 되지 않음.
### Embedding upgrade (fb-39b)

View File

@@ -0,0 +1,540 @@
# p10-1C-Go Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task.
**Goal:** Activate Go code ingest end-to-end on top of 1A-2 (Rust) + 1B (Python/TS/JS) infrastructure. Add `tree-sitter-go` grammar + `GoAstExtractor` + `code-go-ast-v1` chunker + media routing + app dispatch arm.
**Architecture:** Mirror 1A-2 / 1B exactly. `kebab-parse-code/src/go.rs` walks tree-sitter-go parse tree; emits one `Block::Code` per top-level AST semantic unit with `SourceSpan::Code { symbol, lang: Some("go") }`. Symbol prefix = **source-extracted package name** (from `package_clause` AST node — design §3.4 Go row). `kebab-chunk/src/code_go_ast_v1.rs` is a near-duplicate of `code-rust-ast-v1`. App dispatch's `ingest_one_code_asset` (PR #142 generalized 4-arm match) gets a 5th arm.
**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already in workspace), `tree-sitter-go` (NEW), 1A-2/1B infrastructure unchanged.
**Memory note:** Host has been OOM-killed previously. Use `cargo test -p <crate>` and `cargo check -p <crate>` only. ONE full-suite invocation reserved for Task G gate.
---
## Pre-flight
Branch `feat/p10-1c-go` already exists.
- [ ] **Disk hygiene**: `cargo clean` if previous artifacts are bloated. Skip if disk is comfortable (`df -h /`).
Reference files:
- 1A-2 Rust extractor: `crates/kebab-parse-code/src/rust.rs` — closest single-language scaffold template.
- 1B Python extractor (closest analog for "class-nesting recursion" — Go doesn't have classes but has package as the single prefix): `crates/kebab-parse-code/src/python.rs`.
- 1A-2 chunker scaffold: `crates/kebab-chunk/src/code_rust_ast_v1.rs`.
- 1B dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` (~L1645, 4-arm match).
- 1A-2 source-fs routing: `crates/kebab-source-fs/src/media.rs` `"rs" =>` arm.
---
## Task A: Workspace dep `tree-sitter-go`
**Files:**
- Modify: `Cargo.toml` (workspace `[workspace.dependencies]`, after `tree-sitter-javascript` line)
- Modify: `crates/kebab-parse-code/Cargo.toml`
- [ ] **Step 1**: `cargo add tree-sitter-go -p kebab-parse-code` to resolve version.
- [ ] **Step 2**: Lift the resolved version into `[workspace.dependencies]` after `tree-sitter-javascript`:
```toml
# Go grammar for code ingest (kebab-parse-code, p10-1C).
tree-sitter-go = "<resolved>"
```
Switch the crate's entry to `{ workspace = true }` matching existing tree-sitter-* style.
- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine.
- [ ] **Step 4**: Commit:
```bash
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
git commit -m "build(p10-1c-go): add tree-sitter-go workspace dep
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task B: source-fs media routing `.go` → `MediaType::Code("go")`
**Files:**
- Modify: `crates/kebab-source-fs/src/media.rs` (add arm after the existing JS arm at ~L44)
- Test: same file's test module
- [ ] **Step 1 (failing test)** — add to existing tests near `py_ts_js_files_map_to_media_code`:
```rust
#[test]
fn go_files_map_to_media_code_go() {
assert_eq!(media_type_for(Path::new("a/b.go")), MediaType::Code("go".into()));
}
```
- [ ] **Step 2**: Run → FAIL.
- [ ] **Step 3**: Add the arm before the catch-all `_ => MediaType::Other(ext)`:
```rust
// p10-1C-Go: Go ingest activated.
"go" => MediaType::Code("go".into()),
```
- [ ] **Step 4**: Run → PASS. `cargo test -p kebab-source-fs` → no regression.
- [ ] **Step 5**: clippy clean, commit:
```bash
git add crates/kebab-source-fs/
git commit -m "feat(p10-1c-go): route .go to MediaType::Code(go)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task C: App dispatch allowlist + bail arm for "go"
**Files:**
- Modify: `crates/kebab-app/src/lib.rs` (dispatch match guard + 4 internal match arms in `ingest_one_code_asset`)
- [ ] **Step 1**: Find the `MediaType::Code(lang) if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript")` arm (~L953). Add `"go"` to the allowlist:
```rust
MediaType::Code(lang)
if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript" | "go") =>
{
```
- [ ] **Step 2**: In `ingest_one_code_asset`'s 4 `match code_lang` blocks (parser_version, chunker_version, extract, chunk), add a "go" arm that `bail!()`s for now (extractor + chunker land in Task D/E). Mirror the Python/TS/JS bail-then-activate pattern:
```rust
let parser_version = match code_lang {
// ... existing arms ...
"go" => anyhow::bail!("go ingest not yet wired (p10-1c-go Task F)"),
other => anyhow::bail!("unsupported code_lang: {other}"),
};
// similar for chunker_version / extract / chunk matches
```
- [ ] **Step 3**: `cargo test -p kebab-app --lib` → existing 52 lib tests stay green. `cargo test -p kebab-app --test code_ingest_smoke` → 6 stay green (Rust path unaffected).
- [ ] **Step 4**: clippy clean, commit:
```bash
git add crates/kebab-app/
git commit -m "refactor(p10-1c-go): add go to ingest dispatch allowlist (bail until Task F)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task D: `GoAstExtractor` (`kebab-parse-code/src/go.rs`)
**Files:**
- Create: `crates/kebab-parse-code/src/go.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs` (`pub mod go;` + re-exports `GO_PARSER_VERSION`, `GoAstExtractor`)
- Create: `crates/kebab-parse-code/tests/fixtures/sample.go`
Scaffold mirrors `crates/kebab-parse-code/src/rust.rs` line-for-line for the `CanonicalDocument` skeleton (Extractor trait impl, `id_for_doc`, ProvenanceEvent, final `CanonicalDocument` literal). The novel parts:
### Constants
```rust
pub const PARSER_VERSION: &str = "code-go-v1";
pub struct GoAstExtractor;
// new() + Default
// supports: matches!(m, MediaType::Code(l) if l == "go")
// agent = "kb-parse-code"
// metadata.code_lang = Some("go")
// SourceType::Note (no SourceType::Code variant)
// repo/git_branch/git_commit via detect_repo
```
### Package extraction
Unlike 1B's path-based `module_path_for_python` / `_for_tsjs`, the Go package prefix comes from the **source code's `package` declaration** (design §3.4). tree-sitter-go's grammar:
- Root: `source_file`
- First named child is typically `package_clause` → contains `package_identifier` child whose text is the package name.
Helper (local to `go.rs`):
```rust
/// Returns the package name from a tree-sitter-go `source_file`, or
/// `None` if the file has no `package_clause` (invalid Go in practice,
/// but be defensive).
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_clause" {
// `package_clause` has a `package_identifier` named child.
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "package_identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
```
### Semantic-unit rules
| node kind | unit | symbol |
|-----------|------|--------|
| `function_declaration` (name field) | 1 | `<pkg>.<fn_name>` |
| `method_declaration` | 1 | `<pkg>.(<TypeText>).<MethodName>` where `<TypeText>` includes a leading `*` if the receiver is `pointer_type`. Examples: `chunk.(*MdHeadingV1Chunker).ChunkDoc`, `chunk.(Foo).Bar`. |
| `type_declaration` (struct / interface / type alias) | 1 per inner `type_spec` | `<pkg>.<TypeName>` |
| `const_declaration`, `var_declaration`, `import_declaration` (single or block) | glue | `<pkg>.<top-level>` (or `<pkg>.<package>` if file has ZERO real units AND glue is import-only — same `<module>` post-pass pattern as 1B Python, renamed to `<package>` to avoid colliding with Go's `package` keyword? — actually use `<module>` per design §3.4 — see "module / namespace 만 있고 symbol 없는 경우" line) |
`unit_start` walks `comment` siblings (same as 1B). Go doesn't have separate attribute / decorator nodes.
Method receiver pointer detection:
```rust
// In the method_declaration arm:
let receiver = child.child_by_field_name("receiver"); // parameter_list
let receiver_type_text = receiver.and_then(|r| {
let mut cw = r.walk();
for p in r.named_children(&mut cw) {
if p.kind() == "parameter_declaration" {
// type field is either type_identifier (value) or pointer_type (ptr)
if let Some(ty) = p.child_by_field_name("type") {
let s = &src[ty.start_byte()..ty.end_byte()];
return Some(s.to_string()); // includes leading "*" if pointer_type
}
}
}
None
});
// Format: "(*Foo)" or "(Foo)" — wrap in parens, preserve leading "*" if any.
let owner = receiver_type_text
.map(|t| format!("({t})"))
.unwrap_or_else(|| "()".to_string());
let method_name = name_text(&child, src);
// symbol = format!("{pkg}.{owner}.{method_name}")
```
Read tree-sitter-go's grammar.json or node-types.json (in the registry source) if any field name above differs in the resolved crate version.
### Fixture `tests/fixtures/sample.go`:
```go
// sample.go
package chunk
import (
"fmt"
"strings"
)
const Version = "v1"
type MdHeadingV1Chunker struct {
Name string
}
// ChunkDoc returns a stub list of strings.
func (m *MdHeadingV1Chunker) ChunkDoc(input string) []string {
return []string{m.Name}
}
func (m MdHeadingV1Chunker) Name2() string {
return m.Name
}
type Stringer interface {
String() string
}
func Free(x int) int {
return x + 1
}
func init() {
fmt.Println(strings.ToUpper("init"))
}
```
### Test module
Mirror Python's test shape (use `crate::rust::tests_support::fixed_code_asset` from 1B):
```rust
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.go"),
).unwrap();
let asset = crate::rust::tests_support::fixed_code_asset(
"crates/x/src/sample.go", "go",
);
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg };
GoAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_go() {
let e = GoAstExtractor::new();
assert!(e.supports(&MediaType::Code("go".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn go_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc.blocks.iter().filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("go"));
symbol.clone()
}
_ => None,
},
_ => None,
}).collect();
syms.sort();
assert!(syms.iter().any(|s| s == "chunk.Free"), "got {syms:?}");
assert!(syms.iter().any(|s| s == "chunk.init"));
assert!(syms.iter().any(|s| s == "chunk.MdHeadingV1Chunker"));
assert!(syms.iter().any(|s| s == "chunk.(*MdHeadingV1Chunker).ChunkDoc"));
assert!(syms.iter().any(|s| s == "chunk.(MdHeadingV1Chunker).Name2"));
assert!(syms.iter().any(|s| s == "chunk.Stringer"));
assert!(syms.iter().any(|s| s == "chunk.<top-level>")); // import + const grouped
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 { assert_eq!(extract_fixture().blocks, a.blocks); }
}
}
```
### Step list
- [ ] Step 1: create fixture + test module.
- [ ] Step 2: run → FAIL (`GoAstExtractor` undefined).
- [ ] Step 3: implement `go.rs`. Scaffold mirrors `python.rs` (Extractor impl + extract scaffold + `build_blocks` returning blocks). `build_blocks` does: extract_package → walk root's named children → branch per node kind per the table above → emit `Block::Code` with `SourceSpan::Code { symbol, lang: Some("go") }`. Use the same `flush_glue` / glue grouping / `<top-level>` vs `<module>` post-pass as Python (rename to `<package>` if user prefers, but spec §3.4 says `<module>` so keep that name for cross-language consistency).
- [ ] Step 4: wire into `lib.rs`:
```rust
pub mod go;
pub use go::{PARSER_VERSION as GO_PARSER_VERSION, GoAstExtractor};
```
- [ ] Step 5: `cargo test -p kebab-parse-code` → all pass (Rust/Python/TS/JS + new Go). `cargo clippy -p kebab-parse-code --all-targets -- -D warnings` clean.
- [ ] Step 6: commit:
```bash
git add crates/kebab-parse-code/
git commit -m "feat(p10-1c-go): tree-sitter-go AST extractor (GoAstExtractor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task E: `code-go-ast-v1` chunker
**Files:**
- Create: `crates/kebab-chunk/src/code_go_ast_v1.rs`
- Modify: `crates/kebab-chunk/src/lib.rs`
Identical pattern to PR #142 Task I (TS) / Task L (JS) — near-duplicate of `code_rust_ast_v1.rs` with substitutions:
- `const VERSION_LABEL: &str = "code-go-ast-v1";`
- struct name `CodeGoAstV1Chunker`
- error message says `"CodeGoAstV1Chunker only handles..."`
- module doc-comment prose `Rust``Go`, `code-rust-ast-v1``code-go-ast-v1`
`split_oversize` / `make_chunk` / `AST_CHUNK_MAX_LINES = 200` / `BYTES_PER_TOKEN = 3` / `POLICY_HASH_HEX_LEN = 16` IDENTICAL (language-agnostic).
Test module: copy from `code_ts_ast_v1.rs` and substitute names. KEEP cross-chunker `policy_hash_matches_md_heading_v1`.
Wire into `crates/kebab-chunk/src/lib.rs`:
```rust
mod code_go_ast_v1;
pub use code_go_ast_v1::CodeGoAstV1Chunker;
```
(Alphabetical placement.)
Verify + commit:
- `cargo test -p kebab-chunk code_go_ast` PASS (~6 tests)
- `cargo test -p kebab-chunk` full per-crate green
- `cargo clippy -p kebab-chunk --all-targets -- -D warnings` clean
```bash
git add crates/kebab-chunk/
git commit -m "feat(p10-1c-go): code-go-ast-v1 chunker (1:1 + oversize split)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task F: Activate Go in app dispatch
**Files:**
- Modify: `crates/kebab-app/src/lib.rs` (replace 4 "go" bail! arms with real calls)
- Modify: `crates/kebab-app/tests/code_ingest_smoke.rs` (add Go integration test)
Replace the 4 `"go" => anyhow::bail!(...)` arms in `ingest_one_code_asset` (added in Task C) with real:
```rust
"go" => ParserVersion(kebab_parse_code::GO_PARSER_VERSION.to_string()),
// ...
"go" => CodeGoAstV1Chunker.chunker_version(),
// ...
"go" => kebab_parse_code::GoAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::GoAstExtractor::extract (code:go)")?,
// ...
"go" => CodeGoAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeGoAstV1Chunker::chunk (code:go)")?,
```
Add imports at top of lib.rs:
- `kebab_chunk::CodeGoAstV1Chunker`
- `kebab_parse_code::GoAstExtractor`
Integration test (mirror PR #142's `python_file_ingests_and_searches_as_code_citation`):
```rust
#[test]
fn go_file_ingests_and_searches_as_code_citation() {
// ... TempDir + Config harness same as Python/TS test ...
let pkg_dir = env.workspace_root.join("chunk");
std::fs::create_dir_all(&pkg_dir).unwrap();
std::fs::write(
pkg_dir.join("ast.go"),
"package chunk\n\nfunc ParseDoc(input string) string {\n return input\n}\n",
).unwrap();
let report = kebab_app::ingest_with_config(/* ... */).unwrap();
assert!(report.new >= 1);
let go_item = report.items.as_ref().unwrap().iter()
.find(|i| i.doc_path.0.ends_with("ast.go")).expect("ast.go item");
assert_eq!(go_item.parser_version.as_ref().unwrap().0, "code-go-v1");
assert_eq!(go_item.chunker_version.as_ref().unwrap().0, "code-go-ast-v1");
let hits = kebab_app::search_with_config(/* search "ParseDoc" */).unwrap();
let h = hits.iter().find(|h| matches!(h.citation, kebab_core::Citation::Code { .. }))
.expect("Citation::Code hit");
match &h.citation {
kebab_core::Citation::Code { lang, symbol, line_start, .. } => {
assert_eq!(lang.as_deref(), Some("go"));
assert_eq!(symbol.as_deref(), Some("chunk.ParseDoc"));
assert!(*line_start >= 1);
}
_ => unreachable!(),
}
assert_eq!(h.code_lang.as_deref(), Some("go"));
}
```
Verify:
- `cargo test -p kebab-app --test code_ingest_smoke` → 7/7 (6 existing + 1 new go)
- `cargo test -p kebab-app --lib` → 52/52 (no regression)
- clippy clean
```bash
git add crates/kebab-app/
git commit -m "feat(p10-1c-go): activate Go in ingest_one_code_asset dispatch
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task G: Snapshot + full-suite gate + manual SMOKE
**Files:**
- Create: `crates/kebab-chunk/tests/code_go_ast_snapshot.rs` + fixture + baseline (mirror `code_python_ast_snapshot.rs` from PR #142)
- [ ] **Step 1**: Add snapshot integration test. In-memory `CanonicalDocument` (no kebab-parse-code dep — boundary §6.3). Generate baseline: `UPDATE_SNAPSHOTS=1 cargo test -p kebab-chunk code_go_ast_snapshot` → re-run without env → PASS.
- [ ] **Step 2**: Full-suite gate (the ONE invocation allowed this PR):
```bash
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace --no-fail-fast -j 1
```
Both must be CLEAN/GREEN.
- [ ] **Step 3**: Manual SMOKE (optional but recommended — mirror PR #142 SMOKE):
```bash
cargo build --release # OR debug if RAM-tight
rm -rf /tmp/kebab-go-smoke && mkdir -p /tmp/kebab-go-smoke/ws/chunk
echo 'package chunk
func ParseDoc(input string) string { return input }
' > /tmp/kebab-go-smoke/ws/chunk/ast.go
# adapt isolated config from docs/SMOKE.md
./target/release/kebab --config /tmp/kebab-go-smoke/config.toml ingest --json | jq '.items[].parser_version' | sort -u
./target/release/kebab --config /tmp/kebab-go-smoke/config.toml search "ParseDoc" --code-lang go --json | jq '.hits[0]'
```
Expected: `code-go-v1` in parser_versions; Citation::Code with symbol `chunk.ParseDoc`.
- [ ] **Step 4**: Commit snapshot only (full-suite + SMOKE are gates, not commit content):
```bash
git add crates/kebab-chunk/tests/
git commit -m "test(p10-1c-go): code-go-ast-v1 chunker snapshot + full-suite gate
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task H: Docs + version bump
- README: 지원 형식 row — add Go (`.go`, `code-go-ast-v1`).
- HANDOFF: P10 phase row note 1C-Go merged (Go active). Java/Kotlin remain pending.
- ARCHITECTURE: directory tree note for kebab-parse-code includes `go.rs` (Java/Kotlin coming in next PR). Decisions table — no new row (1C-Go follows the 1A-2/1B convention).
- SMOKE: extend the P10 section with a 1-line note for Go (or compact Go example).
- tasks/INDEX + tasks/p10/INDEX: flip the row for 1C-Go to 🟡 (PR open) → ✅ on merge. The 1C row in p10/INDEX may need a split — `p10-1C-Go ⏳ → 🟡` and `p10-1C-JavaKotlin ⏳ unchanged` (since user split into 2 PRs).
- frozen design §10.1: add a one-liner — "p10-1C-Go 활성화 (Go)" (Java/Kotlin will get its own line in the next PR).
- `Cargo.toml`: workspace version `0.11.1 → 0.12.0` (minor — dogfooding surface 확장, 새 chunker + extractor 활성화).
```bash
git add -A
git commit -m "docs(p10-1c-go): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + chore: bump version 0.11.1 → 0.12.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Finalize: PR + review loop + release
Per workflow memory (gitea-pr + review loop, no single-shot):
- [ ] `gitea-pr` → PR title `feat(p10-1C-Go): tree-sitter-go AST extractor + chunker — Go 코드 색인 활성화`
- [ ] Review loop until APPROVE → merge → main pull → branch cleanup → `cargo clean``gitea-release v0.12.0`.
---
## Self-Review (filled by plan author)
- **Spec coverage**: design §1C Go (extractor + chunker + activation) → Tasks D/E/F; §3.3 (`code-go-ast-v1`) → Task E; §3.4 symbol path → Task D (extract_package + method receiver pointer detection); §6.1 (`kebab-parse-code/src/go.rs`) → Task D; §6.2 (`kebab-chunk/src/code_go_ast_v1.rs`) → Task E; §6.3 dep graph (`tree-sitter-go` parser-side) → Task A; §9.1 Tier-1 + oversize fallback → Task E (1A-2 split_oversize reused identically).
- **No placeholders**: novel logic (`extract_package`, method receiver pointer detection, fixture, test assertions, dispatch arm additions) given concretely. Mechanical mirrors (chunker, integration test, snapshot test) pinned to exact existing files with substitutions.
- **Type consistency**: `GoAstExtractor` / `GO_PARSER_VERSION = "code-go-v1"` / `CodeGoAstV1Chunker` / `VERSION_LABEL = "code-go-ast-v1"` used consistently across Tasks A-H. `MediaType::Code("go")` in routing + dispatch. `Citation::Code` with `lang: Some("go")` in integration test.

View File

@@ -0,0 +1,494 @@
# p10-1C-JavaKotlin Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task.
**Goal:** Activate Java + Kotlin code ingest end-to-end. Mirror 1C-Go (PR #151 / v0.12.0) for Java (single-language scaffold) and Kotlin (additional top-level fn variant). Both use source-side `package` extraction (design §3.4 JVM convention).
**Architecture:** Same shape as 1B (multi-language single PR). 2 new tree-sitter grammars + 2 extractors + 2 chunkers + media routing + app dispatch arms. 1C-Go pattern is the closest template for source-side `package` extraction.
**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already), `tree-sitter-java` + `tree-sitter-kotlin` (NEW). 1A-2/1B/1C-Go infrastructure unchanged.
**Memory note:** Host has been OOM'd previously. Per-crate cargo only. ONE full-suite + clippy invocation in Task J.
---
## Pre-flight
Branch `feat/p10-1c-jk` already exists.
- [ ] **Disk hygiene**: `cargo clean` if heavy (last cleanup recovered 34 GB).
Reference files:
- 1C-Go extractor: `crates/kebab-parse-code/src/go.rs` — closest template for source-side package extraction.
- 1B Python extractor: `crates/kebab-parse-code/src/python.rs` — class-nesting recursion model (relevant for Java/Kotlin).
- 1A-2 chunker: `crates/kebab-chunk/src/code_rust_ast_v1.rs` — duplicate-with-substitution.
- 1B dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` 4-arm match (~L1645). 1C-Go already added `"go"`; this PR adds `"java"` + `"kotlin"`.
---
## Task A: Workspace deps (tree-sitter-java + tree-sitter-kotlin)
**Files:**
- Modify: `Cargo.toml` (workspace `[workspace.dependencies]`, after `tree-sitter-go` line)
- Modify: `crates/kebab-parse-code/Cargo.toml`
- [ ] **Step 1**: `cargo add tree-sitter-java tree-sitter-kotlin -p kebab-parse-code`. If `tree-sitter-kotlin` resolves to a fork name, verify the actively-maintained crate (e.g. check crates.io page / GitHub stars / last update). Likely `tree-sitter-kotlin` (without fork suffix) is the default.
- [ ] **Step 2**: Lift the two resolved versions into `[workspace.dependencies]` after `tree-sitter-go`:
```toml
# JVM family grammars for code ingest (kebab-parse-code, p10-1C-JK).
tree-sitter-java = "<resolved>"
tree-sitter-kotlin = "<resolved>"
```
Switch crate's entries to `{ workspace = true }`.
- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine.
- [ ] **Step 4**: Commit:
```bash
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
git commit -m "build(p10-1c-jk): add tree-sitter-java + tree-sitter-kotlin workspace deps
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
If the kotlin crate has a different actual name (e.g. `tree-sitter-kotlin-ng` or fork suffix), document the choice in the commit body briefly.
---
## Task B: source-fs routing `.java` / `.kt` / `.kts`
**Files:**
- Modify: `crates/kebab-source-fs/src/media.rs` (add arm after the existing `.go` arm)
- Test: same file's test module
- [ ] **Step 1 (failing test)** — add near `go_files_map_to_media_code_go`:
```rust
#[test]
fn java_kotlin_files_map_to_media_code() {
assert_eq!(media_type_for(Path::new("a/b.java")), MediaType::Code("java".into()));
assert_eq!(media_type_for(Path::new("a/b.kt")), MediaType::Code("kotlin".into()));
assert_eq!(media_type_for(Path::new("a/b.kts")), MediaType::Code("kotlin".into()));
}
```
- [ ] **Step 2**: Run → FAIL.
- [ ] **Step 3**: Add the arms before the `_ => MediaType::Other(ext)` fallback (after `"go" => ...`):
```rust
// p10-1C-JK: JVM family (Java + Kotlin) ingest activated.
"java" => MediaType::Code("java".into()),
"kt" | "kts" => MediaType::Code("kotlin".into()),
```
- [ ] **Step 4**: Run → PASS. `cargo test -p kebab-source-fs` → no regression.
- [ ] **Step 5**: clippy clean, commit.
```bash
cargo clippy -p kebab-source-fs --all-targets -- -D warnings
git add crates/kebab-source-fs/
git commit -m "feat(p10-1c-jk): route .java/.kt/.kts to MediaType::Code
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task C: App dispatch + bail arms for "java" + "kotlin"
**Files:**
- Modify: `crates/kebab-app/src/lib.rs`
- [ ] **Step 1**: Find the dispatch arm guard (currently `matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript" | "go")`). Add `"java"` + `"kotlin"`:
```rust
MediaType::Code(lang)
if matches!(lang.as_str(),
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin") =>
```
- [ ] **Step 2**: In `ingest_one_code_asset` the 4 `match code_lang` blocks add `"java"` and `"kotlin"` arms that `bail!()` for now:
```rust
"java" => anyhow::bail!("java ingest not yet wired (p10-1c-jk Task F)"),
"kotlin" => anyhow::bail!("kotlin ingest not yet wired (p10-1c-jk Task I)"),
```
(in each of the 4 blocks before the `other =>` catch-all).
- [ ] **Step 3**: Verify per-crate:
- `cargo test -p kebab-app --lib` → 52 stay green
- `cargo test -p kebab-app --test code_ingest_smoke` → 7 stay green
- `cargo clippy -p kebab-app --all-targets -- -D warnings` clean
- [ ] **Step 4**: Commit:
```bash
git add crates/kebab-app/
git commit -m "refactor(p10-1c-jk): add java + kotlin to ingest dispatch allowlist (bail until Tasks F/I)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task D: `JavaAstExtractor`
**Files:**
- Create: `crates/kebab-parse-code/src/java.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs` (`pub mod java;` + re-exports `JAVA_PARSER_VERSION`, `JavaAstExtractor`)
- Create: `crates/kebab-parse-code/tests/fixtures/sample.java`
Scaffold mirrors `crates/kebab-parse-code/src/go.rs` (1C-Go) — single-language with source-side `package` extraction. Differences:
### Constants
```rust
pub const PARSER_VERSION: &str = "code-java-v1";
pub struct JavaAstExtractor;
// supports: matches!(m, MediaType::Code(l) if l == "java")
// code_lang = Some("java"), SourceType::Note, repo via detect_repo
```
### Package extraction (Java)
tree-sitter-java grammar:
- Root: `program`
- `package_declaration` (top-level child) → contains `scoped_identifier` (dotted) OR `identifier` (single-segment)
```rust
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_declaration" {
// package_declaration has scoped_identifier OR identifier as first named child
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "scoped_identifier" || sub.kind() == "identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
```
(Verify field names against tree-sitter-java's node-types.json if any field differs.)
### AST mapping
| node kind | unit | symbol |
|-----------|------|--------|
| `class_declaration` (name field) | 1 + recurse body | `<pkg>.<ClassName>` |
| `interface_declaration` (name) | 1 + recurse body | `<pkg>.<InterfaceName>` |
| `enum_declaration` (name) | 1 | `<pkg>.<EnumName>` |
| `record_declaration` (name, Java 14+) | 1 | `<pkg>.<RecordName>` |
| `annotation_type_declaration` (name) | 1 | `<pkg>.<AnnotationName>` |
| Inside class body: `method_declaration` (name) | 1 | `<pkg>.<Class>.<method>` |
| Inside class body: `constructor_declaration` (name = class name) | 1 | `<pkg>.<Class>.<ClassName>` (matches Java convention) |
| Nested classes recurse with class name pushed onto mod_path | as above | `<pkg>.<Outer>.<Inner>` etc. |
| `import_declaration`, `package_declaration` | glue | `<pkg>.<top-level>` |
| `field_declaration` at top of class | NOT a unit in 1C-JK (would explode unit count for value-only fields) | n/a |
`unit_start` walks `comment` siblings; Java has `@interface` annotations but those are part of `annotation_type_declaration` itself, not separate sibling nodes.
`mod_path` = class nesting (like 1B Python). Empty at file top level.
### Fixture `tests/fixtures/sample.java`:
```java
// sample.java
package com.kebab.chunk;
import java.util.List;
import java.util.stream.Collectors;
/**
* Heading-aware Markdown chunker.
*/
public class MdHeadingV1Chunker {
private final String name;
public MdHeadingV1Chunker(String name) {
this.name = name;
}
public List<String> chunkDoc(String input) {
return List.of(name, input);
}
public String getName() {
return name;
}
public static class Builder {
private String name;
public Builder withName(String n) { this.name = n; return this; }
public MdHeadingV1Chunker build() { return new MdHeadingV1Chunker(name); }
}
}
interface Stringer {
String asString();
}
enum Mode { DEFAULT, FAST }
```
### Test module (inline `#[cfg(test)] mod tests`)
Mirror 1C-Go shape:
```rust
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.java"),
).unwrap();
let asset = crate::rust::tests_support::fixed_code_asset(
"crates/x/src/sample.java", "java",
);
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg };
JavaAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_java() { /* ... */ }
#[test]
fn java_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc.blocks.iter().filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("java"));
symbol.clone()
}
_ => None,
},
_ => None,
}).collect();
syms.sort();
// workspace path → package extracted from source = com.kebab.chunk
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker"), "got {syms:?}");
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.MdHeadingV1Chunker")); // constructor
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.getName"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.withName"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.build"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.Stringer"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.Mode"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.<top-level>"));
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 { assert_eq!(extract_fixture().blocks, a.blocks); }
}
}
```
### Wire into lib.rs
```rust
pub mod java;
pub use java::{PARSER_VERSION as JAVA_PARSER_VERSION, JavaAstExtractor};
```
### Verify + commit
- `cargo test -p kebab-parse-code` → all pass
- `cargo clippy -p kebab-parse-code --all-targets -- -D warnings` clean
- commit `feat(p10-1c-jk): tree-sitter-java AST extractor (JavaAstExtractor)`
---
## Task E: `code-java-ast-v1` chunker
Identical pattern to 1C-Go Task E. Duplicate `code_rust_ast_v1.rs` with substitutions:
- `VERSION_LABEL = "code-java-ast-v1"`, struct `CodeJavaAstV1Chunker`
- error message + module doc-comment prose
- Test module: parser_version `"code-java-v1"`, code_lang `"java"`
- Keep cross-chunker `policy_hash_matches_md_heading_v1`
Wire into `crates/kebab-chunk/src/lib.rs` (alphabetical). Verify + commit.
---
## Task F: Activate Java in app dispatch
Replace the `"java"` `bail!()` arms in `ingest_one_code_asset` with real calls (`JavaAstExtractor` + `CodeJavaAstV1Chunker`). Add integration test `java_file_ingests_and_searches_as_code_citation` (mirror 1C-Go test, fixture `pkg_dir/Foo.java` with `package com.foo;` and `public class Foo { public String bar() { ... } }`, assert symbol `com.foo.Foo.bar`).
Verify + commit.
---
## Task G: `KotlinAstExtractor`
**Files:**
- Create: `crates/kebab-parse-code/src/kotlin.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs`
- Create: `crates/kebab-parse-code/tests/fixtures/sample.kt`
Constants: `PARSER_VERSION = "code-kotlin-v1"`, `KotlinAstExtractor`, `code_lang = "kotlin"`.
### Package extraction (Kotlin)
tree-sitter-kotlin grammar:
- Root: `source_file`
- `package_header` (top-level) → contains `identifier` (dotted is single `identifier` node text; verify against node-types.json)
```rust
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_header" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
```
(Verify against tree-sitter-kotlin's node-types.json — Kotlin grammar varies more than Java's.)
### AST mapping (Kotlin)
| node kind | unit | symbol |
|-----------|------|--------|
| `class_declaration` (name field) — covers `class`, `data class`, `sealed class`, `enum class`, `interface` (Kotlin's interface is a class_declaration variant) | 1 + recurse body | `<pkg>.<ClassName>` |
| `object_declaration` (name) — singleton | 1 + recurse | `<pkg>.<ObjectName>` |
| `function_declaration` (name) | 1 | `<pkg>.<fn_name>` (top-level) or `<pkg>.<Class>.<method>` (inside class) |
| Inside class body: `function_declaration` → method | 1 | `<pkg>.<Class>.<method>` |
| `property_declaration` at top-level (`val` / `var`) | glue | `<top-level>` (Kotlin top-level properties are common — keep as glue not unit) |
| `import_header`, `package_header` | glue | `<top-level>` |
(Detect class-vs-interface via modifier; for 1C 1차 treat both as `class_declaration` arm — symbol differs only via name. If tree-sitter-kotlin exposes `interface` keyword via modifier list, mention in HOTFIXES if special handling needed.)
### Fixture `sample.kt`:
```kotlin
// sample.kt
package com.kebab.chunk
import java.util.List
/**
* Heading-aware Markdown chunker.
*/
class MdHeadingV1Chunker(val name: String) {
fun chunkDoc(input: String): List<String> = listOf(name, input)
fun getName(): String = name
companion object {
fun withName(n: String): MdHeadingV1Chunker = MdHeadingV1Chunker(n)
}
}
interface Stringer {
fun asString(): String
}
enum class Mode { DEFAULT, FAST }
fun freeFunction(x: Int): Int = x + 1
object Singleton {
fun ping(): String = "pong"
}
```
### Test module — assert symbols
```rust
// Asserted symbols:
"com.kebab.chunk.MdHeadingV1Chunker"
"com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"
"com.kebab.chunk.MdHeadingV1Chunker.getName"
"com.kebab.chunk.MdHeadingV1Chunker.Companion" // companion object (verify name)
"com.kebab.chunk.MdHeadingV1Chunker.Companion.withName" // method on companion
"com.kebab.chunk.Stringer"
"com.kebab.chunk.Mode"
"com.kebab.chunk.freeFunction" // top-level fn (Kotlin-specific!)
"com.kebab.chunk.Singleton"
"com.kebab.chunk.Singleton.ping"
"com.kebab.chunk.<top-level>" // import + property glue
```
(Companion object: tree-sitter-kotlin may use `companion_object` or `object_declaration` with `companion` modifier — verify and adjust the symbol if `Companion` isn't the right name.)
### Wire into lib.rs
```rust
pub mod kotlin;
pub use kotlin::{PARSER_VERSION as KOTLIN_PARSER_VERSION, KotlinAstExtractor};
```
Verify + commit.
---
## Task H: `code-kotlin-ast-v1` chunker
Same pattern as Task E. Substitute kotlin labels. Verify + commit.
---
## Task I: Activate Kotlin in app dispatch
Replace `"kotlin"` bail arms with real calls. Add integration test `kotlin_file_ingests_and_searches_as_code_citation`. Verify + commit.
---
## Task J: Snapshots + full-suite + SMOKE
- Create 2 snapshot tests (`code_java_ast_snapshot.rs`, `code_kotlin_ast_snapshot.rs`) + baselines. Mirror 1C-Go Task G snapshot test.
- ONE workspace test + clippy invocation.
- Manual SMOKE: write a `.java` and `.kt` file in TempDir, ingest, search.
Verify + commit (snapshot only).
---
## Task K: Docs + version bump
- README + HANDOFF + ARCHITECTURE + SMOKE + 2 INDEX updates + design §10.1.
- `Cargo.toml` version `0.12.0 → 0.13.0` (minor, surface 확장).
Commit `docs(p10-1c-jk): ... + chore: bump 0.12.0 → 0.13.0`.
---
## Finalize
`gitea-pr` → review loop → merge → main pull → branch cleanup → `cargo clean``gitea-release v0.13.0`.
---
## Self-Review (filled by plan author)
- **Spec coverage**: design §1C Java + Kotlin → Tasks D-I; §3.4 symbol path → extractor (Java D, Kotlin G); §6.1/§6.2 module structure → Tasks D/E/G/H; §6.3 dep graph → Task A; §9.1 Tier-1 + oversize fallback → chunkers E/H.
- **No placeholders**: novel logic (Java `extract_package`, Kotlin `extract_package`, AST walk arm tables) given concretely. Chunkers (E, H) are explicit "duplicate code_rust_ast_v1.rs with substitution X/Y/Z".
- **Type consistency**: `JavaAstExtractor` / `JAVA_PARSER_VERSION` / `CodeJavaAstV1Chunker` + `KotlinAstExtractor` / `KOTLIN_PARSER_VERSION` / `CodeKotlinAstV1Chunker` used consistently. `MediaType::Code("java")` / `("kotlin")` in routing + dispatch.
- **Kotlin grammar risk**: noted — tree-sitter-kotlin's exact node kinds (`class_declaration` vs `object_declaration`, `companion_object` vs companion modifier, `package_header` vs `package_directive`) should be verified against the resolved crate's node-types.json. Pin contract via test fixture; HOTFIXES any deviation found during implementation.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1545,6 +1545,14 @@ transitional 형태) 의 source of truth.
**p10-1B 활성화 (Python / TypeScript / JavaScript) (2026-05-20)**: Python (`code-python-ast-v1`, `.py`), TypeScript (`code-ts-ast-v1`, `.ts`/`.tsx`), JavaScript (`code-js-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx`) AST chunker 활성화. symbol path 는 workspace 경로 → module path prefix: Python = dotted (예: `kebab_eval.metrics.compute_mrr`), TypeScript/JavaScript = slash-style (예: `src/Foo.Foo.search`). Rust 1A-2 의 file-scope-only symbol 과 비일관 수용 (HOTFIXES 2026-05-20). expression-level 함수 (`const foo = () => {}`) 는 glue 처리 (HOTFIXES 2026-05-20).
**p10-1C-Go 활성화 (Go) (2026-05-20)**: Go (`code-go-ast-v1`, `.go`) AST chunker 활성화. symbol = `<package>.<Func>` / `<package>.(*Receiver).<Method>` 형식.
**p10-1C-JavaKotlin 활성화 (Java + Kotlin) (2026-05-20)**: Java (`code-java-ast-v1`, `.java`) + Kotlin (`code-kotlin-ast-v1`, `.kt`/`.kts`) AST chunker 활성화. symbol = `com.foo.Foo.bar` 형식 (패키지 + 클래스 + 메서드/필드). Kotlin grammar 은 `tree-sitter-kotlin-ng` 사용 (bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 고착으로 사용 불가).
**p10-2 활성화 (Tier 2 chunker) (2026-05-20)**: Tier 2 resource-aware chunker 3종 활성화 — k8s-manifest-resource-v1 (`.yaml`/`.yml`), dockerfile-file-v1 (`Dockerfile`), manifest-file-v1 (`Cargo.toml` 등 설정 파일). 추가 code_lang 매핑: XML (`.xml`, `pom.xml`), Groovy (`build.gradle`, `.gradle`), Go module (`go.mod`).
**p10-3 활성화 (Tier 3 paragraph fallback) (2026-05-21)**: Tier 3 chunker `code-text-paragraph-v1` 활성화. shell script (`.sh`/`.bash`/`.zsh`) direct routing + Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback 으로 retry. 비-k8s YAML / invalid YAML / AST 실패 케이스 모두 picked up. lang 은 입력 보존 (shell → "shell", yaml → "yaml" 등), symbol 은 항상 None.
### 10.2 MCP server transport (fb-30)
`kebab mcp` 가 stdio JSON-RPC server. Rust SDK = `rmcp 1.6`. Tool surface

View File

@@ -237,6 +237,9 @@ pub struct Metadata {
- Dockerfile (`Dockerfile`, `*.dockerfile`) → `dockerfile`
- TOML (`.toml`) → `toml`
- JSON (`.json`) → `json`
- XML (`.xml`, `pom.xml`) → `xml`
- Groovy (`build.gradle`, `.gradle`) → `groovy`
- Go module (`go.mod`) → `go-mod`
- Shell (`.sh`, `.bash`, `.zsh`) → `shell`
- Make (`Makefile`, `*.mk`) → `make`
- 미지원 / Tier 3 fallback → null

View File

@@ -142,10 +142,11 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능.
- [p10-1A-1 code ingest framework](p10/p10-1a-1-code-ingest-framework.md) — ✅ 머지
- [p10-1A-2 Rust AST chunker](p10/p10-1a-2-rust-ast-chunker.md) — ✅ 머지
- [p10-1B Python + TS/JS AST chunkers](p10/p10-1b-py-ts-js-ast-chunkers.md) — 🟡 PR 오픈 (코드 완성, 머지 대기)
- p10-1C Go + Java + Kotlin AST chunkers
- p10-1C-Go Go AST chunker — 🟡 PR 오픈 (v0.12.0, `code-go-ast-v1`)
- p10-1C-JavaKotlin Java + Kotlin AST chunkers — 🟢 PR 오픈 (v0.13.0, `code-java-ast-v1` / `code-kotlin-ast-v1`)
- p10-1D C + C++ AST chunkers — ⏳
- p10-2 Tier 2 resource-aware —
- p10-3 Tier 3 paragraph + line-window fallback —
- p10-2 Tier 2 resource-aware — ✅ 머지 (v0.14.0, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1`)
- p10-3 Tier 3 paragraph + line-window fallback — ✅ 머지 (v0.15.0, `code-text-paragraph-v1`)
## Post-merge 핫픽스

View File

@@ -5,9 +5,10 @@
| 1A-1 | code ingest framework (wire schema, parse-code crate skeleton, filter flags, skip policy, config 절) | ✅ 머지 |
| 1A-2 | Rust AST chunker | ✅ 머지 |
| 1B | Python + TS/JS AST chunkers | 🟡 PR 오픈 (코드 완성, 머지 대기) |
| 1C | Go + Java + Kotlin AST chunkers | ⏳ |
| 1C-Go | Go AST chunker (`code-go-ast-v1`) | 🟡 PR 오픈 (v0.12.0) |
| 1C-JavaKotlin | Java + Kotlin AST chunkers (`code-java-ast-v1` / `code-kotlin-ast-v1`) | 🟢 PR 오픈 (v0.13.0) |
| 1D | C + C++ AST chunkers | ⏳ |
| 2 | Tier 2 resource-aware (k8s / Dockerfile / manifest) | |
| 3 | Tier 3 paragraph + line-window fallback | |
| 2 | Tier 2 resource-aware (k8s / Dockerfile / manifest) | ✅ 머지 (v0.14.0) |
| 3 | Tier 3 paragraph + line-window fallback | ✅ 머지 (v0.15.0) |
Design: [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md)

View File

@@ -0,0 +1,54 @@
# p10-1C-Go — Go AST chunker
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `code-go-ast-v1`), §3.4 (symbol path — Go `package.Func` / `package.(*Receiver).Method`), §3.5 (code_lang `go`, ext `.go`), §6.1 (`kebab-parse-code/src/go.rs`), §6.2 (`kebab-chunk/src/code_go_ast_v1.rs`), §9.1 (Tier 1 AST per-language + oversize fallback).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1C (Go 부분 — Java + Kotlin 은 후속 PR).
**Plan:** [2026-05-20-p10-1c-go-ast-chunker.md](../../docs/superpowers/plans/2026-05-20-p10-1c-go-ast-chunker.md).
## Goal
1A-2 / 1B 인프라 위에 Go AST chunker 활성화. 사용자 결정으로 1C 의 3 언어 (Go + Java + Kotlin) 를 2 PR 로 분할 — Go 가 method receiver / package convention 면에서 Java/Kotlin (JVM family) 과 다르므로 별 PR. 본 PR 머지 시점부터 Go 프로젝트 dogfooding 가능.
## 동결된 설계 결정 (이 task 로 확정)
- **Symbol path 의 package prefix = 소스 코드의 `package` 선언에서 추출** (design §3.4 그대로). 1B 의 workspace-path 변환과 다름 — Go 는 언어 자체에 `package` declaration 이 있어 그게 canonical source. tree-sitter-go 의 `source_file` root 의 첫 named child `package_clause` 에서 추출. 빈 경우 (이론상 invalid Go, 실용엔 거의 없음) `<unknown>` 또는 fallback `<package>` (1A `<module>` 패턴과 유사).
- **Method receiver 표현** (design 예시 그대로): `package.(*Receiver).Method` (포인터 receiver), `package.(Receiver).Method` (value receiver). tree-sitter-go 의 `method_declaration``receiver` field 에서 type + pointer 여부 추출. 예: `func (m *MdHeadingV1Chunker) ChunkDoc(...)` → symbol `chunk.(*MdHeadingV1Chunker).ChunkDoc`.
- **Top-level unit 종류**:
- `function_declaration` → 1 unit, symbol `package.Func`
- `method_declaration` → 1 unit, symbol `package.(*Receiver).Method` / `package.(Receiver).Method`
- `type_declaration` (struct / interface / type alias) → 1 unit each, symbol `package.TypeName`
- `const_declaration`, `var_declaration`, `import_declaration` (블록 또는 단일) → glue, grouped → `package.<top-level>` (1A/1B 패턴)
- **Go 의 generic 처리**: `func Foo[T any](...)` 또는 `type Foo[T any] struct{}` 의 type parameter 는 symbol 에 미포함 (Go 자체도 보통 symbol 에 안 적음). 단순 `package.Foo` 만.
- **Test detection**: Go 의 `func TestXxx(t *testing.T)`*일반 fn 으로 emit*. test 감지 boost/penalty 등 ranking 영향은 본 task 범위 밖 (ranking brainstorm 보류 메모리 따름).
- frozen design 자체는 변경 없음 (§3.4 의 Go 행이 이미 본 결정과 일치). §10.1 에 1C-Go 활성화 한 줄 추가.
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` passes (memory-conscious: per-crate 위주, full-suite gate 는 docs task 직전 1회).
- `cargo clippy --workspace --all-targets -- -D warnings` passes.
- Go fixture (`tests/fixtures/sample.go`) ingest → chunk snapshot 안정 + `Citation::Code` 의 symbol 이 §3.4 컨벤션 일치 (`pkg.Func` / `pkg.(*Receiver).Method`).
- 격리 TempDir KB 에 Go 파일 두고 `kebab search --code-lang go --json``Citation::Code { lang: "go", symbol: "...", ... }` 반환.
- `kebab schema --json | jq .stats.code_lang_breakdown``"go"` 카운트.
- README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design §10.1 한 줄 추가.
- workspace `Cargo.toml` minor bump (0.11.1 → 0.12.0).
## Allowed dependencies
- `kebab-parse-code``tree-sitter-go` 추가 (workspace deps). 기존 deps 유지.
- `kebab-chunk` 의 새 모듈 `code_go_ast_v1.rs` — kebab-core + serde_json_canonicalizer + blake3 + anyhow + tracing. tree-sitter 절대 import 금지.
- `kebab-app`, `kebab-source-fs` 변경 — 새 crate dep 없음.
## Forbidden dependencies
- `kebab-chunk``tree-sitter-go` 직접 import 금지.
- UI crate 가 `kebab-parse-code` 직접 import 금지.
- `kebab-parse-code` 가 store / embed / llm / rag 직접 import 금지.
## Risks / notes
- tree-sitter-go 의 `package_clause` node 가 root 의 첫 named child 인지 grammar 버전에 따라 다를 수 있음 — extractor 가 `source_file` 전체를 named_children iterate 하면서 첫 `package_clause` 잡는 방식이 안전.
- `method_declaration` 의 receiver pointer 여부: tree-sitter-go AST 에서 receiver type 이 `pointer_type` 노드면 `*Receiver`, 그냥 `type_identifier``Receiver`. 정확한 텍스트 추출 필요.
- Generic type parameter (`[T any]`) 가 method_declaration / function_declaration 의 name field 와 별도 child — name 만 추출하면 generic 부분 자동 제외.
- 1B Python/TS/JS 패턴 (helpers from lang.rs) 와 *다른* 모델 — 본 task 의 mod_prefix 는 source-side AST 에서 추출, helper fn 불필요.
- 머지 후 deviation 은 `tasks/HOTFIXES.md` 에 dated 로그 + 본 spec `Risks / notes` 에 one-line cross-link.

View File

@@ -0,0 +1,69 @@
# p10-1C-JavaKotlin — Java + Kotlin AST chunkers
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `code-java-ast-v1` + `code-kotlin-ast-v1`), §3.4 (symbol path — Java/Kotlin `package.Class.method`), §3.5 (code_lang `java` + `kotlin`, ext `.java` / `.kt` / `.kts`), §6.1 (`kebab-parse-code/src/{java,kotlin}.rs`), §6.2 (`kebab-chunk/src/code_{java,kotlin}_ast_v1.rs`), §9.1 (Tier 1 AST per-language + oversize fallback).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1C (Java + Kotlin 부분 — Go 는 PR #151 / v0.12.0 별 PR 완료).
**Plan:** [2026-05-20-p10-1c-jk-ast-chunker.md](../../docs/superpowers/plans/2026-05-20-p10-1c-jk-ast-chunker.md).
## Goal
1C-Go (PR #151 / v0.12.0) 의 자매 PR. 같은 1C phase 의 JVM family (Java + Kotlin) 묶음. 머지 시점부터 `.java` / `.kt` / `.kts` 파일 dogfooding 가능.
## 동결된 설계 결정 (이 task 로 확정)
- **Symbol prefix = 소스 코드의 `package` 선언에서 추출** (design §3.4 그대로, 1C-Go 모델과 동일). 1B 의 workspace-path 변환과 다름.
- **Java**: tree-sitter-java 의 `package_declaration` → 안의 `scoped_identifier` 또는 `identifier` 텍스트 (e.g. `com.kebab.chunk`). 없으면 `<unknown>`.
- **Kotlin**: tree-sitter-kotlin 의 `package_header``identifier` 텍스트. 없으면 (default package) `<unknown>`.
- **Symbol 형식** (design §3.4): `package.Class.method`. 예시: `com.kebab.chunk.MdHeadingV1Chunker.chunkDoc`.
- **Java AST mapping**:
- `class_declaration` (name) → 1 unit + recurse body
- `interface_declaration` (name) → 1 unit + recurse
- `enum_declaration` (name) → 1 unit
- `record_declaration` (Java 14+) (name) → 1 unit
- `annotation_type_declaration` → 1 unit
- Inside class/interface/enum: `method_declaration` (name) → unit `package.Class.method` (class nesting like 1B Python)
- `import_declaration`, `package_declaration` 자체 → glue `<top-level>`
- Top-level fn 없음 (Java 자체에 없음)
- **Kotlin AST mapping**:
- `class_declaration` (name) → 1 unit + recurse class_body. `data class` / `sealed class` / `enum class` 도 같은 노드.
- `object_declaration` (name) → 1 unit + recurse class_body (singleton)
- `function_declaration` (name) — **top-level 가능** → unit `package.fnName`. Class 내부면 `package.Class.method`.
- `property_declaration` at top-level → glue
- `interface` (in tree-sitter-kotlin 보통 `class_declaration` with `interface` modifier 또는 별 노드) → 1 unit
- `import_header`, `package_header` 자체 → glue `<top-level>`
- **Glue grouping**: 1B Python / 1C-Go 패턴 동일 — imports + 기타 → 하나의 `<top-level>` (또는 `<module>` post-pass if file has zero real units).
- **Tree-sitter Kotlin crate 선택**: tree-sitter-kotlin 의 가장 잘 유지되는 crate 사용 (`tree-sitter-kotlin` 또는 fork). resolve 시 active maintainer 확인.
- frozen design 자체 변경 없음 — §10.1 에 1C-JK 활성화 한 줄.
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` passes.
- `cargo clippy --workspace --all-targets -- -D warnings` passes.
- Java/Kotlin fixture 각각 (`tests/fixtures/sample.java`, `tests/fixtures/sample.kt`) ingest → chunk snapshot 안정 + symbol 이 §3.4 컨벤션 일치.
- 격리 TempDir KB 에 `.java` / `.kt` 파일 두고 `kebab search --code-lang java --json` / `--code-lang kotlin --json``Citation::Code` 반환.
- `kebab schema --json | jq .stats.code_lang_breakdown``"java"` + `"kotlin"` 카운트.
- README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design §10.1 한 줄.
- workspace `Cargo.toml` minor bump (0.12.0 → 0.13.0).
## Allowed dependencies
- `kebab-parse-code``tree-sitter-java` + `tree-sitter-kotlin` 추가. 기존 deps 유지.
- `kebab-chunk` 의 새 모듈 2개 (`code_java_ast_v1.rs`, `code_kotlin_ast_v1.rs`) — language-agnostic body. tree-sitter import 금지.
- `kebab-app`, `kebab-source-fs` — 새 crate dep 없음.
## Forbidden dependencies
- `kebab-chunk` 가 tree-sitter-java / tree-sitter-kotlin import 금지 (boundary §6.3).
- UI crate 가 `kebab-parse-code` 직접 import 금지.
- `kebab-parse-code` 가 store / embed / llm / rag 직접 import 금지.
## Risks / notes
- tree-sitter-kotlin: 공식 또는 가장 활발히 유지되는 crate (`tree-sitter-kotlin` 또는 fork) 선택 필요. resolve 시 metadata 확인.
- Kotlin 의 grammar 가 다른 tree-sitter-* 보다 update 빈도 낮을 수 있어 grammar field 명 변동 가능 — 테스트 fixture 로 contract 고정.
- Java record (Java 14+) — tree-sitter-java 에서 `record_declaration` 노드 (확인 필요).
- Kotlin sealed class / data class / object declaration 등 변종 노드 — tree-sitter-kotlin 의 정확한 node kind 명 확인 필요 (grammar.json / node-types.json).
- Java class 안의 inner class — Python 패턴 (recursion with class name pushed) 동일 처리.
- Kotlin top-level fn 은 1B Python 의 top-level fn 패턴 + 1C-Go 의 package-prefix 패턴 hybrid — `package.fnName`.
- 머지 후 deviation 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` cross-link.

View File

@@ -0,0 +1,120 @@
# p10-2 — Tier 2 resource-aware chunkers (k8s + Dockerfile + manifest)
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `k8s-manifest-resource-v1` + `dockerfile-file-v1` + `manifest-file-v1`), §3.4 (citation symbol — `<kind>/<namespace>/<name>` / `<dockerfile>` / `<manifest>`), §3.5 (code_lang 추가 매핑 `xml` / `groovy` / `go-mod`), §6.1 (`kebab-parse-code/src/lang.rs` 갱신 + `kebab-source-fs/src/media.rs` 의 inline duplication 정리), §6.2 (`kebab-chunk/src/{k8s_manifest_resource_v1,dockerfile_file_v1,manifest_file_v1}.rs`), §9.2 (Tier 2 정의), §10.1 (deactivation log 한 줄).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1.2 (Phase 2) + §9.2.
**Plan:** [2026-05-20-p10-2-tier2-resource-aware.md](../../docs/superpowers/plans/2026-05-20-p10-2-tier2-resource-aware.md).
## Goal
p10-1A-2 / 1B / 1C 인프라 위에 Tier 2 resource-aware chunker 3종을 단일 PR 로 활성화. AST 가 아닌 file/document-level chunking — 1B (Python+TS+JS) 의 묶음 패턴 따름. 머지 시점부터 `.yaml` / `.yml` / `Dockerfile` / 매니페스트 7종 dogfooding 가능.
비-k8s YAML (Helm values, CI yml, docker-compose 등) 및 invalid YAML 은 본 phase 에선 skip — p10-3 의 paragraph fallback 이 머지되면 자동으로 wire 됨.
## 동결된 설계 결정 (이 task 로 확정)
### 공통
- **3 chunker = self-contained**. `kebab-parse-code` 에 Tier 2 용 extractor 모듈 추가 없음. lang.rs 의 `code_lang_for_path` 갱신만. AST 가 아니라 추상화 비용이 코드 보상보다 큼.
- **`code_lang_for_path` = single source of truth** (design §3.5). `kebab-source-fs/src/media.rs` 의 inline 확장자 match 는 이 함수 호출로 통일 (1A-1 부터 누적된 duplication 정리, 작은 리팩토링).
- **parser_version** = `"none-v1"` 통일. Tier 2 는 parse 단계가 없음을 명시하는 sentinel. chunker_version cascade 만 의미 있음.
- **oversize fallback** = AST chunker 와 동일 정책 (`AST_CHUNK_MAX_LINES = 200` 초과 시 line-window split). 거대 ConfigMap / multi-stage Dockerfile / aggregate POM 대비. split chunk 는 같은 symbol 공유 (line range 만 다름).
- **frozen design 갱신** (본 PR 안에서):
- §3.5 `code_lang` 매핑 표에 3 줄 추가:
- XML (`.xml`, `pom.xml`) → `xml`
- Groovy (`build.gradle`, `.gradle`) → `groovy`
- Go module (`go.mod`) → `go-mod`
- §10.1 deactivation log 한 줄 추가: "p10-2 활성화 — Tier 2 chunker 3종 active."
### k8s-manifest-resource-v1
- **Trigger**: `MediaType::Code("yaml")` (= `.yaml` / `.yml`).
- **k8s 식별**: YAML document 의 top-level mapping 에 `apiVersion: <string>` + `kind: <string>` 둘 다 있어야 인정. 하나라도 없거나 string 타입이 아니면 그 document skip (전체 파일 skip 아님 — 다른 document 는 정상 처리).
- **Multi-document split 구현**: `serde_yaml::Deserializer::from_str` 의 multi-document iterator 가 line offset 을 안 줘서, 원본 텍스트의 `^---\s*$` 줄 정규식 기준으로 pre-split 후 각 슬라이스를 deserialize. line_start/line_end 는 pre-split 단계에서 추적. trailing `---` 의 빈 슬라이스는 skip.
- **Symbol**: `<kind>/<metadata.namespace>/<metadata.name>` (namespace 있으면) 또는 `<kind>/<metadata.name>` (cluster-scoped) 또는 `<kind>/<unnamed>` (name 누락). 예: `Deployment/prod/api-server`, `ClusterRole/cluster-admin`, `ConfigMap/<unnamed>`.
- **Chunk text**: pre-split 슬라이스의 원본 텍스트 그대로 (deserialized form 아님 — 원본 보존).
- **Citation**: `Citation::Code { path, line_start, line_end, symbol: Some(<위>), lang: Some("yaml") }`.
- **Failure modes**:
- Invalid YAML (어떤 document 라도 deserialize 실패) → 파일 전체 emit 0 chunk + warning log `invalid yaml: {path}`. p10-3 의 paragraph fallback 이 picked up.
- 인정된 document 0개 (모두 비-k8s) → 파일 전체 emit 0 chunk. 동일 fallback.
### dockerfile-file-v1
- **Trigger**: `MediaType::Code("dockerfile")` — 파일명이 정확히 `Dockerfile`, 또는 prefix `Dockerfile.` (e.g. `Dockerfile.dev`), 또는 확장자 `.dockerfile` (e.g. `myapp.dockerfile`).
- **Algorithm**: 파일 전체 텍스트 → 1 chunk emit.
- **Symbol**: 통일 `<dockerfile>`.
- **Citation**: `Citation::Code { path, line_start: 1, line_end: <EOF>, symbol: Some("<dockerfile>"), lang: Some("dockerfile") }`.
### manifest-file-v1
- **Trigger**: 파일명이 design §9.2 의 7종 중 하나:
| basename | code_lang |
|----------------|-----------|
| `Cargo.toml` | `toml` |
| `pyproject.toml` | `toml` |
| `package.json` | `json` |
| `tsconfig.json`| `json` |
| `go.mod` | `go-mod` |
| `pom.xml` | `xml` |
| `build.gradle` | `groovy` |
- **제외**: `build.gradle.kts` 는 1C-JK 의 Kotlin AST chunker (code-kotlin-ast-v1) 가 잡으므로 본 chunker 의 대상 아님.
- **Algorithm**: 파일 전체 텍스트 → 1 chunk emit.
- **Symbol**: 통일 `<manifest>` (7종 모두). manifest 종류 구분은 `code_lang` 으로 — 예: `--code-lang go-mod` 는 go.mod 만, `--code-lang toml` 은 Cargo.toml + pyproject.toml.
- **Citation**: `Citation::Code { path, line_start: 1, line_end: <EOF>, symbol: Some("<manifest>"), lang: Some(<위 매핑>) }`.
### Routing (kebab-app::ingest_one_code_asset)
기존 7-arm AST match 옆에 Tier 2 분기 추가:
```text
"rust" | "python" | "typescript" | "javascript"
| "go" | "java" | "kotlin" → 기존 AST chunker (1A-2 / 1B / 1C)
"yaml" → k8s_manifest_resource_v1
"dockerfile" → dockerfile_file_v1
"toml" | "json" | "xml"
| "groovy" | "go-mod" → manifest_file_v1
_ → skip (p10-3 fallback 의 자리)
```
`code_lang_for_path` 의 lookup 순서: basename 우선 매칭 (`Cargo.toml` / `Dockerfile.*` / etc.) → 확장자 fallback (`.yaml` / `.toml` / etc.).
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` passes (memory-conscious: per-crate 위주, full-suite gate 는 docs task 직전 1회).
- `cargo clippy --workspace --all-targets -- -D warnings` passes.
- 각 chunker 의 snapshot test 안정:
- `crates/kebab-chunk/tests/fixtures/sample.yaml` — 2 k8s doc (Deployment + Service) + 1 비-k8s doc (apiVersion 빠짐) → 2 chunk emit, 비-k8s doc skip.
- `crates/kebab-chunk/tests/fixtures/sample.dockerfile` → 1 chunk, symbol `<dockerfile>`.
- `crates/kebab-chunk/tests/fixtures/sample.Cargo.toml` + `sample.package.json` + `sample.pom.xml` + `sample.go.mod` (4종) → 각 1 chunk, symbol `<manifest>`, 매핑된 code_lang.
- `code_lang_for_path` 의 basename 우선 매칭 + 확장자 fallback unit test.
- 격리 TempDir KB 에 yaml + Dockerfile + Cargo.toml 두고 `kebab search --code-lang yaml --json` / `--code-lang dockerfile --json` / `--code-lang toml --json` 각각 `Citation::Code` 반환 (기존 `code_ingest_smoke.rs` 에 3 테스트 추가, 총 12 테스트).
- `kebab schema --json | jq .stats.code_lang_breakdown``yaml` / `dockerfile` / `toml` / `json` / `xml` / `groovy` / `go-mod` 카운트 (사용된 것만 등장).
- README + HANDOFF + docs/ARCHITECTURE + docs/SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design §3.5 매핑 3 줄 + §10.1 활성화 한 줄.
- workspace `Cargo.toml` minor bump (0.13.0 → 0.14.0), gitea-release v0.14.0.
## Allowed dependencies
- `kebab-chunk` 에 새 모듈 3개 (`k8s_manifest_resource_v1.rs` / `dockerfile_file_v1.rs` / `manifest_file_v1.rs`) 및 dep entry `serde_yaml = { workspace = true }` (workspace 에 이미 존재). 기존 deps (kebab-core / serde_json_canonicalizer / blake3 / anyhow / tracing) 유지.
- `kebab-parse-code``lang.rs` 갱신만. extractor 모듈 추가 없음, 새 crate dep 없음.
- `kebab-source-fs/src/media.rs``code_lang_for_path` 호출로 inline match 정리. 기존 dep 유지 (kebab-parse-code 는 이미 의존).
- `kebab-app::ingest_one_code_asset` — match 분기 확장. 새 crate dep 없음.
## Forbidden dependencies
- `kebab-chunk` 가 store / embed / llm / rag / tree-sitter 직접 import 금지 (boundary §6.3 유지).
- `kebab-parse-code` 가 store / embed / llm / rag 직접 import 금지.
- UI crate (`kebab-cli` / `kebab-mcp` / `kebab-tui` / `kebab-desktop`) 가 `kebab-parse-code` / `kebab-chunk` 직접 import 금지 — `kebab-app` facade 만.
## Risks / notes
- **serde_yaml line offset 없음** → 원본 텍스트의 `^---\s*$` 정규식 split 으로 line 추적. trailing `---` 의 빈 슬라이스 / 첫 슬라이스에 `---` prefix 없음 / 비-표준 separator (예: `--- # comment`) 모두 fixture 로 검증.
- **apiVersion / kind 가 string 이 아닌 경우** (예: `kind: 42`) — `serde_yaml::Value::as_str()` 으로 string 체크 후 인정. 비-string 이면 비-k8s 취급.
- **cluster-scoped resource** (Namespace, ClusterRole, ClusterRoleBinding, …) — metadata.namespace 없음이 정상. symbol = `<kind>/<name>` 형태.
- **metadata.name 누락** — 비정상이지만 panic 금지. `<kind>/<unnamed>` fallback + warning log.
- **거대 ConfigMap / Helm-rendered manifest** — `AST_CHUNK_MAX_LINES = 200` oversize fallback. split chunk 가 같은 symbol 공유 → search 시 dedupe 또는 user-visible 두 hit 으로 보임 (1A-2 의 oversize 와 동일 동작).
- **YAML anchor / merge keys (`&`, `<<`, `*`)** — serde_yaml 가 자동 resolve. 원본 텍스트 보존 정책상 chunk text 는 원본 (resolve 전) 유지, 파싱은 resolve 후 값으로.
- **`Dockerfile.example` 같은 doc-purpose 파일** — 확장자/접두사 매칭에 잡힘. user intent 와 어긋날 수 있으나 본 phase 의 scope 밖 (skip 정책은 1A-1 의 size/built-in/generated 정책으로 통제). dogfood 후 false positive 빈도 보고 HOTFIXES 결정.
- **`pom.xml` aggregate parent POM** — 매우 큼 (수백~수천 줄). oversize fallback 으로 split. 거대 fixture 로 한 번 검증.
- **`media.rs` 정리** — 1A-1 부터 누적된 inline `match extension` duplication 을 `code_lang_for_path` 호출로 교체. 기존 단위 테스트 동작 보존 (테스트는 결과 값만 보므로 통과해야 함).
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` 에 one-line cross-link.

View File

@@ -0,0 +1,116 @@
# p10-3 — Tier 3 paragraph + line-window fallback chunker
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `code-text-paragraph-v1`), §3.5 (code_lang routing — `shell` 활성화 + "미지원 / Tier 3 fallback" 명확화), §6.2 (`kebab-chunk/src/code_text_paragraph_v1.rs`), §6.3 (`tier2_shared::build_chunk``pub(crate)` 노출), §9.3 (Tier 3 정의), §10.1 (deactivation log 한 줄).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1.3 (Phase 3) + §9.3.
**Plan:** [2026-05-20-p10-3-tier3-paragraph-fallback.md](../../docs/superpowers/plans/2026-05-20-p10-3-tier3-paragraph-fallback.md).
## Goal
p10-1A-2 / 1B / 1C / 1A-1 의 framework + p10-2 Tier 2 인프라 위에 Tier 3 paragraph fallback chunker 활성화. 단일 PR. 머지 시점부터:
- `.sh` / `.bash` / `.zsh` 파일이 paragraph 단위로 색인.
- p10-2 의 비-k8s YAML / invalid YAML / Tier 1 AST extractor 실패 등 0-chunk 결과가 자동으로 Tier 3 로 fallback 되어 색인 — 이전에 skip 되던 파일이 search 가능.
## 동결된 설계 결정 (이 task 로 확정)
### chunker (`code-text-paragraph-v1`)
- **Input**: `Document` with single `Block::Code { text, lang, ... }`. Tier 2 의 `synthesize_tier2_document` 와 동일한 모양 — fallback wrapper 가 같은 doc 재사용.
- **VERSION_LABEL**: `"code-text-paragraph-v1"`.
- **Paragraph 분할**: `text.lines()` 순회. 빈 줄 (정확히 빈 줄 또는 only-whitespace) 을 paragraph boundary 로. 빈 줄 자체는 어느 paragraph 에도 포함되지 않음 (chunk 의 line range 에 미포함). 빈 paragraph (전부 whitespace) skip.
- **Paragraph 크기 룰** (design §9.3 default 그대로, hardcoded):
- paragraph line count ≤ 80 → 1 chunk emit.
- paragraph line count > 80 → line-window split with window size 80 / overlap 20 (stride 60). 즉 line 1-80, 61-140, 121-200, … 마지막 window 는 EOF 까지 (≤ 80 lines).
- `FALLBACK_LINES_PER_CHUNK = 80`, `FALLBACK_LINES_OVERLAP = 20` 둘 다 hardcoded constants (1A-2 의 `AST_CHUNK_MAX_LINES = 200` 패턴 그대로 — 사용자 config 노출 안 함, 미래 HOTFIXES 시 노출 검토).
- **Citation**: `SourceSpan::Code { line_start, line_end, symbol: None, lang: <input lang> }`. `symbol = None` 통일 (Tier 3 는 의미 단위 식별 안 함). `lang` 은 입력 Document 의 `Block::Code.lang` 그대로 보존 — shell → `"shell"`, k8s skip → `"yaml"`, Rust extractor 실패 → `"rust"` 등.
- **chunk_id 충돌 방지**: 동일 paragraph 의 line-window split 시 `id_for_chunk``split_key``window_start` 전달 (Tier 2 `#L{k}` 패턴 동일).
- **Edge cases**:
- 전체 파일이 빈 줄만 → 0 chunk emit (fallback 의 fallback 없음). `tracing::warn!`.
- 단일 paragraph + ≤ 80 lines → 1 chunk, line range 1..N.
- 빈 줄 없는 거대 파일 (한 paragraph 전체) → line-window split.
### Routing / fallback wrapper
- **`code_lang_for_path`** 변경 없음 (shell 매핑은 1A-1 시점부터 이미 존재).
- **`ingest_one_code_asset` allowlist** (`crates/kebab-app/src/lib.rs:953`) 에 `"shell"` 추가.
- **4-arm match (parser_version / chunker_version / extract / chunks)** 에 `"shell"` arm 추가:
- parser_version = `"none-v1"` (Tier 2 sentinel 재사용).
- chunker_version = `CodeTextParagraphV1Chunker.chunker_version()`.
- extract = `synthesize_tier2_document(asset, &bytes, "shell", &parser_version)?` (재사용).
- chunks = `CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)?`.
- **Fallback wrapper** (핵심 신규 로직) — chunks match 직후 후처리:
- Tier 1/2 lang 의 결과가 `Err(_)` 또는 `Ok(empty_vec)` 이면 Tier 3 retry.
- retry 시:
- `chunker_version``code-text-paragraph-v1` 로 swap (downstream stamping 정확성).
- `canonical.parser_version``"none-v1"` 로 swap (Tier 1 의 `RUST_PARSER_VERSION` 등이 misleading 하므로).
- `CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)` 실행.
- 실패 사유는 `tracing::warn!("tier1/2 emitted 0 chunks or errored for {workspace_path} ({code_lang}); falling back to tier 3")`.
- **Tier 3 자체가 0 chunk 또는 Err** 인 경우는 그대로 fail/skip (fallback 의 fallback 없음).
### `tier2_shared::build_chunk` 노출
- 현재 module-private `fn build_chunk`. Tier 3 가 동일 Chunk 생성 (hash / token / policy_hash 일관) 을 위해 호출 — `pub(crate) fn build_chunk(...)` 으로 visibility 만 변경. signature 동일.
### Lang 보존 정책
- Tier 3 chunk 의 `Citation::Code.lang` = 입력 Document 의 `Block::Code.lang` 그대로. 명시적으로 표:
| Source | input lang | Tier 3 output lang |
|--------|-----------|----------|
| shell direct | `"shell"` | `"shell"` |
| k8s 0-chunk fallback | `"yaml"` | `"yaml"` |
| Rust AST 실패 fallback | `"rust"` | `"rust"` |
| manifest 0-chunk (이론상, 거의 발생 안 함) | `"toml"` 등 | 유지 |
- 검색 시 `--code-lang shell` / `--code-lang yaml` 등이 fallback chunk 도 매칭 — search filter 동작 자연.
### Non-scope
- **미지원 확장자 wiring**: `.txt` / `.log` / `.scala` / `.rb` 등은 본 PR scope 밖. `code_lang_for_path` 의 매핑은 unchanged. Tier 3 chunker 자체는 만들어두고, 미래에 `code_lang_for_path` 에 새 lang 추가 시 자동 picked up (1A-2 패턴).
- **config 노출**: `FALLBACK_LINES_PER_CHUNK` / `FALLBACK_LINES_OVERLAP` hardcoded. config.toml 노출 없음.
### Frozen design 갱신
- `docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md` §10.1 활성화 로그 한 줄.
- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` §10 activation log 한 줄.
- §3.5 의 "미지원 / Tier 3 fallback → null" 표현은 그대로 유지 (해당 표현이 본 phase 의 정확한 의미 — Tier 3 chunk 의 lang 은 입력 lang 보존이므로 "null" 은 미지원 확장자 wire 시 적용).
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` PASS (memory-conscious `-j 1`).
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- 4 신규 unit test in `crates/kebab-chunk/tests/code_text_paragraph_v1.rs`:
- `shell_multi_paragraph_splits_on_blank_lines` — 3-paragraph fixture → 3 chunk, symbol=None, lang=shell, contiguous (exclusive of blank lines).
- `single_long_paragraph_line_window_split` — 200+ line single paragraph → window split, distinct chunk_ids, expected line ranges (1-80, 61-140, 121-200, …).
- `empty_file_emits_zero_chunks` — 빈 텍스트 → `Ok(vec![])`.
- `lang_field_preserved_from_input_doc` — lang=yaml 입력 → emit chunk lang=yaml.
- 2 신규 integration test in `crates/kebab-app/tests/code_ingest_smoke.rs`:
- `tier3_shell_ingest_searchable``.sh` 파일 ingest → `--code-lang shell` 검색 → `Citation::Code { symbol: None, lang: "shell" }`, `chunker_version: "code-text-paragraph-v1"`.
- `tier3_yaml_fallback_picks_up_non_k8s_yaml` — apiVersion+kind 없는 yaml ingest → fallback 발동 → `Citation::Code { symbol: None, lang: "yaml" }`, chunker_version `code-text-paragraph-v1`.
- 기존 12 smoke test + 2 신규 = 14 testing surface. (Tier 1 9 + Tier 2 3 + Tier 3 2.)
- `kebab schema --json | jq .stats.code_lang_breakdown``"shell"` 카운트 등장 (.sh 파일 ingest 후). 비-k8s YAML 도 `"yaml"` 카운트에 누적 (Tier 2 와 Tier 3 가 같은 lang).
- README + HANDOFF + docs/ARCHITECTURE + docs/SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design §10.1 + §10 activation log 한 줄씩.
- workspace `Cargo.toml` minor bump (0.14.0 → 0.15.0), gitea-release v0.15.0.
## Allowed dependencies
- `kebab-chunk` 의 새 모듈 `code_text_paragraph_v1.rs` — kebab-core + anyhow + tracing. tier2_shared 의 `build_chunk` 호출 (visibility `pub(crate)` 로 노출). tree-sitter / serde_yaml 비사용.
- `kebab-app::ingest_one_code_asset` — 4-arm match + allowlist + fallback wrapper 확장. 새 crate dep 없음.
- `kebab-parse-code` — 변경 없음 (lang.rs 의 shell 매핑은 1A-1 부터 존재).
- `kebab-source-fs` — 변경 없음 (media.rs 이미 `code_lang_for_path` 위임).
## Forbidden dependencies
- `kebab-chunk` 가 store / embed / llm / rag / tree-sitter 직접 import 금지 (boundary §6.3 유지).
- UI crate (`kebab-cli` / `kebab-mcp` / `kebab-tui` / `kebab-desktop`) 가 `kebab-parse-code` / `kebab-chunk` 직접 import 금지 — `kebab-app` facade 만.
## Risks / notes
- **Fallback infinite loop 방지**: Tier 3 자체가 0 chunk 또는 Err 인 경우는 그대로 fail/skip — fallback 의 fallback 없음. 명시 spec.
- **chunker_version swap 시 `try_skip_unchanged` 일관성**: fallback 발동 후 stored chunker_version = `code-text-paragraph-v1`. 다음 ingest 에 동일 파일 → 동일 chunker_version 으로 lookup 매칭 (skip 동작 OK). Tier 1 chunker 가 미래에 작동하기 시작하면 (예: tree-sitter grammar fix) cascade rule 로 incremental cache miss → 자동 reprocess 가 정상 동작.
- **lang 보존 vs fallback 의미**: fallback chunk 의 lang 이 원본 lang 유지라 search filter `--code-lang yaml` 가 Tier 2 와 Tier 3 chunk 둘 다 매칭. 의도된 동작 — 사용자가 "yaml 파일 검색" 했을 때 모든 yaml 결과 표시.
- **line-window overlap 의미**: 80/20 (stride 60) 은 design §9.3 default. 거대 paragraph (예: minified JSON 한 줄) 의 경우에도 동일 알고리즘 — 단 한 줄 = 한 line 이라 split 발생 안 함 (length 80 lines 기준). minified 의 경우 chunk 한 개에 매우 긴 텍스트가 들어가는데 이는 paragraph 분할 정책의 inherent limitation. 미래 HOTFIXES 검토.
- **빈 줄 처리**: `^\s*$` 매칭 (whitespace-only) 줄을 paragraph boundary 로. 탭만 있는 줄 / CR-only 줄 등 edge case fixture 로 검증.
- **shell line-comment 처리**: shell script 의 `# comment` 줄은 일반 line. paragraph 분할에 영향 없음 (빈 줄 아님). chunk 안에 그대로 보존.
- **fallback wrapper 의 `canonical.parser_version` mutation**: Document 의 parser_version 을 Tier 3 fallback 시 `"none-v1"` 로 swap. CanonicalDocument 가 `mut` 로 받아져야 함. 이미 `let mut canonical = match ...` 이라 mut 가능. plan 단계 검증.
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` cross-link.