Compare commits

..

93 Commits

Author SHA1 Message Date
a0c0dca321 fix(dogfood): k8s multi-resource YAML chunk_id collision (#158) 2026-05-21 23:57:49 +00:00
667495ae6a docs(dogfood): HOTFIXES entry for k8s multi-resource chunk_id collision
PR #158 code-reviewer recommendation. Records the dogfood-discovered
k8s multi-resource chunk_id collision + the deliberate decision NOT to
bump chunker_version (dogfood-only stage, single-resource k8s chunk_id
shift is benign churn). Cross-link added to p10-2 spec Risks/notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:57:34 +00:00
08d72a12e0 chore: bump version 0.16.0 → 0.16.1 (k8s multi-resource chunk_id fix)
Patch bump — bug fix only (P10 dogfood-discovered k8s multi-resource
chunk_id collision). New binary needed to resume dogfooding. No wire
schema change, no DB migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:54:33 +00:00
1969c8e3b5 fix(dogfood): k8s multi-resource YAML chunk_id collision
P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
  UNIQUE constraint failed: chunks.chunk_id

Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.

Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).

Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 23:49:37 +00:00
c6207d196e chore(p10-1d-followup): reviewer nit cleanup — C extractor tests + HOTFIXES + cpp snapshot (#157) 2026-05-21 22:47:38 +00:00
840c6c40a6 test(p10-1d-followup): cpp snapshot exercises actual CppAstExtractor
Reviewer nit #3: the hand-built fixed_doc() only verified chunker 1:1
mapping. New tests invoke CppAstExtractor against tests/fixtures/sample.cpp
and snapshot the real extractor → chunker pipeline (14 blocks emitted
covering namespace::chunk::Class, ctor/dtor/operator/template/free-fn
convention, glue <top-level> blocks between units).

Adds kebab-parse-code as a dev-dep of kebab-chunk (same precedent as
kebab-parse-md). Both the existing hand-built test AND the new
extractor-driven tests are kept — the former for fast chunker-only
validation, the latter for end-to-end regression detection.

Added tests:
- code_cpp_ast_extractor_snapshot: asserts all 8 named symbol units are present
- code_cpp_ast_extractor_chunks_deterministic: chunker output is stable

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:43:57 +00:00
b81574afa9 docs(p10-1d-followup): HOTFIXES entry — typedef-wrapped struct/enum in C falls into glue
PR #156 reviewer nit #2. Documents the tension between spec body
("struct_specifier (named, top-level) → 1 unit") and the actual behavior
for the C idiom `typedef struct { ... } Foo;` — the inner struct_specifier
is anonymous, so the extractor falls into glue. Workaround: dogfood-driven
revisit if frequent pain point emerges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:40:04 +00:00
6beff35a2f test(p10-1d-followup): add in-file unit tests to C AST extractor
Mirrors the cpp.rs 15-test pattern. Covers function_definition (incl.
pointer-return, static/extern/inline), struct_specifier / enum_specifier /
union_specifier (named), anonymous struct/enum/union → glue, typedef-wrapped
struct → glue (per spec risks note), preprocessor directives → glue, empty
file → <module> post-pass, preprocessor-only → <module>, mixed fn + glue →
<top-level> present, determinism (20 runs). 17 tests total.

Reviewer nit #1 (PR #156 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:39:36 +00:00
75a4207aa1 feat(p10-1d): C + C++ AST chunkers — P10 Tier 1 chunker family complete (#156) 2026-05-21 15:48:34 +00:00
86aa180ad7 chore: bump version 0.15.0 → 0.16.0 (p10-1d C + C++ AST chunkers)
Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1
+ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace
deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema
major bump.

Also lands the missing p10-3 try_skip_unchanged fallback-aware fix (Option
B1 — 7th param) that PR #155 was supposed to ship but never made it to main
(implementer reported commit SHA 2a39513 that didn't exist in the merged
branch). Same commit extends tier3_fallback_cv to include c/cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 15:38:00 +00:00
802c573c07 docs(p10-1d): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++).

- README adds C/C++ to the ingest row + --code-lang c/cpp + Mermaid brace.
- HANDOFF flips p10-1D to  (v0.16.0), updates 한 줄 요약 + 다음 후보.
- ARCHITECTURE adds C/C++ to the code-parser row, extends flowchart pcode
  node, adds chunker tree entries.
- SMOKE adds P10-1D walkthrough section + verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-1D to .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:35:59 +00:00
438870ee25 docs(p10-1d): activate C + C++ in frozen design §10
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:32:26 +00:00
192835e5bf test(p10-1d): integration smoke tests for C + C++
Verifies end-to-end ingest + search + Citation::Code shape:
- tier1_c_ingest_searchable: .c file → --code-lang c search → symbol
  = function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1".
- tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search →
  symbol starts with namespace::Class prefix, lang = "cpp",
  chunker_version = "code-cpp-ast-v1".

Brings code_ingest_smoke to 18 tests (Tier 1: 9 → 11, Tier 2: 3,
Tier 3: 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:31:35 +00:00
1034de25a2 fix(p10-3+p10-1d): land the missing try_skip_unchanged fallback-aware fix
PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix —
the implementer reported a commit SHA (2a39513) that never made it to main.
Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid
YAML, AST extractor failure) re-runs full extract + chunk + embed because
the parser/chunker version comparison can never match (stored is
code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch
values).

This commit:
1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>`
   to try_skip_unchanged + the stored_is_tier3_fallback detection branch
   (skip parser/chunker equality, keep embedder check).
2. Threads `None` through non-code call sites (md / image / pdf).
3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs
   that can fall back: rust / python / ts / js / go / java / kotlin /
   yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp
   (p10-1D additions).
4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged
   regression tests (the originally-promised PR #155 regression coverage
   that also never made it to main).

Smoke tests: 14 + 2 = 16 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:19:17 +00:00
d1560be80d feat(p10-1d): activate C + C++ in ingest_one_code_asset dispatch
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
+ allowlist + tier3_fallback_cv with "c" + "cpp" arms. C uses CAstExtractor
+ CodeCAstV1Chunker; C++ uses CppAstExtractor + CodeCppAstV1Chunker. Both
langs are Tier 3-fallback-eligible (e.g. .h file with C++ syntax may fail
tree-sitter-c parse → Tier 3 paragraph fallback per p10-3 wrapper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:56:45 +00:00
b2a2902e38 feat(p10-1d): code-cpp-ast-v1 chunker + snapshot test
Identical chunker body to code-c-ast-v1 (per-language work happens in the
CppAstExtractor, Task C). Snapshot fixture covers nested namespace + class
+ ctor/dtor + method + operator overload + template fn + free fn + top-level
main, verifying namespace::Class::method symbol convention per design §3.4.

5 chunks emitted:
- <top-level> (includes, namespace opening)
- kebab::chunk::MdHeadingV1Chunker (class unit)
- kebab::identity (template function)
- kebab::global_helper (free function in namespace)
- main (top-level main function)

Template function symbols emit without <T> parameters per spec convention.
Namespace::Class::method pattern verified. All tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:46:12 +00:00
03cd41c48f feat(p10-1d): code-c-ast-v1 chunker + snapshot test
Mirrors code-go-ast-v1's chunker pattern. Snapshot test against
tests/fixtures/sample.c (function + typedef struct + typedef enum +
preprocessor) verifies symbol list + lang=c stamping.

Chunks produced (4 total):
- <top-level> glue: includes, defines, static vars, typedefs (lines 1-18)
- parse_record function (lines 20-23)
- print_record function (lines 25-27)
- main function (lines 29-33)

All chunks stamped with lang=c and chunker_version=code-c-ast-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:41:19 +00:00
926042049c feat(p10-1d): C++ AST extractor (tree-sitter-cpp)
Symbol = namespace::Class::method via recursive build_blocks. namespace_definition
pushes namespace name (anonymous → <anonymous>). nested_namespace_specifier
(outer::inner) flattens all segments and pushes them. class_specifier / struct_specifier
(named) emit class unit + recurse with class name pushed. function_definition emits
method unit; symbol resolution unpacks declarator chain (pointer_declarator /
reference_declarator → function_declarator → identifier / field_identifier /
qualified_identifier / operator_name / destructor_name).

operator_cast (conversion operators, e.g. operator bool) handled as a direct
declarator kind on function_definition. template_declaration recurses with same
prefix (template params NOT in symbol). enum_specifier + concept_definition emit
type-level units. linkage_specification (extern "C") recurses into body with same
prefix. Other top-level nodes → <top-level> glue.

All 15 unit tests pass; build and clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:37:58 +00:00
e0a29225da feat(p10-1d): C AST extractor (tree-sitter-c)
Top-level units: function_definition (symbol = fn name from declarator's
innermost identifier), struct_specifier, enum_specifier, union_specifier
(each emits 1 unit with the named identifier as symbol). Preprocessor
directives + top-level declarations group into a <top-level> glue chunk.
Empty file or zero units → <module> post-pass.

C symbol = function name only — no namespace, no class nesting (design §3.4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:29:36 +00:00
b541567946 build(p10-1d): add tree-sitter-c + tree-sitter-cpp workspace deps
Standard crate names resolved cleanly: tree-sitter-c v0.24.2 and
tree-sitter-cpp v0.23.4 are both compatible with workspace tree-sitter 0.26.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:19:00 +00:00
a58d400abd docs(p10-1d): implementation plan (11 tasks A-K, subagent-driven)
Tasks: workspace deps / C extractor / C++ extractor / C chunker + snapshot /
C++ chunker + snapshot / ingest dispatch + tier3_fallback_cv extension /
2 smoke tests / frozen design §10 / docs sync / workspace test gate /
version bump 0.15.0 → 0.16.0 + gitea PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:15:22 +00:00
8add684ffc docs(p10-1d): task spec for C + C++ AST chunkers
Frozen contract: single PR with code-c-ast-v1 + code-cpp-ast-v1. C symbol
= function name only (no nesting). C++ symbol = namespace::Class::method
(recursion). .h → C (design §3.5); C++ headers' parse failure picked up
by p10-3 Tier 3 fallback. tree-sitter-c + tree-sitter-cpp workspace deps,
version bump 0.15.0 → 0.16.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:12:11 +00:00
7a90df1485 feat(p10-3): Tier 3 paragraph + line-window fallback chunker — shell direct + Tier 1/2 0-chunk/Err 자동 picked up (#155) 2026-05-21 12:27:18 +00:00
46f408dc0f chore: bump version 0.14.0 → 0.15.0 (p10-3 Tier 3 paragraph fallback)
Minor bump — additive new chunker_version "code-text-paragraph-v1" + new
routing lang "shell" + new Tier 1/2 → Tier 3 fallback wrapper behavior.
No DB migration, no wire schema major bump (Citation::Code.lang values
remain a free string field).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:05:53 +00:00
49e60fb314 docs(p10-3): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
- README adds Tier 3 to the ingest row (shell + fallback) and the Mermaid
  chunker enumeration; --code-lang shell admitted.
- HANDOFF flips p10-3 to  (v0.15.0) and updates the 한 줄 요약 + next
  candidates.
- ARCHITECTURE adds Tier 3 to the code-parser row, extends the flowchart
  pcode node, and lists code_text_paragraph_v1.rs in the chunker tree.
- SMOKE adds a P10-3 walkthrough (shell + non-k8s YAML fallback) and a
  verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-3 to .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:43:38 +00:00
6bc7a83d3c docs(p10-3): activate Tier 3 in frozen design §10.1
Add p10-3 activation log entry for Tier 3 paragraph fallback chunker
(code-text-paragraph-v1) with shell direct routing and fallback wrapper
for invalid YAML / AST failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:39:49 +00:00
df3c5b8caf test(p10-3): integration smoke tests for Tier 3 (shell + yaml fallback)
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
  Citation::Code { symbol: None, lang: "shell" }, chunker_version
  "code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
  (no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
  retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
  chunker_version "code-text-paragraph-v1".

Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.

Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:37:44 +00:00
5051ea7534 feat(p10-3): Tier 1/2 → Tier 3 fallback wrapper in ingest_one_code_asset
After the chunks match resolves, an Ok(empty) result (Tier 2 invalid YAML
/ non-k8s YAML / similar) or Err (Tier 1 extractor / chunker failure) is
retried against CodeTextParagraphV1Chunker. On retry, chunker_version is
swapped to "code-text-paragraph-v1" and canonical.parser_version to
"none-v1" so downstream stamping + try_skip_unchanged remain consistent.

Extract failure is handled similarly — when a Tier 1 extractor errors
(e.g. tree-sitter parse failure), a synthesize_tier2_document-shaped
fallback doc is built from raw bytes and routed through Tier 3 chunker
directly (extract_fell_back guard).

shell direct path + Tier 2 extract synthesize_tier2_document failures
are exempted from the fallback chain (they ARE Tier 3 already, or the
error is real).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:32:49 +00:00
88d7fbc182 feat(p10-3): activate shell direct routing through Tier 3 chunker
Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.

Tier 1/2 fallback wrapper lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:28:41 +00:00
0b7d8af759 feat(p10-3): code-text-paragraph-v1 chunker — paragraph + line-window fallback
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.

4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:22:48 +00:00
9342b9543f refactor(p10-3): expose tier2_shared::build_chunk as pub(crate)
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:17:51 +00:00
a8aa03042f docs(p10-3): implementation plan (9 tasks A-I, subagent-driven)
Tasks: tier2_shared visibility upgrade / Tier 3 chunker + 4 unit tests /
shell direct routing / Tier 1/2 fallback wrapper / 2 smoke tests / frozen
design §10.1+§10 / docs sync (6 files) / workspace test gate / version
bump 0.14.0→0.15.0 + gitea PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:16:55 +00:00
9d4a60aac5 docs(p10-3): task spec for Tier 3 paragraph + line-window fallback chunker
Frozen contract for p10-3 single PR: code-text-paragraph-v1 chunker
(blank-line paragraph split + 80-line/20-overlap line-window for oversize),
shell direct routing, Tier 1/2 fallback wrapper (0-chunk or Err → Tier 3
retry with chunker_version + parser_version swap), tier2_shared::build_chunk
pub(crate) exposure, frozen design §10.1 + §10 deltas, version bump
0.14.0 → 0.15.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:55:16 +00:00
8ce7a911ee chore(p10-2-followup): reviewer nit cleanup — Mermaid + 주석 + oversize test (#154) 2026-05-20 14:44:39 +00:00
75c1c7b911 test(p10-2-followup): cover tier2_shared oversize fallback with >200-line k8s ConfigMap
Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised
the line-window split branch of tier2_shared::push_chunks_with_oversize.
This adds a 256-line ConfigMap fixture (generated inline) and asserts:
- ≥2 chunks emitted (split happened),
- all chunks share symbol `ConfigMap/prod/big`,
- chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation),
- line ranges form a contiguous partition (prev.line_end + 1 == next.line_start).

Reviewer nit #1 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:41:16 +00:00
b5c12ecb6f docs(p10-2-followup): clarify synthesize_tier2_document path resolution comment
Earlier comment claimed the function "mirrors RustAstExtractor pattern" but
the two differ: RustAstExtractor joins ctx.workspace_root to handle relative
paths, while Tier 2 trusts FsSourceConnector's absolute-path invariant.
Rephrase to document the actual rationale + the Kb URI fallback.

Reviewer nit #3 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:39:02 +00:00
a1192ce3b2 docs(p10-2-followup): README Mermaid chunker_version list — Java/Kotlin + Tier 2
p10-1C-JK 이후 누락된 code-java-ast-v1 / code-kotlin-ast-v1 + p10-2 의
k8s-manifest-resource-v1 / dockerfile-file-v1 / manifest-file-v1 추가.
표기 단순화를 위해 code-* 는 brace 묶음.

Reviewer nit #2 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:35:20 +00:00
17ee400fd5 feat(p10-2): Tier 2 resource-aware chunkers (k8s + Dockerfile + manifest) — 코드 색인 외 리소스 파일 활성화 (#153) 2026-05-20 14:22:55 +00:00
217dddb4ba chore: bump version 0.13.0 → 0.14.0 (p10-2 Tier 2 resource-aware)
Minor bump — additive code_lang values (xml / groovy / go-mod) + 3 new
chunker_version labels (k8s-manifest-resource-v1 / dockerfile-file-v1 /
manifest-file-v1) + frozen design §3.5 / §10.1 deltas. No DB migration,
no wire schema major bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:14:38 +00:00
308666dbd5 docs(p10-2): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync + tasks/p10/INDEX
User-visible surface sync per the docs-split rule:
- README adds Tier 2 langs (yaml / dockerfile / toml / json / xml / groovy / go-mod) to the ingest支援 list and --code-lang options.
- HANDOFF flips p10-2 phase row to  (v0.14.0) and updates the next-task candidates.
- ARCHITECTURE extends crates/kebab-chunk/src/ tree with k8s_manifest_resource_v1.rs / dockerfile_file_v1.rs / manifest_file_v1.rs / tier2_shared.rs, plus a Tier 2 note on the code-parser row and flowchart node.
- SMOKE adds a Tier 2 smoke walkthrough (k8s yaml + Dockerfile + Cargo.toml ingest + --code-lang search) and a P10-2 entry in the verification checklist.
- tasks/INDEX + tasks/p10/INDEX flip p10-2 to  (v0.14.0).

Workspace test gate (-j 1) + clippy --workspace pass cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:10:13 +00:00
522ae7b8bc docs(p10-2): activate Tier 2 in code-ingest design §10.1 + §3.5 mappings
§3.5: add code_lang_for_path mappings xml / groovy / go-mod.
§10.1: add deactivation log entry for p10-2 (3 Tier 2 chunkers active).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:24:16 +00:00
166e1ddfaf test(p10-2): integration smoke tests for Tier 2 (k8s yaml + Dockerfile + Cargo.toml)
Three new tests in code_ingest_smoke.rs verifying isolated-TempDir ingest +
--code-lang filter + Citation::Code.lang / .symbol shape for each Tier 2
chunker. Brings the suite to 12 tests (Rust 3 + Python 1 + TS 1 + JS 1 +
Go 1 + Java 1 + Kotlin 1 + yaml 1 + dockerfile 1 + manifest 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:23:01 +00:00
226ce8b744 feat(p10-2): activate Tier 2 chunkers in ingest_one_code_asset dispatch
Adds yaml / dockerfile / toml / json / xml / groovy / go-mod arms to the
existing 7-arm AST match. parser_version unified to "none-v1" for Tier 2.
synthesize_tier2_document builds a minimal Document (single Block::Code
with raw file text) since Tier 2 has no parse step. allowlist in
ingest_one_asset extended to admit Tier 2 langs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:19:54 +00:00
22d4161728 feat(p10-2): manifest-file-v1 chunker (whole-file 1 chunk, symbol <manifest>)
Emits 1 Chunk per manifest file (Cargo.toml / pyproject.toml / package.json /
tsconfig.json / pom.xml / build.gradle / go.mod). Symbol unified to
"<manifest>"; manifest type distinguished by code_lang (toml / json / xml /
groovy / go-mod) read from Block::Code.lang. Oversize >200 lines splits via
tier2_shared::push_chunks_with_oversize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:11:46 +00:00
51004ac593 feat(p10-2): dockerfile-file-v1 chunker (whole-file 1 chunk, symbol <dockerfile>)
Reads entire Dockerfile / Dockerfile.* / *.dockerfile content and emits a
single Chunk with symbol "<dockerfile>", code_lang "dockerfile", line range
1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
tier2_shared::push_chunks_with_oversize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:09:13 +00:00
8996e73282 feat(p10-2): k8s-manifest-resource-v1 chunker + tier2_shared helper
Splits multi-document YAML by ^---\s*$, requires apiVersion + kind string
fields per document, emits 1 chunk per recognized k8s resource. Symbol =
<kind>/<namespace>/<name> or <kind>/<name> (cluster-scoped). Invalid YAML
returns 0 chunks (handled by p10-3 paragraph fallback). Oversize >200 lines
splits into line-windows sharing the same symbol.

tier2_shared module hosts the oversize fallback + Chunk-construction helper
mirroring code_rust_ast_v1's Chunk shape. Task E (dockerfile) and Task F
(manifest) will reuse it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:06:47 +00:00
22dba09857 refactor(p10-2): media.rs delegates code lang to code_lang_for_path
Replaces 1A-1 era inline match block with a single call to
kebab_parse_code::code_lang_for_path, per design §3.5 single-source-of-truth
rule. Adds Tier 2 routing test (yaml / dockerfile / toml / json / xml /
groovy / go-mod) and preserves all non-code extension branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:01:14 +00:00
aaa90b1754 feat(p10-2): extend code_lang_for_path with Tier 2 basenames + extensions
Adds basename-first matching for Dockerfile / Cargo.toml / pyproject.toml /
package.json / tsconfig.json / go.mod / pom.xml / build.gradle plus
Dockerfile.* prefix variant. Extension fallback adds .yaml/.yml/.dockerfile/
.toml/.json/.xml/.gradle → yaml/dockerfile/toml/json/xml/groovy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:59:11 +00:00
077f92f41e build(p10-2): add serde_yaml dep to kebab-chunk for k8s-manifest-resource-v1
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:57:06 +00:00
5ce7f60932 docs(p10-2): implementation plan (11 tasks A-K, subagent-driven)
Branch feat/p10-2-tier2-resource. Tasks: serde_yaml dep / lang.rs basenames /
media.rs source-of-truth consolidation / 3 chunkers (k8s + dockerfile +
manifest) + tier2_shared helper / ingest dispatch / smoke tests / frozen
design §3.5+§10.1 / docs sync / version bump 0.13.0→0.14.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:55:36 +00:00
47857b2622 docs(p10-2): task spec for Tier 2 resource-aware chunkers (k8s + Dockerfile + manifest)
Frozen contract for the p10-2 single PR: 3 chunker activation, k8s
identification via apiVersion+kind, Dockerfile/manifest basename matching,
code_lang_for_path source-of-truth consolidation, frozen design §3.5 +
§10.1 deltas, and version bump 0.13.0 → 0.14.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:43:34 +00:00
1e4cff879b Merge pull request 'feat(p10-1C-JK): Java + Kotlin AST chunkers — JVM family 코드 색인 활성화' (#152) from feat/p10-1c-jk into main 2026-05-20 11:57:39 +00:00
2d7a566624 docs(p10-1c-jk): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + design §10.1; chore: bump version 0.12.0 → 0.13.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 11:38:40 +00:00
813bdd1a16 test(p10-1c-jk): code-java-ast-v1 + code-kotlin-ast-v1 chunker snapshots
Mirrors code_go_ast_snapshot pattern. In-memory CanonicalDocument (no
kebab-parse-code dep — boundary §6.3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:57:37 +00:00
ff1bedbef5 feat(p10-1c-jk): activate Kotlin in ingest_one_code_asset dispatch
Replaces Kotlin bail! arms with KotlinAstExtractor + CodeKotlinAstV1Chunker.
Adds kotlin_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=kotlin, symbol=com.foo.Foo.bar, code_lang=kotlin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:54:55 +00:00
30e03c7a12 feat(p10-1c-jk): code-kotlin-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-java-ast-v1 with language-agnostic body unchanged. Cross-
chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:52:24 +00:00
2ce6ae47c5 feat(p10-1c-jk): tree-sitter-kotlin-ng AST extractor (KotlinAstExtractor)
Uses tree-sitter-kotlin-ng (bare tree-sitter-kotlin is stuck on tree-sitter
0.21-0.23, incompatible with our 0.26). Mirrors JavaAstExtractor (JVM family,
source-side package extraction + class-nesting) with Kotlin grammar quirks:

- Root is `source_file`, not `program`.
- `package_header` child is `qualified_identifier` (its slice text is the
  dotted path); the bare `identifier` shape is also accepted as a fallback.
- `class_declaration` is the single node kind for `class` / `data class` /
  `sealed class` / `interface` / `enum class` — distinguished only by its
  `modifiers` child. Body is `class_body` for non-enum, `enum_class_body`
  for enum class; neither carries a `body` field name, so the extractor
  looks the body up by node kind rather than `child_by_field_name("body")`.
- `companion_object` is its own node kind (NOT object_declaration with a
  modifier); its `name` field is optional, so the extractor fills in the
  implicit Kotlin convention name `Companion`.
- `function_declaration` is allowed at top level (unlike Java), emitted as
  `<pkg>.<fn_name>`; the same node kind nested in `class_body` becomes
  `<pkg>.<...>.<Class>.<method>` via the same mod_path mechanism.
- `secondary_constructor` has no `name` field; symbol uses the enclosing
  class name (Java duplication convention: `<pkg>.<...>.<Class>.<Class>`).
- Enum bodies (`enum_class_body`) are NOT recursed — `enum_entry` is not
  emitted as a unit (matches Java 1차 scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:49:57 +00:00
ebc4ef2eea feat(p10-1c-jk): activate Java in ingest_one_code_asset dispatch
Replaces Java bail! arms with JavaAstExtractor + CodeJavaAstV1Chunker. Adds
java_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=java, symbol=com.foo.Foo.bar, code_lang=java.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:44:05 +00:00
7bda1509b7 feat(p10-1c-jk): code-java-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-rust-ast-v1 / code-go-ast-v1 with language-agnostic body
unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:41:27 +00:00
61d48d67a3 feat(p10-1c-jk): tree-sitter-java AST extractor (JavaAstExtractor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:39:02 +00:00
f4c840b994 refactor(p10-1c-jk): add java + kotlin to dispatch allowlist (bail until Tasks F/I)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:33:27 +00:00
15244b7494 feat(p10-1c-jk): route .java/.kt/.kts to MediaType::Code
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:31:29 +00:00
a7f7ab9f93 build(p10-1c-jk): add tree-sitter-java + tree-sitter-kotlin-ng workspace deps
Bare tree-sitter-kotlin v0.3.8 requires tree-sitter >=0.21,<0.23 which
conflicts with the workspace's tree-sitter 0.26 (links = "tree-sitter"
is a singleton). tree-sitter-kotlin-ng v1.1.0 (from
tree-sitter-grammars/tree-sitter-kotlin) uses the tree-sitter-language
0.1 shim which is compatible with tree-sitter 0.26. Using
tree-sitter-kotlin-ng as the Kotlin grammar crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:30:03 +00:00
1b19e33a4f docs(p10-1c-jk): task spec + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:27:13 +00:00
9c9e391b15 Merge pull request 'feat(p10-1C-Go): tree-sitter-go AST extractor + chunker — Go 코드 색인 활성화' (#151) from feat/p10-1c-go into main 2026-05-20 10:16:09 +00:00
f95cd55484 docs(p10-1c-go): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + design §10.1; chore: bump version 0.11.1 → 0.12.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:02:21 +00:00
ab288135e9 test(p10-1c-go): code-go-ast-v1 chunker snapshot + full-suite gate
Mirrors code_python_ast_snapshot / code_ts_ast_snapshot patterns. In-memory
CanonicalDocument (no kebab-parse-code dep — boundary §6.3 respected).

verify:
- cargo test -p kebab-chunk --test code_go_ast_snapshot → 2/2
- cargo test --workspace --no-fail-fast -j 1 → 0 failures (all green)
- cargo clippy --workspace --all-targets -- -D warnings → clean
- SMOKE: chunk.ParseDoc symbol + code_lang_breakdown {"go": 1} 확인

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:54:17 +00:00
c19aa006d0 feat(p10-1c-go): activate Go in ingest_one_code_asset dispatch
Replaces Go bail! arms with GoAstExtractor + CodeGoAstV1Chunker. Adds
go_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=go, symbol=chunk.ParseDoc, code_lang=go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:13:47 +00:00
f1a4f67e12 feat(p10-1c-go): code-go-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-rust-ast-v1 / code-{python,ts,js}-ast-v1 with language-agnostic
body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:11:14 +00:00
6463c52827 feat(p10-1c-go): tree-sitter-go AST extractor (GoAstExtractor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:08:46 +00:00
2559d0d95a refactor(p10-1c-go): add go to ingest dispatch allowlist (bail until Task F)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:03:28 +00:00
4524830306 feat(p10-1c-go): route .go to MediaType::Code(go)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:01:29 +00:00
8cdd3903c7 build(p10-1c-go): add tree-sitter-go workspace dep
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:00:04 +00:00
8b89961ada docs(p10-1c-go): task spec + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:58:45 +00:00
eec90996aa chore: bump version 0.11.0 → 0.11.1
dogfood semantic cleanup (PR #150) lands: document-centric fetch_span +
assets.workspace_path 'last-registered' semantic explicitly documented.

patch bump 사유: 외부 wire / CLI / config surface 변경 없음. 새 internal
trait method (get_asset) + caller refactor + doc-comment 갱신. twin file
의 fetch_span 잘못 분기 가능성 fix (rare).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:09:46 +00:00
ce1c778b4a Merge pull request 'fix(dogfood): document-centric fetch_span + assets.workspace_path semantic doc' (#150) from fix/dogfood-asset-flip-flop-cleanup into main 2026-05-20 08:08:55 +00:00
453ec15df4 fix(dogfood): document-centric fetch_span + remove get_asset_by_workspace_path
assets.workspace_path is INTENTIONALLY 'last-registered path' for twin
files (identical content at different paths share one asset row PK'd by
blake3 content hash). PR #146 made try_skip_unchanged document-centric;
PR #149 made reset --orphans-only document-centric; this PR removes the
last caller of get_asset_by_workspace_path (fetch.rs:193 in fetch_span,
which used it to reject PDF/audio media — for twins this could read the
wrong asset's media_type and pick the wrong branch).

Replaced with the natural 2-step lookup: get_document_by_workspace_path
(PR #146) → doc.source_asset_id → get_asset (NEW trait method, asset_id
is PRIMARY KEY so flip-flop-immune by construction).

Then removed get_asset_by_workspace_path trait method + SqliteStore impl
— 0 callers after the refactor.

UPSERT doc-comment refreshed in store.rs to make the 'last-registered'
semantics explicit so future readers don't try to 'fix' the flip-flop.

Dogfood follow-up (PR #142 1B + multi-root corpus).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:03:38 +00:00
1e6de9fe9f chore: bump version 0.10.0 → 0.11.0
dogfood follow-up (PR #149) lands: kebab reset --orphans-only explicit
complement to PR #148's conservative sweep.

minor bump 사유: 새 CLI flag (--orphans-only) + 새 ResetScope variant +
ResetReport additive 필드 = surface 확장. design §10.4 트리거 충족.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:53:55 +00:00
9fa2a1ebac Merge pull request 'feat(dogfood): kebab reset --orphans-only — explicit complement to PR #148 sweep' (#149) from feat/dogfood-reset-orphans-only into main 2026-05-20 07:50:43 +00:00
749c6ae240 docs(dogfood): sync reset_report schema + README for --orphans-only (PR #149 review)
Round 1 review found 2 doc gaps:
- docs/wire-schema/v1/reset_report.schema.json: 'orphans_only' missing
  from scope enum; orphans_purged/purged_paths properties absent
- README: --orphans-only not listed in the reset prose

Schema additions are additive minor (default values keep back-compat).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:47:44 +00:00
5f2bd9e97e feat(dogfood): kebab reset --orphans-only — purge stored docs outside walker scope
PR #148 auto-purges only filesystem-missing files (conservative — leaves
on-disk-but-out-of-scope docs alone for data safety). This is the explicit
complement: when the user has narrowed include / widened exclude / removed
a sub-directory from the workspace and WANTS the stored docs reconciled,
they invoke 'kebab reset --orphans-only'.

Confirm prompt with orphan count + sample paths; --yes required in
non-TTY. SQLite purge via existing purge_deleted_workspace_path (PR #148)
+ vector store delete_by_chunk_ids when configured. No fs existence
check — orphans-only is the explicit 'I know what I'm doing' variant.

dogfood follow-up to PR #148 (file deletion auto-purge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:38:10 +00:00
1ce06c1e2d chore: bump version 0.9.0 → 0.10.0
dogfood-discovered file-deletion auto-purge (PR #148) lands. minor bump
사유: additive wire field IngestReport.purged_deleted_files + 새 CLI
summary surface (purged N) + 새 사용자-가시 동작 (rm a.md 후 ingest 시
자동 정리). design §10.4 도그푸딩-ready surface 확장 트리거.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:12:58 +00:00
d26efe167f Merge pull request 'fix(dogfood): auto-purge stored docs for filesystem-deleted files' (#148) from fix/dogfood-file-deletion-auto-purge into main 2026-05-20 07:10:33 +00:00
d6d165df01 docs(dogfood): sync sweep_deleted_files algorithm doc with try_exists (PR #148 nit)
Round 2 review found the function-level doc-comment still referenced the
old fs::exists() (now replaced by try_exists().unwrap_or(true) in commit
2baa846). One-line clarification — describes the conservative-on-Err
semantics so future readers don't reintroduce the data-safety bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:10:27 +00:00
2baa846c6b fix(dogfood): conservative try_exists() in sweep_deleted_files (PR #148 review)
Round 1 review found a data-safety bug: fs::exists() returns false on
errors like EACCES / EPERM / NFS-hiccup / ownership-change, which would
trigger purge on a file that is in fact still on disk (just unreadable
this moment). Switched to try_exists().unwrap_or(true) so transient FS
errors are CONSERVATIVELY treated as 'file present' — never purge on
uncertain signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:04:03 +00:00
27baec82ea fix(dogfood): auto-purge stored docs for filesystem-deleted files
Files deleted from disk (rm a.md) were leaving stale documents + chunks +
embeddings in the store, surfacing as ghost citations in search/ask.
Existing purge_orphan_at_workspace_path only handled content-changed
stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no
new asset_id.

Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths),
for each candidate check filesystem existence — only purge when the
file is TRULY missing. Scope-narrowing case (file on disk but outside
include glob) is explicitly NOT purged to protect users from accidental
data loss via config edits.

Adds:
- DocumentStore::all_workspace_paths trait method + SqliteStore impl
- purge_deleted_workspace_path in store-sqlite (returns chunk_ids for
  vector delete; deletes doc CASCADE + asset row + copied storage file)
- sweep_deleted_files in kebab-app::ingest path; called once per ingest
  before the per-asset loop
- IngestReport.purged_deleted_files counter (additive, serde default)
- CLI ingest summary mentions purge count when > 0
- 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge

dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod
+ lodash). Per user decision: only filesystem deletion auto-purges;
scope narrowing requires explicit kebab reset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:51:07 +00:00
acf8cf3be2 chore: bump version 0.8.3 → 0.9.0
dogfood-discovered routing additions (PR #147) land:
- .mts / .cts → MediaType::Code(typescript)
- .mdx → MediaType::Markdown

minor bump 사유: 사용자 도그푸딩 surface 확장 — 이전에 skip 되던 28+ 파일이
이제 색인됨. design §10.4 dogfooding-ready surface 확장 = minor trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:29:27 +00:00
ea5f7b22c8 Merge pull request 'feat(dogfood): route .mts/.cts → typescript + .mdx → markdown' (#147) from feat/dogfood-routing-cts-mts-mdx into main 2026-05-20 06:28:41 +00:00
5497c6e7b5 feat(dogfood): route .mts/.cts to typescript + .mdx to markdown
Dogfood (PR #142 1B + multi-root: kebab-docs + httpx + zod + lodash)
showed 28 files skipped by extension that are routable to existing
extractors:
- .mts (ESM TypeScript) / .cts (CommonJS TypeScript) — same grammar as
  .ts in tree-sitter-typescript 0.23 (LANGUAGE_TYPESCRIPT covers JSX-
  agnostic variants; LANGUAGE_TSX stays for .tsx only)
- .mdx (Markdown + JSX) — routed as MediaType::Markdown; the md parser
  folds JSX islands through as raw passthrough

Changes:
- crates/kebab-source-fs/src/media.rs: 'mts'|'cts' → Code(typescript),
  'mdx' → Markdown. +2 unit tests.
- crates/kebab-parse-code/src/lang.rs: code_lang_for_path matches mts/cts;
  module_path_for_tsjs strips .mts/.cts as well. Test cases extended.
- crates/kebab-parse-code/src/typescript.rs: doc comment on select_grammar
  refreshed to mention .mts/.cts.
- crates/kebab-parse-code/tests/lang.rs: 2 new assertions.

verify: kebab-source-fs 44 / kebab-parse-code lib 20 + lang 4 all pass; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:24:21 +00:00
5a90940f1c chore: bump version 0.8.2 → 0.8.3
dogfood-discovered fix (PR #146) lands: idempotent re-ingest now correctly
returns Unchanged for twin files (identical content at different paths)
via document-centric try_skip_unchanged lookup.

patch bump 사유: advertised idempotency 의 정상 동작 복원. 새 wire / config / surface 변경 없음.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:20:34 +00:00
4389b887f0 Merge pull request 'fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency' (#146) from fix/dogfood-bug4-idempotent-twin-files into main 2026-05-20 06:16:28 +00:00
360f825f3a docs(dogfood): refresh try_skip_unchanged doc-comment to match new flow (PR #146 review)
Round 1 review found the function-level doc-comment still described the
old asset-side algorithm (item 2 asset-row checksum, item 3 id_for_doc
miss). Updated to the document-centric flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:35:17 +00:00
641b92af7d fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency
Identical-content files at different workspace paths share one assets row
(assets.asset_id = blake3 content hash, PRIMARY KEY). The UPSERT
`ON CONFLICT(asset_id) DO UPDATE SET workspace_path = excluded` made
twin files overwrite each other's workspace_path on every ingest, so
`get_asset_by_workspace_path(path1)` returned the OTHER twin's row (or
None) — break idempotent unchanged-detection for both files.

Fix: switch try_skip_unchanged to document-centric lookup. `documents.
workspace_path` is already UNIQUE (V001) and `id_for_doc(path, ...)`
includes path, so each twin has its own stable document row. Compare
`doc.source_asset_id` with the new asset's checksum instead of going
through the assets table.

Dogfood (multi-root: kebab-docs + httpx + zod + lodash) showed 27 of
726 docs marked Updated on every idempotent re-ingest — all 27 are
twin-file victims (empty `__init__.py` ×3, AGENTS.md ↔ CLAUDE.md
same content, duplicate logo PDFs/JPGs).

After: re-ingest reports 0 new / 0 updated / 726 unchanged.

No schema migration needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:27:21 +00:00
89 changed files with 16983 additions and 140 deletions

103
Cargo.lock generated
View File

@@ -4127,7 +4127,7 @@ dependencies = [
[[package]]
name = "kebab-app"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"base64 0.22.1",
@@ -4172,22 +4172,24 @@ dependencies = [
[[package]]
name = "kebab-chunk"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
"kebab-core",
"kebab-normalize",
"kebab-parse-code",
"kebab-parse-md",
"serde_json",
"serde_json_canonicalizer",
"serde_yaml",
"time",
"tracing",
]
[[package]]
name = "kebab-cli"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"clap",
@@ -4208,7 +4210,7 @@ dependencies = [
[[package]]
name = "kebab-config"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"dirs 5.0.1",
@@ -4223,7 +4225,7 @@ dependencies = [
[[package]]
name = "kebab-core"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4237,7 +4239,7 @@ dependencies = [
[[package]]
name = "kebab-embed"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4251,7 +4253,7 @@ dependencies = [
[[package]]
name = "kebab-embed-local"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"fastembed",
@@ -4264,7 +4266,7 @@ dependencies = [
[[package]]
name = "kebab-eval"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-app",
@@ -4283,7 +4285,7 @@ dependencies = [
[[package]]
name = "kebab-llm"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-core",
@@ -4292,7 +4294,7 @@ dependencies = [
[[package]]
name = "kebab-llm-local"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-config",
@@ -4309,7 +4311,7 @@ dependencies = [
[[package]]
name = "kebab-mcp"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-app",
@@ -4327,7 +4329,7 @@ dependencies = [
[[package]]
name = "kebab-normalize"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-core",
@@ -4342,7 +4344,7 @@ dependencies = [
[[package]]
name = "kebab-parse-code"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"gix",
@@ -4352,7 +4354,12 @@ dependencies = [
"time",
"tracing",
"tree-sitter",
"tree-sitter-c",
"tree-sitter-cpp",
"tree-sitter-go",
"tree-sitter-java",
"tree-sitter-javascript",
"tree-sitter-kotlin-ng",
"tree-sitter-python",
"tree-sitter-rust",
"tree-sitter-typescript",
@@ -4360,7 +4367,7 @@ dependencies = [
[[package]]
name = "kebab-parse-image"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"ab_glyph",
"anyhow",
@@ -4384,7 +4391,7 @@ dependencies = [
[[package]]
name = "kebab-parse-md"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"kebab-core",
@@ -4401,7 +4408,7 @@ dependencies = [
[[package]]
name = "kebab-parse-pdf"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4414,7 +4421,7 @@ dependencies = [
[[package]]
name = "kebab-parse-types"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"kebab-core",
"serde",
@@ -4422,7 +4429,7 @@ dependencies = [
[[package]]
name = "kebab-rag"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4443,7 +4450,7 @@ dependencies = [
[[package]]
name = "kebab-search"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"globset",
@@ -4462,7 +4469,7 @@ dependencies = [
[[package]]
name = "kebab-source-fs"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4481,7 +4488,7 @@ dependencies = [
[[package]]
name = "kebab-store-sqlite"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"blake3",
@@ -4502,7 +4509,7 @@ dependencies = [
[[package]]
name = "kebab-store-vector"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"arrow",
@@ -4526,7 +4533,7 @@ dependencies = [
[[package]]
name = "kebab-tui"
version = "0.8.2"
version = "0.16.1"
dependencies = [
"anyhow",
"crossterm",
@@ -8527,6 +8534,46 @@ dependencies = [
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-c"
version = "0.24.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a9b2eb57a55fed6b00812912e730b7a275cf4fe98bfd6a5d76263d4438371728"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-cpp"
version = "0.23.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "df2196ea9d47b4ab4a31b9297eaa5a5d19a0b121dceb9f118f6790ad0ab94743"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-go"
version = "0.25.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c8560a4d2f835cc0d4d2c2e03cbd0dde2f6114b43bc491164238d333e28b16ea"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-java"
version = "0.23.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0aa6cbcdc8c679b214e616fd3300da67da0e492e066df01bcf5a5921a71e90d6"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-javascript"
version = "0.25.0"
@@ -8537,6 +8584,16 @@ dependencies = [
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-kotlin-ng"
version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e800ebbda938acfbf224f4d2c34947a31994b1295ee6e819b65226c7b51b4450"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-language"
version = "0.1.7"

View File

@@ -31,7 +31,7 @@ edition = "2024"
rust-version = "1.85"
license = "MIT OR Apache-2.0"
repository = "https://github.com/altair823/kebab"
version = "0.8.2"
version = "0.16.1"
[workspace.dependencies]
anyhow = "1"
@@ -94,6 +94,14 @@ tree-sitter-rust = "0.24"
tree-sitter-python = "0.25.0"
tree-sitter-typescript = "0.23.2"
tree-sitter-javascript = "0.25.0"
# Go grammar for code ingest (kebab-parse-code, p10-1C-Go).
tree-sitter-go = "0.25.0"
# JVM family grammars for code ingest (kebab-parse-code, p10-1C-JK).
tree-sitter-java = "0.23.5"
tree-sitter-kotlin-ng = "1.1.0" # bare tree-sitter-kotlin requires ts <0.23; -ng uses tree-sitter-language 0.1 (ts 0.26 compat)
# C/C++ family grammars for code ingest (kebab-parse-code, p10-1D).
tree-sitter-c = "0.24.2"
tree-sitter-cpp = "0.23.4"
# Disk-footprint trim for dev / test builds. Codegen, opt-level, and
# behavior are unchanged — only DWARF debug info is reduced (line

View File

@@ -4,7 +4,7 @@
## 한 줄 요약
P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료. `kebab ingest` 가 markdown / image / PDF 모두 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공 — 사용자가 `?` 로 ask, `/` 로 search, Library Enter / Search `i` 로 inspect, Search `g` 로 editor jump. 다음 후보 = P9-5 (desktop tauri) 또는 보류 중인 P8 (audio) 의 시스템 dep brainstorm.
P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료. `kebab ingest` 가 markdown / image / PDF / 소스코드 (Rust / Python / TS / JS / Go / Java / Kotlin) / Tier 2 리소스 파일 (yaml/k8s / dockerfile / toml / json / xml / groovy / go-mod) + Tier 3 paragraph fallback (shell / 비-k8s YAML / AST 실패 케이스) 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page / code citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공. P10-3 (Tier 3 paragraph fallback) 완료. P10-1D (C + C++) 완료로 Tier 1 chunker family 마무리 — 다음 후보 = P9-5 (desktop tauri) 또는 보류 중인 P8 (audio).
## Phase 로드맵
@@ -20,7 +20,7 @@ P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료.
| **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) |
| **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) |
| **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, tree-sitter-rust, `code-rust-ast-v1` — v0.7.0), **1B 🟡 PR 오픈** (Python `code-python-ast-v1` + TypeScript `code-ts-ast-v1` + JavaScript `code-js-ast-v1`3 언어 dogfooding 가능, v0.8.0 대기) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, `code-rust-ast-v1` — v0.7.0), 1B ✅ (Python/TS/JS AST chunkers — v0.8.0 이후), **1C-Go ✅ (Go AST chunker, `code-go-ast-v1` — v0.12.0)**, **1C-JavaKotlin ✅ (Java + Kotlin AST chunkers, `code-java-ast-v1` / `code-kotlin-ast-v1` — v0.13.0)**, **2 ✅ (Tier 2 resource-aware: yaml/k8s + dockerfile + manifest, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1` — v0.14.0)**, **3 ✅ (Tier 3 paragraph fallback: code-text-paragraph-v1 — v0.15.0)**, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)** |
P0~P5 직렬. P6~P9 P5 이후 병렬 가능.

View File

@@ -34,7 +34,7 @@ cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin ke
업데이트는 `git pull && cargo install --path crates/kebab-cli --locked --force` 또는 git URL 형식의 경우 `cargo install --git ... --force`.
제거는 `cargo uninstall kebab-cli`. 이 명령은 binary 만 지우고 워크스페이스 데이터는 그대로 남는다. 데이터까지 정리하려면 `kebab reset --all --yes` (config + data + cache + state 4 개 XDG 경로 모두 wipe — **irreversible**, 재시작 시 `kebab init` 다시 실행). 부분 wipe 는 `kebab reset --data-only` (config 보존), `kebab reset --vector-only` (Lance + `embedding_records` 만, 다음 ingest 가 re-embed) 등.
제거는 `cargo uninstall kebab-cli`. 이 명령은 binary 만 지우고 워크스페이스 데이터는 그대로 남는다. 데이터까지 정리하려면 `kebab reset --all --yes` (config + data + cache + state 4 개 XDG 경로 모두 wipe — **irreversible**, 재시작 시 `kebab init` 다시 실행). 부분 wipe 는 `kebab reset --data-only` (config 보존), `kebab reset --vector-only` (Lance + `embedding_records` 만, 다음 ingest 가 re-embed), **`kebab reset --orphans-only`** (현재 walker scope 밖에 있는 stored doc 만 정리 — `config.workspace.include` 좁히거나 sub-dir 옮긴 후 explicit reconcile; fs 의 file 은 건드리지 않음) 등.
## Quick start
@@ -42,7 +42,7 @@ cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin ke
# 첫 실행 — XDG 경로에 데이터 디렉토리 + config.toml 생성
kebab init
# config 손보고 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js)
# config 손보고 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js / go)
${EDITOR:-vi} ~/.config/kebab/config.toml
# 색인 (Markdown / 이미지 / PDF 모두 한 번에)
@@ -70,7 +70,7 @@ kebab doctor
| 명령 | 동작 |
|------|------|
| `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs``code-rust-ast-v1`, `.py``code-python-ast-v1`, `.ts`/`.tsx``code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx``code-js-ast-v1` — 모두 tree-sitter AST chunker). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"``citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`). |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs``code-rust-ast-v1`, `.py``code-python-ast-v1`, `.ts`/`.tsx``code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx``code-js-ast-v1`, `.go``code-go-ast-v1`, `.java``code-java-ast-v1`, `.kt`/`.kts``code-kotlin-ast-v1`, `.c`/`.h``code-c-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx``code-cpp-ast-v1` — 모두 tree-sitter AST chunker; **Tier 2 리소스 파일**: `.yaml`/`.yml``k8s-manifest-resource-v1` (apiVersion+kind 파싱), `Dockerfile`/`Dockerfile.*`/`*.dockerfile``dockerfile-file-v1` (전체 파일), `Cargo.toml`/`pyproject.toml`/`.toml`/`package.json`/`tsconfig.json`/`.json`/`pom.xml`/`.xml`/`build.gradle`/`.gradle`/`go.mod``manifest-file-v1` (전체 파일) — yaml (k8s) / dockerfile / toml / json / xml / groovy / go-mod 지원); **Tier 3 paragraph fallback** (`.sh`/`.bash`/`.zsh``code-text-paragraph-v1`, blank-line paragraph split + 80-line/20-overlap line-window. Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback — 비-k8s YAML 같은 케이스 picked up. symbol = None, lang 은 원본 보존.). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"``citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--code-lang go` / `--code-lang java` / `--code-lang kotlin` / `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` / `--code-lang json` / `--code-lang xml` / `--code-lang groovy` / `--code-lang go-mod` / `--code-lang shell` / `--code-lang c` / `--code-lang cpp` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`), Go symbol 은 `package.Func` / `package.(*Receiver).Method` 형식, Java / Kotlin symbol 은 `com.foo.Foo.bar` 형식 (패키지 + 클래스 + 메서드/필드). |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media``,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min``primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md``markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. |
| `kebab list docs` | 색인된 문서 목록 |
| `kebab inspect doc <id>` / `kebab inspect chunk <id>` | raw record 보기 |
@@ -132,7 +132,7 @@ flowchart TB
subgraph Pipeline["도메인 + 파이프라인"]
parse["parse-md / parse-pdf / parse-image / parse-code"]
chunker["chunker (md-heading-v1, pdf-page-v1, code-rust-ast-v1, code-python-ast-v1, code-ts-ast-v1, code-js-ast-v1)"]
chunker["chunker (md-heading-v1, pdf-page-v1, code-{rust,python,ts,js,go,java,kotlin,c,cpp}-ast-v1, k8s-manifest-resource-v1, dockerfile-file-v1, manifest-file-v1, code-text-paragraph-v1)"]
embedder["embedder (fastembed multilingual-e5-large)"]
retriever["retriever (lexical / vector / hybrid RRF)"]
rag["RAG pipeline"]

View File

@@ -189,10 +189,12 @@ fn fetch_span(
// (markdown / note / paper / reference / inbox) is the *user-facing*
// category, not the rendering format — the actual byte-level format
// lives on the source `RawAsset.media_type`. Look it up via
// workspace_path (unique key per asset).
if let Some(asset) = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_asset_by_workspace_path(
// doc.source_asset_id (PRIMARY KEY) so twin files (identical content
// at different paths) always read *this* document's own asset row,
// not whichever twin last wrote `assets.workspace_path`.
if let Some(asset) = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_asset(
&app.sqlite,
&doc.workspace_path,
&doc.source_asset_id,
)? {
if matches!(
asset.media_type,

View File

@@ -39,7 +39,7 @@ use std::sync::Arc;
use anyhow::{Context, anyhow};
use serde::{Deserialize, Serialize};
use kebab_chunk::{CodeJsAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTsAstV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
use kebab_chunk::{CodeCAstV1Chunker, CodeCppAstV1Chunker, CodeGoAstV1Chunker, CodeJavaAstV1Chunker, CodeJsAstV1Chunker, CodeKotlinAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTextParagraphV1Chunker, CodeTsAstV1Chunker, DockerfileFileV1Chunker, K8sManifestResourceV1Chunker, ManifestFileV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
use kebab_core::{
Answer, Block, CanonicalDocument, Chunk, ChunkId, ChunkPolicy, ChunkerVersion, Chunker,
DocFilter, DocSummary, DocumentId, DocumentStore, Embedder, EmbeddingInput,
@@ -50,7 +50,7 @@ use kebab_core::{
use kebab_llm_local::OllamaLanguageModel;
use kebab_normalize::build_canonical_document;
use kebab_parse_image::{ImageExtractor, OllamaVisionOcr, apply_caption, apply_ocr};
use kebab_parse_code::{JavascriptAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
use kebab_parse_code::{CAstExtractor, CppAstExtractor, GoAstExtractor, JavaAstExtractor, JavascriptAstExtractor, KotlinAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
use kebab_parse_pdf::PdfTextExtractor;
use kebab_parse_md::{BodyHints, parse_blocks, parse_frontmatter};
use kebab_source_fs::FsSourceConnector;
@@ -71,7 +71,7 @@ mod staleness;
pub use app::{App, SearchResponse};
pub use ingest_progress::{AggregateCounts, IngestEvent, render_skipped_breakdown};
pub use reset::{ResetReport, ResetScope};
pub use reset::{ResetReport, ResetScope, enumerate_orphans};
pub use error_wire::{ERROR_V1_ID, ErrorV1, StructuredError, classify};
pub use fetch::fetch_with_config;
#[doc(hidden)]
@@ -375,6 +375,28 @@ pub fn ingest_with_config_opts(
.map(|d| d.doc_id.0)
.collect();
// Dogfood: post-walker sweep to remove stored docs whose source
// file has been deleted from the filesystem. Must run BEFORE the
// per-asset loop so the loop's New/Updated labelling is based on
// the post-purge store state (the purged doc_ids won't be in
// `existing_doc_ids` above — they were already removed, OR the
// sweep here removes them before we start counting).
//
// Critical design invariant: only purge when the file is TRULY
// absent from disk. A file that is still on disk but outside the
// current walker scope (config narrowing / include-glob change) is
// NOT purged — we leave it in place to protect against accidental
// data loss via config edits.
let scanned_paths: std::collections::HashSet<kebab_core::WorkspacePath> = assets
.iter()
.map(|a| a.workspace_path.clone())
.collect();
let purged_deleted_files = sweep_deleted_files(
&app,
&scanned_paths,
vector_store.as_ref().map(|v| v.as_ref()),
)?;
let started_at = time::OffsetDateTime::now_utc();
let mut items: Vec<kebab_core::IngestItem> = Vec::new();
@@ -647,11 +669,11 @@ pub fn ingest_with_config_opts(
crate::ingest_progress::emit(progress, terminal_event);
// p9-fb-19: bump the persistent corpus_revision counter when a
// commit landed (any new / updated). This invalidates every
// commit landed (any new / updated / purged). This invalidates every
// entry in any in-process LRU search cache (in this process or
// a sibling) on the next lookup. No-op when nothing changed
// (skipped-only run) — the cache stays valid.
if new_count > 0 || updated_count > 0 {
if new_count > 0 || updated_count > 0 || purged_deleted_files > 0 {
match app.sqlite.bump_corpus_revision() {
Ok(rev) => tracing::debug!(
target: "kebab-app",
@@ -682,6 +704,7 @@ pub fn ingest_with_config_opts(
skipped_generated: fs_skips.skipped_generated,
skipped_size_exceeded: fs_skips.skipped_size_exceeded,
skip_examples: fs_skips.skip_examples,
purged_deleted_files,
items: if summary_only { None } else { Some(items) },
})
}
@@ -748,15 +771,18 @@ struct ImagePipeline<'a> {
/// hold (per design §9 cascade rule):
///
/// 1. `force_reingest == false` — caller hasn't asked to bypass skip.
/// 2. The freshly-scanned asset's blake3 checksum equals what the
/// existing `assets` row stores at the same `workspace_path`.
/// 3. The doc keyed on `(workspace_path, asset_id, current_parser_version)`
/// exists. If the parser_version changed, `id_for_doc` produces a
/// different `doc_id` so the lookup misses → no skip → re-process.
/// 4. The existing doc's stamped `last_chunker_version` AND
/// `last_embedding_version` match the values the caller is about
/// to use (`Some(v) == Some(v)` and `None == None` — see design
/// doc for the `None == None` rule when no embedder is configured).
/// 2. A document already exists at this `workspace_path`
/// (`get_document_by_workspace_path`). The lookup is document-side, not
/// asset-side, so twin files (identical content at different paths) each
/// hit their own stable doc row — `documents.workspace_path` is UNIQUE
/// while `assets` may dedupe content into a single row with a flip-flop
/// `workspace_path` column (dogfood bug #4, see `tasks/HOTFIXES.md`).
/// 3. The existing doc's `source_asset_id` equals the freshly-scanned
/// asset's blake3 checksum (content unchanged).
/// 4. The existing doc's `parser_version` matches the current extractor's
/// `parser_version` (extractor not upgraded). Combined with `chunker_version`
/// and `last_embedding_version` checks immediately below — full cascade
/// per design §9.
///
/// Returns `Ok(None)` (proceed with full re-process) when any check
/// fails or any DB read errors out — the skip path is opportunistic;
@@ -769,35 +795,24 @@ fn try_skip_unchanged(
current_chunker_version: &ChunkerVersion,
current_embedding_version: Option<&kebab_core::EmbeddingVersion>,
force_reingest: bool,
fallback_chunker_version: Option<&ChunkerVersion>, // p10-3 fix
) -> anyhow::Result<Option<kebab_core::IngestItem>> {
if force_reingest {
return Ok(None);
}
let existing_asset = match app
// Document-centric skip: look up the existing document row by
// workspace_path directly. This avoids the twin-file flip-flop
// that the old asset-side lookup suffers from — multiple files
// with identical content share one `assets` row whose
// `workspace_path` is overwritten on every UPSERT, so
// `get_asset_by_workspace_path(path1)` could return the OTHER
// twin's path (or None) after any ingest of the twin. The
// `documents` table has a UNIQUE index on `workspace_path` (V001),
// so each twin has its own stable row regardless of asset de-dup.
let existing_doc = match app
.sqlite
.get_asset_by_workspace_path(&asset.workspace_path)
.get_document_by_workspace_path(&asset.workspace_path)
{
Ok(Some(a)) => a,
Ok(None) => return Ok(None),
Err(e) => {
tracing::debug!(
target: "kebab-app",
path = %asset.workspace_path.0,
error = %e,
"skip-check: get_asset_by_workspace_path failed; falling through to re-process"
);
return Ok(None);
}
};
if existing_asset.checksum != asset.checksum {
return Ok(None);
}
let candidate_doc_id = kebab_core::id_for_doc(
&asset.workspace_path,
&asset.asset_id,
current_parser_version,
);
let existing_doc = match app.sqlite.get_document(&candidate_doc_id) {
Ok(Some(d)) => d,
Ok(None) => return Ok(None),
Err(e) => {
@@ -805,21 +820,81 @@ fn try_skip_unchanged(
target: "kebab-app",
path = %asset.workspace_path.0,
error = %e,
"skip-check: get_document failed; falling through to re-process"
"skip-check: get_document_by_workspace_path failed; falling through to re-process"
);
return Ok(None);
}
};
// 1. Content unchanged: the freshly-computed asset_id (blake3
// content hash) must match what this document was ingested from.
if existing_doc.source_asset_id != asset.asset_id {
return Ok(None);
}
// p10-3 fix: detect "stored doc was previously Tier 3 fallback".
// When a Tier 1/2 extractor emits empty chunks, the fallback wrapper
// retries with CodeTextParagraphV1Chunker and stores
// last_chunker_version = "code-text-paragraph-v1" + parser_version = "none-v1".
// On the next ingest the caller computes current_parser_version /
// current_chunker_version from the Tier 1/2 dispatch (e.g.
// "k8s-manifest-resource-v1"), which can never match the stored
// fallback values, causing spurious re-ingests. Detect this case
// and bypass the parser/chunker equality checks — only the embedder
// version still must match.
let stored_is_tier3_fallback = fallback_chunker_version.is_some_and(|fbv| {
existing_doc.last_chunker_version.as_ref() == Some(fbv)
&& existing_doc.parser_version.0 == "none-v1"
});
if stored_is_tier3_fallback {
// Embedder version still must match.
let embedder_match = existing_doc.last_embedding_version.as_ref()
== current_embedding_version;
if !embedder_match {
return Ok(None);
}
let candidate_doc_id = existing_doc.doc_id.clone();
tracing::debug!(
target: "kebab-app::ingest",
path = %asset.workspace_path.0,
doc_id = %candidate_doc_id.0,
"skip-unchanged: tier 3 fallback state detected; bypassing parser/chunker equality"
);
return Ok(Some(kebab_core::IngestItem {
kind: kebab_core::IngestItemKind::Unchanged,
doc_id: Some(candidate_doc_id),
doc_path: asset.workspace_path.clone(),
asset_id: Some(asset.asset_id.clone()),
byte_len: Some(asset.byte_len),
block_count: u32::try_from(existing_doc.blocks.len()).ok(),
chunk_count: None,
parser_version: Some(existing_doc.parser_version.clone()),
chunker_version: existing_doc.last_chunker_version.clone(),
warnings: Vec::new(),
error: None,
}));
}
// 2. Parser unchanged: parser_version is baked into id_for_doc so
// a version bump yields a different doc_id and the row above
// would have been missing. Checking here explicitly keeps the
// logic self-documenting and guards against future id_for_doc
// changes.
if existing_doc.parser_version != *current_parser_version {
return Ok(None);
}
// 3. Chunker unchanged.
let chunker_match = existing_doc.last_chunker_version.as_ref()
== Some(current_chunker_version);
if !chunker_match {
return Ok(None);
}
// 4. Embedder unchanged.
let embedder_match = existing_doc.last_embedding_version.as_ref()
== current_embedding_version;
if !embedder_match {
return Ok(None);
}
let candidate_doc_id = existing_doc.doc_id.clone();
tracing::debug!(
target: "kebab-app::ingest",
path = %asset.workspace_path.0,
@@ -918,9 +993,12 @@ fn ingest_one_asset(
force_reingest,
);
}
// p10-1A-2 / 1B: code ingest dispatch.
// p10-1A-2 / 1B: code ingest dispatch. p10-2: Tier 2 langs added. p10-3: shell added. p10-1D: c/cpp added.
MediaType::Code(lang)
if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript") =>
if matches!(lang.as_str(),
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "shell" | "c" | "cpp") =>
{
return ingest_one_code_asset(
app,
@@ -984,6 +1062,7 @@ fn ingest_one_asset(
&MdHeadingV1Chunker.chunker_version(),
embedder.map(|e| e.model_version()).as_ref(),
force_reingest,
None,
)? {
return Ok(item);
}
@@ -1178,6 +1257,7 @@ fn ingest_one_image_asset(
&MdHeadingV1Chunker.chunker_version(),
embedder.map(|e| e.model_version()).as_ref(),
force_reingest,
None,
)? {
return Ok(item);
}
@@ -1446,6 +1526,120 @@ fn purge_vector_orphans_for_workspace_path(
Ok(())
}
/// Dogfood: post-walker sweep that purges stored documents whose source
/// file has been physically deleted from the filesystem.
///
/// Algorithm:
/// 1. Query `documents` for every `workspace_path` currently stored.
/// 2. Compute `orphan_candidates = stored_paths - scanned_paths`.
/// 3. For each candidate: resolve to an absolute path and call
/// `Path::try_exists().unwrap_or(true)` — transient FS errors
/// (EACCES, NFS hiccup, ownership change) conservatively count as
/// "still present" so we never purge on uncertain signal. If the
/// file still exists on disk it was merely out-of-scope this run
/// (config narrowing / include-glob change) — leave it alone. Only
/// files that are truly absent trigger a purge.
/// 4. For absent files: call `purge_deleted_workspace_path` (SQLite
/// cascade delete + optional copied-asset file removal) and, if a
/// vector store is present, delete the associated vectors.
///
/// Returns the number of documents purged.
///
/// Non-fatal design: individual purge failures are logged and counted
/// as errors on the per-file level but do NOT abort the sweep — a
/// partial failure is preferable to blocking the rest of ingest. The
/// return value only counts successful purges.
fn sweep_deleted_files(
app: &App,
scanned_paths: &std::collections::HashSet<kebab_core::WorkspacePath>,
vector_store: Option<&kebab_store_vector::LanceVectorStore>,
) -> anyhow::Result<u32> {
use kebab_core::DocumentStore as _;
let stored_paths = app
.sqlite
.all_workspace_paths()
.context("sweep_deleted_files: all_workspace_paths")?;
if stored_paths.is_empty() {
return Ok(0);
}
let workspace_root = app.config.resolve_workspace_root();
let mut purged: u32 = 0;
for stored_path in stored_paths {
if scanned_paths.contains(&stored_path) {
continue; // still in scope — skip
}
// Resolve to an absolute path and check existence on disk.
// Use `try_exists` + `unwrap_or(true)` so transient FS errors
// (EACCES on a path we lack read on, NFS hiccups, ownership
// change) are CONSERVATIVELY treated as "file still present" —
// never purge on uncertain signal (data-safety: PR #148 review).
// `exists()` would return false on Err and trigger a wrongful
// purge. Files whose path cannot be joined (theoretically
// impossible for non-empty workspace_path strings, but
// defense-in-depth) are likewise treated as still present.
let abs = workspace_root.join(&stored_path.0);
if abs.try_exists().unwrap_or(true) {
// File is on disk but not in this scan's scope (config
// narrowing). DO NOT purge — critical design constraint.
tracing::debug!(
target: "kebab-app",
path = %stored_path.0,
"sweep_deleted_files: file on disk but out of scope — leaving in store"
);
continue;
}
// File is truly absent → purge.
let chunk_ids = match kebab_store_sqlite::purge_deleted_workspace_path(
&app.sqlite,
&stored_path,
) {
Ok(ids) => ids,
Err(e) => {
tracing::warn!(
target: "kebab-app",
path = %stored_path.0,
error = %e,
"sweep_deleted_files: purge failed; skipping this path"
);
continue;
}
};
// Purge associated vectors (best-effort; partial failure
// acceptable — orphan vectors get cleaned by `kebab reset
// --vector-only` if they accumulate).
if let Some(vec) = vector_store {
if !chunk_ids.is_empty() {
use kebab_core::VectorStore as _;
if let Err(e) = vec.delete_by_chunk_ids(&chunk_ids) {
tracing::warn!(
target: "kebab-app",
path = %stored_path.0,
count = chunk_ids.len(),
error = %e,
"sweep_deleted_files: vector delete failed; SQLite side already cleaned"
);
}
}
}
tracing::info!(
target: "kebab-app",
path = %stored_path.0,
"sweep_deleted_files: purged document for deleted file"
);
purged = purged.saturating_add(1);
}
Ok(purged)
}
/// P7-3: process one `MediaType::Pdf` asset end-to-end.
///
/// - Reads bytes from disk.
@@ -1510,6 +1704,7 @@ fn ingest_one_pdf_asset(
&PdfPageV1Chunker.chunker_version(),
embedder.map(|e| e.model_version()).as_ref(),
force_reingest,
None,
)? {
return Ok(item);
}
@@ -1683,18 +1878,54 @@ fn ingest_one_code_asset(
"python" => ParserVersion(kebab_parse_code::PYTHON_PARSER_VERSION.to_string()),
"typescript" => ParserVersion(kebab_parse_code::TS_PARSER_VERSION.to_string()),
"javascript" => ParserVersion(kebab_parse_code::JS_PARSER_VERSION.to_string()),
"go" => ParserVersion(kebab_parse_code::GO_PARSER_VERSION.to_string()),
"java" => ParserVersion(kebab_parse_code::JAVA_PARSER_VERSION.to_string()),
"kotlin" => ParserVersion(kebab_parse_code::KOTLIN_PARSER_VERSION.to_string()),
// p10-2: Tier 2 has no parse step — sentinel "none-v1".
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
=> ParserVersion("none-v1".to_string()),
// p10-3: shell direct routes to Tier 3 (no parse step).
"shell" => ParserVersion("none-v1".to_string()),
// p10-1D: C + C++ AST extractors.
"c" => ParserVersion(kebab_parse_code::C_PARSER_VERSION.to_string()),
"cpp" => ParserVersion(kebab_parse_code::CPP_PARSER_VERSION.to_string()),
other => anyhow::bail!("unsupported code_lang: {other}"),
};
// p10-1b Task D/G/J/L: chunker_version per-lang.
let chunker_version = match code_lang {
let mut chunker_version = match code_lang {
"rust" => CodeRustAstV1Chunker.chunker_version(),
"python" => CodePythonAstV1Chunker.chunker_version(),
"typescript" => CodeTsAstV1Chunker.chunker_version(),
"javascript" => CodeJsAstV1Chunker.chunker_version(),
"go" => CodeGoAstV1Chunker.chunker_version(),
"java" => CodeJavaAstV1Chunker.chunker_version(),
"kotlin" => CodeKotlinAstV1Chunker.chunker_version(),
// p10-2 Tier 2:
"yaml" => K8sManifestResourceV1Chunker.chunker_version(),
"dockerfile" => DockerfileFileV1Chunker.chunker_version(),
"toml" | "json" | "xml" | "groovy" | "go-mod"
=> ManifestFileV1Chunker.chunker_version(),
// p10-3:
"shell" => CodeTextParagraphV1Chunker.chunker_version(),
// p10-1D: C + C++ AST chunkers.
"c" => CodeCAstV1Chunker.chunker_version(),
"cpp" => CodeCppAstV1Chunker.chunker_version(),
other => anyhow::bail!("unreachable chunker_version: {other}"),
};
// p10-3 fix: if this lang can fall back to Tier 3, compute the fallback
// chunker_version so try_skip_unchanged can detect the stored-as-Tier-3
// state and skip parser/chunker equality checks.
let tier3_fallback_cv: Option<ChunkerVersion> = match code_lang {
"rust" | "python" | "typescript" | "javascript"
| "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "c" | "cpp" // p10-1D
=> Some(CodeTextParagraphV1Chunker.chunker_version()),
_ => None,
};
if let Some(item) = try_skip_unchanged(
app,
asset,
@@ -1702,6 +1933,7 @@ fn ingest_one_code_asset(
&chunker_version,
embedder.map(|e| e.model_version()).as_ref(),
force_reingest,
tier3_fallback_cv.as_ref(),
)? {
return Ok(item);
}
@@ -1717,37 +1949,159 @@ fn ingest_one_code_asset(
};
// p10-1b Task D/G/J/L: extractor per-lang.
let mut canonical = match code_lang {
// p10-3: capture Result so Tier 1 extractor errors can fall back to Tier 3.
let canonical_result: anyhow::Result<kebab_core::CanonicalDocument> = match code_lang {
"rust" => RustAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::RustAstExtractor::extract (code:rust)")?,
.context("kb-parse-code::RustAstExtractor::extract (code:rust)"),
"python" => PythonAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::PythonAstExtractor::extract (code:python)")?,
.context("kb-parse-code::PythonAstExtractor::extract (code:python)"),
"typescript" => TypescriptAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)")?,
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)"),
"javascript" => JavascriptAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)")?,
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)"),
"go" => GoAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::GoAstExtractor::extract (code:go)"),
"java" => JavaAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::JavaAstExtractor::extract (code:java)"),
"kotlin" => KotlinAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::KotlinAstExtractor::extract (code:kotlin)"),
// p10-2 Tier 2: no extractor — synthesize Document directly from raw bytes.
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" => {
synthesize_tier2_document(asset, &bytes, code_lang, &parser_version)
}
// p10-3: shell reuses the same synthesizer.
"shell" => synthesize_tier2_document(asset, &bytes, "shell", &parser_version),
// p10-1D: C + C++ AST extractors.
"c" => CAstExtractor::new()
.extract(&ctx, &bytes)
.context("kebab-parse-code::CAstExtractor::extract (code:c)"),
"cpp" => CppAstExtractor::new()
.extract(&ctx, &bytes)
.context("kebab-parse-code::CppAstExtractor::extract (code:cpp)"),
other => anyhow::bail!("unreachable (extract): {other}"),
};
// p10-3: Tier 1 extractor failure → fall back to Tier 3 synthesized doc.
// Tier 2 (yaml/dockerfile/…) and shell errors are real (e.g. non-UTF-8) — propagate.
let mut canonical = match canonical_result {
Ok(d) => d,
Err(e) if code_lang == "shell"
|| matches!(code_lang, "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod") =>
{
return Err(e).context("synthesize_tier2_document failed for tier 2/3 lang");
}
Err(e) => {
// Tier 1 extractor errored — fall back to Tier 3 synthesized doc.
tracing::warn!(
workspace_path = %asset.workspace_path.0,
code_lang = code_lang,
error = %e,
"tier1 extract errored; falling back to tier 3 synthesized doc"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
let tier3_parser_version = ParserVersion("none-v1".to_string());
synthesize_tier2_document(asset, &bytes, code_lang, &tier3_parser_version)
.context("synthesize_tier2_document for tier 3 fallback after extract error")?
}
};
// p10-1b Task D/G/J/L: chunker per-lang.
let chunks = match code_lang {
"rust" => CodeRustAstV1Chunker
// p10-3: track whether the extract stage already fell back to Tier 3.
// Tier 2 langs already have "none-v1" parser_version normally, so exclude them
// from the extract_fell_back guard with the !matches! exclusion.
let extract_fell_back = canonical.parser_version.0 == "none-v1"
&& !matches!(code_lang, "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" | "shell");
let chunks_result: anyhow::Result<Vec<Chunk>> = if extract_fell_back {
// Tier 1 lang whose extractor errored — go straight to Tier 3 chunker.
CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)")?,
"python" => CodePythonAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)")?,
"typescript" => CodeTsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)")?,
"javascript" => CodeJsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)")?,
other => anyhow::bail!("unreachable (chunk): {other}"),
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 after extract fallback)")
} else {
match code_lang {
"rust" => CodeRustAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)"),
"python" => CodePythonAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)"),
"typescript" => CodeTsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)"),
"javascript" => CodeJsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)"),
"go" => CodeGoAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeGoAstV1Chunker::chunk (code:go)"),
"java" => CodeJavaAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJavaAstV1Chunker::chunk (code:java)"),
"kotlin" => CodeKotlinAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeKotlinAstV1Chunker::chunk (code:kotlin)"),
// p10-2 Tier 2:
"yaml" => K8sManifestResourceV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::K8sManifestResourceV1Chunker::chunk"),
"dockerfile" => DockerfileFileV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::DockerfileFileV1Chunker::chunk"),
"toml" | "json" | "xml" | "groovy" | "go-mod" => ManifestFileV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::ManifestFileV1Chunker::chunk"),
// p10-3:
"shell" => CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (code:shell)"),
// p10-1D: C + C++ AST chunkers.
"c" => CodeCAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kebab-chunk::CodeCAstV1Chunker::chunk (code:c)"),
"cpp" => CodeCppAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kebab-chunk::CodeCppAstV1Chunker::chunk (code:cpp)"),
other => anyhow::bail!("unreachable (chunk): {other}"),
}
};
// p10-3: Tier 1/2 0-chunk OR error → Tier 3 fallback retry.
// "shell" direct path is already Tier 3 — don't retry-double-up.
let chunks: Vec<Chunk> = match chunks_result {
Ok(v) if !v.is_empty() => v,
other if code_lang == "shell" => other?, // shell propagates directly
Ok(_empty) => {
tracing::warn!(
workspace_path = %asset.workspace_path.0,
code_lang = code_lang,
"tier1/2 emitted 0 chunks; falling back to tier 3 (code-text-paragraph-v1)"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
canonical.parser_version = ParserVersion("none-v1".to_string());
CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback)")?
}
Err(e) => {
tracing::warn!(
workspace_path = %asset.workspace_path.0,
code_lang = code_lang,
error = %e,
"tier1/2 chunker errored; falling back to tier 3 (code-text-paragraph-v1)"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
canonical.parser_version = ParserVersion("none-v1".to_string());
CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback after error)")?
}
};
// Stamp chunker + embedding versions so incremental skip detection has
@@ -1842,6 +2196,139 @@ fn ingest_one_code_asset(
})
}
/// p10-2: Build a minimal [`CanonicalDocument`] for Tier 2 code assets
/// (yaml / dockerfile / toml / json / xml / groovy / go-mod) that have
/// no AST extractor. Produces a single `Block::Code` whose source span
/// covers the entire file, mirroring the shape the Tier 1 extractors
/// produce for glue / top-level regions.
fn synthesize_tier2_document(
asset: &RawAsset,
bytes: &[u8],
code_lang: &str,
parser_version: &ParserVersion,
) -> anyhow::Result<kebab_core::CanonicalDocument> {
use anyhow::Context as _;
use kebab_core::{
BlockId, CodeBlock, CommonBlock, Lang, Metadata, Provenance, ProvenanceEvent,
ProvenanceKind, SourceSpan, SourceType, TrustLevel, id_for_block, id_for_doc,
};
let text = std::str::from_utf8(bytes)
.with_context(|| format!("tier2 doc not utf-8: {}", asset.workspace_path.0))?
.to_string();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, parser_version);
let n_lines = text.lines().count().max(1) as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: n_lines,
symbol: Some("<file>".to_string()),
lang: Some(code_lang.to_string()),
};
let block_id: BlockId = id_for_block(
&doc_id,
"code",
&[],
0,
&span,
);
let block = kebab_core::Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: vec![],
source_span: span,
},
lang: Some(code_lang.to_string()),
code: text,
});
let now = time::OffsetDateTime::now_utc();
let events = vec![
ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
},
ProvenanceEvent {
at: now,
agent: "kb-app".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; tier2_synthesized; lang={}",
parser_version.0, code_lang
)),
},
];
// Resolve absolute path for repo detection. FsSourceConnector always
// emits absolute paths in SourceUri::File (verified in connector.rs); Kb
// URIs were rejected earlier in ingest_one_code_asset (returns Skipped),
// so the fallback below is purely defensive. This does NOT mirror
// RustAstExtractor — that extractor joins ctx.workspace_root for relative
// paths, but Tier 2 trusts the connector invariant.
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => p.clone(),
kebab_core::SourceUri::Kb(_) => std::path::PathBuf::new(),
};
let (repo, git_branch, git_commit) = match kebab_parse_code::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let title = {
let fname = asset.workspace_path.0
.rsplit('/')
.next()
.unwrap_or(&asset.workspace_path.0);
// strip extension
match fname.rfind('.') {
Some(i) => fname[..i].to_string(),
None => fname.to_string(),
}
};
let metadata = Metadata {
aliases: vec![],
tags: vec![],
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: serde_json::Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some(code_lang.to_string()),
};
tracing::debug!(
target: "kebab-app",
"synthesized tier2 doc_id={} workspace_path={} lang={}",
doc_id.0,
asset.workspace_path.0,
code_lang,
);
Ok(kebab_core::CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks: vec![block],
metadata,
provenance: Provenance { events },
parser_version: parser_version.clone(),
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
/// Pull the BCP-47 language hint from the canonical document. P6-1
/// stamps `Lang("und")` by default; image-pipeline OCR / caption
/// adapters special-case "und" so the hint is intentionally dropped

View File

@@ -9,13 +9,19 @@
//!
//! `--vector-only` additionally truncates `embedding_records` in SQLite
//! so the next `kebab ingest` re-embeds cleanly without orphan rows.
//!
//! `--orphans-only` purges stored docs that are outside the current walker
//! scope (config narrowing / removed sub-directory). No filesystem paths are
//! removed — this is purely a store-level reconciliation.
use std::collections::HashSet;
use std::path::PathBuf;
use anyhow::{Context, Result};
use serde::{Deserialize, Serialize};
use kebab_config::{Config, expand_path};
use kebab_core::WorkspacePath;
/// What the user asked to remove. Mutually exclusive — picked by the CLI
/// from a clap `ArgGroup`.
@@ -32,6 +38,13 @@ pub enum ResetScope {
VectorOnly,
/// Wipe only the config dir.
ConfigOnly,
/// Purge stored docs that are outside the current walker scope (no
/// filesystem paths are removed). Filesystem existence is NOT checked —
/// anything the current walker would not visit is considered an orphan.
/// The explicit complement to the conservative `sweep_deleted_files`
/// that runs during ingest (which leaves on-disk-but-out-of-scope docs
/// alone for data safety).
OrphansOnly,
}
/// Result of a successful wipe — emitted as `reset_report.v1` by the
@@ -41,6 +54,16 @@ pub struct ResetReport {
pub scope: ResetScope,
pub removed_paths: Vec<PathBuf>,
pub embedding_rows_truncated: u64,
/// Number of stored docs purged because they are outside the current
/// walker scope. Non-zero only when `scope == OrphansOnly`.
/// `#[serde(default)]` preserves back-compat with older callers that
/// do not include this field.
#[serde(default)]
pub orphans_purged: u32,
/// Paths of the orphaned docs that were purged. Sorted for deterministic
/// output. Non-empty only when `scope == OrphansOnly`.
#[serde(default)]
pub purged_paths: Vec<WorkspacePath>,
}
/// Compute the absolute on-disk paths a given scope will wipe, given a
@@ -67,6 +90,10 @@ pub fn enumerate_paths(scope: ResetScope, cfg: &Config) -> Vec<PathBuf> {
vec![vector_dir]
}
ResetScope::ConfigOnly => vec![cfg_dir],
// OrphansOnly operates purely at the store level — no filesystem paths
// are removed. Return empty so `estimate_size_bytes` stays zero and
// the existing confirm UI path for directory wipes is skipped.
ResetScope::OrphansOnly => vec![],
}
}
@@ -96,16 +123,82 @@ pub fn estimate_size_bytes(paths: &[PathBuf]) -> u64 {
paths.iter().map(|p| walk(p)).sum()
}
/// Compute the workspace paths stored in SQLite that are NOT visited by
/// the current walker scope (i.e. they are "orphans" — on disk but
/// outside the configured include/exclude rules, or from a sub-directory
/// that has since been removed from the workspace).
///
/// Does NOT check filesystem existence — `OrphansOnly` is the explicit
/// "I know what I'm doing" variant; callers that want the conservative
/// fs-aware sweep should use `sweep_deleted_files` inside ingest.
///
/// Returns the list sorted for deterministic output. Called twice by the
/// CLI path (once for the confirm UI preview, once inside `execute`);
/// the double scan is acceptable for a rare destructive operation.
pub fn enumerate_orphans(cfg: &Config) -> Result<Vec<WorkspacePath>> {
use kebab_core::DocumentStore as _;
use kebab_source_fs::FsSourceConnector;
use kebab_core::SourceScope;
let store = kebab_store_sqlite::SqliteStore::open(cfg)
.context("enumerate_orphans: open SqliteStore")?;
let stored = store
.all_workspace_paths()
.context("enumerate_orphans: all_workspace_paths")?;
if stored.is_empty() {
return Ok(Vec::new());
}
// Build the same SourceScope the CLI's ingest path uses: root from
// config, exclude list from config, no include override (full scope).
let root = cfg.resolve_workspace_root();
let scope = SourceScope {
root: root.clone(),
exclude: cfg.workspace.exclude.clone(),
..Default::default()
};
let connector = FsSourceConnector::new(cfg)
.context("enumerate_orphans: build FsSourceConnector")?;
let (assets, _skips) = connector
.scan_with_skips(&scope)
.context("enumerate_orphans: scan workspace")?;
let scanned: HashSet<WorkspacePath> = assets
.into_iter()
.map(|a| a.workspace_path)
.collect();
let mut orphans: Vec<WorkspacePath> = stored
.into_iter()
.filter(|p| !scanned.contains(p))
.collect();
orphans.sort_by(|a, b| a.0.cmp(&b.0));
Ok(orphans)
}
/// Wipe every path from `enumerate_paths(scope, cfg)`. For
/// `ResetScope::VectorOnly`, also truncates the SQLite
/// `embedding_records` table so the store doesn't point at the Lance
/// rows we just removed off-disk.
///
/// For `ResetScope::OrphansOnly`, no filesystem directories are removed.
/// Instead the store is reconciled: stored docs outside the current walker
/// scope are purged from SQLite (+ vector store when configured). The
/// caller is expected to have already shown the confirm UI using
/// `enumerate_orphans`.
///
/// Idempotent: a missing path is treated as already-removed (success).
/// Returns a `ResetReport` listing exactly what was removed (paths that
/// existed before the call) so `--json` callers see the truth, not the
/// request.
pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
if matches!(scope, ResetScope::OrphansOnly) {
return execute_orphans_only(cfg);
}
let paths = enumerate_paths(scope, cfg);
let mut removed = Vec::new();
@@ -128,9 +221,100 @@ pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
scope,
removed_paths: removed,
embedding_rows_truncated,
orphans_purged: 0,
purged_paths: Vec::new(),
})
}
/// Execute the `OrphansOnly` variant: reconcile stored docs against the
/// current walker scope without touching any filesystem directory.
fn execute_orphans_only(cfg: &Config) -> Result<ResetReport> {
let orphans = enumerate_orphans(cfg)
.context("execute_orphans_only: enumerate orphans")?;
if orphans.is_empty() {
return Ok(ResetReport {
scope: ResetScope::OrphansOnly,
removed_paths: Vec::new(),
embedding_rows_truncated: 0,
orphans_purged: 0,
purged_paths: Vec::new(),
});
}
let store = std::sync::Arc::new(
kebab_store_sqlite::SqliteStore::open(cfg)
.context("execute_orphans_only: open SqliteStore")?,
);
// Open vector store if configured. Mirror the same guard the ingest
// path uses: only construct when the provider is not "none" / dims > 0.
let vector_store: Option<kebab_store_vector::LanceVectorStore> =
open_vector_store_if_configured(cfg, store.clone())?;
let mut purged_paths: Vec<WorkspacePath> = Vec::new();
for path in &orphans {
let chunk_ids = kebab_store_sqlite::purge_deleted_workspace_path(&store, path)
.with_context(|| format!("execute_orphans_only: purge {}", path.0))?;
if let Some(ref vs) = vector_store {
if !chunk_ids.is_empty() {
use kebab_core::VectorStore as _;
if let Err(e) = vs.delete_by_chunk_ids(&chunk_ids) {
tracing::warn!(
target: "kebab-app",
path = %path.0,
count = chunk_ids.len(),
error = %e,
"reset --orphans-only: vector delete failed; SQLite side already cleaned"
);
}
}
}
tracing::info!(
target: "kebab-app",
path = %path.0,
"reset --orphans-only: purged orphan document"
);
purged_paths.push(path.clone());
}
let orphans_purged = u32::try_from(purged_paths.len()).unwrap_or(u32::MAX);
Ok(ResetReport {
scope: ResetScope::OrphansOnly,
removed_paths: Vec::new(),
embedding_rows_truncated: 0,
orphans_purged,
purged_paths,
})
}
/// Open the Lance vector store if the configured embedding provider is
/// active (non-"none", dimensions > 0). Returns `None` for lexical-only
/// configs. Mirrors the guard in `App::vector`.
fn open_vector_store_if_configured(
cfg: &Config,
store: std::sync::Arc<kebab_store_sqlite::SqliteStore>,
) -> Result<Option<kebab_store_vector::LanceVectorStore>> {
if cfg.models.embedding.provider == "none" || cfg.models.embedding.dimensions == 0 {
return Ok(None);
}
match kebab_store_vector::LanceVectorStore::new(cfg, store) {
Ok(vs) => Ok(Some(vs)),
Err(e) => {
tracing::warn!(
target: "kebab-app",
error = %e,
"reset --orphans-only: could not open vector store; skipping vector delete"
);
Ok(None)
}
}
}
/// Open the SQLite store at the configured path and run
/// `truncate_embedding_records`. Returns the count of truncated rows
/// (the helper itself reports `DELETE` rowcount). If the SQLite file
@@ -200,4 +384,14 @@ mod tests {
let bytes = estimate_size_bytes(&[dir.path().to_path_buf()]);
assert_eq!(bytes, 5 + 6);
}
#[test]
fn enumerate_orphans_only_returns_empty_paths() {
let cfg = Config::defaults();
let paths = enumerate_paths(ResetScope::OrphansOnly, &cfg);
assert!(
paths.is_empty(),
"OrphansOnly must return empty vec from enumerate_paths"
);
}
}

View File

@@ -390,6 +390,641 @@ fn javascript_file_ingests_and_searches_as_code_citation() {
);
}
/// p10-1c-go Task F: a `.go` file in a sub-directory is ingested and the
/// resulting `Citation::Code` hit must carry `lang="go"`,
/// `symbol="chunk.ParseDoc"`, and `line_start >= 1`.
/// The sub-directory (`chunk/`) ensures the Go package-prefix wiring
/// produces a non-empty module prefix so the fully-qualified symbol assertion
/// exercises that path end-to-end.
#[test]
fn go_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
let pkg_dir = env.workspace_root.join("chunk");
std::fs::create_dir_all(&pkg_dir).unwrap();
std::fs::write(
pkg_dir.join("ast.go"),
"package chunk\n\nfunc ParseDoc(input string) string {\n return input\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0);
assert!(report.new >= 1);
let go_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("ast.go"))
.expect("ast.go item present");
assert_eq!(
go_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-go-v1"),
"parser_version must be code-go-v1"
);
assert_eq!(
go_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-go-ast-v1"),
"chunker_version must be code-go-ast-v1"
);
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("ParseDoc"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, kebab_core::Citation::Code { .. }))
.expect("Citation::Code hit");
match &h.citation {
kebab_core::Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("go"), "citation.lang must be 'go'");
assert_eq!(
symbol.as_deref(),
Some("chunk.ParseDoc"),
"citation.symbol must be 'chunk.ParseDoc'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("go"),
"SearchHit.code_lang must be 'go'"
);
}
/// p10-1c-jk Task F: a `.java` file in a package directory is ingested and the
/// resulting `Citation::Code` hit must carry `lang="java"`,
/// `symbol="com.foo.Foo.bar"`, and `line_start >= 1`.
/// The sub-directory (`com/foo/`) ensures the Java package-prefix wiring
/// produces a non-empty module prefix so the fully-qualified symbol assertion
/// exercises that path end-to-end.
#[test]
fn java_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
let pkg_dir = env.workspace_root.join("com").join("foo");
std::fs::create_dir_all(&pkg_dir).unwrap();
std::fs::write(
pkg_dir.join("Foo.java"),
"package com.foo;\n\npublic class Foo {\n public String bar() { return \"x\"; }\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0);
assert!(report.new >= 1);
let java_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("Foo.java"))
.expect("Foo.java item present");
assert_eq!(
java_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-java-v1"),
"parser_version must be code-java-v1"
);
assert_eq!(
java_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-java-ast-v1"),
"chunker_version must be code-java-ast-v1"
);
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("bar"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, kebab_core::Citation::Code { .. }))
.expect("Citation::Code hit");
match &h.citation {
kebab_core::Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("java"), "citation.lang must be 'java'");
assert_eq!(
symbol.as_deref(),
Some("com.foo.Foo.bar"),
"citation.symbol must be 'com.foo.Foo.bar'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("java"),
"SearchHit.code_lang must be 'java'"
);
}
/// p10-1c-jk Task I: a `.kt` file in a package directory is ingested and the
/// resulting `Citation::Code` hit must carry `lang="kotlin"`,
/// `symbol="com.foo.Foo.bar"`, and `line_start >= 1`.
/// The sub-directory (`com/foo/`) ensures the Kotlin package-prefix wiring
/// produces a non-empty module prefix so the fully-qualified symbol assertion
/// exercises that path end-to-end.
#[test]
fn kotlin_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
let pkg_dir = env.workspace_root.join("com").join("foo");
std::fs::create_dir_all(&pkg_dir).unwrap();
std::fs::write(
pkg_dir.join("Foo.kt"),
"package com.foo\n\nclass Foo {\n fun bar(): String = \"x\"\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0);
assert!(report.new >= 1);
let kt_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("Foo.kt"))
.expect("Foo.kt item present");
assert_eq!(
kt_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-kotlin-v1"),
"parser_version must be code-kotlin-v1"
);
assert_eq!(
kt_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-kotlin-ast-v1"),
"chunker_version must be code-kotlin-ast-v1"
);
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("bar"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, kebab_core::Citation::Code { .. }))
.expect("Citation::Code hit");
match &h.citation {
kebab_core::Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("kotlin"), "citation.lang must be 'kotlin'");
assert_eq!(
symbol.as_deref(),
Some("com.foo.Foo.bar"),
"citation.symbol must be 'com.foo.Foo.bar'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("kotlin"),
"SearchHit.code_lang must be 'kotlin'"
);
}
/// p10-2 Task H: a `k8s/deploy.yaml` file with a Deployment resource is
/// ingested and the resulting `Citation::Code` hit must carry
/// `lang="yaml"`, `symbol="Deployment/prod/api"`, and `line_start >= 1`.
/// Exercises the k8s-manifest-resource-v1 chunker end-to-end.
#[test]
fn tier2_k8s_yaml_ingest_searchable() {
let env = TestEnv::lexical_only();
let k8s_dir = env.workspace_root.join("k8s");
std::fs::create_dir_all(&k8s_dir).unwrap();
std::fs::write(
k8s_dir.join("deploy.yaml"),
"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: api\n namespace: prod\nspec:\n replicas: 1\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "yaml file ingested: {report:?}");
let yaml_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("deploy.yaml"))
.expect("deploy.yaml item present");
assert_eq!(
yaml_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1"
);
assert_eq!(
yaml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("k8s-manifest-resource-v1"),
"chunker_version must be k8s-manifest-resource-v1"
);
let query = kebab_core::SearchQuery {
text: "api".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["yaml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'api'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("yaml"), "citation.lang must be 'yaml'");
assert_eq!(
symbol.as_deref(),
Some("Deployment/prod/api"),
"citation.symbol must be 'Deployment/prod/api'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("yaml"),
"SearchHit.code_lang must be 'yaml'"
);
}
/// p10-2 Task H: a `Dockerfile` is ingested and the resulting
/// `Citation::Code` hit must carry `lang="dockerfile"`,
/// `symbol="<dockerfile>"`, and `line_start >= 1`.
/// Exercises the dockerfile-file-v1 chunker end-to-end.
#[test]
fn tier2_dockerfile_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("Dockerfile"),
"FROM rust:1.94\nRUN cargo install foo\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "Dockerfile ingested: {report:?}");
let df_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("Dockerfile"))
.expect("Dockerfile item present");
assert_eq!(
df_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1"
);
assert_eq!(
df_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("dockerfile-file-v1"),
"chunker_version must be dockerfile-file-v1"
);
let query = kebab_core::SearchQuery {
text: "cargo".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["dockerfile".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'cargo'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("dockerfile"),
"citation.lang must be 'dockerfile'"
);
assert_eq!(
symbol.as_deref(),
Some("<dockerfile>"),
"citation.symbol must be '<dockerfile>'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("dockerfile"),
"SearchHit.code_lang must be 'dockerfile'"
);
}
/// p10-2 Task H: a `Cargo.toml` manifest is ingested and the resulting
/// `Citation::Code` hit must carry `lang="toml"`, `symbol="<manifest>"`,
/// and `line_start >= 1`.
/// Exercises the manifest-file-v1 chunker end-to-end.
#[test]
fn tier2_cargo_toml_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("Cargo.toml"),
"[package]\nname = \"demo\"\nversion = \"0.1.0\"\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "Cargo.toml ingested: {report:?}");
let toml_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("Cargo.toml"))
.expect("Cargo.toml item present");
assert_eq!(
toml_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1"
);
assert_eq!(
toml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("manifest-file-v1"),
"chunker_version must be manifest-file-v1"
);
let query = kebab_core::SearchQuery {
text: "demo".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["toml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'demo'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("toml"),
"citation.lang must be 'toml'"
);
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"citation.symbol must be '<manifest>'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("toml"),
"SearchHit.code_lang must be 'toml'"
);
}
/// p10-3 Task E: a `.sh` file is ingested via the shell direct-Tier-3 path
/// and the resulting `Citation::Code` hit must carry `lang="shell"`,
/// `symbol=None`, `line_start >= 1`, and
/// `chunker_version = "code-text-paragraph-v1"`.
#[test]
fn tier3_shell_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("deploy.sh"),
"#!/usr/bin/env bash\nset -e\necho hello\n\nkebab ingest --json\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "shell file ingested: {report:?}");
let sh_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
.expect("deploy.sh item present");
assert_eq!(
sh_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1 for shell (Tier 3 direct)"
);
assert_eq!(
sh_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-text-paragraph-v1"),
"chunker_version must be code-text-paragraph-v1 for shell"
);
let query = kebab_core::SearchQuery {
text: "kebab".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["shell".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'kebab'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("shell"),
"citation.lang must be 'shell'"
);
assert_eq!(*symbol, None, "Tier 3 symbol must be None");
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("shell"),
"SearchHit.code_lang must be 'shell'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-text-paragraph-v1",
"shell chunks must be stamped with the Tier 3 chunker_version"
);
}
/// p10-3 Task E: a docker-compose-shaped YAML file (no `apiVersion`/`kind`)
/// is ingested; the k8s chunker returns `Ok(vec![])` and the Tier 3 fallback
/// wrapper retries with `CodeTextParagraphV1Chunker`. The resulting
/// `Citation::Code` hit must carry `lang="yaml"`, `symbol=None`,
/// `line_start >= 1`, and `chunker_version = "code-text-paragraph-v1"`.
#[test]
fn tier3_yaml_fallback_picks_up_non_k8s_yaml() {
let env = TestEnv::lexical_only();
// docker-compose-shaped YAML — version + services but no apiVersion/kind.
// The k8s chunker returns Ok(vec![]); Tier 3 fallback should pick this up.
std::fs::write(
env.workspace_root.join("docker-compose.yml"),
"version: '3'\nservices:\n api:\n image: nginx:latest\n ports:\n - 8080:80\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(
report.new >= 1,
"expected non-k8s yaml ingested via Tier 3, got {} new docs",
report.new
);
let yaml_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
.expect("docker-compose.yml item present");
assert_eq!(
yaml_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1 after Tier 3 fallback"
);
assert_eq!(
yaml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-text-paragraph-v1"),
"chunker_version must be code-text-paragraph-v1 after Tier 3 fallback"
);
let query = kebab_core::SearchQuery {
text: "nginx".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["yaml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'nginx'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("yaml"),
"citation.lang must be 'yaml'"
);
assert_eq!(*symbol, None, "Tier 3 fallback symbol must be None");
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("yaml"),
"SearchHit.code_lang must be 'yaml'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-text-paragraph-v1",
"non-k8s yaml fallback must be stamped code-text-paragraph-v1"
);
}
/// Re-ingesting the same `.rs` file without changes must report
/// `Unchanged` (incremental-skip path exercised).
#[test]
@@ -429,3 +1064,328 @@ fn rust_file_re_ingest_is_unchanged() {
);
assert_eq!(item2.doc_id, item1.doc_id);
}
/// p10-3 fix regression: a docker-compose YAML that falls back to Tier 3
/// (k8s chunker returns empty, CodeTextParagraphV1Chunker retries) must
/// report Unchanged on the second ingest rather than re-processing.
/// Before the fix, try_skip_unchanged returned None because the stored
/// last_chunker_version ("code-text-paragraph-v1" / parser_version
/// "none-v1") never matched the caller's dispatch values.
#[test]
fn tier3_yaml_fallback_reingest_is_unchanged() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("docker-compose.yml"),
"version: '3'\nservices:\n api:\n image: nginx:latest\n",
)
.unwrap();
let report1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("first ingest");
let item1 = report1
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
.expect("docker-compose.yml in first report");
assert!(
matches!(item1.kind, IngestItemKind::New),
"first ingest must be New, got {:?}", item1.kind
);
assert_eq!(
item1.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-text-paragraph-v1"),
"first ingest must use Tier 3 fallback chunker"
);
let report2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest");
let item2 = report2
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
.expect("docker-compose.yml in second report");
assert!(
matches!(item2.kind, IngestItemKind::Unchanged),
"second ingest must be Unchanged, got {:?}", item2.kind
);
}
/// p10-1d Task G: a `.c` file with a single top-level function is ingested
/// and the resulting `Citation::Code` hit must carry `lang="c"`,
/// `symbol="parse_record"` (function name only — no nesting in C), and
/// `chunker_version = "code-c-ast-v1"`.
#[test]
fn tier1_c_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("parser.c"),
"#include <stdio.h>\n\nint parse_record(const char *line) {\n if (line == NULL) return -1;\n return 0;\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "c file ingested: {report:?}");
let c_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("parser.c"))
.expect("parser.c item present");
assert_eq!(
c_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-c-v1"),
"parser_version must be code-c-v1"
);
assert_eq!(
c_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-c-ast-v1"),
"chunker_version must be code-c-ast-v1"
);
let query = kebab_core::SearchQuery {
text: "parse_record".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["c".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'parse_record'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("c"), "citation.lang must be 'c'");
assert_eq!(
symbol.as_deref(),
Some("parse_record"),
"C symbol must be function name only (no nesting)"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("c"),
"SearchHit.code_lang must be 'c'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-c-ast-v1",
"C chunks must be stamped with code-c-ast-v1"
);
}
/// p10-1d Task G: a `.cpp` file with nested namespace + class is ingested
/// and the resulting `Citation::Code` hit must carry `lang="cpp"`, a
/// `symbol` that starts with `"kebab::chunk::Foo"` (namespace::Class or
/// namespace::Class::method), and `chunker_version = "code-cpp-ast-v1"`.
#[test]
fn tier1_cpp_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("chunker.cpp"),
"namespace kebab {\nnamespace chunk {\nclass Foo {\npublic:\n void bar() { /* impl */ }\n};\n}\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "cpp file ingested: {report:?}");
let cpp_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("chunker.cpp"))
.expect("chunker.cpp item present");
assert_eq!(
cpp_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-cpp-v1"),
"parser_version must be code-cpp-v1"
);
assert_eq!(
cpp_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-cpp-ast-v1"),
"chunker_version must be code-cpp-ast-v1"
);
let query = kebab_core::SearchQuery {
text: "bar".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["cpp".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'bar'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("cpp"), "citation.lang must be 'cpp'");
// Symbol could be "kebab::chunk::Foo" (class) or "kebab::chunk::Foo::bar"
// (method) depending on which chunk ranks first.
assert!(
symbol.as_deref().is_some_and(|s| s.starts_with("kebab::chunk::Foo")),
"C++ symbol must start with namespace::Class prefix, got {:?}", symbol
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("cpp"),
"SearchHit.code_lang must be 'cpp'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-cpp-ast-v1",
"C++ chunks must be stamped with code-cpp-ast-v1"
);
}
/// P10 dogfood regression: a k8s YAML with 2 documents (Deployment + Service
/// separated by `---`) must ingest without a UNIQUE constraint violation.
/// Before the fix, push_chunks_with_oversize emitted split_key=None for each
/// resource, giving every resource chunk the same id_hash → identical chunk_id
/// → SQLite UNIQUE constraint failure on the second resource.
#[test]
fn tier2_k8s_multi_resource_yaml_ingests_without_collision() {
let env = TestEnv::lexical_only();
let k8s_dir = env.workspace_root.join("k8s");
std::fs::create_dir_all(&k8s_dir).unwrap();
std::fs::write(
k8s_dir.join("k8s-multi.yaml"),
"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: api\n namespace: prod\nspec:\n replicas: 2\n---\napiVersion: v1\nkind: Service\nmetadata:\n name: api\n namespace: prod\nspec:\n selector:\n app: api\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
// The bug: this would land in report with an error + UNIQUE constraint message.
let item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("k8s-multi.yaml"))
.expect("k8s-multi.yaml in report");
assert!(
item.error.is_none(),
"multi-resource k8s yaml must ingest without error, got: {:?}",
item.error
);
assert!(
matches!(item.kind, IngestItemKind::New),
"expected New, got {:?}",
item.kind
);
// Both resources must be searchable (≥2 hits: Deployment/prod/api + Service/prod/api).
let query = kebab_core::SearchQuery {
text: "api".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["yaml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
assert!(
hits.len() >= 2,
"expected ≥2 hits (Deployment + Service), got {}",
hits.len()
);
}
/// p10-3 fix regression: a shell file (direct Tier 3, not a fallback)
/// must also report Unchanged on re-ingest. Shell goes straight to
/// CodeTextParagraphV1Chunker so `stored_is_tier3_fallback` is false
/// (parser_version is "none-v1" and chunker matches the current dispatch),
/// but the normal equality path should pass regardless.
#[test]
fn tier3_shell_reingest_is_unchanged() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("deploy.sh"),
"#!/usr/bin/env bash\nset -e\necho hello\n",
)
.unwrap();
let report1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("first ingest");
let item1 = report1
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
.expect("deploy.sh in first report");
assert!(
matches!(item1.kind, IngestItemKind::New),
"first ingest must be New, got {:?}", item1.kind
);
let report2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest");
let item2 = report2
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
.expect("deploy.sh in second report");
assert!(
matches!(item2.kind, IngestItemKind::Unchanged),
"shell reingest must be Unchanged, got {:?}", item2.kind
);
}

View File

@@ -0,0 +1,178 @@
//! Dogfood: auto-purge stored docs for filesystem-deleted files.
//!
//! Two tests:
//!
//! 1. `file_deletion_auto_purge` — ingest 2 files, delete one, re-ingest.
//! The re-ingest must report `purged_deleted_files = 1`, the deleted
//! file must no longer appear in `list_docs`, and lexical search for
//! its unique content must return no hits.
//!
//! 2. `include_scope_narrowing_does_not_purge` — ingest 2 files under a
//! wide glob, narrow the walker scope to only one file, re-ingest.
//! The narrowed ingest must NOT purge the out-of-scope file because
//! the file is still on disk (just excluded from this run). Protects
//! users against accidental data loss via config edits.
mod common;
use common::TestEnv;
use kebab_app::ingest_with_config_opts;
use kebab_app::IngestOpts;
use kebab_core::{DocFilter, DocumentStore, SearchMode, SearchQuery, SourceScope};
/// Helper: open the store via `TestEnv` and run `list_documents`.
fn list_doc_paths(env: &TestEnv) -> Vec<String> {
use kebab_store_sqlite::SqliteStore;
let store = SqliteStore::open(&env.config).unwrap();
store.run_migrations().unwrap();
store
.list_documents(&DocFilter::default())
.unwrap()
.into_iter()
.map(|d| d.doc_path.0)
.collect()
}
#[test]
fn file_deletion_auto_purge() {
let env = TestEnv::lexical_only();
// Write two .rs files into the workspace.
let a_path = env.workspace_root.join("a.rs");
let b_path = env.workspace_root.join("b.rs");
std::fs::write(&a_path, "// file a\nfn alpha() {}\n").unwrap();
std::fs::write(&b_path, "// file b\nfn bravo() {}\n").unwrap();
// First ingest — both must be New.
let first = ingest_with_config_opts(
env.config.clone(),
env.scope(),
false,
IngestOpts::default(),
)
.expect("first ingest must succeed");
// Only count the .rs files we added (there may be fixture files too).
let first_new = first.new;
assert!(first_new >= 2, "expected at least 2 new docs: {first:?}");
assert_eq!(
first.purged_deleted_files, 0,
"no purges on first ingest: {first:?}"
);
assert_eq!(first.errors, 0, "no errors on first ingest: {first:?}");
// Delete one file from the filesystem.
std::fs::remove_file(&b_path).expect("remove b.rs");
// Second ingest — scanned count drops by 1; b.rs should be purged.
let second = ingest_with_config_opts(
env.config.clone(),
env.scope(),
false,
IngestOpts::default(),
)
.expect("second ingest must succeed");
assert_eq!(
second.purged_deleted_files, 1,
"exactly 1 file should be purged: {second:?}"
);
assert_eq!(second.new, 0, "no new docs after deletion: {second:?}");
assert_eq!(second.updated, 0, "no updated docs: {second:?}");
assert_eq!(second.errors, 0, "no errors: {second:?}");
// b.rs must no longer appear in list_docs.
let doc_paths = list_doc_paths(&env);
let b_ws_path = "b.rs";
assert!(
!doc_paths.iter().any(|p| p == b_ws_path),
"b.rs must be gone from list_docs; got: {doc_paths:?}"
);
// a.rs must still be present.
let a_ws_path = "a.rs";
assert!(
doc_paths.iter().any(|p| p == a_ws_path),
"a.rs must still be in list_docs; got: {doc_paths:?}"
);
// Lexical search for b.rs's unique content returns no hits.
let app = env.app();
let query = SearchQuery {
text: "bravo".to_string(),
mode: SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(query).expect("search must not error");
assert!(
hits.is_empty(),
"search for deleted file's content must return no hits; got: {hits:?}"
);
}
#[test]
fn include_scope_narrowing_does_not_purge() {
let env = TestEnv::lexical_only();
// Write two .rs files.
let a_path = env.workspace_root.join("a_narrow.rs");
let b_path = env.workspace_root.join("b_narrow.rs");
std::fs::write(&a_path, "// narrow a\nfn alpha_narrow() {}\n").unwrap();
std::fs::write(&b_path, "// narrow b\nfn bravo_narrow() {}\n").unwrap();
// Wide scope: first ingest — both must be New.
let wide_scope = SourceScope {
root: env.workspace_root.clone(),
include: vec!["**/*.rs".to_string()],
exclude: env.config.workspace.exclude.clone(),
};
let first = ingest_with_config_opts(
env.config.clone(),
wide_scope,
false,
IngestOpts::default(),
)
.expect("first ingest (wide) must succeed");
assert!(
first.new >= 2,
"expected at least 2 new docs: {first:?}"
);
assert_eq!(
first.purged_deleted_files, 0,
"no purges on first ingest: {first:?}"
);
// Narrow scope: only a_narrow.rs in include — b_narrow.rs is still
// on disk but excluded from the walker scope.
let narrow_scope = SourceScope {
root: env.workspace_root.clone(),
include: vec!["a_narrow.rs".to_string()],
exclude: env.config.workspace.exclude.clone(),
};
let second = ingest_with_config_opts(
env.config.clone(),
narrow_scope,
false,
IngestOpts::default(),
)
.expect("second ingest (narrow) must succeed");
// CRITICAL: b_narrow.rs is still on disk — must NOT be purged.
assert_eq!(
second.purged_deleted_files, 0,
"scope-narrowing must NOT purge on-disk files; got: {second:?}"
);
assert_eq!(second.errors, 0, "no errors: {second:?}");
// b_narrow.rs must still exist in the store.
let doc_paths = list_doc_paths(&env);
let b_ws_path = "b_narrow.rs";
assert!(
doc_paths.iter().any(|p| p == b_ws_path),
"b_narrow.rs must still be in list_docs after scope narrowing; got: {doc_paths:?}"
);
// And the file must still be on disk.
assert!(
b_path.exists(),
"b_narrow.rs must still be on disk (we didn't delete it)"
);
}

View File

@@ -0,0 +1,141 @@
//! Integration test for `kebab reset --orphans-only`.
//!
//! Verifies that stored docs outside the current walker scope are purged
//! from the store without removing any files from the filesystem.
//!
//! Test outline:
//! 1. Ingest 3 .rs files (a.rs, b.rs, c.rs) — all New.
//! 2. Narrow the config `include` to `["a.rs"]` only; b.rs and c.rs are
//! still on disk but outside the walker scope.
//! 3. Run `execute(ResetScope::OrphansOnly, &cfg)` — report must show
//! `orphans_purged == 2` and `purged_paths` contains b.rs + c.rs.
//! 4. `list docs` must show only a.rs.
//! 5. b.rs and c.rs must still exist on disk (no filesystem removal).
//! 6. Second reset → `orphans_purged == 0` (idempotent).
mod common;
use common::TestEnv;
use kebab_app::IngestOpts;
use kebab_app::reset::{ResetScope, execute};
use kebab_core::{DocFilter, DocumentStore, SourceScope};
/// Open the SqliteStore and list all `workspace_path` values.
fn list_doc_paths(env: &TestEnv) -> Vec<String> {
use kebab_store_sqlite::SqliteStore;
let store = SqliteStore::open(&env.config).unwrap();
store.run_migrations().unwrap();
store
.list_documents(&DocFilter::default())
.unwrap()
.into_iter()
.map(|d| d.doc_path.0)
.collect()
}
#[test]
fn reset_orphans_only_purges_out_of_scope_docs() {
let env = TestEnv::lexical_only();
// Write three .rs files into the workspace.
let a_path = env.workspace_root.join("a.rs");
let b_path = env.workspace_root.join("b.rs");
let c_path = env.workspace_root.join("c.rs");
std::fs::write(&a_path, "// file a\nfn alpha() {}\n").unwrap();
std::fs::write(&b_path, "// file b\nfn bravo() {}\n").unwrap();
std::fs::write(&c_path, "// file c\nfn charlie() {}\n").unwrap();
// Ingest all three with a wide scope.
let wide_scope = SourceScope {
root: env.workspace_root.clone(),
include: vec!["**/*.rs".to_string()],
exclude: env.config.workspace.exclude.clone(),
};
let first = kebab_app::ingest_with_config_opts(
env.config.clone(),
wide_scope,
false,
IngestOpts::default(),
)
.expect("first ingest must succeed");
// The fixture workspace may contain other .rs files — just assert we
// got at least 3 new docs (our a.rs, b.rs, c.rs).
assert!(first.new >= 3, "expected at least 3 new docs: {first:?}");
assert_eq!(first.errors, 0, "no errors on first ingest");
// Narrow config to include only a.rs; b.rs + c.rs are still on disk.
let mut narrow_cfg = env.config.clone();
narrow_cfg.workspace.exclude.clear();
// Re-point workspace root (already correct) and restrict include via
// the SourceScope in the connector. The config's `workspace.root` is
// used by `enumerate_orphans` to build its scope — we keep that
// pointing at the workspace root. We simulate narrowing by setting a
// glob that only matches a.rs.
//
// NOTE: `kebab_config::WorkspaceCfg` does not have an `include` field
// (it was removed in p9-fb-25). We narrow the scope via the walker
// exclude list: exclude b.rs and c.rs explicitly.
narrow_cfg.workspace.exclude = vec!["b.rs".to_string(), "c.rs".to_string()];
// Run orphans-only reset.
let report = execute(ResetScope::OrphansOnly, &narrow_cfg)
.expect("orphans-only reset must succeed");
assert_eq!(
report.orphans_purged, 2,
"expected 2 orphans purged (b.rs + c.rs): {report:?}"
);
let mut purged: Vec<String> = report
.purged_paths
.iter()
.map(|p| p.0.clone())
.collect();
purged.sort();
assert_eq!(
purged,
vec!["b.rs".to_string(), "c.rs".to_string()],
"purged_paths must list b.rs and c.rs in sorted order: {purged:?}"
);
// list docs must show only a.rs (and any pre-existing fixture files
// that are not excluded by the narrow config).
let doc_paths = list_doc_paths(&env);
// The narrow_cfg excludes b.rs + c.rs — they must no longer be in store.
assert!(
!doc_paths.iter().any(|p| p == "b.rs"),
"b.rs must be gone from store after orphans-only reset; got: {doc_paths:?}"
);
assert!(
!doc_paths.iter().any(|p| p == "c.rs"),
"c.rs must be gone from store after orphans-only reset; got: {doc_paths:?}"
);
assert!(
doc_paths.iter().any(|p| p == "a.rs"),
"a.rs must still be in store; got: {doc_paths:?}"
);
// Both b.rs and c.rs must still exist on the filesystem — no file
// removal is performed by orphans-only.
assert!(
b_path.exists(),
"b.rs must still be on disk after orphans-only reset"
);
assert!(
c_path.exists(),
"c.rs must still be on disk after orphans-only reset"
);
// Second reset must be idempotent: nothing left to purge.
let second = execute(ResetScope::OrphansOnly, &narrow_cfg)
.expect("second orphans-only reset must succeed");
assert_eq!(
second.orphans_purged, 0,
"second reset must be idempotent (orphans_purged == 0): {second:?}"
);
assert!(
second.purged_paths.is_empty(),
"second reset purged_paths must be empty: {:?}",
second.purged_paths
);
}

View File

@@ -0,0 +1,176 @@
//! Regression test for the twin-file fetch_span media-type lookup bug.
//!
//! Twin files (identical content at different workspace paths) share one
//! `assets` row whose PRIMARY KEY is the blake3 content hash. The old
//! `fetch_span` implementation called
//! `get_asset_by_workspace_path(&doc.workspace_path)` to check whether the
//! media type was PDF/audio (and therefore reject span fetch). For a twin
//! file that lookup could silently return the *other* twin's asset row if
//! `assets.workspace_path` had been overwritten on the most recent ingest of
//! the sibling — making the media-type branch decision incorrect.
//!
//! Fix: `fetch_span` now uses the 2-step lookup
//! `get_document_by_workspace_path` → `doc.source_asset_id` → `get_asset`
//! so the result is always anchored to the requesting document, not
//! whichever twin last updated `assets.workspace_path`.
//!
//! This test builds a twin-file scenario (two .md files at different paths
//! with identical content), ingests both, then calls `fetch_span` on each
//! twin's `doc_id` and asserts it succeeds. Before the fix, if the asset
//! row's workspace_path happened to point at the wrong twin the span could
//! return an incorrect `span_not_supported` for a non-PDF/audio file, or
//! conversely allow span on a PDF twin by accident. After the fix, the
//! lookup is always doc-specific.
mod common;
use common::TestEnv;
use kebab_app::ingest_with_config;
use kebab_core::{DocumentStore, FetchKind, FetchOpts, FetchQuery, IngestItemKind};
#[test]
fn twin_files_fetch_span_uses_correct_asset() {
let env = TestEnv::lexical_only();
// Write two markdown files with identical content at different paths.
let dir_a = env.workspace_root.join("src_a");
let dir_b = env.workspace_root.join("src_b");
std::fs::create_dir_all(&dir_a).unwrap();
std::fs::create_dir_all(&dir_b).unwrap();
// The content must produce at least 1 line so span fetch is non-trivial.
let content = "# Twin\n\nLine one.\n\nLine two.\n\nLine three.\n";
std::fs::write(dir_a.join("note.md"), content).unwrap();
std::fs::write(dir_b.join("note.md"), content).unwrap();
// Ingest all files (fixture workspace + our two new twins).
let report = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors; report={report:?}");
// Both twin paths must appear as New in the report.
let items = report.items.as_ref().expect("items must be present");
let twin_items: Vec<_> = items
.iter()
.filter(|i| {
i.doc_path.0.ends_with("src_a/note.md")
|| i.doc_path.0.ends_with("src_b/note.md")
})
.collect();
assert_eq!(
twin_items.len(),
2,
"exactly 2 twin items expected; items={items:?}"
);
for item in &twin_items {
assert_eq!(
item.kind,
IngestItemKind::New,
"each twin must be New; item={item:?}"
);
}
// Resolve doc_ids for both workspace paths.
// The ingest layer normalises workspace_path to the path relative to
// workspace_root (e.g. "src_a/note.md"), so we look up by that form.
let store = kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
store.run_migrations().unwrap();
// Find the twin items by matching on suffix so the test is robust to
// however the workspace root is represented.
let items = report.items.as_ref().expect("items must be present");
let path_a_str = items
.iter()
.find(|i| i.doc_path.0.ends_with("src_a/note.md"))
.map(|i| i.doc_path.0.clone())
.expect("src_a/note.md must appear in ingest report");
let path_b_str = items
.iter()
.find(|i| i.doc_path.0.ends_with("src_b/note.md"))
.map(|i| i.doc_path.0.clone())
.expect("src_b/note.md must appear in ingest report");
let path_a = kebab_core::WorkspacePath(path_a_str);
let path_b = kebab_core::WorkspacePath(path_b_str);
let doc_a = store
.get_document_by_workspace_path(&path_a)
.expect("get_document_by_workspace_path path_a")
.expect("doc_a must exist after ingest");
let doc_b = store
.get_document_by_workspace_path(&path_b)
.expect("get_document_by_workspace_path path_b")
.expect("doc_b must exist after ingest");
// Both twins share one asset_id (same content hash).
assert_eq!(
doc_a.source_asset_id, doc_b.source_asset_id,
"twin files must share one asset_id"
);
// Open App and issue span fetch on each twin's doc_id.
let app = env.app();
let result_a = app
.fetch(
FetchQuery::Span {
doc_id: doc_a.doc_id.clone(),
line_start: 1,
line_end: 2,
},
FetchOpts::default(),
)
.expect("fetch_span on twin A must succeed for a markdown file");
assert_eq!(result_a.kind, FetchKind::Span);
assert!(
result_a.text.as_deref().is_some_and(|t| !t.is_empty()),
"span text for twin A must not be empty"
);
let result_b = app
.fetch(
FetchQuery::Span {
doc_id: doc_b.doc_id.clone(),
line_start: 1,
line_end: 2,
},
FetchOpts::default(),
)
.expect("fetch_span on twin B must succeed for a markdown file");
assert_eq!(result_b.kind, FetchKind::Span);
assert!(
result_b.text.as_deref().is_some_and(|t| !t.is_empty()),
"span text for twin B must not be empty"
);
// Ingest again to force the asset.workspace_path flip-flop, then
// re-check. Pre-fix this was the scenario that triggered the bug:
// after the second ingest the asset row's workspace_path could point
// at either twin, making one twin's span fetch behave incorrectly.
let report2 = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest must succeed");
assert_eq!(report2.errors, 0, "no ingest errors on second run; report={report2:?}");
// Re-open app after second ingest and verify span still works on both.
let app2 = env.app();
app2.fetch(
FetchQuery::Span {
doc_id: doc_a.doc_id.clone(),
line_start: 1,
line_end: 3,
},
FetchOpts::default(),
)
.expect("fetch_span on twin A after flip-flop must still succeed");
app2.fetch(
FetchQuery::Span {
doc_id: doc_b.doc_id.clone(),
line_start: 1,
line_end: 3,
},
FetchOpts::default(),
)
.expect("fetch_span on twin B after flip-flop must still succeed");
}

View File

@@ -0,0 +1,90 @@
//! Regression test for the twin-file idempotency bug.
//!
//! Identical-content files at different workspace paths share one
//! `assets` row (`asset_id` = blake3 content hash, PRIMARY KEY). The
//! old UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path =
//! excluded.workspace_path` made each twin overwrite the other's path
//! on every ingest, so `get_asset_by_workspace_path(path1)` returned
//! None (or the wrong twin) → re-process every time.
//!
//! Fix: `try_skip_unchanged` now uses `get_document_by_workspace_path`
//! instead. `documents.workspace_path` is UNIQUE (V001) so each twin
//! has its own stable document row.
//!
//! Assertion contract:
//! 1st ingest → 2 New (one per twin)
//! 2nd ingest → 0 New, 0 Updated, 2 Unchanged
mod common;
use common::TestEnv;
use kebab_app::ingest_with_config;
use kebab_core::IngestItemKind;
#[test]
fn twin_files_second_ingest_is_unchanged() {
let env = TestEnv::lexical_only();
// Write two files with identical content at different paths.
let pkg_a = env.workspace_root.join("pkg_a");
let pkg_b = env.workspace_root.join("pkg_b");
std::fs::create_dir_all(&pkg_a).unwrap();
std::fs::create_dir_all(&pkg_b).unwrap();
let content = b"# shared\nThis content is identical in both files.\n";
std::fs::write(pkg_a.join("__init__.py"), content).unwrap();
std::fs::write(pkg_b.join("__init__.py"), content).unwrap();
// First ingest — both files must be New.
let first = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("first ingest must succeed");
assert_eq!(first.errors, 0, "first ingest: no errors; report={first:?}");
let items = first.items.as_ref().expect("items must be present");
let twin_items: Vec<_> = items
.iter()
.filter(|i| {
i.doc_path.0.ends_with("__init__.py")
})
.collect();
assert_eq!(
twin_items.len(),
2,
"first ingest: expected exactly 2 __init__.py items; items={items:?}"
);
for item in &twin_items {
assert_eq!(
item.kind,
IngestItemKind::New,
"first ingest: each twin must be New; item={item:?}"
);
}
// Second ingest — same files, same content → both must be Unchanged.
let second = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest must succeed");
assert_eq!(second.errors, 0, "second ingest: no errors; report={second:?}");
assert_eq!(second.new, 0, "second ingest: no new docs; report={second:?}");
assert_eq!(
second.updated, 0,
"second ingest: no updated docs (twin-file bug would set this to 2); report={second:?}"
);
let second_items = second.items.as_ref().expect("items must be present");
let twin_items2: Vec<_> = second_items
.iter()
.filter(|i| i.doc_path.0.ends_with("__init__.py"))
.collect();
assert_eq!(
twin_items2.len(),
2,
"second ingest: expected exactly 2 __init__.py items; items={second_items:?}"
);
for item in &twin_items2 {
assert_eq!(
item.kind,
IngestItemKind::Unchanged,
"second ingest: each twin must be Unchanged; item={item:?}"
);
}
}

View File

@@ -13,14 +13,16 @@ serde_json_canonicalizer = "0.3"
blake3 = { workspace = true }
anyhow = { workspace = true }
tracing = { workspace = true }
serde_yaml = { workspace = true }
[dev-dependencies]
# kb-parse-md / kb-normalize are dev-only — used by the snapshot integration
# test to build a CanonicalDocument from a fixture Markdown file. Forbidden as
# regular deps per design §8 (chunker consumes CanonicalDocument from kb-core
# only); `cargo tree -p kb-chunk --depth 1` (default scope, excludes dev-deps)
# confirms this.
kebab-parse-md = { path = "../kebab-parse-md" }
kebab-normalize = { path = "../kebab-normalize" }
serde_json = { workspace = true }
time = { workspace = true }
# kb-parse-md / kb-normalize / kb-parse-code are dev-only — used by the
# snapshot integration tests to build a CanonicalDocument from fixture files.
# Forbidden as regular deps per design §8 (chunker consumes CanonicalDocument
# from kb-core only); `cargo tree -p kb-chunk --depth 1` (default scope,
# excludes dev-deps) confirms this.
kebab-parse-md = { path = "../kebab-parse-md" }
kebab-parse-code = { path = "../kebab-parse-code" }
kebab-normalize = { path = "../kebab-normalize" }
serde_json = { workspace = true }
time = { workspace = true }

View File

@@ -0,0 +1,322 @@
//! `code-c-ast-v1` — maps a tree-sitter-derived C AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-c-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeCAstV1Chunker;
impl Chunker for CodeCAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeCAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeCAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-c-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.c".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-c-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("c".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("c".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("c".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_c_ast_v1() {
assert_eq!(CodeCAstV1Chunker.chunker_version(),
ChunkerVersion("code-c-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "int parse() {\n\t// x\n}"),
("print", 5, 7, "void print() {\n\t//\n\treturn;\n}"),
]);
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-c-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tx{i} = {i};\n")).collect::<Vec<_>>().join("");
let code = format!("int big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "int parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeCAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
let base: Vec<String> = CodeCAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeCAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeCAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-cpp-ast-v1` — maps a tree-sitter-derived C++ AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-cpp-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeCppAstV1Chunker;
impl Chunker for CodeCppAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeCppAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeCppAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-cpp-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.cpp".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-cpp-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("cpp".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("cpp".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("cpp".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_cpp_ast_v1() {
assert_eq!(CodeCppAstV1Chunker.chunker_version(),
ChunkerVersion("code-cpp-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "int parse() {\n\t// x\n}"),
("print", 5, 7, "void print() {\n\t//\n\treturn;\n}"),
]);
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-cpp-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tx{i} = {i};\n")).collect::<Vec<_>>().join("");
let code = format!("int big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "int parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeCppAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
let base: Vec<String> = CodeCppAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeCppAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeCppAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-go-ast-v1` — maps a tree-sitter-derived Go AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-go-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeGoAstV1Chunker;
impl Chunker for CodeGoAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeGoAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeGoAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-go-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.go".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-go-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("go".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("go".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("go".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_go_ast_v1() {
assert_eq!(CodeGoAstV1Chunker.chunker_version(),
ChunkerVersion("code-go-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "func parse() {\n\t// x\n}"),
("Foo.double", 5, 7, "func double() int {\n\t//\n\treturn 0\n}"),
]);
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-go-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tx{i} := {i}")).collect::<Vec<_>>().join("\n");
let code = format!("func big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "func parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeGoAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "func parse() {}\n")]);
let base: Vec<String> = CodeGoAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeGoAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeGoAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-java-ast-v1` — maps a tree-sitter-derived Java AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-java-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeJavaAstV1Chunker;
impl Chunker for CodeJavaAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeJavaAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeJavaAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-java-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/Main.java".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-java-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("java".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("java".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("java".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_java_ast_v1() {
assert_eq!(CodeJavaAstV1Chunker.chunker_version(),
ChunkerVersion("code-java-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "void parse() {\n\t// x\n}"),
("Foo.double", 5, 7, "int double() {\n\t//\n\treturn 0;\n}"),
]);
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-java-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tint x{i} = {i};")).collect::<Vec<_>>().join("\n");
let code = format!("void big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "void parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeJavaAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "void parse() {}\n")]);
let base: Vec<String> = CodeJavaAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeJavaAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeJavaAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-kotlin-ast-v1` — maps a tree-sitter-derived Kotlin AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-kotlin-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeKotlinAstV1Chunker;
impl Chunker for CodeKotlinAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeKotlinAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeKotlinAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-kotlin-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/Main.kt".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-kotlin-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("kotlin".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("kotlin".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("kotlin".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_kotlin_ast_v1() {
assert_eq!(CodeKotlinAstV1Chunker.chunker_version(),
ChunkerVersion("code-kotlin-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "fun parse() {\n\t// x\n}"),
("Foo.double", 5, 7, "fun double(): Int {\n\t//\n\treturn 0\n}"),
]);
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-kotlin-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tval x{i} = {i}")).collect::<Vec<_>>().join("\n");
let code = format!("fun big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "fun parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeKotlinAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "fun parse() {}\n")]);
let base: Vec<String> = CodeKotlinAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeKotlinAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeKotlinAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,170 @@
//! p10-3: Tier 3 paragraph + line-window fallback chunker.
//!
//! Splits code/text files on blank-line paragraph boundaries. Paragraphs
//! with more than 80 lines are further split into 80-line windows with a
//! 20-line overlap (stride 60) — the same oversize pattern used by Tier 1/2
//! chunkers but without AST structure, hence no symbol.
//!
//! Per spec §9.3: all emitted chunks carry `symbol: None`.
use crate::tier2_shared::{build_chunk_no_symbol, policy_hash};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
pub const VERSION_LABEL: &str = "code-text-paragraph-v1";
/// Lines-per-window for the oversize fallback (Tier 3).
const FALLBACK_LINES_PER_CHUNK: usize = 80;
/// Overlap between consecutive windows.
const FALLBACK_LINES_OVERLAP: usize = 20;
// stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP = 60.
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeTextParagraphV1Chunker;
impl Chunker for CodeTextParagraphV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full source text.
let (text, lang_str) = match doc.blocks.first() {
Some(Block::Code(cb)) => (cb.code.as_str(), cb.lang.as_deref().unwrap_or("")),
_ => return Ok(vec![]),
};
let mut chunks = Vec::new();
for para in split_paragraphs(text) {
push_paragraph(&mut chunks, doc, policy, &para, lang_str)?;
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"code-text-paragraph-v1 chunked",
);
Ok(chunks)
}
}
/// A contiguous run of non-blank lines from the source text.
struct Paragraph {
/// Lines joined with `\n` (no trailing newline).
text: String,
/// 1-indexed line number of the first line in the source file.
line_start: u32,
/// 1-indexed line number of the last line in the source file.
line_end: u32,
}
/// Split `text` into `Paragraph`s separated by blank (all-whitespace) lines.
///
/// Blank lines are treated as boundaries and are NOT included in any
/// paragraph's line range. Paragraphs that would consist entirely of blank
/// lines are skipped.
fn split_paragraphs(text: &str) -> Vec<Paragraph> {
let mut paragraphs = Vec::new();
let mut current: Vec<&str> = Vec::new();
let mut current_start: Option<u32> = None;
for (idx, line) in text.lines().enumerate() {
let line_no = (idx + 1) as u32;
let is_blank = line.trim().is_empty();
if is_blank {
if let Some(start) = current_start.take() {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
});
current.clear();
}
} else {
if current_start.is_none() {
current_start = Some(line_no);
}
current.push(line);
}
}
// Flush any trailing paragraph not terminated by a blank line.
if let Some(start) = current_start {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
});
}
paragraphs
}
/// Emit one or more chunks for a single paragraph.
///
/// Paragraphs with ≤ `FALLBACK_LINES_PER_CHUNK` lines become a single chunk.
/// Larger paragraphs are split into overlapping windows of
/// `FALLBACK_LINES_PER_CHUNK` lines with stride `FALLBACK_LINES_PER_CHUNK -
/// FALLBACK_LINES_OVERLAP`. The last window may be shorter. Window starts
/// are passed as `split_key` so `id_for_chunk` can produce distinct ids
/// across windows.
fn push_paragraph(
out: &mut Vec<Chunk>,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
para: &Paragraph,
lang: &str,
) -> Result<()> {
let n_lines = (para.line_end - para.line_start + 1) as usize;
if n_lines <= FALLBACK_LINES_PER_CHUNK {
// Use line_start as split_key so each paragraph gets a distinct
// chunk_id even when block_ids is empty (no symbol, no AST structure).
// Without this, all short paragraphs from the same doc share the same
// base_policy_hash and therefore the same id_for_chunk result.
out.push(build_chunk_no_symbol(
doc,
policy,
&para.text,
para.line_start,
para.line_end,
lang,
VERSION_LABEL,
Some(para.line_start),
));
return Ok(());
}
// Oversize: line-window split with overlap.
let stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP;
let lines: Vec<&str> = para.text.lines().collect();
let mut i = 0usize;
loop {
let end = (i + FALLBACK_LINES_PER_CHUNK).min(lines.len());
let window_text = lines[i..end].join("\n");
let window_start = para.line_start + i as u32;
let window_end = para.line_start + (end as u32) - 1;
// Use window_start as split_key so chunk_ids are unique across windows.
out.push(build_chunk_no_symbol(
doc,
policy,
&window_text,
window_start,
window_end,
lang,
VERSION_LABEL,
Some(window_start),
));
if end == lines.len() {
break;
}
i += stride;
}
Ok(())
}

View File

@@ -0,0 +1,58 @@
//! p10-2: dockerfile whole-file chunker (Tier 2).
//!
//! Reads entire Dockerfile content and emits a single Chunk with symbol
//! "<dockerfile>", code_lang "dockerfile", line range 1..EOF.
//! Oversize >200 lines splits into line-windows sharing the symbol via
//! tier2_shared::push_chunks_with_oversize.
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
pub const VERSION_LABEL: &str = "dockerfile-file-v1";
#[derive(Clone, Copy, Debug, Default)]
pub struct DockerfileFileV1Chunker;
impl Chunker for DockerfileFileV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full Dockerfile text.
let text = match doc.blocks.first() {
Some(Block::Code(cb)) => cb.code.as_str(),
_ => return Ok(vec![]),
};
let total_lines = text.lines().count().max(1) as u32;
let mut chunks = Vec::new();
push_chunks_with_oversize(
&mut chunks,
doc,
policy,
text,
1,
total_lines,
"<dockerfile>",
"dockerfile",
VERSION_LABEL,
None,
)?;
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"dockerfile-file-v1 chunked",
);
Ok(chunks)
}
}

View File

@@ -0,0 +1,170 @@
//! p10-2: k8s manifest resource-aware chunker.
//!
//! Splits a multi-document YAML file on `^---\s*$` boundaries, recognises
//! documents that have both `apiVersion` and `kind` string fields as k8s
//! resources, and emits one `Chunk` per resource (with oversize >200-line
//! fallback). Non-k8s documents are skipped; invalid YAML yields 0 chunks
//! for the entire file.
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
pub const VERSION_LABEL: &str = "k8s-manifest-resource-v1";
#[derive(Clone, Copy, Debug, Default)]
pub struct K8sManifestResourceV1Chunker;
impl Chunker for K8sManifestResourceV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full YAML text.
let text = match doc.blocks.first() {
Some(Block::Code(cb)) => cb.code.as_str(),
_ => return Ok(vec![]),
};
let slices = split_yaml_documents(text);
let mut chunks: Vec<Chunk> = Vec::new();
for slice in slices {
// Invalid YAML in any document → return 0 chunks for the file.
let value: serde_yaml::Value = match serde_yaml::from_str(slice.text) {
Ok(v) => v,
Err(_) => return Ok(vec![]),
};
let Some(mapping) = value.as_mapping() else {
continue;
};
let api = mapping
.get("apiVersion")
.and_then(|v| v.as_str())
.unwrap_or("");
let kind = mapping
.get("kind")
.and_then(|v| v.as_str())
.unwrap_or("");
// Skip non-k8s documents.
if api.is_empty() || kind.is_empty() {
continue;
}
let metadata = mapping
.get("metadata")
.and_then(|v| v.as_mapping());
let name = metadata
.and_then(|m| m.get("name"))
.and_then(|v| v.as_str())
.unwrap_or("<unnamed>");
let namespace = metadata
.and_then(|m| m.get("namespace"))
.and_then(|v| v.as_str());
let symbol = match namespace {
Some(ns) if !ns.is_empty() => format!("{kind}/{ns}/{name}"),
_ => format!("{kind}/{name}"),
};
push_chunks_with_oversize(
&mut chunks,
doc,
policy,
slice.text,
slice.line_start,
slice.line_end,
&symbol,
"yaml",
VERSION_LABEL,
Some(slice.line_start),
)?;
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"k8s-manifest-resource-v1 chunked",
);
Ok(chunks)
}
}
struct YamlSlice<'a> {
text: &'a str,
line_start: u32,
line_end: u32,
}
/// Split raw YAML text into per-document slices on `---` separator lines.
/// Line numbers are 1-indexed.
fn split_yaml_documents(text: &str) -> Vec<YamlSlice<'_>> {
let lines: Vec<&str> = text.lines().collect();
// Collect indices of separator lines (0-based), then append a sentinel at
// the end so the last slice is always terminated.
let mut separators: Vec<usize> = lines
.iter()
.enumerate()
.filter_map(|(i, l)| {
let trimmed = l.trim_end();
if trimmed == "---"
|| trimmed.starts_with("--- ")
|| trimmed.starts_with("---\t")
{
Some(i)
} else {
None
}
})
.collect();
separators.push(lines.len());
let mut slices: Vec<YamlSlice<'_>> = Vec::new();
let mut doc_start_line: usize = 0; // 0-based index of current doc start
for sep_line in separators {
if sep_line > doc_start_line {
let start_byte = byte_offset_of_line(text, doc_start_line);
let end_byte = byte_offset_of_line(text, sep_line);
let slice_text = &text[start_byte..end_byte];
if !slice_text.trim().is_empty() {
slices.push(YamlSlice {
text: slice_text,
line_start: (doc_start_line + 1) as u32,
line_end: sep_line as u32,
});
}
}
doc_start_line = sep_line + 1;
}
slices
}
/// Return the byte offset of the start of `line_idx` (0-based line index).
fn byte_offset_of_line(text: &str, line_idx: usize) -> usize {
if line_idx == 0 {
return 0;
}
let mut count = 0usize;
for (i, c) in text.char_indices() {
if c == '\n' {
count += 1;
if count == line_idx {
return i + 1;
}
}
}
text.len()
}

View File

@@ -15,16 +15,35 @@
//! embedder, the retriever, the LLM, the RAG layer, or the UI layers.
//! It consumes `CanonicalDocument` purely through `kb-core` types.
mod code_c_ast_v1;
mod code_cpp_ast_v1;
mod code_go_ast_v1;
mod code_java_ast_v1;
mod code_js_ast_v1;
mod code_kotlin_ast_v1;
mod code_python_ast_v1;
mod code_rust_ast_v1;
mod code_ts_ast_v1;
mod md_heading_v1;
mod pdf_page_v1;
mod tier2_shared;
pub mod k8s_manifest_resource_v1;
pub mod dockerfile_file_v1;
pub mod manifest_file_v1;
pub mod code_text_paragraph_v1;
pub use code_c_ast_v1::CodeCAstV1Chunker;
pub use code_cpp_ast_v1::CodeCppAstV1Chunker;
pub use code_go_ast_v1::CodeGoAstV1Chunker;
pub use code_java_ast_v1::CodeJavaAstV1Chunker;
pub use code_js_ast_v1::CodeJsAstV1Chunker;
pub use code_kotlin_ast_v1::CodeKotlinAstV1Chunker;
pub use code_python_ast_v1::CodePythonAstV1Chunker;
pub use code_rust_ast_v1::CodeRustAstV1Chunker;
pub use code_ts_ast_v1::CodeTsAstV1Chunker;
pub use md_heading_v1::MdHeadingV1Chunker;
pub use pdf_page_v1::PdfPageV1Chunker;
pub use k8s_manifest_resource_v1::K8sManifestResourceV1Chunker;
pub use dockerfile_file_v1::DockerfileFileV1Chunker;
pub use manifest_file_v1::ManifestFileV1Chunker;
pub use code_text_paragraph_v1::CodeTextParagraphV1Chunker;

View File

@@ -0,0 +1,59 @@
//! p10-2: manifest whole-file chunker (Tier 2).
//!
//! Reads entire manifest file (Cargo.toml / package.json / pom.xml / go.mod /
//! build.gradle / pyproject.toml / tsconfig.json) and emits a single Chunk
//! with symbol "<manifest>", code_lang read from Block::Code.lang, line range
//! 1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
//! tier2_shared::push_chunks_with_oversize.
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
pub const VERSION_LABEL: &str = "manifest-file-v1";
#[derive(Clone, Copy, Debug, Default)]
pub struct ManifestFileV1Chunker;
impl Chunker for ManifestFileV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full manifest text.
let (text, lang) = match doc.blocks.first() {
Some(Block::Code(cb)) => (cb.code.as_str(), cb.lang.as_deref().unwrap_or("")),
_ => return Ok(vec![]),
};
let total_lines = text.lines().count().max(1) as u32;
let mut chunks = Vec::new();
push_chunks_with_oversize(
&mut chunks,
doc,
policy,
text,
1,
total_lines,
"<manifest>",
lang,
VERSION_LABEL,
None,
)?;
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"manifest-file-v1 chunked",
);
Ok(chunks)
}
}

View File

@@ -0,0 +1,192 @@
//! p10-2: Tier 2 chunker shared helpers (oversize fallback + Chunk build).
//!
//! Mirrors `code_rust_ast_v1`'s Chunk-construction pattern exactly so that
//! id / hashes / token-count / ChunkPolicy semantics stay identical across
//! Tier 1 (AST) and Tier 2 (resource-aware) chunkers.
use anyhow::Result;
use kebab_core::{
BlockId, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, DocumentId, SourceSpan,
id_for_chunk,
};
pub(crate) const AST_CHUNK_MAX_LINES: u32 = 200;
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
/// Compute the policy hash the same way `code_rust_ast_v1` does.
pub(crate) fn policy_hash(policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
/// Emit one chunk for `(text, line_start..=line_end, symbol, lang)`, splitting
/// into line-windows of at most `AST_CHUNK_MAX_LINES` if the slice is oversize.
/// Mirrors the oversize path in `code_rust_ast_v1`'s `chunk` impl.
///
/// `base_split_key` is used as the `split_key` for the non-oversize single-chunk
/// case. Callers that emit multiple chunks from the same document (e.g.
/// `K8sManifestResourceV1Chunker` — one call per k8s resource) MUST pass
/// `Some(line_start)` so that each call produces a distinct `chunk_id`.
/// Single-chunk callers (dockerfile-file-v1, manifest-file-v1) pass `None` to
/// keep chunk_ids stable (no sibling can collide when there's only one chunk).
#[allow(clippy::too_many_arguments)]
pub(crate) fn push_chunks_with_oversize(
out: &mut Vec<Chunk>,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
text: &str,
line_start: u32,
line_end: u32,
symbol: &str,
lang: &str,
chunker_version: &str,
base_split_key: Option<u32>,
) -> Result<()> {
let n_lines = (line_end - line_start + 1).max(1);
let cv = ChunkerVersion(chunker_version.to_string());
let base_policy_hash = policy_hash(policy);
if n_lines <= AST_CHUNK_MAX_LINES {
out.push(build_chunk(
doc,
&cv,
&base_policy_hash,
text,
line_start,
line_end,
symbol,
lang,
base_split_key,
));
return Ok(());
}
let lines: Vec<&str> = text.lines().collect();
let total = lines.len();
let mut window_start = line_start;
let mut i = 0usize;
while i < total {
let take = (AST_CHUNK_MAX_LINES as usize).min(total - i);
let window_text = lines[i..i + take].join("\n");
let window_end = window_start + take as u32 - 1;
out.push(build_chunk(
doc,
&cv,
&base_policy_hash,
&window_text,
window_start,
window_end,
symbol,
lang,
Some(window_start),
));
i += take;
window_start = window_end + 1;
}
Ok(())
}
/// Build a single `Chunk`, mirroring `make_chunk` in `code_rust_ast_v1.rs`
/// exactly (same id recipe, same token estimate, same field set).
///
/// `split_key` is `Some(line_start_of_window)` for oversize splits, `None`
/// for normal single-chunk emission. Mirrors the `Some(part_ls)` / `None`
/// split_key pattern in 1A-2.
#[allow(clippy::too_many_arguments)]
pub(crate) fn build_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
base_policy_hash: &str,
text: &str,
line_start: u32,
line_end: u32,
symbol: &str,
lang: &str,
split_key: Option<u32>,
) -> Chunk {
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol.to_string()),
lang: Some(lang.to_string()),
};
build_chunk_from_span(doc, chunker_version, base_policy_hash, text, span, split_key)
}
/// Like `build_chunk` but emits `symbol: None`. Used by Tier 3 (per spec §9.3).
///
/// Accepts `policy: &ChunkPolicy` and `chunker_version: &str` (string slice)
/// so callers don't need to pre-compute the hash and version wrapper.
/// `split_key` is `Some(window_start)` for oversize line-window splits.
#[allow(clippy::too_many_arguments)]
pub(crate) fn build_chunk_no_symbol(
doc: &CanonicalDocument,
policy: &ChunkPolicy,
text: &str,
line_start: u32,
line_end: u32,
lang: &str,
chunker_version: &str,
split_key: Option<u32>,
) -> Chunk {
let cv = ChunkerVersion(chunker_version.to_string());
let base_policy_hash = policy_hash(policy);
let span = SourceSpan::Code {
line_start,
line_end,
symbol: None,
lang: Some(lang.to_string()),
};
build_chunk_from_span(doc, &cv, &base_policy_hash, text, span, split_key)
}
/// Core chunk-building logic shared by `build_chunk` and `build_chunk_no_symbol`.
///
/// Takes a pre-built `SourceSpan` so the only difference between the two
/// public helpers is whether `symbol` is `Some` or `None`. All id/hash/
/// token mechanics are identical.
fn build_chunk_from_span(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
base_policy_hash: &str,
text: &str,
span: SourceSpan,
split_key: Option<u32>,
) -> Chunk {
// id_hash mirrors code_rust_ast_v1's make_chunk logic:
// split_key Some(k) => "{base_policy_hash}#L{k}"
// split_key None => base_policy_hash
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
// block_ids: Tier 2/3 chunkers have no per-block structure (the whole file
// is one Block::Code), so we pass an empty slice — same as using the doc-
// level slice without explicit block granularity.
let block_ids: Vec<BlockId> = vec![];
let chunk_id = id_for_chunk(
&DocumentId(doc.doc_id.0.clone()),
chunker_version,
&block_ids,
&id_hash,
);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids,
text: text.to_string(),
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}

View File

@@ -0,0 +1,196 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative C code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_go_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeCAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("projects/record.c".into());
let aid = AssetId("c".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-c-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Representative units:
// 0. imports + defines (lines 14, ≤200)
// 1. status_t enum typedef (lines 69, ≤200)
// 2. record_t struct typedef (lines 1116, ≤200)
// 3. static counter decl glue (line 18, ≤200)
// 4. parse_record fn (lines 2023, ≤200)
// 5. print_record fn (lines 2527, ≤200)
// 6. main fn (lines 2933, ≤200)
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"<top-level>",
1,
18,
"#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_BUF 4096\n\ntypedef enum {\n OK = 0,\n ERR_PARSE,\n ERR_IO,\n} status_t;\n\ntypedef struct {\n int id;\n char name[64];\n status_t status;\n} record_t;\n\nstatic int counter = 0;".to_string(),
),
(
"parse_record",
20,
23,
"int parse_record(const char *line, record_t *out) {\n if (line == NULL || out == NULL) return ERR_PARSE;\n return OK;\n}".to_string(),
),
(
"print_record",
25,
27,
"void print_record(const record_t *r) {\n printf(\"[%d] %s (status=%d)\\n\", r->id, r->name, r->status);\n}".to_string(),
),
(
"main",
29,
33,
"int main(void) {\n record_t r = { .id = 1, .name = \"foo\", .status = OK };\n print_record(&r);\n return 0;\n}".to_string(),
),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("c".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("c".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "record.c".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("c".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-c-ast-v1".into()),
}
}
#[test]
fn code_c_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.c.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-c-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_c_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeCAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeCAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,325 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative C++ code `CanonicalDocument`.
//!
//! Two complementary tests:
//! 1. `code_cpp_ast_chunks_snapshot` — hand-built `fixed_doc()` validates the
//! chunker's 1:1 mapping (design §6.3 / §8 boundary: no parse-code dep needed).
//! 2. `code_cpp_ast_extractor_snapshot` — invokes `CppAstExtractor` against the
//! real `tests/fixtures/sample.cpp` fixture, validating the extractor → chunker
//! end-to-end pipeline. `kebab-parse-code` is a dev-dep (same pattern as
//! `kebab-parse-md` in Markdown snapshot tests).
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeCppAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use kebab_parse_code::CppAstExtractor;
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("projects/record.cpp".into());
let aid = AssetId("c".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-cpp-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Representative units (C++ specific):
// 0. includes + namespace opening (lines 14, ≤200)
// 1. class definition (lines 620, ≤200)
// 2. template function (lines 2225, ≤200)
// 3. namespace closing + free fn (lines 2729, ≤200)
// 4. main fn (lines 3134, ≤200)
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"<top-level>",
1,
4,
"#include <string>\n#include <vector>\n\nnamespace kebab {".to_string(),
),
(
"kebab::chunk::MdHeadingV1Chunker",
6,
20,
"class MdHeadingV1Chunker {\npublic:\n MdHeadingV1Chunker() = default;\n ~MdHeadingV1Chunker() = default;\n\n std::string chunk_doc(const std::string& doc) {\n return doc;\n }\n\n int operator()(int x) const {\n return x * 2;\n }\n\nprivate:\n int counter_ = 0;\n};".to_string(),
),
(
"kebab::identity",
22,
25,
"template <typename T>\nT identity(T value) {\n return value;\n}".to_string(),
),
(
"kebab::global_helper",
27,
29,
"void global_helper() {\n // free function in kebab namespace\n}".to_string(),
),
(
"main",
31,
34,
"int main() {\n kebab::chunk::MdHeadingV1Chunker c;\n return 0;\n}".to_string(),
),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("cpp".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("cpp".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "record.cpp".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("cpp".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-cpp-ast-v1".into()),
}
}
// ---------------------------------------------------------------------------
// Helper: run the real CppAstExtractor against tests/fixtures/sample.cpp
// ---------------------------------------------------------------------------
fn extract_cpp_fixture() -> CanonicalDocument {
use kebab_core::{
AssetId, AssetStorage, Checksum, ExtractConfig, ExtractContext, Extractor, RawAsset,
SourceUri, WorkspacePath,
};
use std::path::PathBuf;
let bytes = std::fs::read(fixtures_dir().join("sample.cpp")).expect("read sample.cpp fixture");
let src = String::from_utf8(bytes).expect("fixture is valid UTF-8");
let wp = WorkspacePath("tests/fixtures/sample.cpp".to_string());
let asset = RawAsset {
asset_id: AssetId("e".repeat(64)),
source_uri: SourceUri::File(PathBuf::from("tests/fixtures/sample.cpp")),
workspace_path: wp,
media_type: kebab_core::MediaType::Code("cpp".to_string()),
byte_len: src.len() as u64,
checksum: Checksum("f".repeat(64)),
discovered_at: time::OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::Reference {
path: PathBuf::from("tests/fixtures/sample.cpp"),
sha: Checksum("f".repeat(64)),
},
};
let cfg = ExtractConfig::default();
let root = PathBuf::from("/tmp");
let ctx = ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
CppAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
}
// ---------------------------------------------------------------------------
// Test 1 (hand-built): chunker-only 1:1 mapping validation
// ---------------------------------------------------------------------------
#[test]
fn code_cpp_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.cpp.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-cpp-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_cpp_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeCppAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeCppAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}
// ---------------------------------------------------------------------------
// Test 2 (real extractor): end-to-end extractor → chunker pipeline
// ---------------------------------------------------------------------------
/// Validates that the real `CppAstExtractor` processes `sample.cpp` and
/// emits the expected set of symbols through the full chunker pipeline.
///
/// `sample.cpp` contains:
/// - `#include` directives + nested namespace `kebab::chunk` → glue + struct unit
/// - `class MdHeadingV1Chunker` with methods (ctor, dtor, chunk_doc, operator())
/// - `template <typename T> T identity(T value)` (template fn)
/// - `void kebab::global_helper()` (free fn in namespace)
/// - `int main()` (global free fn)
#[test]
fn code_cpp_ast_extractor_snapshot() {
let doc = extract_cpp_fixture();
// Verify the extractor emits all expected named units.
let block_syms: Vec<Option<String>> = doc.blocks.iter().filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => Some(symbol.clone()),
_ => None,
},
_ => None,
}).collect();
// Must include namespace-qualified class and its methods
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker")),
"class unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker")),
"ctor unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker")),
"dtor unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::chunk_doc")),
"chunk_doc unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::operator()")),
"operator() unit missing: {block_syms:?}"
);
// Template function (inside kebab::chunk namespace in the fixture)
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::identity")),
"identity template fn unit missing: {block_syms:?}"
);
// Free function in outer namespace
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::global_helper")),
"global_helper unit missing: {block_syms:?}"
);
// Global main
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("main")),
"main unit missing: {block_syms:?}"
);
}
/// End-to-end chunker output from real extractor is deterministic.
#[test]
fn code_cpp_ast_extractor_chunks_deterministic() {
let doc1 = extract_cpp_fixture();
let doc2 = extract_cpp_fixture();
assert_eq!(doc1.blocks, doc2.blocks, "extractor output non-deterministic");
let policy = fixed_policy();
let chunks1 = CodeCppAstV1Chunker.chunk(&doc1, &policy).unwrap();
let chunks2 = CodeCppAstV1Chunker.chunk(&doc2, &policy).unwrap();
assert_eq!(
chunks1.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
chunks2.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
"chunker output non-deterministic"
);
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Go code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeGoAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("kebab_eval/metrics.go".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-go-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line function body to force split_oversize.
let big_body: String = {
let header = "func BigCompute(data []int) int {\n";
let body: String = (0..210u32)
.map(|i| format!("\tv{i} := 0\n\tif {i} < len(data) {{\n\t\tv{i} = data[{i}]\n\t}}\n"))
.collect();
let footer = "\treturn len(data)\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. free fn `ComputeMRR` (lines 712, ≤200)
// 2. struct `MetricsCollector` (lines 1420, ≤200)
// 3. struct `BaseEvaluator` (lines 2230, ≤200)
// 4. method `Run` (lines 3238, ≤200)
// 5. method `Report` (lines 4046, ≤200)
// 6. BigCompute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import (\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)".to_string(),
),
(
"ComputeMRR",
7,
12,
"func ComputeMRR(scores []float64) float64 {\n\tif len(scores) == 0 {\n\t\treturn 0.0\n\t}\n\t_ = fmt.Sprintf(\"%v\", scores)\n\treturn 1.0 / float64(len(scores))\n}".to_string(),
),
(
"MetricsCollector",
14,
20,
"type MetricsCollector struct {\n\tScores []float64\n\tLabels []string\n\tCounts map[string]int\n\tTotals map[string]float64\n\tTags []string\n}".to_string(),
),
(
"BaseEvaluator",
22,
30,
"type BaseEvaluator struct {\n\tName string\n}\n\nfunc (e *BaseEvaluator) Evaluate(data []string) error {\n\t_ = os.Stderr\n\t_ = strings.Join(data, \",\")\n\treturn nil\n}".to_string(),
),
(
"MetricsCollector.Run",
32,
38,
"func (m *MetricsCollector) Run(inputs []float64) {\n\tfor _, inp := range inputs {\n\t\tm.Scores = append(\n\t\t\tm.Scores,\n\t\t\tinp,\n\t\t)\n\t}\n}".to_string(),
),
(
"MetricsCollector.Report",
40,
46,
"func (m *MetricsCollector) Report() map[string]interface{} {\n\treturn map[string]interface{}{\n\t\t\"mean\": 0.0,\n\t\t\"count\": len(m.Scores),\n\t\t\"tags\": m.Tags,\n\t}\n}".to_string(),
),
("BigCompute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("go".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("go".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "metrics.go".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("go".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-go-ast-v1".into()),
}
}
#[test]
fn code_go_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.go.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-go-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_go_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeGoAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeGoAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Java code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeJavaAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("src/main/java/com/example/Metrics.java".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-java-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line method body to force split_oversize.
let big_body: String = {
let header = "public class BigCompute {\n public int compute(int[] data) {\n";
let body: String = (0..210u32)
.map(|i| format!(" int v{i} = {i} < data.length ? data[{i}] : 0;\n"))
.collect();
let footer = " return data.length;\n }\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. free method `computeMRR` (lines 712, ≤200)
// 2. class `MetricsCollector` (lines 1420, ≤200)
// 3. class `BaseEvaluator` (lines 2230, ≤200)
// 4. method `MetricsCollector.run` (lines 3238, ≤200)
// 5. method `MetricsCollector.report` (lines 4046, ≤200)
// 6. BigCompute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import java.util.List;\nimport java.util.Map;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.stream.Collectors;".to_string(),
),
(
"computeMRR",
7,
12,
"public static double computeMRR(List<Double> scores) {\n if (scores.isEmpty()) {\n return 0.0;\n }\n return 1.0 / scores.size();\n}".to_string(),
),
(
"MetricsCollector",
14,
20,
"public class MetricsCollector {\n private List<Double> scores;\n private List<String> labels;\n private Map<String, Integer> counts;\n private Map<String, Double> totals;\n private List<String> tags;\n}".to_string(),
),
(
"BaseEvaluator",
22,
30,
"public class BaseEvaluator {\n private String name;\n\n public BaseEvaluator(String name) {\n this.name = name;\n }\n\n public void evaluate(List<String> data) throws Exception {\n String joined = String.join(\",\", data);\n }\n}".to_string(),
),
(
"MetricsCollector.run",
32,
38,
"public void run(List<Double> inputs) {\n for (Double inp : inputs) {\n scores.add(\n inp\n );\n }\n}".to_string(),
),
(
"MetricsCollector.report",
40,
46,
"public Map<String, Object> report() {\n Map<String, Object> result = new HashMap<>();\n result.put(\"mean\", 0.0);\n result.put(\"count\", scores.size());\n result.put(\"tags\", tags);\n return result;\n}".to_string(),
),
("BigCompute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("java".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("java".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Metrics.java".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("java".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-java-ast-v1".into()),
}
}
#[test]
fn code_java_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.java.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-java-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_java_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeJavaAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeJavaAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Kotlin code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeKotlinAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("src/main/kotlin/com/example/Metrics.kt".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-kotlin-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line function body to force split_oversize.
let big_body: String = {
let header = "class BigCompute {\n fun compute(data: IntArray): Int {\n";
let body: String = (0..210u32)
.map(|i| format!(" val v{i} = if ({i} < data.size) data[{i}] else 0\n"))
.collect();
let footer = " return data.size\n }\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. top-level fn `computeMRR` (lines 712, ≤200)
// 2. data class `MetricsCollector` (lines 1420, ≤200)
// 3. class `BaseEvaluator` (lines 2230, ≤200)
// 4. method `MetricsCollector.run` (lines 3238, ≤200)
// 5. method `MetricsCollector.report` (lines 4046, ≤200)
// 6. BigCompute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import kotlin.collections.List\nimport kotlin.collections.Map\nimport kotlin.collections.MutableList\nimport kotlin.collections.MutableMap\nimport kotlin.collections.mutableListOf".to_string(),
),
(
"computeMRR",
7,
12,
"fun computeMRR(scores: List<Double>): Double {\n if (scores.isEmpty()) {\n return 0.0\n }\n return 1.0 / scores.size\n}".to_string(),
),
(
"MetricsCollector",
14,
20,
"data class MetricsCollector(\n val scores: MutableList<Double> = mutableListOf(),\n val labels: MutableList<String> = mutableListOf(),\n val counts: MutableMap<String, Int> = mutableMapOf(),\n val totals: MutableMap<String, Double> = mutableMapOf(),\n val tags: MutableList<String> = mutableListOf(),\n)".to_string(),
),
(
"BaseEvaluator",
22,
30,
"open class BaseEvaluator(val name: String) {\n\n fun evaluate(data: List<String>) {\n val joined = data.joinToString(\",\")\n println(joined)\n }\n\n open fun describe(): String = name\n}".to_string(),
),
(
"MetricsCollector.run",
32,
38,
"fun MetricsCollector.run(inputs: List<Double>) {\n for (inp in inputs) {\n scores.add(\n inp\n )\n }\n}".to_string(),
),
(
"MetricsCollector.report",
40,
46,
"fun MetricsCollector.report(): Map<String, Any> {\n return mapOf(\n \"mean\" to 0.0,\n \"count\" to scores.size,\n \"tags\" to tags,\n )\n}".to_string(),
),
("BigCompute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("kotlin".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("kotlin".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Metrics.kt".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("kotlin".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-kotlin-ast-v1".into()),
}
}
#[test]
fn code_kotlin_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.kt.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-kotlin-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_kotlin_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,270 @@
//! Behavioural tests for `CodeTextParagraphV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing raw text into a single `Block::Code`, mirroring the pattern used
//! in `k8s_manifest_resource_v1.rs`.
use std::path::PathBuf;
use kebab_chunk::CodeTextParagraphV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing `text`
/// and the supplied `lang` label.
fn text_doc(lang: &str, text: &str) -> CanonicalDocument {
let wp = WorkspacePath("scripts/sample.sh".into());
let aid = AssetId("d".repeat(64));
let pv = ParserVersion("code-text-paragraph-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some(lang.into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some(lang.into()),
code: text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "sample.sh".into(),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some(lang.into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-text-paragraph-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// `sample_shell.sh` has 4 paragraphs separated by 3 blank lines:
/// - paragraph 1: lines 1-2 (shebang + set -euo pipefail)
/// - paragraph 2: lines 4-7 (env setup block)
/// - paragraph 3: lines 9-11 (ingest block)
/// - paragraph 4: lines 13-15 (report block)
///
/// We assert:
/// - exactly 4 chunks (one per paragraph)
/// - all symbols are None (Tier 3 spec §9.3)
/// - all langs are "shell"
/// - line ranges are strictly ascending and do NOT include the blank lines
/// (lines 3, 8, 12 must not appear in any range)
#[test]
fn shell_multi_paragraph_splits_on_blank_lines() {
let fixture_path = fixtures_dir().join("sample_shell.sh");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
4,
"expected 4 chunks (one per paragraph), got {}: {chunks:#?}",
chunks.len()
);
// All symbols must be None (Tier 3 requirement).
for (i, chunk) in chunks.iter().enumerate() {
match &chunk.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(
symbol.is_none(),
"chunk[{i}] symbol must be None for Tier 3 chunker, got {symbol:?}"
);
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// All langs must be "shell".
for (i, chunk) in chunks.iter().enumerate() {
match &chunk.source_spans[0] {
SourceSpan::Code { lang, .. } => {
assert_eq!(
lang.as_deref(),
Some("shell"),
"chunk[{i}] lang must be 'shell', got {lang:?}"
);
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// Line ranges must be strictly ascending with no overlap,
// and blank lines (3, 8, 12) must not be included in any range.
let expected_ranges: &[(u32, u32)] = &[(1, 2), (4, 7), (9, 11), (13, 15)];
let actual_ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code {
line_start,
line_end,
..
} => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
assert_eq!(
actual_ranges, expected_ranges,
"line ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
);
}
/// `sample_long_paragraph.txt` has exactly 200 non-blank lines and no blank
/// lines, so the entire file is one paragraph. 200 > 80 (FALLBACK_LINES_PER_CHUNK),
/// so the oversize window split fires with stride 60:
/// - window 1: lines 1-80
/// - window 2: lines 61-140
/// - window 3: lines 121-200
///
/// All chunk_ids must be distinct (the #L{window_start} split_key suffix).
#[test]
fn single_long_paragraph_line_window_split() {
let fixture_path = fixtures_dir().join("sample_long_paragraph.txt");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
assert_eq!(
text.lines().count(),
200,
"fixture must have exactly 200 lines"
);
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
3,
"expected 3 window chunks for 200-line paragraph, got {}: {chunks:#?}",
chunks.len()
);
let expected_ranges: &[(u32, u32)] = &[(1, 80), (61, 140), (121, 200)];
let actual_ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code {
line_start,
line_end,
..
} => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
assert_eq!(
actual_ranges, expected_ranges,
"window ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
);
// All chunk_ids must be distinct (#L{window_start} suffix differentiates them).
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"oversize window chunks must have distinct chunk_ids"
);
}
/// An empty source file (no non-blank lines) must yield zero chunks.
#[test]
fn empty_file_emits_zero_chunks() {
let doc = text_doc("shell", "");
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
0,
"empty file must yield 0 chunks, got {}: {chunks:#?}",
chunks.len()
);
}
/// The `lang` field on each emitted chunk must match the `lang` passed to
/// `text_doc`, regardless of content. `symbol` must be `None` (Tier 3 spec).
#[test]
fn lang_field_preserved_from_input_doc() {
let doc = text_doc("yaml", "key1: value1\nkey2: value2\n");
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert!(!chunks.is_empty(), "expected at least one chunk");
match &chunks[0].source_spans[0] {
SourceSpan::Code { lang, symbol, .. } => {
assert_eq!(
lang.as_deref(),
Some("yaml"),
"lang must be 'yaml', got {lang:?}"
);
assert!(
symbol.is_none(),
"symbol must be None for Tier 3 chunker, got {symbol:?}"
);
}
other => panic!("expected Code span, got {other:?}"),
}
}

View File

@@ -0,0 +1,134 @@
//! Behavioural tests for `DockerfileFileV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing the raw Dockerfile text into a single `Block::Code`, mirroring the
//! pattern used in `k8s_manifest_resource_v1.rs`.
use std::path::PathBuf;
use kebab_chunk::DockerfileFileV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing `dockerfile_text`.
fn dockerfile_doc(dockerfile_text: &str) -> CanonicalDocument {
let wp = WorkspacePath("build/Dockerfile".into());
let aid = AssetId("d".repeat(64));
let pv = ParserVersion("code-dockerfile-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = dockerfile_text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some("dockerfile".into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("dockerfile".into()),
code: dockerfile_text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Dockerfile".into(),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("dockerfile".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("dockerfile-file-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// A simple 5-line Dockerfile fixture must emit exactly 1 chunk with the
/// correct symbol, lang, and line range.
#[test]
fn dockerfile_emits_single_chunk() {
let fixture_path = fixtures_dir().join("sample.dockerfile");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = dockerfile_doc(&text);
let chunks = DockerfileFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
// Inspect the Chunk's source_spans for symbol / lang / line range.
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(*line_end, 5, "line_end must be 5 (5-line fixture)");
assert_eq!(
symbol.as_deref(),
Some("<dockerfile>"),
"symbol must be '<dockerfile>'"
);
assert_eq!(lang.as_deref(), Some("dockerfile"), "lang must be 'dockerfile'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
// Verify chunker_version label.
assert_eq!(chunks[0].chunker_version.0, "dockerfile-file-v1");
}

View File

@@ -0,0 +1,86 @@
[
{
"block_ids": [
"8149e12ca002489acb4a0f74c97a061a"
],
"chunk_id": "ec3cf06ae56c8e9796bbc9196438b7c5",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 18,
"line_start": 1,
"symbol": "<top-level>"
}
],
"text": "#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_BUF 4096\n\ntypedef enum {\n OK = 0,\n ERR_PARSE,\n ERR_IO,\n} status_t;\n\ntypedef struct {\n int id;\n char name[64];\n status_t status;\n} record_t;\n\nstatic int counter = 0;",
"token_estimate": 78
},
{
"block_ids": [
"1baaa89f21a47b2f32d6396a24a85454"
],
"chunk_id": "c2d7a81c898106733ef2e703774a6a4a",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 23,
"line_start": 20,
"symbol": "parse_record"
}
],
"text": "int parse_record(const char *line, record_t *out) {\n if (line == NULL || out == NULL) return ERR_PARSE;\n return OK;\n}",
"token_estimate": 41
},
{
"block_ids": [
"8d0e14cbcc6d1e92d7878ab796ea68b8"
],
"chunk_id": "0e4d7b131ab64eba03b51903b5d8f96d",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 27,
"line_start": 25,
"symbol": "print_record"
}
],
"text": "void print_record(const record_t *r) {\n printf(\"[%d] %s (status=%d)\\n\", r->id, r->name, r->status);\n}",
"token_estimate": 35
},
{
"block_ids": [
"9c2ede84423871b615d48c38fefb1853"
],
"chunk_id": "e076f8edb2ff141d7e99b4106bb95157",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 33,
"line_start": 29,
"symbol": "main"
}
],
"text": "int main(void) {\n record_t r = { .id = 1, .name = \"foo\", .status = OK };\n print_record(&r);\n return 0;\n}",
"token_estimate": 38
}
]

View File

@@ -0,0 +1,107 @@
[
{
"block_ids": [
"53292605459065d170cd36c118e20546"
],
"chunk_id": "50a5b324300d9082eac4ce2a422810e1",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 4,
"line_start": 1,
"symbol": "<top-level>"
}
],
"text": "#include <string>\n#include <vector>\n\nnamespace kebab {",
"token_estimate": 18
},
{
"block_ids": [
"f349acad94c9fa4cf9ad1c0a93e83610"
],
"chunk_id": "0e6bc7c522665af8a4b0f66afb9d29c8",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 20,
"line_start": 6,
"symbol": "kebab::chunk::MdHeadingV1Chunker"
}
],
"text": "class MdHeadingV1Chunker {\npublic:\n MdHeadingV1Chunker() = default;\n ~MdHeadingV1Chunker() = default;\n\n std::string chunk_doc(const std::string& doc) {\n return doc;\n }\n\n int operator()(int x) const {\n return x * 2;\n }\n\nprivate:\n int counter_ = 0;\n};",
"token_estimate": 95
},
{
"block_ids": [
"8b9811387717d0bd4abf84abcc35b8b1"
],
"chunk_id": "d9326d252905b665b2adb9a416c20451",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 25,
"line_start": 22,
"symbol": "kebab::identity"
}
],
"text": "template <typename T>\nT identity(T value) {\n return value;\n}",
"token_estimate": 21
},
{
"block_ids": [
"1754cb6b971f6a4cb292f144a4f0570b"
],
"chunk_id": "56ee5f991de4a413c016da8dc4acfc35",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 29,
"line_start": 27,
"symbol": "kebab::global_helper"
}
],
"text": "void global_helper() {\n // free function in kebab namespace\n}",
"token_estimate": 22
},
{
"block_ids": [
"14b5f3393d6d25f822f5b70763d24acd"
],
"chunk_id": "c0d7c043cdd575c530db3909b54cc906",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 34,
"line_start": 31,
"symbol": "main"
}
],
"text": "int main() {\n kebab::chunk::MdHeadingV1Chunker c;\n return 0;\n}",
"token_estimate": 23
}
]

View File

@@ -0,0 +1,233 @@
[
{
"block_ids": [
"c182bf37e32c7fc1b868bd617f8eaf66"
],
"chunk_id": "43de518d946dc18ec040ae20d74e0cff",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 5,
"line_start": 1,
"symbol": "imports"
}
],
"text": "import (\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)",
"token_estimate": 12
},
{
"block_ids": [
"c9992cdcfdf3c2a7700a4abc4782a8a4"
],
"chunk_id": "af4c382a83f1e8cdea495d8b33c11abc",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 12,
"line_start": 7,
"symbol": "ComputeMRR"
}
],
"text": "func ComputeMRR(scores []float64) float64 {\n\tif len(scores) == 0 {\n\t\treturn 0.0\n\t}\n\t_ = fmt.Sprintf(\"%v\", scores)\n\treturn 1.0 / float64(len(scores))\n}",
"token_estimate": 50
},
{
"block_ids": [
"5f18dc3e79fe946ba05d32c3bfc00684"
],
"chunk_id": "4be6d8f180bc19b8651877e5264852ac",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 20,
"line_start": 14,
"symbol": "MetricsCollector"
}
],
"text": "type MetricsCollector struct {\n\tScores []float64\n\tLabels []string\n\tCounts map[string]int\n\tTotals map[string]float64\n\tTags []string\n}",
"token_estimate": 45
},
{
"block_ids": [
"3009cc022ca832c323393e4f9bcdb388"
],
"chunk_id": "3ae182f4c6d304ee7f0aaf447142f948",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 30,
"line_start": 22,
"symbol": "BaseEvaluator"
}
],
"text": "type BaseEvaluator struct {\n\tName string\n}\n\nfunc (e *BaseEvaluator) Evaluate(data []string) error {\n\t_ = os.Stderr\n\t_ = strings.Join(data, \",\")\n\treturn nil\n}",
"token_estimate": 53
},
{
"block_ids": [
"e0e83d1d7f9327a1902ae9a8f67c1f1c"
],
"chunk_id": "b962f14980e756bb8ba514e2282756cd",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 38,
"line_start": 32,
"symbol": "MetricsCollector.Run"
}
],
"text": "func (m *MetricsCollector) Run(inputs []float64) {\n\tfor _, inp := range inputs {\n\t\tm.Scores = append(\n\t\t\tm.Scores,\n\t\t\tinp,\n\t\t)\n\t}\n}",
"token_estimate": 44
},
{
"block_ids": [
"0e6a572bc3fe2bd6d173fe614bd1b763"
],
"chunk_id": "441c695e990e7f49188068433e313e87",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 46,
"line_start": 40,
"symbol": "MetricsCollector.Report"
}
],
"text": "func (m *MetricsCollector) Report() map[string]interface{} {\n\treturn map[string]interface{}{\n\t\t\"mean\": 0.0,\n\t\t\"count\": len(m.Scores),\n\t\t\"tags\": m.Tags,\n\t}\n}",
"token_estimate": 53
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "7a942d871c588ec69426290561f05179",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 247,
"line_start": 48,
"symbol": "BigCompute [part 1/5]"
}
],
"text": "func BigCompute(data []int) int {\n\tv0 := 0\n\tif 0 < len(data) {\n\t\tv0 = data[0]\n\t}\n\tv1 := 0\n\tif 1 < len(data) {\n\t\tv1 = data[1]\n\t}\n\tv2 := 0\n\tif 2 < len(data) {\n\t\tv2 = data[2]\n\t}\n\tv3 := 0\n\tif 3 < len(data) {\n\t\tv3 = data[3]\n\t}\n\tv4 := 0\n\tif 4 < len(data) {\n\t\tv4 = data[4]\n\t}\n\tv5 := 0\n\tif 5 < len(data) {\n\t\tv5 = data[5]\n\t}\n\tv6 := 0\n\tif 6 < len(data) {\n\t\tv6 = data[6]\n\t}\n\tv7 := 0\n\tif 7 < len(data) {\n\t\tv7 = data[7]\n\t}\n\tv8 := 0\n\tif 8 < len(data) {\n\t\tv8 = data[8]\n\t}\n\tv9 := 0\n\tif 9 < len(data) {\n\t\tv9 = data[9]\n\t}\n\tv10 := 0\n\tif 10 < len(data) {\n\t\tv10 = data[10]\n\t}\n\tv11 := 0\n\tif 11 < len(data) {\n\t\tv11 = data[11]\n\t}\n\tv12 := 0\n\tif 12 < len(data) {\n\t\tv12 = data[12]\n\t}\n\tv13 := 0\n\tif 13 < len(data) {\n\t\tv13 = data[13]\n\t}\n\tv14 := 0\n\tif 14 < len(data) {\n\t\tv14 = data[14]\n\t}\n\tv15 := 0\n\tif 15 < len(data) {\n\t\tv15 = data[15]\n\t}\n\tv16 := 0\n\tif 16 < len(data) {\n\t\tv16 = data[16]\n\t}\n\tv17 := 0\n\tif 17 < len(data) {\n\t\tv17 = data[17]\n\t}\n\tv18 := 0\n\tif 18 < len(data) {\n\t\tv18 = data[18]\n\t}\n\tv19 := 0\n\tif 19 < len(data) {\n\t\tv19 = data[19]\n\t}\n\tv20 := 0\n\tif 20 < len(data) {\n\t\tv20 = data[20]\n\t}\n\tv21 := 0\n\tif 21 < len(data) {\n\t\tv21 = data[21]\n\t}\n\tv22 := 0\n\tif 22 < len(data) {\n\t\tv22 = data[22]\n\t}\n\tv23 := 0\n\tif 23 < len(data) {\n\t\tv23 = data[23]\n\t}\n\tv24 := 0\n\tif 24 < len(data) {\n\t\tv24 = data[24]\n\t}\n\tv25 := 0\n\tif 25 < len(data) {\n\t\tv25 = data[25]\n\t}\n\tv26 := 0\n\tif 26 < len(data) {\n\t\tv26 = data[26]\n\t}\n\tv27 := 0\n\tif 27 < len(data) {\n\t\tv27 = data[27]\n\t}\n\tv28 := 0\n\tif 28 < len(data) {\n\t\tv28 = data[28]\n\t}\n\tv29 := 0\n\tif 29 < len(data) {\n\t\tv29 = data[29]\n\t}\n\tv30 := 0\n\tif 30 < len(data) {\n\t\tv30 = data[30]\n\t}\n\tv31 := 0\n\tif 31 < len(data) {\n\t\tv31 = data[31]\n\t}\n\tv32 := 0\n\tif 32 < len(data) {\n\t\tv32 = data[32]\n\t}\n\tv33 := 0\n\tif 33 < len(data) {\n\t\tv33 = data[33]\n\t}\n\tv34 := 0\n\tif 34 < len(data) {\n\t\tv34 = data[34]\n\t}\n\tv35 := 0\n\tif 35 < len(data) {\n\t\tv35 = data[35]\n\t}\n\tv36 := 0\n\tif 36 < len(data) {\n\t\tv36 = data[36]\n\t}\n\tv37 := 0\n\tif 37 < len(data) {\n\t\tv37 = data[37]\n\t}\n\tv38 := 0\n\tif 38 < len(data) {\n\t\tv38 = data[38]\n\t}\n\tv39 := 0\n\tif 39 < len(data) {\n\t\tv39 = data[39]\n\t}\n\tv40 := 0\n\tif 40 < len(data) {\n\t\tv40 = data[40]\n\t}\n\tv41 := 0\n\tif 41 < len(data) {\n\t\tv41 = data[41]\n\t}\n\tv42 := 0\n\tif 42 < len(data) {\n\t\tv42 = data[42]\n\t}\n\tv43 := 0\n\tif 43 < len(data) {\n\t\tv43 = data[43]\n\t}\n\tv44 := 0\n\tif 44 < len(data) {\n\t\tv44 = data[44]\n\t}\n\tv45 := 0\n\tif 45 < len(data) {\n\t\tv45 = data[45]\n\t}\n\tv46 := 0\n\tif 46 < len(data) {\n\t\tv46 = data[46]\n\t}\n\tv47 := 0\n\tif 47 < len(data) {\n\t\tv47 = data[47]\n\t}\n\tv48 := 0\n\tif 48 < len(data) {\n\t\tv48 = data[48]\n\t}\n\tv49 := 0\n\tif 49 < len(data) {\n\t\tv49 = data[49]",
"token_estimate": 847
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "3f44ba43c9415652e2705bb667776e76",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 447,
"line_start": 248,
"symbol": "BigCompute [part 2/5]"
}
],
"text": "\t}\n\tv50 := 0\n\tif 50 < len(data) {\n\t\tv50 = data[50]\n\t}\n\tv51 := 0\n\tif 51 < len(data) {\n\t\tv51 = data[51]\n\t}\n\tv52 := 0\n\tif 52 < len(data) {\n\t\tv52 = data[52]\n\t}\n\tv53 := 0\n\tif 53 < len(data) {\n\t\tv53 = data[53]\n\t}\n\tv54 := 0\n\tif 54 < len(data) {\n\t\tv54 = data[54]\n\t}\n\tv55 := 0\n\tif 55 < len(data) {\n\t\tv55 = data[55]\n\t}\n\tv56 := 0\n\tif 56 < len(data) {\n\t\tv56 = data[56]\n\t}\n\tv57 := 0\n\tif 57 < len(data) {\n\t\tv57 = data[57]\n\t}\n\tv58 := 0\n\tif 58 < len(data) {\n\t\tv58 = data[58]\n\t}\n\tv59 := 0\n\tif 59 < len(data) {\n\t\tv59 = data[59]\n\t}\n\tv60 := 0\n\tif 60 < len(data) {\n\t\tv60 = data[60]\n\t}\n\tv61 := 0\n\tif 61 < len(data) {\n\t\tv61 = data[61]\n\t}\n\tv62 := 0\n\tif 62 < len(data) {\n\t\tv62 = data[62]\n\t}\n\tv63 := 0\n\tif 63 < len(data) {\n\t\tv63 = data[63]\n\t}\n\tv64 := 0\n\tif 64 < len(data) {\n\t\tv64 = data[64]\n\t}\n\tv65 := 0\n\tif 65 < len(data) {\n\t\tv65 = data[65]\n\t}\n\tv66 := 0\n\tif 66 < len(data) {\n\t\tv66 = data[66]\n\t}\n\tv67 := 0\n\tif 67 < len(data) {\n\t\tv67 = data[67]\n\t}\n\tv68 := 0\n\tif 68 < len(data) {\n\t\tv68 = data[68]\n\t}\n\tv69 := 0\n\tif 69 < len(data) {\n\t\tv69 = data[69]\n\t}\n\tv70 := 0\n\tif 70 < len(data) {\n\t\tv70 = data[70]\n\t}\n\tv71 := 0\n\tif 71 < len(data) {\n\t\tv71 = data[71]\n\t}\n\tv72 := 0\n\tif 72 < len(data) {\n\t\tv72 = data[72]\n\t}\n\tv73 := 0\n\tif 73 < len(data) {\n\t\tv73 = data[73]\n\t}\n\tv74 := 0\n\tif 74 < len(data) {\n\t\tv74 = data[74]\n\t}\n\tv75 := 0\n\tif 75 < len(data) {\n\t\tv75 = data[75]\n\t}\n\tv76 := 0\n\tif 76 < len(data) {\n\t\tv76 = data[76]\n\t}\n\tv77 := 0\n\tif 77 < len(data) {\n\t\tv77 = data[77]\n\t}\n\tv78 := 0\n\tif 78 < len(data) {\n\t\tv78 = data[78]\n\t}\n\tv79 := 0\n\tif 79 < len(data) {\n\t\tv79 = data[79]\n\t}\n\tv80 := 0\n\tif 80 < len(data) {\n\t\tv80 = data[80]\n\t}\n\tv81 := 0\n\tif 81 < len(data) {\n\t\tv81 = data[81]\n\t}\n\tv82 := 0\n\tif 82 < len(data) {\n\t\tv82 = data[82]\n\t}\n\tv83 := 0\n\tif 83 < len(data) {\n\t\tv83 = data[83]\n\t}\n\tv84 := 0\n\tif 84 < len(data) {\n\t\tv84 = data[84]\n\t}\n\tv85 := 0\n\tif 85 < len(data) {\n\t\tv85 = data[85]\n\t}\n\tv86 := 0\n\tif 86 < len(data) {\n\t\tv86 = data[86]\n\t}\n\tv87 := 0\n\tif 87 < len(data) {\n\t\tv87 = data[87]\n\t}\n\tv88 := 0\n\tif 88 < len(data) {\n\t\tv88 = data[88]\n\t}\n\tv89 := 0\n\tif 89 < len(data) {\n\t\tv89 = data[89]\n\t}\n\tv90 := 0\n\tif 90 < len(data) {\n\t\tv90 = data[90]\n\t}\n\tv91 := 0\n\tif 91 < len(data) {\n\t\tv91 = data[91]\n\t}\n\tv92 := 0\n\tif 92 < len(data) {\n\t\tv92 = data[92]\n\t}\n\tv93 := 0\n\tif 93 < len(data) {\n\t\tv93 = data[93]\n\t}\n\tv94 := 0\n\tif 94 < len(data) {\n\t\tv94 = data[94]\n\t}\n\tv95 := 0\n\tif 95 < len(data) {\n\t\tv95 = data[95]\n\t}\n\tv96 := 0\n\tif 96 < len(data) {\n\t\tv96 = data[96]\n\t}\n\tv97 := 0\n\tif 97 < len(data) {\n\t\tv97 = data[97]\n\t}\n\tv98 := 0\n\tif 98 < len(data) {\n\t\tv98 = data[98]\n\t}\n\tv99 := 0\n\tif 99 < len(data) {\n\t\tv99 = data[99]",
"token_estimate": 850
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "e4763e10f059d97f40c2932761b56c3e",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 647,
"line_start": 448,
"symbol": "BigCompute [part 3/5]"
}
],
"text": "\t}\n\tv100 := 0\n\tif 100 < len(data) {\n\t\tv100 = data[100]\n\t}\n\tv101 := 0\n\tif 101 < len(data) {\n\t\tv101 = data[101]\n\t}\n\tv102 := 0\n\tif 102 < len(data) {\n\t\tv102 = data[102]\n\t}\n\tv103 := 0\n\tif 103 < len(data) {\n\t\tv103 = data[103]\n\t}\n\tv104 := 0\n\tif 104 < len(data) {\n\t\tv104 = data[104]\n\t}\n\tv105 := 0\n\tif 105 < len(data) {\n\t\tv105 = data[105]\n\t}\n\tv106 := 0\n\tif 106 < len(data) {\n\t\tv106 = data[106]\n\t}\n\tv107 := 0\n\tif 107 < len(data) {\n\t\tv107 = data[107]\n\t}\n\tv108 := 0\n\tif 108 < len(data) {\n\t\tv108 = data[108]\n\t}\n\tv109 := 0\n\tif 109 < len(data) {\n\t\tv109 = data[109]\n\t}\n\tv110 := 0\n\tif 110 < len(data) {\n\t\tv110 = data[110]\n\t}\n\tv111 := 0\n\tif 111 < len(data) {\n\t\tv111 = data[111]\n\t}\n\tv112 := 0\n\tif 112 < len(data) {\n\t\tv112 = data[112]\n\t}\n\tv113 := 0\n\tif 113 < len(data) {\n\t\tv113 = data[113]\n\t}\n\tv114 := 0\n\tif 114 < len(data) {\n\t\tv114 = data[114]\n\t}\n\tv115 := 0\n\tif 115 < len(data) {\n\t\tv115 = data[115]\n\t}\n\tv116 := 0\n\tif 116 < len(data) {\n\t\tv116 = data[116]\n\t}\n\tv117 := 0\n\tif 117 < len(data) {\n\t\tv117 = data[117]\n\t}\n\tv118 := 0\n\tif 118 < len(data) {\n\t\tv118 = data[118]\n\t}\n\tv119 := 0\n\tif 119 < len(data) {\n\t\tv119 = data[119]\n\t}\n\tv120 := 0\n\tif 120 < len(data) {\n\t\tv120 = data[120]\n\t}\n\tv121 := 0\n\tif 121 < len(data) {\n\t\tv121 = data[121]\n\t}\n\tv122 := 0\n\tif 122 < len(data) {\n\t\tv122 = data[122]\n\t}\n\tv123 := 0\n\tif 123 < len(data) {\n\t\tv123 = data[123]\n\t}\n\tv124 := 0\n\tif 124 < len(data) {\n\t\tv124 = data[124]\n\t}\n\tv125 := 0\n\tif 125 < len(data) {\n\t\tv125 = data[125]\n\t}\n\tv126 := 0\n\tif 126 < len(data) {\n\t\tv126 = data[126]\n\t}\n\tv127 := 0\n\tif 127 < len(data) {\n\t\tv127 = data[127]\n\t}\n\tv128 := 0\n\tif 128 < len(data) {\n\t\tv128 = data[128]\n\t}\n\tv129 := 0\n\tif 129 < len(data) {\n\t\tv129 = data[129]\n\t}\n\tv130 := 0\n\tif 130 < len(data) {\n\t\tv130 = data[130]\n\t}\n\tv131 := 0\n\tif 131 < len(data) {\n\t\tv131 = data[131]\n\t}\n\tv132 := 0\n\tif 132 < len(data) {\n\t\tv132 = data[132]\n\t}\n\tv133 := 0\n\tif 133 < len(data) {\n\t\tv133 = data[133]\n\t}\n\tv134 := 0\n\tif 134 < len(data) {\n\t\tv134 = data[134]\n\t}\n\tv135 := 0\n\tif 135 < len(data) {\n\t\tv135 = data[135]\n\t}\n\tv136 := 0\n\tif 136 < len(data) {\n\t\tv136 = data[136]\n\t}\n\tv137 := 0\n\tif 137 < len(data) {\n\t\tv137 = data[137]\n\t}\n\tv138 := 0\n\tif 138 < len(data) {\n\t\tv138 = data[138]\n\t}\n\tv139 := 0\n\tif 139 < len(data) {\n\t\tv139 = data[139]\n\t}\n\tv140 := 0\n\tif 140 < len(data) {\n\t\tv140 = data[140]\n\t}\n\tv141 := 0\n\tif 141 < len(data) {\n\t\tv141 = data[141]\n\t}\n\tv142 := 0\n\tif 142 < len(data) {\n\t\tv142 = data[142]\n\t}\n\tv143 := 0\n\tif 143 < len(data) {\n\t\tv143 = data[143]\n\t}\n\tv144 := 0\n\tif 144 < len(data) {\n\t\tv144 = data[144]\n\t}\n\tv145 := 0\n\tif 145 < len(data) {\n\t\tv145 = data[145]\n\t}\n\tv146 := 0\n\tif 146 < len(data) {\n\t\tv146 = data[146]\n\t}\n\tv147 := 0\n\tif 147 < len(data) {\n\t\tv147 = data[147]\n\t}\n\tv148 := 0\n\tif 148 < len(data) {\n\t\tv148 = data[148]\n\t}\n\tv149 := 0\n\tif 149 < len(data) {\n\t\tv149 = data[149]",
"token_estimate": 917
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "24176c911d0bacf9a29fa7f8251f5036",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 847,
"line_start": 648,
"symbol": "BigCompute [part 4/5]"
}
],
"text": "\t}\n\tv150 := 0\n\tif 150 < len(data) {\n\t\tv150 = data[150]\n\t}\n\tv151 := 0\n\tif 151 < len(data) {\n\t\tv151 = data[151]\n\t}\n\tv152 := 0\n\tif 152 < len(data) {\n\t\tv152 = data[152]\n\t}\n\tv153 := 0\n\tif 153 < len(data) {\n\t\tv153 = data[153]\n\t}\n\tv154 := 0\n\tif 154 < len(data) {\n\t\tv154 = data[154]\n\t}\n\tv155 := 0\n\tif 155 < len(data) {\n\t\tv155 = data[155]\n\t}\n\tv156 := 0\n\tif 156 < len(data) {\n\t\tv156 = data[156]\n\t}\n\tv157 := 0\n\tif 157 < len(data) {\n\t\tv157 = data[157]\n\t}\n\tv158 := 0\n\tif 158 < len(data) {\n\t\tv158 = data[158]\n\t}\n\tv159 := 0\n\tif 159 < len(data) {\n\t\tv159 = data[159]\n\t}\n\tv160 := 0\n\tif 160 < len(data) {\n\t\tv160 = data[160]\n\t}\n\tv161 := 0\n\tif 161 < len(data) {\n\t\tv161 = data[161]\n\t}\n\tv162 := 0\n\tif 162 < len(data) {\n\t\tv162 = data[162]\n\t}\n\tv163 := 0\n\tif 163 < len(data) {\n\t\tv163 = data[163]\n\t}\n\tv164 := 0\n\tif 164 < len(data) {\n\t\tv164 = data[164]\n\t}\n\tv165 := 0\n\tif 165 < len(data) {\n\t\tv165 = data[165]\n\t}\n\tv166 := 0\n\tif 166 < len(data) {\n\t\tv166 = data[166]\n\t}\n\tv167 := 0\n\tif 167 < len(data) {\n\t\tv167 = data[167]\n\t}\n\tv168 := 0\n\tif 168 < len(data) {\n\t\tv168 = data[168]\n\t}\n\tv169 := 0\n\tif 169 < len(data) {\n\t\tv169 = data[169]\n\t}\n\tv170 := 0\n\tif 170 < len(data) {\n\t\tv170 = data[170]\n\t}\n\tv171 := 0\n\tif 171 < len(data) {\n\t\tv171 = data[171]\n\t}\n\tv172 := 0\n\tif 172 < len(data) {\n\t\tv172 = data[172]\n\t}\n\tv173 := 0\n\tif 173 < len(data) {\n\t\tv173 = data[173]\n\t}\n\tv174 := 0\n\tif 174 < len(data) {\n\t\tv174 = data[174]\n\t}\n\tv175 := 0\n\tif 175 < len(data) {\n\t\tv175 = data[175]\n\t}\n\tv176 := 0\n\tif 176 < len(data) {\n\t\tv176 = data[176]\n\t}\n\tv177 := 0\n\tif 177 < len(data) {\n\t\tv177 = data[177]\n\t}\n\tv178 := 0\n\tif 178 < len(data) {\n\t\tv178 = data[178]\n\t}\n\tv179 := 0\n\tif 179 < len(data) {\n\t\tv179 = data[179]\n\t}\n\tv180 := 0\n\tif 180 < len(data) {\n\t\tv180 = data[180]\n\t}\n\tv181 := 0\n\tif 181 < len(data) {\n\t\tv181 = data[181]\n\t}\n\tv182 := 0\n\tif 182 < len(data) {\n\t\tv182 = data[182]\n\t}\n\tv183 := 0\n\tif 183 < len(data) {\n\t\tv183 = data[183]\n\t}\n\tv184 := 0\n\tif 184 < len(data) {\n\t\tv184 = data[184]\n\t}\n\tv185 := 0\n\tif 185 < len(data) {\n\t\tv185 = data[185]\n\t}\n\tv186 := 0\n\tif 186 < len(data) {\n\t\tv186 = data[186]\n\t}\n\tv187 := 0\n\tif 187 < len(data) {\n\t\tv187 = data[187]\n\t}\n\tv188 := 0\n\tif 188 < len(data) {\n\t\tv188 = data[188]\n\t}\n\tv189 := 0\n\tif 189 < len(data) {\n\t\tv189 = data[189]\n\t}\n\tv190 := 0\n\tif 190 < len(data) {\n\t\tv190 = data[190]\n\t}\n\tv191 := 0\n\tif 191 < len(data) {\n\t\tv191 = data[191]\n\t}\n\tv192 := 0\n\tif 192 < len(data) {\n\t\tv192 = data[192]\n\t}\n\tv193 := 0\n\tif 193 < len(data) {\n\t\tv193 = data[193]\n\t}\n\tv194 := 0\n\tif 194 < len(data) {\n\t\tv194 = data[194]\n\t}\n\tv195 := 0\n\tif 195 < len(data) {\n\t\tv195 = data[195]\n\t}\n\tv196 := 0\n\tif 196 < len(data) {\n\t\tv196 = data[196]\n\t}\n\tv197 := 0\n\tif 197 < len(data) {\n\t\tv197 = data[197]\n\t}\n\tv198 := 0\n\tif 198 < len(data) {\n\t\tv198 = data[198]\n\t}\n\tv199 := 0\n\tif 199 < len(data) {\n\t\tv199 = data[199]",
"token_estimate": 917
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "438127626378632c03780d10603de32c",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 890,
"line_start": 848,
"symbol": "BigCompute [part 5/5]"
}
],
"text": "\t}\n\tv200 := 0\n\tif 200 < len(data) {\n\t\tv200 = data[200]\n\t}\n\tv201 := 0\n\tif 201 < len(data) {\n\t\tv201 = data[201]\n\t}\n\tv202 := 0\n\tif 202 < len(data) {\n\t\tv202 = data[202]\n\t}\n\tv203 := 0\n\tif 203 < len(data) {\n\t\tv203 = data[203]\n\t}\n\tv204 := 0\n\tif 204 < len(data) {\n\t\tv204 = data[204]\n\t}\n\tv205 := 0\n\tif 205 < len(data) {\n\t\tv205 = data[205]\n\t}\n\tv206 := 0\n\tif 206 < len(data) {\n\t\tv206 = data[206]\n\t}\n\tv207 := 0\n\tif 207 < len(data) {\n\t\tv207 = data[207]\n\t}\n\tv208 := 0\n\tif 208 < len(data) {\n\t\tv208 = data[208]\n\t}\n\tv209 := 0\n\tif 209 < len(data) {\n\t\tv209 = data[209]\n\t}\n\treturn len(data)\n}",
"token_estimate": 191
}
]

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,33 @@
#include <stdio.h>
#include <stdlib.h>
#define MAX_BUF 4096
typedef enum {
OK = 0,
ERR_PARSE,
ERR_IO,
} status_t;
typedef struct {
int id;
char name[64];
status_t status;
} record_t;
static int counter = 0;
int parse_record(const char *line, record_t *out) {
if (line == NULL || out == NULL) return ERR_PARSE;
return OK;
}
void print_record(const record_t *r) {
printf("[%d] %s (status=%d)\n", r->id, r->name, r->status);
}
int main(void) {
record_t r = { .id = 1, .name = "foo", .status = OK };
print_record(&r);
return 0;
}

View File

@@ -0,0 +1,40 @@
#include <string>
#include <vector>
namespace kebab {
namespace chunk {
class MdHeadingV1Chunker {
public:
MdHeadingV1Chunker() = default;
~MdHeadingV1Chunker() = default;
std::string chunk_doc(const std::string& doc) {
return doc;
}
int operator()(int x) const {
return x * 2;
}
private:
int counter_ = 0;
};
template <typename T>
T identity(T value) {
return value;
}
} // namespace chunk
void global_helper() {
// free function in kebab namespace
}
} // namespace kebab
int main() {
kebab::chunk::MdHeadingV1Chunker c;
return 0;
}

View File

@@ -0,0 +1,5 @@
FROM rust:1.94-slim AS builder
WORKDIR /app
COPY . .
RUN cargo build --release
CMD ["/app/target/release/kebab"]

View File

@@ -0,0 +1,7 @@
[package]
name = "demo"
version = "0.1.0"
edition = "2021"
[dependencies]
serde = "1"

View File

@@ -0,0 +1,5 @@
module example.com/demo
go 1.22
require github.com/spf13/cobra v1.8.0

View File

@@ -0,0 +1,34 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: prod
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: example/api:1.2.3
---
apiVersion: v1
kind: Service
metadata:
name: api-server
namespace: prod
spec:
selector:
app: api-server
ports:
- port: 80
targetPort: 8080
---
# Non-k8s document — apiVersion missing
kind: ClusterIP
foo: bar

View File

@@ -0,0 +1,200 @@
line 001
line 002
line 003
line 004
line 005
line 006
line 007
line 008
line 009
line 010
line 011
line 012
line 013
line 014
line 015
line 016
line 017
line 018
line 019
line 020
line 021
line 022
line 023
line 024
line 025
line 026
line 027
line 028
line 029
line 030
line 031
line 032
line 033
line 034
line 035
line 036
line 037
line 038
line 039
line 040
line 041
line 042
line 043
line 044
line 045
line 046
line 047
line 048
line 049
line 050
line 051
line 052
line 053
line 054
line 055
line 056
line 057
line 058
line 059
line 060
line 061
line 062
line 063
line 064
line 065
line 066
line 067
line 068
line 069
line 070
line 071
line 072
line 073
line 074
line 075
line 076
line 077
line 078
line 079
line 080
line 081
line 082
line 083
line 084
line 085
line 086
line 087
line 088
line 089
line 090
line 091
line 092
line 093
line 094
line 095
line 096
line 097
line 098
line 099
line 100
line 101
line 102
line 103
line 104
line 105
line 106
line 107
line 108
line 109
line 110
line 111
line 112
line 113
line 114
line 115
line 116
line 117
line 118
line 119
line 120
line 121
line 122
line 123
line 124
line 125
line 126
line 127
line 128
line 129
line 130
line 131
line 132
line 133
line 134
line 135
line 136
line 137
line 138
line 139
line 140
line 141
line 142
line 143
line 144
line 145
line 146
line 147
line 148
line 149
line 150
line 151
line 152
line 153
line 154
line 155
line 156
line 157
line 158
line 159
line 160
line 161
line 162
line 163
line 164
line 165
line 166
line 167
line 168
line 169
line 170
line 171
line 172
line 173
line 174
line 175
line 176
line 177
line 178
line 179
line 180
line 181
line 182
line 183
line 184
line 185
line 186
line 187
line 188
line 189
line 190
line 191
line 192
line 193
line 194
line 195
line 196
line 197
line 198
line 199
line 200

View File

@@ -0,0 +1,7 @@
{
"name": "demo",
"version": "0.1.0",
"dependencies": {
"react": "^18.0.0"
}
}

View File

@@ -0,0 +1,7 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
<groupId>com.demo</groupId>
<artifactId>demo</artifactId>
<version>0.1.0</version>
</project>

View File

@@ -0,0 +1,15 @@
#!/usr/bin/env bash
set -euo pipefail
# First paragraph: env setup
export KEBAB_HOME="${KEBAB_HOME:-$HOME/.local/share/kebab}"
mkdir -p "$KEBAB_HOME"
cd "$KEBAB_HOME"
# Second paragraph: ingest
echo "ingesting workspace..."
kebab ingest --config /etc/kebab/config.toml
# Third paragraph: report
echo "done"
kebab schema --json | jq '.stats'

View File

@@ -0,0 +1,288 @@
//! Behavioural tests for `K8sManifestResourceV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing the raw YAML text into a single `Block::Code`, mirroring the
//! pattern used in `code_rust_ast_snapshot.rs`.
use std::path::PathBuf;
use kebab_chunk::K8sManifestResourceV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing `yaml_text`.
fn yaml_doc(yaml_text: &str) -> CanonicalDocument {
let wp = WorkspacePath("manifests/deploy.yaml".into());
let aid = AssetId("c".repeat(64));
let pv = ParserVersion("code-yaml-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = yaml_text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some("yaml".into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("yaml".into()),
code: yaml_text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "deploy.yaml".into(),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("yaml".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("k8s-manifest-resource-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// Three YAML documents: 2 valid k8s resources + 1 non-k8s (no apiVersion).
/// The chunker must emit exactly 2 chunks with the correct symbols and lang.
#[test]
fn k8s_multi_doc_emits_one_chunk_per_resource() {
let fixture_path = fixtures_dir().join("sample_k8s.yaml");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = yaml_doc(&text);
let chunks = K8sManifestResourceV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
2,
"expected 2 k8s chunks, got {}: {chunks:#?}",
chunks.len()
);
let symbols: Vec<&str> = chunks
.iter()
.map(|c| {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
symbol.as_deref().expect("symbol must be Some for k8s chunks")
}
other => panic!("expected Code span, got {other:?}"),
}
})
.collect();
assert_eq!(
symbols,
vec!["Deployment/prod/api-server", "Service/prod/api-server"],
"symbols mismatch: {symbols:?}"
);
// Verify lang = "yaml" on every chunk.
for chunk in &chunks {
match &chunk.source_spans[0] {
SourceSpan::Code { lang, .. } => {
assert_eq!(lang.as_deref(), Some("yaml"), "lang must be 'yaml'");
}
other => panic!("expected Code span, got {other:?}"),
}
}
// Verify chunker_version label.
for chunk in &chunks {
assert_eq!(chunk.chunker_version.0, "k8s-manifest-resource-v1");
}
// Every chunk from a multi-resource file must have a distinct chunk_id.
// Without the fix, all non-oversize resources get split_key=None which
// collapses to the same id_hash (= base_policy_hash) → UNIQUE constraint
// violation on the second resource.
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"every k8s resource chunk must have a distinct chunk_id (multi-resource collision regression)"
);
}
/// A YAML document with an indentation error (tab in a space-indented context)
/// must cause the chunker to return 0 chunks for the entire file.
#[test]
fn k8s_invalid_yaml_emits_zero_chunks() {
// serde_yaml 0.9 is lenient about duplicate keys (last wins), so use a
// genuine YAML structural error (unclosed flow sequence) to force a parse
// failure.
let actually_bad = "apiVersion: v1\nkind: Service\nfoo: [\nbar\n";
let doc = yaml_doc(actually_bad);
let chunks = K8sManifestResourceV1Chunker
.chunk(&doc, &policy())
.expect("chunk should not error — return Ok(vec![]) for invalid yaml");
assert_eq!(
chunks.len(),
0,
"invalid YAML must yield 0 chunks, got {}: {chunks:#?}",
chunks.len()
);
}
/// A cluster-scoped resource (no `metadata.namespace`) must produce a symbol
/// of the form `<Kind>/<name>` (two components, no namespace segment).
#[test]
fn k8s_cluster_scoped_resource_symbol() {
let yaml = "\
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-admin
rules:
- apiGroups: [\"*\"]
resources: [\"*\"]
verbs: [\"*\"]
";
let doc = yaml_doc(yaml);
let chunks = K8sManifestResourceV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk for cluster-scoped resource, got {}: {chunks:#?}",
chunks.len()
);
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(
symbol.as_deref(),
Some("ClusterRole/cluster-admin"),
"cluster-scoped symbol must be <Kind>/<name>"
);
assert_eq!(lang.as_deref(), Some("yaml"));
}
other => panic!("expected Code span, got {other:?}"),
}
}
/// 200+ line resource exercises `tier2_shared::push_chunks_with_oversize`'s
/// line-window split branch. All chunks must share the same symbol
/// (`<Kind>/<ns>/<name>`); their line ranges must form a contiguous
/// partition; chunk_ids must all differ (the `#L{k}` suffix on `id_for_chunk`
/// ensures uniqueness across windows). Spec p10-2 risks section explicitly
/// flags "거대 ConfigMap" — this test covers that path.
#[test]
fn k8s_oversize_splits_into_line_windows_sharing_symbol() {
// ConfigMap with 250 data keys → ~256 total lines, > AST_CHUNK_MAX_LINES (200).
let mut yaml = String::from(
"apiVersion: v1\nkind: ConfigMap\nmetadata:\n name: big\n namespace: prod\ndata:\n",
);
for i in 0..250 {
yaml.push_str(&format!(" key{i}: value{i}\n"));
}
let doc = yaml_doc(&yaml);
let chunks = K8sManifestResourceV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for oversize resource, got {}",
chunks.len()
);
// Every chunk must share the same symbol + lang.
let expected_symbol = "ConfigMap/prod/big";
for (i, c) in chunks.iter().enumerate() {
match &c.source_spans[0] {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(
symbol.as_deref(),
Some(expected_symbol),
"chunk[{i}] symbol must equal `{expected_symbol}`"
);
assert_eq!(lang.as_deref(), Some("yaml"));
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// chunk_ids must all be distinct (oversize fallback's #L{k} suffix).
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"oversize chunks must have distinct chunk_ids (the #L{{k}} suffix should disambiguate)"
);
// Line ranges must form a contiguous partition: chunk[i].line_end + 1 == chunk[i+1].line_start.
let ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code { line_start, line_end, .. } => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
for w in ranges.windows(2) {
let (_, prev_end) = w[0];
let (next_start, _) = w[1];
assert_eq!(
prev_end + 1,
next_start,
"line ranges must be contiguous: {} → {} (got gap or overlap)",
prev_end,
next_start
);
}
}

View File

@@ -0,0 +1,267 @@
//! Behavioural tests for `ManifestFileV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing the raw manifest text into a single `Block::Code`, mirroring the
//! pattern used in `dockerfile_file_v1.rs`.
use std::path::PathBuf;
use kebab_chunk::ManifestFileV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing manifest text.
fn manifest_doc(lang: &str, manifest_text: &str) -> CanonicalDocument {
let wp = WorkspacePath(format!("build/{}", manifest_filename(lang)));
let aid = AssetId("m".repeat(64));
let pv = ParserVersion("code-manifest-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = manifest_text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some(lang.into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some(lang.into()),
code: manifest_text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: format!("Manifest ({})", lang),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some(lang.into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn manifest_filename(lang: &str) -> &'static str {
match lang {
"toml" => "Cargo.toml",
"json" => "package.json",
"xml" => "pom.xml",
"go-mod" => "go.mod",
_ => "manifest",
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("manifest-file-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// A Cargo.toml fixture must emit exactly 1 chunk with the correct symbol,
/// lang, and line range.
#[test]
fn cargo_toml_single_chunk_with_toml_lang() {
let fixture_path = fixtures_dir().join("sample_cargo.toml");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("toml", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end: _,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"symbol must be '<manifest>'"
);
assert_eq!(lang.as_deref(), Some("toml"), "lang must be 'toml'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
}
/// A package.json fixture must emit exactly 1 chunk with the correct symbol,
/// lang, and line range.
#[test]
fn package_json_single_chunk_with_json_lang() {
let fixture_path = fixtures_dir().join("sample_package.json");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("json", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end: _,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"symbol must be '<manifest>'"
);
assert_eq!(lang.as_deref(), Some("json"), "lang must be 'json'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
}
/// A pom.xml fixture must emit exactly 1 chunk with the correct symbol,
/// lang, and line range.
#[test]
fn pom_xml_single_chunk_with_xml_lang() {
let fixture_path = fixtures_dir().join("sample_pom.xml");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("xml", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end: _,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"symbol must be '<manifest>'"
);
assert_eq!(lang.as_deref(), Some("xml"), "lang must be 'xml'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
}
/// A go.mod fixture must emit exactly 1 chunk with the correct symbol,
/// lang, and line range.
#[test]
fn go_mod_single_chunk_with_go_mod_lang() {
let fixture_path = fixtures_dir().join("sample_go.mod");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("go-mod", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end: _,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(
symbol.as_deref(),
Some("<manifest>"),
"symbol must be '<manifest>'"
);
assert_eq!(lang.as_deref(), Some("go-mod"), "lang must be 'go-mod'");
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
}

View File

@@ -275,6 +275,14 @@ enum Cmd {
#[arg(long, group = "reset_scope")]
config_only: bool,
/// Purge stored docs that are outside the current walker scope
/// (config narrowing / removed sub-directory). No filesystem paths
/// are removed — this is purely a store-level reconciliation.
/// Filesystem existence is NOT checked; anything the current walker
/// would not visit is considered an orphan and removed from the store.
#[arg(long, group = "reset_scope")]
orphans_only: bool,
/// Skip the interactive confirm. Required in non-interactive
/// contexts (CI, pipes).
#[arg(long)]
@@ -595,14 +603,20 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
println!("{}", serde_json::to_string(&wire::wire_ingest(&report))?);
} else {
let skipped_breakdown = kebab_app::render_skipped_breakdown(&report.skipped_by_extension);
let purged_suffix = if report.purged_deleted_files > 0 {
format!(" purged {}", report.purged_deleted_files)
} else {
String::new()
};
println!(
"scanned {} new {} updated {} skipped {}{} errors {} ({} ms)",
"scanned {} new {} updated {} skipped {}{} errors {}{} ({} ms)",
report.scanned,
report.new,
report.updated,
report.skipped,
skipped_breakdown,
report.errors,
purged_suffix,
report.duration_ms
);
}
@@ -1088,6 +1102,7 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
data_only: _,
vector_only,
config_only,
orphans_only,
yes,
} => {
use kebab_app::ResetScope;
@@ -1101,11 +1116,50 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
ResetScope::VectorOnly
} else if *config_only {
ResetScope::ConfigOnly
} else if *orphans_only {
ResetScope::OrphansOnly
} else {
ResetScope::DataOnly
};
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
if matches!(scope, ResetScope::OrphansOnly) {
// OrphansOnly: confirm UI shows orphan count + sample paths
// rather than on-disk directory sizes.
let orphan_paths = kebab_app::enumerate_orphans(&cfg)?;
if !*yes {
use std::io::IsTerminal;
if !std::io::stdin().is_terminal() {
anyhow::bail!(
"reset --orphans-only is destructive and stdin is non-interactive — pass --yes to proceed"
);
}
if !confirm_orphans_only(&orphan_paths)? {
if !cli.quiet {
eprintln!("aborted.");
}
return Ok(());
}
}
let report = kebab_app::reset::execute(scope, &cfg)?;
if cli.json {
println!("{}", serde_json::to_string(&wire::wire_reset(&report))?);
} else {
if report.orphans_purged > 0 {
println!("orphans purged: {}", report.orphans_purged);
for p in &report.purged_paths {
println!(" - {}", p.0);
}
} else {
println!("no orphaned docs found — store is already in sync with walker scope");
}
}
return Ok(());
}
let paths = kebab_app::reset::enumerate_paths(scope, &cfg);
let bytes = kebab_app::reset::estimate_size_bytes(&paths);
@@ -1444,6 +1498,46 @@ fn confirm_destructive(
Ok(matches!(s.as_str(), "y" | "yes"))
}
/// Confirm prompt for `--orphans-only`: shows the orphan count + a
/// sample of up to 5 paths so the user knows what will be purged before
/// committing. No filesystem paths are removed — only store records.
fn confirm_orphans_only(
orphan_paths: &[kebab_core::WorkspacePath],
) -> anyhow::Result<bool> {
use std::io::Write;
let n = orphan_paths.len();
let mut out = std::io::stderr().lock();
if n == 0 {
writeln!(out, "no orphaned docs found — nothing to purge.")?;
out.flush()?;
// Nothing to do; treat as confirmed so the caller can emit the
// "no orphans" report without prompting.
return Ok(true);
}
let sample: Vec<&str> = orphan_paths
.iter()
.take(5)
.map(|p| p.0.as_str())
.collect();
let sample_str = sample.join(", ");
let ellipsis = if n > 5 { ", …" } else { "" };
writeln!(
out,
"Purge {n} stored doc(s) outside the current walker scope? (no filesystem paths removed)"
)?;
writeln!(out, " sample: {sample_str}{ellipsis}")?;
write!(out, "[y/N] ")?;
out.flush()?;
let mut line = String::new();
std::io::stdin().read_line(&mut line)?;
let s = line.trim().to_ascii_lowercase();
Ok(matches!(s.as_str(), "y" | "yes"))
}
/// p9-fb-35: human-friendly plain output for `kebab fetch`.
fn render_fetch_plain(r: &kebab_core::FetchResult) {
println!("# {} ({})", r.doc_path.0, format_kind(r.kind));

View File

@@ -260,6 +260,7 @@ mod tests {
skipped_generated: 0,
skipped_size_exceeded: 0,
skip_examples: SkipExamples::default(),
purged_deleted_files: 0,
items: None,
};
let v = wire_ingest(&r);
@@ -364,6 +365,8 @@ mod tests {
scope: kebab_app::ResetScope::DataOnly,
removed_paths: vec![std::path::PathBuf::from("/tmp/x")],
embedding_rows_truncated: 0,
orphans_purged: 0,
purged_paths: vec![],
};
let v = wire_reset(&r);
assert_eq!(schema_of(&v), Some("reset_report.v1"));

View File

@@ -47,6 +47,12 @@ pub struct IngestReport {
/// p10-1A-1: sample file paths per skip category (≤ 5 each).
#[serde(default)]
pub skip_examples: SkipExamples,
/// Dogfood: docs whose on-disk file was deleted since the last ingest
/// and were therefore removed from the store. Additive field — older
/// wire consumers that pre-date this field read it as 0 via
/// `#[serde(default)]`.
#[serde(default)]
pub purged_deleted_files: u32,
/// `None` ↔ wire `items: null` (`--summary-only`).
pub items: Option<Vec<IngestItem>>,
}
@@ -136,6 +142,7 @@ mod tests {
builtin_blacklist: vec!["node_modules/x.js".into()],
gitignore: vec![],
},
purged_deleted_files: 0,
items: None,
};
let v = serde_json::to_value(&r).unwrap();

View File

@@ -8,7 +8,7 @@ use serde_json::Value;
use crate::asset::{RawAsset, WorkspacePath};
use crate::chunk::Chunk;
use crate::document::{Block, CanonicalDocument};
use crate::ids::{ChunkId, DocumentId};
use crate::ids::{AssetId, ChunkId, DocumentId};
use crate::jobs::{JobFilter, JobId, JobKind, JobRow, JobStatus};
use crate::media::MediaType;
use crate::search::{DocFilter, DocSummary, SearchFilters, SearchHit, SearchQuery};
@@ -161,14 +161,51 @@ pub trait DocumentStore {
fn get_document(&self, id: &DocumentId) -> anyhow::Result<Option<CanonicalDocument>>;
fn get_chunk(&self, id: &ChunkId) -> anyhow::Result<Option<Chunk>>;
fn list_documents(&self, filter: &DocFilter) -> anyhow::Result<Vec<DocSummary>>;
/// Look up an asset row by its `asset_id` (PRIMARY KEY = blake3
/// content hash). Twin-file safe: asset_id is PK so there is
/// exactly one row per unique content hash, regardless of how many
/// `documents` rows share it. Use this instead of
/// `get_asset_by_workspace_path` when you already have a
/// `CanonicalDocument` (which carries `source_asset_id`).
fn get_asset(&self, id: &AssetId) -> anyhow::Result<Option<RawAsset>>;
/// p9-fb-23: look up an asset row by its workspace path. Used by
/// the incremental-ingest skip path to compare the freshly
/// computed blake3 checksum against what's already in SQLite. The
/// schema enforces a unique workspace_path per asset.
///
/// NOTE: for twin files (identical content at different paths),
/// `assets.workspace_path` is "last-registered path" — it
/// flip-flops on every ingest. Prefer `get_asset` (by asset_id)
/// when you have a `CanonicalDocument.source_asset_id`.
fn get_asset_by_workspace_path(
&self,
path: &WorkspacePath,
) -> anyhow::Result<Option<RawAsset>>;
/// Look up a document row by its workspace path. Used by the
/// document-centric skip path in `try_skip_unchanged` to avoid the
/// twin-file flip-flop that the asset-side lookup suffers from
/// (multiple files with identical content share one `assets` row
/// whose `workspace_path` is overwritten on every UPSERT, so
/// `get_asset_by_workspace_path` returns the wrong twin's path).
///
/// `documents.workspace_path` is UNIQUE (V001), so each twin has
/// its own stable document row regardless of the asset de-dup.
fn get_document_by_workspace_path(
&self,
path: &WorkspacePath,
) -> anyhow::Result<Option<CanonicalDocument>>;
/// Return every `workspace_path` stored in the `documents` table.
///
/// Used by the post-walker sweep in `kebab-app::ingest` to detect
/// documents whose source file has been deleted from the filesystem.
/// The set difference `(stored - scanned)` yields orphan candidates;
/// each candidate is then existence-checked on disk so that
/// out-of-scope files (config narrowing) are NOT purged — only truly
/// absent files trigger the purge.
fn all_workspace_paths(&self) -> anyhow::Result<Vec<WorkspacePath>>;
}
pub trait VectorStore {

View File

@@ -19,6 +19,11 @@ tree-sitter-rust = { workspace = true }
tree-sitter-python = { workspace = true }
tree-sitter-typescript = { workspace = true }
tree-sitter-javascript = { workspace = true }
tree-sitter-go = { workspace = true }
tree-sitter-java = { workspace = true }
tree-sitter-kotlin-ng = { workspace = true }
tree-sitter-c = { workspace = true }
tree-sitter-cpp = { workspace = true }
[dev-dependencies]
tempfile = { workspace = true }

View File

@@ -0,0 +1,592 @@
//! `kebab-parse-code::c` — tree-sitter C AST extractor (P10-1D Task B).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("c")`].
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
//! top-level AST semantic unit:
//!
//! - `function_definition` → 1 unit, symbol = function name (extracted
//! from the declarator's innermost `identifier`, handles pointer-returning
//! functions where the declarator is wrapped in `pointer_declarator`).
//! - `struct_specifier` (named) → 1 unit, symbol = struct name.
//! - `enum_specifier` (named) → 1 unit, symbol = enum name.
//! - `union_specifier` (named) → 1 unit, symbol = union name.
//!
//! Everything else (`declaration`, `preproc_*`, `type_definition`,
//! `linkage_specification`, etc.) collapses into a single `<top-level>`
//! glue chunk. If the file produces zero units **and** zero glue, the
//! `<module>` post-pass emits one unit covering the whole file (1A-2
//! pattern).
//!
//! C symbol = function name only — no namespace, no class nesting
//! (design §3.4 C row). Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, strip_extension};
pub const PARSER_VERSION: &str = "code-c-v1";
/// C AST extractor. Per-unit blocks via tree-sitter-c 0.24.2
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct CAstExtractor;
impl CAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for CAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for CAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "c")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for CAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec())
.map_err(|e| anyhow::anyhow!("kebab-parse-code: C source is not valid UTF-8: {e}"))?;
let blocks = build_blocks(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("c".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted C doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
/// Walk down the declarator chain of a `function_definition` to find
/// the innermost `identifier` — the function name.
///
/// The tree for `int *foo(int x) { ... }` looks like:
/// ```text
/// function_definition
/// type: primitive_type "int"
/// declarator: pointer_declarator
/// declarator: function_declarator
/// declarator: identifier "foo"
/// parameters: parameter_list
/// body: compound_statement
/// ```
/// We walk `declarator` fields recursively until we reach an `identifier`
/// or run out of nodes. Returns `None` if no identifier is found
/// (malformed / unsupported declarator shape).
fn extract_fn_name<'a>(decl_node: tree_sitter::Node, src: &'a str) -> Option<&'a str> {
let mut cur = decl_node;
loop {
match cur.kind() {
"identifier" => return Some(&src[cur.start_byte()..cur.end_byte()]),
// pointer_declarator, function_declarator, array_declarator,
// attributed_declarator, parenthesized_declarator —
// all carry a `declarator` field pointing deeper.
_ => {
if let Some(inner) = cur.child_by_field_name("declarator") {
cur = inner;
} else {
// No further `declarator` field; give up.
return None;
}
}
}
}
}
fn build_blocks(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_c::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-c language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse C source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue is accumulated as (start, end) pairs and flushed into one
// "<top-level>" block (or "<module>" if no real unit exists).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
let mut glue: Vec<(u32, u32)> = Vec::new();
/// Walk preceding `comment` siblings to extend the unit's line range
/// upward, folding doc / line comments into the unit (1B pattern).
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
if p.kind() == "comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"function_definition" => {
if let Some(decl) = child.child_by_field_name("declarator") {
if let Some(name) = extract_fn_name(decl, source) {
flush_glue(&mut glue, &mut units);
units.push((name.to_string(), s, e, true));
} else {
// Could not extract name — treat as glue.
glue.push((s, e));
}
} else {
glue.push((s, e));
}
}
"struct_specifier" | "enum_specifier" | "union_specifier" => {
if let Some(name_node) = child.child_by_field_name("name") {
let name = &source[name_node.start_byte()..name_node.end_byte()];
flush_glue(&mut glue, &mut units);
units.push((name.to_string(), s, e, true));
} else {
// Anonymous struct/enum/union — glue.
glue.push((s, e));
}
}
// Everything else: preprocessor directives, declarations
// (typedef / global var / fn prototype), type_definition,
// linkage_specification, etc. — all collapse into glue.
_ => {
glue.push((s, e));
}
}
}
flush_glue(&mut glue, &mut units);
// Post-pass: if the file has no real semantic unit (only glue, or
// completely empty), rename the single glue unit to "<module>" and
// emit it. If there are zero units AND zero glue, synthesise a
// one-line "<module>" covering the whole file.
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if units.is_empty() {
// Completely empty file or whitespace/comments only.
let total = lines.len() as u32;
units.push((
"<module>".to_string(),
1,
total.max(1),
false,
));
}
// If there is only glue (no real unit) the single pushed "<top-level>"
// label should be "<module>" — rename it now.
if !has_real_unit {
for (sym, _, _, _) in units.iter_mut() {
if sym == "<top-level>" {
*sym = "<module>".to_string();
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("c".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("c".to_string()),
code,
}));
}
Ok(blocks)
}
fn flush_glue(glue: &mut Vec<(u32, u32)>, units: &mut Vec<(String, u32, u32, bool)>) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, b)| *b).max().unwrap();
units.push(("<top-level>".to_string(), s, e, false));
glue.clear();
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
pub(crate) mod tests_support {
use kebab_core::*;
use std::path::PathBuf;
use time::OffsetDateTime;
pub fn fixed_code_asset(workspace_path: &str, lang: &str) -> RawAsset {
RawAsset {
asset_id: AssetId("a".repeat(64)),
source_uri: SourceUri::File(PathBuf::from(workspace_path)),
workspace_path: WorkspacePath(workspace_path.to_string()),
media_type: MediaType::Code(lang.to_string()),
byte_len: 0,
checksum: Checksum("b".repeat(64)),
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::Reference {
path: PathBuf::from(workspace_path),
sha: Checksum("b".repeat(64)),
},
}
}
pub fn extract_c(src: &str, path: &str) -> kebab_core::CanonicalDocument {
use super::CAstExtractor;
use kebab_core::Extractor;
let asset = fixed_code_asset(path, "c");
let cfg = ExtractConfig::default();
let root = PathBuf::from("/tmp");
let ctx = ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
CAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
}
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn syms(doc: &kebab_core::CanonicalDocument) -> Vec<String> {
doc.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => symbol.clone(),
_ => None,
},
_ => None,
})
.collect()
}
#[test]
fn extractor_supports_only_media_code_c() {
let e = CAstExtractor::new();
assert!(e.supports(&MediaType::Code("c".into())));
assert!(!e.supports(&MediaType::Code("cpp".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn c_extractor_simple_function() {
let src = "int add(int a, int b) { return a + b; }\n";
let doc = tests_support::extract_c(src, "x/math.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "add"), "got {s:?}");
}
#[test]
fn c_extractor_pointer_return_function() {
let src = "int *find(int *arr, int n) { return arr; }\n";
let doc = tests_support::extract_c(src, "x/find.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "find"), "ptr-return fn missing: {s:?}");
}
#[test]
fn c_extractor_static_function() {
let src = "static void helper(void) {}\n";
let doc = tests_support::extract_c(src, "x/helper.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "helper"), "static fn missing: {s:?}");
}
#[test]
fn c_extractor_extern_function() {
let src = "extern int compute(int x);\n";
// extern prototype is a declaration → glue
let doc = tests_support::extract_c(src, "x/compute.c");
let s = syms(&doc);
// declaration (prototype) falls into glue → "<module>"
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for extern proto: {s:?}"
);
}
#[test]
fn c_extractor_inline_function() {
let src = "inline int square(int x) { return x * x; }\n";
let doc = tests_support::extract_c(src, "x/square.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "square"), "inline fn missing: {s:?}");
}
#[test]
fn c_extractor_named_struct() {
let src = "struct Point { int x; int y; };\n";
let doc = tests_support::extract_c(src, "x/point.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Point"), "struct missing: {s:?}");
}
#[test]
fn c_extractor_named_enum() {
let src = "enum Color { RED, GREEN, BLUE };\n";
let doc = tests_support::extract_c(src, "x/color.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Color"), "enum missing: {s:?}");
}
#[test]
fn c_extractor_named_union() {
let src = "union Data { int i; float f; };\n";
let doc = tests_support::extract_c(src, "x/data.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Data"), "union missing: {s:?}");
}
#[test]
fn c_extractor_anonymous_struct_falls_into_glue() {
// Anonymous struct (no name field) → glue → "<module>" (only glue, no real unit)
let src = "struct { int x; int y; } origin;\n";
let doc = tests_support::extract_c(src, "x/anon.c");
let s = syms(&doc);
// anonymous struct is a declaration containing anonymous struct_specifier → glue
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for anon struct: {s:?}"
);
// Must NOT emit a unit named after anything else
assert!(
!s.iter().any(|x| x == "origin"),
"unexpected 'origin' unit: {s:?}"
);
}
#[test]
fn c_extractor_typedef_struct_falls_into_glue() {
// typedef struct { ... } Foo; — inner struct_specifier is anonymous,
// outer node is type_definition → glue. See HOTFIXES.md 2026-05-21.
let src = "typedef struct { int x; int y; } Point;\n";
let doc = tests_support::extract_c(src, "x/typedef.c");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for typedef struct: {s:?}"
);
// The typedef alias should NOT surface as a Code symbol
assert!(
!s.iter().any(|x| x == "Point"),
"unexpected 'Point' unit for typedef struct: {s:?}"
);
}
#[test]
fn c_extractor_preprocessor_directives_are_glue() {
let src = "#include <stdio.h>\n#define MAX 100\n#ifdef DEBUG\n#endif\n";
let doc = tests_support::extract_c(src, "x/macros.c");
let s = syms(&doc);
// Only preprocessor → no real unit → "<module>"
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for preproc-only file: {s:?}"
);
assert_eq!(s.len(), 1, "expected exactly 1 block: {s:?}");
}
#[test]
fn c_extractor_multiple_functions_correct_count() {
let src = "int foo(void) { return 1; }\nint bar(void) { return 2; }\nint baz(void) { return 3; }\n";
let doc = tests_support::extract_c(src, "x/multi.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "foo"), "foo missing: {s:?}");
assert!(s.iter().any(|x| x == "bar"), "bar missing: {s:?}");
assert!(s.iter().any(|x| x == "baz"), "baz missing: {s:?}");
assert_eq!(s.len(), 3, "expected 3 units: {s:?}");
}
#[test]
fn c_extractor_empty_file_produces_module() {
let src = "";
let doc = tests_support::extract_c(src, "x/empty.c");
let s = syms(&doc);
assert_eq!(s, vec!["<module>"], "expected <module>: got {s:?}");
}
#[test]
fn c_extractor_preprocessor_only_produces_module() {
let src = "#include <stdlib.h>\n#define VERSION \"1.0\"\n";
let doc = tests_support::extract_c(src, "x/header.c");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for preproc-only file: {s:?}"
);
}
#[test]
fn c_extractor_mixed_functions_and_glue() {
let src = r#"#include <stdio.h>
int compute(int x) {
return x * 2;
}
extern int lookup(int key);
void print_result(int v) {
printf("%d\n", v);
}
"#;
let doc = tests_support::extract_c(src, "x/mixed.c");
let s = syms(&doc);
// Two real functions + one glue block
assert!(s.iter().any(|x| x == "compute"), "compute missing: {s:?}");
assert!(s.iter().any(|x| x == "print_result"), "print_result missing: {s:?}");
assert!(
s.iter().any(|x| x == "<top-level>"),
"<top-level> glue missing: {s:?}"
);
}
#[test]
fn c_extractor_deterministic_across_runs() {
let src = r#"
struct Node { int val; };
int sum(int a, int b) { return a + b; }
void noop(void) {}
"#;
let a = tests_support::extract_c(src, "x/det.c");
for _ in 0..20 {
assert_eq!(
tests_support::extract_c(src, "x/det.c").blocks,
a.blocks
);
}
}
}

View File

@@ -0,0 +1,883 @@
//! `kebab-parse-code::cpp` — tree-sitter C++ AST extractor (P10-1D Task C).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("cpp")`].
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
//! top-level AST semantic unit, each carrying [`SourceSpan::Code`] with
//! the unit's `::` separated symbol path (design §3.4 C++ row).
//!
//! ## Symbol formation
//!
//! Symbol = `namespace::Class::method` via recursive `build_blocks`:
//!
//! - `namespace_definition` (named) → push namespace name, recurse into body.
//! - Anonymous namespace (`namespace { ... }`) → push `<anonymous>`, recurse.
//! - `nested_namespace_specifier` (`outer::inner`) → push all segments, recurse.
//! - `class_specifier` / `struct_specifier` (named) → emit class unit + recurse
//! into body with class name pushed.
//! - `function_definition` → emit method/function unit. Symbol is built from
//! the prefix chain + the extracted declarator name component.
//! - Out-of-class method def (`void Foo::bar() {}`) — the declarator's inner
//! node is a `qualified_identifier`; its scope chain is prepended to the
//! current prefix to form the full symbol.
//! - `template_declaration` → recurse into named children with same prefix;
//! the inner function/class body is matched by its own arm. Template params
//! are NOT included in the symbol.
//! - `enum_specifier` (named) → emit type unit.
//! - `concept_definition` (C++20) → emit type unit.
//! - `linkage_specification` (extern "C") → recurse into body with same prefix.
//!
//! ## Constructor / destructor / operator overload
//!
//! - Constructor: `function_declarator > identifier` matching the class name.
//! Symbol = `Class::Class` (name duplicated, same convention as Java).
//! - Destructor: `function_declarator > destructor_name`. Symbol = `Class::~Foo`.
//! - Operator overload: `function_declarator > operator_name`. Symbol = `Class::operator+`.
//! - Conversion operator: `function_definition.declarator` is `operator_cast`.
//! Symbol = `Class::operator <type>` (e.g. `Class::operator bool`).
//!
//! ## Glue
//!
//! Everything not in the unit list collapses into a single `<top-level>` glue
//! chunk (preproc, declarations, using, typedef, etc.). If the file produces
//! zero units AND zero glue, the `<module>` post-pass emits one unit covering
//! the whole file.
//!
//! Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, strip_extension};
pub const PARSER_VERSION: &str = "code-cpp-v1";
/// C++ AST extractor. Per-unit blocks via tree-sitter-cpp 0.23.4
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct CppAstExtractor;
impl CppAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for CppAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for CppAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "cpp")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for CppAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
anyhow::anyhow!("kebab-parse-code: C++ source is not valid UTF-8: {e}")
})?;
let blocks = build_blocks_top(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("cpp".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted C++ doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
// ---------------------------------------------------------------------------
// Core block-building logic
// ---------------------------------------------------------------------------
/// Top-level entry: parse source, walk the `translation_unit` root, assemble
/// units + glue, apply the `<module>` post-pass, and emit `Block::Code`s.
fn build_blocks_top(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_cpp::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-cpp language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse C++ source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue is accumulated as (start, end) pairs and flushed into one
// "<top-level>" block (or "<module>" if no real unit exists).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
let mut glue: Vec<(u32, u32)> = Vec::new();
build_blocks(root, source, &[], &mut units, &mut glue);
flush_glue(&mut glue, &mut units);
// Post-pass: if the file has no real semantic unit (only glue, or
// completely empty), rename the single glue unit to "<module>".
// If there are zero units AND zero glue, synthesize a one-line
// "<module>" covering the whole file.
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if units.is_empty() {
let total = lines.len() as u32;
units.push(("<module>".to_string(), 1, total.max(1), false));
}
if !has_real_unit {
for (sym, _, _, _) in units.iter_mut() {
if sym == "<top-level>" {
*sym = "<module>".to_string();
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("cpp".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("cpp".to_string()),
code,
}));
}
Ok(blocks)
}
/// Walk preceding `comment` siblings to extend the unit's line range upward,
/// folding leading doc / line comments into the unit (1B pattern).
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
if p.kind() == "comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
fn flush_glue(glue: &mut Vec<(u32, u32)>, units: &mut Vec<(String, u32, u32, bool)>) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, b)| *b).max().unwrap();
units.push(("<top-level>".to_string(), s, e, false));
glue.clear();
}
/// Walk a scope node (translation_unit, declaration_list, field_declaration_list)
/// emitting unit + glue blocks. `prefix` is the current namespace/class chain
/// (e.g. `["kebab", "Chunk", "Foo"]`).
///
/// After returning, any pending glue in `glue` is NOT flushed — callers
/// responsible for flushing at the scope boundary (top-level flush in
/// `build_blocks_top`). Within recursive scope bodies (namespace/class) we
/// do flush before returning so that glue doesn't leak across scopes.
fn build_blocks(
node: tree_sitter::Node,
source: &str,
prefix: &[String],
units: &mut Vec<(String, u32, u32, bool)>,
glue: &mut Vec<(u32, u32)>,
) {
let mut cur = node.walk();
for child in node.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"namespace_definition" => {
// Flush pending glue before starting this namespace block.
flush_glue(glue, units);
let name_node = child.child_by_field_name("name");
let body = child
.child_by_field_name("body")
.unwrap_or(child);
match name_node {
None => {
// Anonymous namespace: push "<anonymous>", recurse.
let mut new_prefix = prefix.to_vec();
new_prefix.push("<anonymous>".to_string());
build_blocks(body, source, &new_prefix, units, glue);
flush_glue(glue, units);
}
Some(nn) => match nn.kind() {
"namespace_identifier" => {
let name = &source[nn.start_byte()..nn.end_byte()];
let mut new_prefix = prefix.to_vec();
new_prefix.push(name.to_string());
build_blocks(body, source, &new_prefix, units, glue);
flush_glue(glue, units);
}
"nested_namespace_specifier" => {
// e.g. `namespace outer::inner { ... }`
// All named children are namespace_identifier nodes.
let mut new_prefix = prefix.to_vec();
let mut nc = nn.walk();
for seg in nn.named_children(&mut nc) {
new_prefix.push(source[seg.start_byte()..seg.end_byte()].to_string());
}
build_blocks(body, source, &new_prefix, units, glue);
flush_glue(glue, units);
}
_ => {
// Unknown name kind — treat entire namespace as glue.
glue.push((s, e));
}
},
}
}
"class_specifier" | "struct_specifier" => {
let name_node = child.child_by_field_name("name");
let Some(nn) = name_node else {
// Anonymous class/struct — glue.
glue.push((s, e));
continue;
};
let name = match nn.kind() {
"type_identifier" => &source[nn.start_byte()..nn.end_byte()],
_ => {
// template_type or qualified_identifier — use full text
// as the symbol segment (includes template args).
&source[nn.start_byte()..nn.end_byte()]
}
};
flush_glue(glue, units);
let sym = build_symbol(prefix, &[name]);
units.push((sym, s, e, true));
if let Some(body) = child.child_by_field_name("body") {
let mut new_prefix = prefix.to_vec();
new_prefix.push(name.to_string());
build_blocks(body, source, &new_prefix, units, glue);
flush_glue(glue, units);
}
}
"function_definition" => {
let decl = child.child_by_field_name("declarator");
let Some(decl_node) = decl else {
glue.push((s, e));
continue;
};
match extract_fn_symbol(decl_node, source, prefix) {
Some(sym) => {
flush_glue(glue, units);
units.push((sym, s, e, true));
}
None => {
glue.push((s, e));
}
}
}
"template_declaration" => {
// Unwrap: recurse into named children with same prefix.
// The inner function/class/concept will be matched by their own
// arms. template_parameter_list is not a unit; it will fall
// through to glue (it's not a named child of the template_declaration
// that matches any of our arms).
build_blocks(child, source, prefix, units, glue);
// Do NOT flush glue here — template body may be part of a glue group.
}
"enum_specifier" => {
if let Some(nn) = child.child_by_field_name("name") {
let name = &source[nn.start_byte()..nn.end_byte()];
flush_glue(glue, units);
let sym = build_symbol(prefix, &[name]);
units.push((sym, s, e, true));
} else {
// Anonymous enum — glue.
glue.push((s, e));
}
}
"concept_definition" => {
// C++20. Has required "name" field (identifier).
if let Some(nn) = child.child_by_field_name("name") {
let name = &source[nn.start_byte()..nn.end_byte()];
flush_glue(glue, units);
let sym = build_symbol(prefix, &[name]);
units.push((sym, s, e, true));
} else {
glue.push((s, e));
}
}
"linkage_specification" => {
// extern "C" { ... } — glue-wrapper, but recurse into body
// with same prefix so inner definitions are extracted.
let body = child.child_by_field_name("body").unwrap_or(child);
// The linkage_spec itself is glue; inner defs handled by recursion.
// Don't emit the wrapper as a unit; but also don't push it as glue
// since recursion will push its inner children individually.
build_blocks(body, source, prefix, units, glue);
}
// Everything else: preproc, declarations, using, typedef, etc.
_ => {
glue.push((s, e));
}
}
}
}
/// Join prefix + extras into a `::` separated symbol.
fn build_symbol(prefix: &[String], extras: &[&str]) -> String {
let mut parts: Vec<&str> = prefix.iter().map(String::as_str).collect();
parts.extend_from_slice(extras);
parts.join("::")
}
/// Extract the symbol for a `function_definition` given its top-level
/// `declarator` node. Returns `None` if the name cannot be determined.
///
/// The declarator chain may be:
/// - `function_declarator` (plain fn or method)
/// - `pointer_declarator` wrapping `function_declarator` (fn returning pointer)
/// - `reference_declarator` wrapping `function_declarator` (fn returning ref)
/// - `operator_cast` (conversion operator — e.g. `operator bool`)
///
/// The inner `function_declarator.declarator` is one of:
/// - `identifier` → free fn or constructor, symbol = `prefix::name`
/// - `field_identifier` → method in class body, symbol = `prefix::name`
/// - `destructor_name` → `~Foo`, symbol = `prefix::~Foo`
/// - `operator_name` → `operator+` etc., symbol = `prefix::operator+`
/// - `qualified_identifier` → out-of-class def `Foo::bar` or `ns::Foo::bar`;
/// the scope chain is extracted and prepended to prefix.
///
/// For `qualified_identifier`, the scope hierarchy (which may itself be a
/// `qualified_identifier`) is flattened into a list of segments. These
/// segments REPLACE the current prefix (since out-of-class defs carry their
/// full scope explicitly). Example: `void ns::Foo::bar() {}` at top level
/// with prefix=[] → segments=[ns, Foo, bar] → symbol = `ns::Foo::bar`.
fn extract_fn_symbol(
decl_node: tree_sitter::Node,
source: &str,
prefix: &[String],
) -> Option<String> {
// Walk down pointer/reference wrapper layers to reach the
// function_declarator (or operator_cast at definition level).
let fn_decl = unwrap_to_fn_declarator(decl_node, source)?;
match fn_decl.kind() {
"operator_cast" => {
// e.g. `operator bool() const` — the function_definition.declarator
// IS the operator_cast (no function_declarator wrapper).
// Symbol = `prefix::operator <type>`.
let type_node = fn_decl.child_by_field_name("type")?;
let type_text = &source[type_node.start_byte()..type_node.end_byte()];
Some(build_symbol(prefix, &[&format!("operator {type_text}")]))
}
"function_declarator" => {
let inner = fn_decl.child_by_field_name("declarator")?;
extract_name_node(inner, source, prefix)
}
_ => None,
}
}
/// Walk pointer_declarator / reference_declarator chains down to the
/// first `function_declarator` or `operator_cast` node.
///
/// Returns `None` if no such node is found (e.g. a function definition
/// whose declarator is malformed or unknown).
fn unwrap_to_fn_declarator<'a>(
mut node: tree_sitter::Node<'a>,
_source: &str,
) -> Option<tree_sitter::Node<'a>> {
loop {
match node.kind() {
"function_declarator" | "operator_cast" => return Some(node),
"pointer_declarator" => {
node = node.child_by_field_name("declarator")?;
}
"reference_declarator" | "rvalue_reference_declarator" => {
// reference_declarator has no `declarator` field; its child
// is in the unnamed children list.
let mut walker = node.walk();
node = node.named_children(&mut walker).next()?;
}
_ => return None,
}
}
}
/// Given the innermost name node of a function_declarator, produce the symbol.
fn extract_name_node(
inner: tree_sitter::Node,
source: &str,
prefix: &[String],
) -> Option<String> {
match inner.kind() {
"identifier" | "field_identifier" => {
let name = &source[inner.start_byte()..inner.end_byte()];
Some(build_symbol(prefix, &[name]))
}
"destructor_name" => {
// destructor_name text includes the `~` prefix (e.g. "~Foo").
let full = &source[inner.start_byte()..inner.end_byte()];
Some(build_symbol(prefix, &[full]))
}
"operator_name" => {
// Full text e.g. "operator+", "operator->", "operator()".
let full = &source[inner.start_byte()..inner.end_byte()];
Some(build_symbol(prefix, &[full]))
}
"template_function" | "template_method" => {
// Template function like `foo<int>()`. Use the `name` field
// (the identifier / field_identifier before `<`).
let name_node = inner.child_by_field_name("name")?;
let name = &source[name_node.start_byte()..name_node.end_byte()];
Some(build_symbol(prefix, &[name]))
}
"qualified_identifier" => {
// Out-of-class method definition. Flatten the nested
// qualified_identifier chain into ordered segments.
// Example: `ns::Foo::method`
// qualified_identifier {
// scope: namespace_identifier "ns"
// name: qualified_identifier {
// scope: namespace_identifier "Foo"
// name: identifier "method"
// }
// }
// → ["ns", "Foo", "method"]
//
// These segments are combined with the current prefix so that a
// top-level out-of-class def `void Foo::bar() {}` inside a
// namespace body with prefix=["ns"] produces `ns::Foo::bar`.
let mut segments: Vec<String> = Vec::new();
flatten_qualified_id(inner, source, &mut segments);
if segments.is_empty() {
return None;
}
// Build: prefix + all segments (scope chain + leaf).
let mut all: Vec<&str> = prefix.iter().map(String::as_str).collect();
for seg in &segments {
all.push(seg.as_str());
}
Some(all.join("::"))
}
_ => None,
}
}
/// Recursively flatten a `qualified_identifier` node into ordered string
/// segments. For `ns::Foo::method` this produces `["ns", "Foo", "method"]`.
fn flatten_qualified_id(node: tree_sitter::Node, source: &str, out: &mut Vec<String>) {
// A qualified_identifier has:
// scope: namespace_identifier | (None for global-scope `::foo`)
// name: identifier | field_identifier | destructor_name |
// operator_name | qualified_identifier | template_function |
// template_method | ...
let scope_node = node.child_by_field_name("scope");
let name_node = node.child_by_field_name("name");
if let Some(s) = scope_node {
out.push(source[s.start_byte()..s.end_byte()].to_string());
}
match name_node {
Some(n) if n.kind() == "qualified_identifier" => {
// Recurse: more nesting.
flatten_qualified_id(n, source, out);
}
Some(n) => {
// Leaf name — push its text.
out.push(source[n.start_byte()..n.end_byte()].to_string());
}
None => {}
}
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
pub(crate) mod tests_support {
use kebab_core::*;
use std::path::PathBuf;
use time::OffsetDateTime;
pub fn fixed_code_asset(workspace_path: &str, lang: &str) -> RawAsset {
RawAsset {
asset_id: AssetId("a".repeat(64)),
source_uri: SourceUri::File(PathBuf::from(workspace_path)),
workspace_path: WorkspacePath(workspace_path.to_string()),
media_type: MediaType::Code(lang.to_string()),
byte_len: 0,
checksum: Checksum("b".repeat(64)),
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::Reference {
path: PathBuf::from(workspace_path),
sha: Checksum("b".repeat(64)),
},
}
}
pub fn extract_cpp(src: &str, path: &str) -> kebab_core::CanonicalDocument {
use super::CppAstExtractor;
use kebab_core::Extractor;
let asset = fixed_code_asset(path, "cpp");
let cfg = ExtractConfig::default();
let root = PathBuf::from("/tmp");
let ctx = ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
CppAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
}
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn syms(doc: &kebab_core::CanonicalDocument) -> Vec<String> {
let mut s: Vec<String> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => symbol.clone(),
_ => None,
},
_ => None,
})
.collect();
s.sort();
s
}
#[test]
fn extractor_supports_only_media_code_cpp() {
let e = CppAstExtractor::new();
assert!(e.supports(&MediaType::Code("cpp".into())));
assert!(!e.supports(&MediaType::Code("c".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn free_function() {
let src = "void foo() {}\n";
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "foo"), "got {s:?}");
}
#[test]
fn namespace_and_class() {
let src = r#"
namespace ns {
class Foo {
public:
void method() {}
Foo() {}
~Foo() {}
int operator+(const Foo& o) { return 0; }
};
}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "ns::Foo"), "ns::Foo missing: {s:?}");
assert!(s.iter().any(|x| x == "ns::Foo::method"), "method missing: {s:?}");
assert!(s.iter().any(|x| x == "ns::Foo::Foo"), "ctor missing: {s:?}");
assert!(s.iter().any(|x| x == "ns::Foo::~Foo"), "dtor missing: {s:?}");
assert!(s.iter().any(|x| x == "ns::Foo::operator+"), "op+ missing: {s:?}");
}
#[test]
fn anonymous_namespace() {
let src = r#"
namespace {
void hidden_fn() {}
}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "<anonymous>::hidden_fn"),
"anon fn missing: {s:?}"
);
}
#[test]
fn nested_namespace_specifier() {
let src = r#"
namespace outer::inner {
void fn_in_nested() {}
}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "outer::inner::fn_in_nested"),
"nested ns fn missing: {s:?}"
);
}
#[test]
fn out_of_class_method_def() {
let src = r#"
void ns::Foo::method() { }
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "ns::Foo::method"),
"out-of-class method missing: {s:?}"
);
}
#[test]
fn template_declaration() {
let src = r#"
template<typename T>
class Bar {
void tmpl_method() {}
};
template<typename T>
void tmpl_free_fn(T x) {}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Bar"), "Bar class missing: {s:?}");
assert!(
s.iter().any(|x| x == "Bar::tmpl_method"),
"Bar::tmpl_method missing: {s:?}"
);
assert!(
s.iter().any(|x| x == "tmpl_free_fn"),
"tmpl_free_fn missing: {s:?}"
);
}
#[test]
fn enum_and_concept() {
let src = r#"
enum class Color { Red, Green };
template<typename T>
concept Printable = requires(T t) { t.print(); };
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Color"), "Color missing: {s:?}");
assert!(s.iter().any(|x| x == "Printable"), "Printable missing: {s:?}");
}
#[test]
fn extern_c_block() {
let src = r#"
extern "C" {
void c_fn1() {}
void c_fn2() {}
}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "c_fn1"), "c_fn1 missing: {s:?}");
assert!(s.iter().any(|x| x == "c_fn2"), "c_fn2 missing: {s:?}");
}
#[test]
fn conversion_operator() {
let src = r#"
class Foo {
operator bool() const { return true; }
};
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "Foo::operator bool"),
"conversion op missing: {s:?}"
);
}
#[test]
fn empty_file_produces_module() {
let src = "";
let doc = tests_support::extract_cpp(src, "x/empty.cpp");
let s = syms(&doc);
assert_eq!(s, vec!["<module>"], "expected <module>: got {s:?}");
}
#[test]
fn glue_only_produces_module() {
let src = "#include <vector>\nusing namespace std;\n";
let doc = tests_support::extract_cpp(src, "x/glue.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "<module>"), "expected <module>: got {s:?}");
}
#[test]
fn ptr_returning_function() {
let src = "int* ptr_fn(int x) { return &x; }\n";
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "ptr_fn"), "ptr_fn missing: {s:?}");
}
#[test]
fn ref_returning_operator() {
let src = r#"
class Foo {
Foo& operator=(const Foo& o) { return *this; }
};
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "Foo::operator="),
"operator= missing: {s:?}"
);
}
#[test]
fn deterministic_across_runs() {
let src = r#"
namespace ns {
class Foo {
void method() {}
};
}
void free_fn() {}
"#;
let a = tests_support::extract_cpp(src, "x/foo.cpp");
for _ in 0..20 {
assert_eq!(tests_support::extract_cpp(src, "x/foo.cpp").blocks, a.blocks);
}
}
}

View File

@@ -0,0 +1,451 @@
//! `kebab-parse-code::go` — tree-sitter Go AST extractor (P10-1C-Go Task D).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("go")`].
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
//! top-level AST semantic unit (free fn, method, each type spec) carrying
//! [`SourceSpan::Code`] with the unit's self-reference symbol path
//! (design §3.4 Go row). Glue declarations (`import` / `const` / `var`)
//! collapse into one grouped `<top-level>` (or `<module>`) unit.
//!
//! Unlike the Python/TS/JS extractors which path-derive their module
//! prefix from the workspace file path, Go's package identity comes from
//! the source itself (the leading `package` clause) — `extract_package`
//! reads it from the AST. If the `package_clause` is missing (invalid Go
//! in practice) the prefix falls back to `"<unknown>"`.
//!
//! Doc comments immediately preceding an item are folded into that
//! item's line range via `unit_start` (1B pattern). Go has no separate
//! attribute/decorator AST nodes.
//!
//! Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
pub const PARSER_VERSION: &str = "code-go-v1";
/// Go AST extractor. Per-unit blocks via tree-sitter-go 0.25
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct GoAstExtractor;
impl GoAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for GoAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for GoAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "go")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for GoAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec())
.map_err(|e| anyhow::anyhow!("kebab-parse-code: Go source is not valid UTF-8: {e}"))?;
let blocks = build_blocks(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
// Resolve the file's absolute path for repo detection. If the
// source URI carries a relative path, anchor it at the workspace
// root so the `.git/` walk-up starts from the right place.
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("go".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted Go doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
/// p10-1C-Go: extract `package` declaration text from a tree-sitter-go
/// `source_file`. Returns `None` if no `package_clause` (invalid Go in
/// practice but defense-in-depth). Per design §3.4 Go row.
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_clause" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "package_identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
fn build_blocks(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_go::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-go language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Go source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
let mod_prefix = extract_package(root, source).unwrap_or_else(|| "<unknown>".to_string());
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue groups are pushed with a sentinel symbol + is_real=false so a
// post-pass can decide `<module>` vs `<top-level>` (1B post-pass
// mirror).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
// (is_import 0/1, s, e). `is_import` flags `import_declaration` —
// used by the glue flush to pick `<module>` vs `<top-level>`
// provisional label.
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
fn node_name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
n.child_by_field_name("name")
.map(|c| &src[c.start_byte()..c.end_byte()])
}
/// Walk preceding `comment` siblings to extend the unit's line range
/// upward, folding leading doc / line comments into the unit. Go has
/// no decorator/attribute nodes — doc comments are simply preceding
/// `comment` siblings (the 1B pattern).
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
if p.kind() == "comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
/// Extract the receiver type text for a `method_declaration`. The
/// returned slice INCLUDES the leading `*` for pointer receivers
/// (`(*Foo).Bar`) per design §3.4 Go row example. Returns `None` if
/// the receiver is malformed (defense in depth).
fn receiver_type_text<'a>(method_node: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
let recv = method_node.child_by_field_name("receiver")?;
let mut cw = recv.walk();
for p in recv.named_children(&mut cw) {
if p.kind() == "parameter_declaration" {
if let Some(ty) = p.child_by_field_name("type") {
return Some(&src[ty.start_byte()..ty.end_byte()]);
}
}
}
None
}
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"function_declaration" => {
if let Some(name) = node_name_text(&child, source) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(&mut glue, &mut units, &mod_prefix);
let sym = join_symbol(&mod_prefix, &[], name);
units.push((sym, s, e, true));
}
}
"method_declaration" => {
if let Some(name_node) = child.child_by_field_name("name") {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(&mut glue, &mut units, &mod_prefix);
let owner = receiver_type_text(&child, source).unwrap_or("<unknown>");
let method_name = &source[name_node.start_byte()..name_node.end_byte()];
let sym = format!("{mod_prefix}.({owner}).{method_name}");
units.push((sym, s, e, true));
}
}
"type_declaration" => {
// One unit per inner `type_spec`. Each type_spec gets
// the type_declaration's whole upward-folded `s` range
// start so doc comments are attached to the first spec;
// subsequent specs use their own start. Match 1B
// pattern: keep the outer `s` only when there's a single
// spec; otherwise use the spec's own start.
let mut tcur = child.walk();
let specs: Vec<tree_sitter::Node> = child
.named_children(&mut tcur)
.filter(|c| c.kind() == "type_spec")
.collect();
let single = specs.len() == 1;
for spec in specs {
let name_node = match spec.child_by_field_name("name") {
Some(n) => n,
None => continue,
};
let spec_s = if single {
s
} else {
spec.start_position().row as u32 + 1
};
let spec_e = spec.end_position().row as u32 + 1;
glue.retain(|(_, gs, _)| *gs < spec_s);
flush_glue(&mut glue, &mut units, &mod_prefix);
let name = &source[name_node.start_byte()..name_node.end_byte()];
let sym = join_symbol(&mod_prefix, &[], name);
units.push((sym, spec_s, spec_e, true));
}
}
"import_declaration" => {
glue.push((1, s, e));
}
"const_declaration" | "var_declaration" => {
glue.push((0, s, e));
}
_ => {}
}
}
flush_glue(&mut glue, &mut units, &mod_prefix);
// `<module>` is correct only when the file produced no real unit.
// Otherwise the import/const/var-only group becomes `<top-level>`
// (same post-pass as 1B). Match on the suffix so the demotion stays
// mod-prefix-agnostic.
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if has_real_unit {
for (sym, _, _, is_real) in units.iter_mut() {
if !*is_real && sym.ends_with("<module>") {
let pre = &sym[..sym.len() - "<module>".len()];
*sym = format!("{pre}<top-level>");
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("go".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("go".to_string()),
code,
}));
}
Ok(blocks)
}
fn flush_glue(
glue: &mut Vec<(usize, u32, u32)>,
units: &mut Vec<(String, u32, u32, bool)>,
mod_prefix: &str,
) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
// Provisional label: `<module>` only if the group is exclusively
// imports (1A's `only_mod_decls` analog). The post-pass demotes any
// `<module>` to `<top-level>` if the file produced any real unit.
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
let label = if only_imports { "<module>" } else { "<top-level>" };
units.push((join_symbol(mod_prefix, &[], label), s, e, false));
glue.clear();
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(concat!(
env!("CARGO_MANIFEST_DIR"),
"/tests/fixtures/sample.go"
))
.unwrap();
// Reuse the cross-language test-support helper promoted in 1B.
let asset = crate::rust::tests_support::fixed_code_asset("crates/x/src/sample.go", "go");
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
GoAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_go() {
let e = GoAstExtractor::new();
assert!(e.supports(&MediaType::Code("go".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn go_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("go"));
symbol.clone()
}
_ => None,
},
_ => None,
})
.collect();
syms.sort();
assert!(syms.iter().any(|s| s == "chunk.Free"), "got {syms:?}");
assert!(syms.iter().any(|s| s == "chunk.init"), "got {syms:?}");
assert!(
syms.iter().any(|s| s == "chunk.MdHeadingV1Chunker"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "chunk.(*MdHeadingV1Chunker).ChunkDoc"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "chunk.(MdHeadingV1Chunker).Name2"),
"got {syms:?}"
);
assert!(syms.iter().any(|s| s == "chunk.Stringer"), "got {syms:?}");
// import + const grouped into one glue unit (no isolated `<module>`).
assert!(
syms.iter().any(|s| s == "chunk.<top-level>"),
"got {syms:?}"
);
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 {
assert_eq!(extract_fixture().blocks, a.blocks);
}
}
}

View File

@@ -0,0 +1,543 @@
//! `kebab-parse-code::java` — tree-sitter Java AST extractor (P10-1C-JK Task D).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("java")`].
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
//! top-level AST semantic unit (class / interface / enum / record /
//! annotation-type at any nesting level, plus methods + constructors
//! inside class / interface / record bodies), each carrying
//! [`SourceSpan::Code`] with the unit's dotted self-reference symbol
//! path (design §3.4 Java row). Glue declarations (`import`) collapse
//! into one grouped `<top-level>` (or `<module>`) unit.
//!
//! Like the Go extractor, Java's package identity comes from the
//! source itself (the `package_declaration` clause), not from the
//! workspace file path — `extract_package` reads it from the AST. If
//! the clause is missing the prefix falls back to `"<unknown>"`.
//!
//! Class/interface/record bodies are recursed (1B Python pattern):
//! the type name is pushed onto `mod_path` so methods and nested
//! types become `<pkg>.<Outer>.<Inner>.<method>`. Constructors use
//! the Java convention `<pkg>.<...>.<Class>.<ClassName>` (name
//! duplicated, per design §3.4). Enum bodies are not recursed for
//! the 1차 cut — enum constants are not emitted as units.
//!
//! Javadoc (`/** ... */` → `block_comment`) and line comments
//! immediately preceding an item are folded into that item's line
//! range via `unit_start` (1B pattern). Annotations are children of
//! the declaration node itself (inside `modifiers`), so they are
//! already part of the declaration's span — no separate unwrap arm.
//!
//! Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
pub const PARSER_VERSION: &str = "code-java-v1";
/// Java AST extractor. Per-unit blocks via tree-sitter-java 0.23
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct JavaAstExtractor;
impl JavaAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for JavaAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for JavaAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "java")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for JavaAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec())
.map_err(|e| anyhow::anyhow!("kebab-parse-code: Java source is not valid UTF-8: {e}"))?;
let blocks = build_blocks(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
// Resolve the file's absolute path for repo detection. If the
// source URI carries a relative path, anchor it at the workspace
// root so the `.git/` walk-up starts from the right place.
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("java".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted Java doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
/// p10-1C-JK: extract `package` declaration text from a tree-sitter-java
/// `program`. Returns `None` if no `package_declaration` (default-package
/// Java file). The package_declaration's named children are either a
/// single `identifier` (single-segment package, rare) or a
/// `scoped_identifier` (dotted, common). Per design §3.4 Java row.
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_declaration" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "scoped_identifier" || sub.kind() == "identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
/// Walk preceding `line_comment` / `block_comment` siblings to extend
/// the unit's line range upward, folding leading Javadoc / line
/// comments into the unit. Annotations live INSIDE `modifiers` on the
/// declaration node itself, so their lines are already inside
/// `n.start_position()` — no separate unwrap arm is needed for them.
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
let k = p.kind();
if k == "line_comment" || k == "block_comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
fn node_name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
n.child_by_field_name("name")
.map(|c| &src[c.start_byte()..c.end_byte()])
}
fn build_blocks(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_java::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-java language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Java source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
let mod_prefix = extract_package(root, source).unwrap_or_else(|| "<unknown>".to_string());
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue groups are pushed with a sentinel symbol + is_real=false so a
// post-pass can decide `<module>` vs `<top-level>` (1B/1C-Go pattern).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
// (is_import 0/1, s, e). `is_import` flags `import_declaration` —
// used by the glue flush to pick `<module>` vs `<top-level>`
// provisional label.
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
walk_top(root, source, &mod_prefix, &mut units, &mut glue);
// `<module>` is correct only when the file produced no real unit.
// Otherwise the import-only group becomes `<top-level>` (same
// post-pass as 1B / 1C-Go).
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if has_real_unit {
for (sym, _, _, is_real) in units.iter_mut() {
if !*is_real && sym.ends_with("<module>") {
let pre = &sym[..sym.len() - "<module>".len()];
*sym = format!("{pre}<top-level>");
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("java".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("java".to_string()),
code,
}));
}
Ok(blocks)
}
/// Walk the file's top-level children — `program` named children:
/// `package_declaration` (handled by `extract_package`), `import_declaration`
/// (glue), and the five type declarations (`class` / `interface` /
/// `enum` / `record` / `annotation_type`). Type-declaration bodies
/// are recursed via [`walk_body`] with the type name pushed onto
/// `mod_path` (1B Python pattern). Enum bodies are NOT recursed
/// (1차 cut — see module-level doc).
fn walk_top(
node: tree_sitter::Node,
src: &str,
mod_prefix: &str,
units: &mut Vec<(String, u32, u32, bool)>,
glue: &mut Vec<(usize, u32, u32)>,
) {
let mod_path: &[String] = &[];
let mut cur = node.walk();
for child in node.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"class_declaration"
| "interface_declaration"
| "record_declaration" => {
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if let Some(body) = child.child_by_field_name("body") {
let np: Vec<String> = vec![name.to_string()];
walk_body(body, src, mod_prefix, &np, units);
}
}
}
"enum_declaration" => {
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
// Enum body NOT recursed for 1차 — enum constants are
// not emitted as units, and method declarations inside
// enum bodies (rare) live under `enum_body_declarations`
// not `class_body`. Skip per design §3.4 1차 scope.
}
}
"annotation_type_declaration" => {
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
}
}
"import_declaration" => {
glue.push((1, s, e));
}
// package_declaration is handled by `extract_package`; no
// glue entry — it's structural metadata, not a unit.
_ => {}
}
}
flush_glue(glue, units, mod_prefix, mod_path);
}
/// Walk a `class_body` / `interface_body` (or record's `class_body`).
/// Emits one unit per method / constructor, and recurses into nested
/// type declarations. Field declarations are NOT emitted (would
/// explode unit count). `compact_constructor_declaration` (records)
/// is handled the same as `constructor_declaration`.
///
/// No `glue` parameter: Java does not have imports inside type
/// bodies — they only appear at file top level, handled by
/// [`walk_top`].
fn walk_body(
body: tree_sitter::Node,
src: &str,
mod_prefix: &str,
mod_path: &[String],
units: &mut Vec<(String, u32, u32, bool)>,
) {
let mut cur = body.walk();
for child in body.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"method_declaration"
| "constructor_declaration"
| "compact_constructor_declaration" => {
// Constructor: name field equals the class name. Per
// design §3.4 Java convention, symbol is
// `<pkg>.<mod_path>.<ClassName>` with the constructor
// name (== class name) as the trailing segment. This
// means the symbol duplicates the class name (e.g.
// `com.x.Foo.Foo`), which is the documented convention.
if let Some(name) = node_name_text(&child, src) {
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
}
}
"class_declaration"
| "interface_declaration"
| "record_declaration"
| "enum_declaration"
| "annotation_type_declaration" => {
// Nested type — emit unit, then recurse into its body
// (skipped for enum + annotation_type per 1차 scope).
let name = match node_name_text(&child, src) {
Some(n) => n,
None => continue,
};
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if child.kind() != "enum_declaration"
&& child.kind() != "annotation_type_declaration"
{
if let Some(inner_body) = child.child_by_field_name("body") {
let mut np = mod_path.to_vec();
np.push(name.to_string());
walk_body(inner_body, src, mod_prefix, &np, units);
}
}
}
// field_declaration, static_initializer, block: NOT emitted.
_ => {}
}
}
}
fn flush_glue(
glue: &mut Vec<(usize, u32, u32)>,
units: &mut Vec<(String, u32, u32, bool)>,
mod_prefix: &str,
mod_path: &[String],
) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
// Provisional label: `<module>` only if the group is exclusively
// imports (1A's `only_mod_decls` analog). The post-pass demotes any
// `<module>` to `<top-level>` if the file produced any real unit.
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
let label = if only_imports { "<module>" } else { "<top-level>" };
units.push((join_symbol(mod_prefix, mod_path, label), s, e, false));
glue.clear();
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(concat!(
env!("CARGO_MANIFEST_DIR"),
"/tests/fixtures/sample.java"
))
.unwrap();
let asset =
crate::rust::tests_support::fixed_code_asset("crates/x/src/sample.java", "java");
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
JavaAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_java() {
let e = JavaAstExtractor::new();
assert!(e.supports(&MediaType::Code("java".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn java_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("java"));
symbol.clone()
}
_ => None,
},
_ => None,
})
.collect();
syms.sort();
// package extracted from source = com.kebab.chunk
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker"),
"got {syms:?}"
);
// constructor — Java convention is class-name-as-method-name
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.MdHeadingV1Chunker"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.getName"),
"got {syms:?}"
);
// static nested class
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.withName"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.build"),
"got {syms:?}"
);
// package-private interface + enum
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Stringer"),
"got {syms:?}"
);
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Mode"),
"got {syms:?}"
);
// import grouped as <top-level>
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.<top-level>"),
"got {syms:?}"
);
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 {
assert_eq!(extract_fixture().blocks, a.blocks);
}
}
}

View File

@@ -0,0 +1,627 @@
//! `kebab-parse-code::kotlin` — tree-sitter Kotlin AST extractor (P10-1C-JK Task G).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("kotlin")`].
//! Mirrors the Java extractor (JVM family, source-side `package` extraction +
//! class-nesting) with Kotlin-specific adjustments:
//!
//! * Root is `source_file` (not `program`).
//! * `package_header` carries a single `qualified_identifier` child whose
//! slice text IS the dotted package path — never a bare `identifier`
//! sub-form for the package (the grammar always wraps a single segment
//! in `qualified_identifier` too).
//! * `class_declaration` covers `class`, `data class`, `sealed class`,
//! `enum class`, AND `interface` — Kotlin uses ONE node kind with a
//! `modifiers` child rather than separate `interface_declaration` /
//! `enum_declaration` nodes (verified via tree-sitter-kotlin-ng
//! `node-types.json`).
//! * The body child of `class_declaration` is either `class_body` (normal
//! classes / interfaces) OR `enum_class_body` (enum class). Neither
//! carries a `body` field name, so it is matched by kind, not by
//! `child_by_field_name("body")`.
//! * `companion_object` is a SEPARATE node kind (not `object_declaration`
//! with a modifier). Its `name` field is OPTIONAL — when omitted (the
//! common case `companion object { ... }`) the symbol uses the
//! implicit Kotlin convention name `Companion`.
//! * `object_declaration` (named singleton) carries a `name` field and a
//! `class_body` child.
//! * `function_declaration` may appear at top level (Kotlin top-level
//! function) AND inside `class_body` — same node kind, the
//! `mod_path` state distinguishes the two emit forms.
//!
//! Enum bodies (`enum_class_body`) are NOT recursed for the 1차 cut —
//! `enum_entry` declarations are not emitted as units, matching the
//! Java extractor's enum policy (design §3.4 1차 scope).
//!
//! Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
pub const PARSER_VERSION: &str = "code-kotlin-v1";
/// Kotlin AST extractor. Per-unit blocks via tree-sitter-kotlin-ng 1.1
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct KotlinAstExtractor;
impl KotlinAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for KotlinAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for KotlinAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "kotlin")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for KotlinAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
anyhow::anyhow!("kebab-parse-code: Kotlin source is not valid UTF-8: {e}")
})?;
let blocks = build_blocks(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
// Resolve the file's absolute path for repo detection. If the
// source URI carries a relative path, anchor it at the workspace
// root so the `.git/` walk-up starts from the right place.
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("kotlin".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted Kotlin doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
/// p10-1C-JK: extract `package` declaration text from a tree-sitter-kotlin
/// `source_file`. Returns `None` if no `package_header` (default-package
/// Kotlin file). The package_header's single named child is a
/// `qualified_identifier`; its slice text is the dotted path. Per design
/// §3.4 Kotlin row.
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_header" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
let k = sub.kind();
if k == "qualified_identifier" || k == "identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
/// Walk preceding `line_comment` / `block_comment` siblings to extend
/// the unit's line range upward, folding leading KDoc / line comments
/// into the unit. Modifiers / annotations live INSIDE the declaration
/// node itself, so their lines are already inside `n.start_position()`.
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
let k = p.kind();
if k == "line_comment" || k == "block_comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
fn node_name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
n.child_by_field_name("name")
.map(|c| &src[c.start_byte()..c.end_byte()])
}
/// Find the first child of a node with one of the given kinds. Used to
/// locate `class_body` / `enum_class_body` on `class_declaration` since
/// the kotlin grammar attaches them without a `body` field name.
fn first_child_of_kinds<'a>(
n: &tree_sitter::Node<'a>,
kinds: &[&str],
) -> Option<tree_sitter::Node<'a>> {
let mut cur = n.walk();
n.named_children(&mut cur)
.find(|child| kinds.contains(&child.kind()))
}
/// `true` iff a `class_declaration` carries the `enum` class modifier.
/// Detected by walking `modifiers` → `class_modifier` and checking the
/// child text. The grammar exposes "enum" / "sealed" / "data" /
/// "annotation" / "inner" as named `class_modifier` children of
/// `modifiers`. We only need to know about "enum" to decide whether to
/// look for `class_body` or `enum_class_body` and whether to skip body
/// recursion.
fn class_decl_is_enum(n: &tree_sitter::Node, src: &str) -> bool {
let mut cur = n.walk();
for child in n.named_children(&mut cur) {
if child.kind() == "modifiers" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "class_modifier" {
let text = &src[sub.start_byte()..sub.end_byte()];
if text == "enum" {
return true;
}
}
}
}
}
false
}
fn build_blocks(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_kotlin_ng::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-kotlin-ng language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Kotlin source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
let mod_prefix = extract_package(root, source).unwrap_or_else(|| "<unknown>".to_string());
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue groups are pushed with a sentinel symbol + is_real=false so a
// post-pass can decide `<module>` vs `<top-level>` (JVM family pattern).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
// (is_import 0/1, s, e). `is_import` flags `import` — used by the
// glue flush to pick `<module>` vs `<top-level>` provisional label.
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
walk_top(root, source, &mod_prefix, &mut units, &mut glue);
// `<module>` is correct only when the file produced no real unit.
// Otherwise the import-only group becomes `<top-level>` (same
// post-pass as 1B / 1C-Go / Java).
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if has_real_unit {
for (sym, _, _, is_real) in units.iter_mut() {
if !*is_real && sym.ends_with("<module>") {
let pre = &sym[..sym.len() - "<module>".len()];
*sym = format!("{pre}<top-level>");
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("kotlin".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("kotlin".to_string()),
code,
}));
}
Ok(blocks)
}
/// Walk the file's top-level children — `source_file` named children:
/// `package_header` (handled by `extract_package`), `import` (glue),
/// `class_declaration` (class / interface / enum class), `object_declaration`,
/// `function_declaration` (top-level), `property_declaration` (top-level),
/// `type_alias` (currently treated as glue). Class / object bodies are
/// recursed via [`walk_body`] with the type name pushed onto `mod_path`
/// (JVM family pattern). Enum bodies are NOT recursed (1차 cut).
fn walk_top(
node: tree_sitter::Node,
src: &str,
mod_prefix: &str,
units: &mut Vec<(String, u32, u32, bool)>,
glue: &mut Vec<(usize, u32, u32)>,
) {
let mod_path: &[String] = &[];
let mut cur = node.walk();
for child in node.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"class_declaration" => {
// Covers class / data class / sealed class / interface /
// enum class — single grammar node, the modifiers child
// distinguishes them. The body is `class_body` for
// non-enum and `enum_class_body` for enum class; both
// attach without a `body` field name.
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
let is_enum = class_decl_is_enum(&child, src);
if !is_enum {
if let Some(body) = first_child_of_kinds(&child, &["class_body"]) {
let np: Vec<String> = vec![name.to_string()];
walk_body(body, src, mod_prefix, &np, units);
}
}
// enum_class_body NOT recursed — enum constants are
// not emitted as units (1차 scope, matches Java).
}
}
"object_declaration" => {
// Singleton object — name field is required by the grammar.
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if let Some(body) = first_child_of_kinds(&child, &["class_body"]) {
let np: Vec<String> = vec![name.to_string()];
walk_body(body, src, mod_prefix, &np, units);
}
}
}
"function_declaration" => {
// Top-level Kotlin function (unlike Java).
if let Some(name) = node_name_text(&child, src) {
glue.retain(|(_, gs, _)| *gs < s);
flush_glue(glue, units, mod_prefix, mod_path);
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
}
}
"import" => {
glue.push((1, s, e));
}
// `property_declaration` (top-level val/var) and `type_alias`
// are not emitted as standalone units in the 1차 cut — they
// glue into the import group instead. `package_header` is
// handled by `extract_package` (structural metadata, not a
// unit).
_ => {}
}
}
flush_glue(glue, units, mod_prefix, mod_path);
}
/// Walk a `class_body` (or object's `class_body`). Emits one unit per
/// method / secondary constructor and recurses into nested type
/// declarations + companion objects. Property declarations are NOT
/// emitted (would explode unit count, parallel to Java field policy).
///
/// `companion_object` carries an optional `name` field — when omitted
/// (the common case `companion object { ... }`) the implicit Kotlin
/// convention name `Companion` is used.
///
/// No `glue` parameter: Kotlin imports are file-level only.
fn walk_body(
body: tree_sitter::Node,
src: &str,
mod_prefix: &str,
mod_path: &[String],
units: &mut Vec<(String, u32, u32, bool)>,
) {
let mut cur = body.walk();
for child in body.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"function_declaration" => {
if let Some(name) = node_name_text(&child, src) {
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
}
}
"secondary_constructor" => {
// Kotlin secondary constructor — no `name` field on the
// grammar node. Per design §3.4 (Java JVM convention) the
// symbol uses the enclosing class name as the trailing
// segment (matches the Java `<pkg>.<...>.<Class>.<Class>`
// duplication for constructors).
if let Some(class_name) = mod_path.last() {
let sym = join_symbol(mod_prefix, mod_path, class_name);
units.push((sym, s, e, true));
}
}
"companion_object" => {
// Companion's name field is OPTIONAL — fall back to the
// Kotlin implicit name `Companion`.
let name: &str = node_name_text(&child, src).unwrap_or("Companion");
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if let Some(inner_body) = first_child_of_kinds(&child, &["class_body"]) {
let mut np = mod_path.to_vec();
np.push(name.to_string());
walk_body(inner_body, src, mod_prefix, &np, units);
}
}
"class_declaration" => {
let name = match node_name_text(&child, src) {
Some(n) => n,
None => continue,
};
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
let is_enum = class_decl_is_enum(&child, src);
if !is_enum {
if let Some(inner_body) = first_child_of_kinds(&child, &["class_body"]) {
let mut np = mod_path.to_vec();
np.push(name.to_string());
walk_body(inner_body, src, mod_prefix, &np, units);
}
}
}
"object_declaration" => {
let name = match node_name_text(&child, src) {
Some(n) => n,
None => continue,
};
let sym = join_symbol(mod_prefix, mod_path, name);
units.push((sym, s, e, true));
if let Some(inner_body) = first_child_of_kinds(&child, &["class_body"]) {
let mut np = mod_path.to_vec();
np.push(name.to_string());
walk_body(inner_body, src, mod_prefix, &np, units);
}
}
// property_declaration, anonymous_initializer: NOT emitted.
_ => {}
}
}
}
fn flush_glue(
glue: &mut Vec<(usize, u32, u32)>,
units: &mut Vec<(String, u32, u32, bool)>,
mod_prefix: &str,
mod_path: &[String],
) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
// Provisional label: `<module>` only if the group is exclusively
// imports. The post-pass demotes any `<module>` to `<top-level>` if
// the file produced any real unit.
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
let label = if only_imports { "<module>" } else { "<top-level>" };
units.push((join_symbol(mod_prefix, mod_path, label), s, e, false));
glue.clear();
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(concat!(
env!("CARGO_MANIFEST_DIR"),
"/tests/fixtures/sample.kt"
))
.unwrap();
let asset =
crate::rust::tests_support::fixed_code_asset("crates/x/src/sample.kt", "kotlin");
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
KotlinAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_kotlin() {
let e = KotlinAstExtractor::new();
assert!(e.supports(&MediaType::Code("kotlin".into())));
assert!(!e.supports(&MediaType::Code("java".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn kotlin_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("kotlin"));
symbol.clone()
}
_ => None,
},
_ => None,
})
.collect();
syms.sort();
// package extracted from source = com.kebab.chunk
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.getName"),
"got {syms:?}"
);
// Implicit companion object name = Companion (grammar leaves the
// name field unset; the extractor fills it in).
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Companion"),
"got {syms:?}"
);
assert!(
syms.iter()
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Companion.withName"),
"got {syms:?}"
);
// interface — also via class_declaration in the grammar
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Stringer"),
"got {syms:?}"
);
// enum class — also via class_declaration; body NOT recursed
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Mode"),
"got {syms:?}"
);
// Kotlin top-level fn — unlike Java
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.freeFunction"),
"got {syms:?}"
);
// Singleton object + its method
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Singleton"),
"got {syms:?}"
);
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.Singleton.ping"),
"got {syms:?}"
);
// import grouped as <top-level>
assert!(
syms.iter().any(|s| s == "com.kebab.chunk.<top-level>"),
"got {syms:?}"
);
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 {
assert_eq!(extract_fixture().blocks, a.blocks);
}
}
}

View File

@@ -10,33 +10,57 @@ use std::path::Path;
/// `None` if the extension / filename is not recognized.
///
/// Matching priority:
/// 1. exact filename match (e.g. `Dockerfile`, `Makefile`)
/// 2. lowercase extension match
/// 1. Tier 1 basename exact match (e.g. `Dockerfile`, `Makefile`)
/// 2. Tier 2 basename match (e.g. `Cargo.toml`, `package.json`, `build.gradle`)
/// 3. Tier 2 `Dockerfile.*` prefix variant
/// 4. Tier 1 + Tier 2 extension fallback (lowercase)
pub fn code_lang_for_path(path: &Path) -> Option<&'static str> {
if let Some(name) = path.file_name().and_then(|n| n.to_str()) {
// Tier 1 basename exact match
match name {
"Dockerfile" => return Some("dockerfile"),
"Makefile" | "GNUmakefile" => return Some("make"),
_ => {}
}
// Tier 2 basename match (configuration / manifest files)
match name {
"Cargo.toml" | "pyproject.toml" => return Some("toml"),
"package.json" | "tsconfig.json" => return Some("json"),
"go.mod" => return Some("go-mod"),
"pom.xml" => return Some("xml"),
"build.gradle" => return Some("groovy"),
_ => {}
}
// Tier 2: `Dockerfile.*` prefix variant (e.g. `Dockerfile.dev`, `Dockerfile.prod`)
if name.starts_with("Dockerfile.") && name.len() > "Dockerfile.".len() {
return Some("dockerfile");
}
}
// Extension fallback (Tier 1 + Tier 2)
let ext = path.extension()?.to_str()?.to_ascii_lowercase();
match ext.as_str() {
// Tier 1 extensions
"rs" => Some("rust"),
"py" | "pyi" => Some("python"),
"ts" | "tsx" => Some("typescript"),
"ts" | "tsx" | "mts" | "cts" => Some("typescript"),
"js" | "mjs" | "cjs" | "jsx" => Some("javascript"),
"go" => Some("go"),
"java" => Some("java"),
"kt" | "kts" => Some("kotlin"),
"c" | "h" => Some("c"),
"cpp" | "cc" | "cxx" | "hpp" | "hh" | "hxx" => Some("cpp"),
"sh" | "bash" | "zsh" => Some("shell"),
"mk" => Some("make"),
// Tier 2 extensions
"yaml" | "yml" => Some("yaml"),
"toml" => Some("toml"),
"json" => Some("json"),
"sh" | "bash" | "zsh" => Some("shell"),
"mk" => Some("make"),
"xml" => Some("xml"),
"dockerfile" => Some("dockerfile"),
"gradle" => Some("groovy"),
_ => None,
}
}
@@ -82,7 +106,7 @@ pub fn module_path_for_python(workspace_path: &str) -> String {
/// (no slash replacement, no source-root strip). See plan §Task C.
pub fn module_path_for_tsjs(workspace_path: &str) -> String {
let p = workspace_path;
for ext in [".tsx", ".ts", ".jsx", ".mjs", ".cjs", ".js"] {
for ext in [".tsx", ".mts", ".cts", ".ts", ".jsx", ".mjs", ".cjs", ".js"] {
if let Some(stripped) = p.strip_suffix(ext) {
return stripped.to_string();
}
@@ -110,7 +134,7 @@ mod tests {
#[test]
fn module_path_for_tsjs_keeps_slashes_and_strips_ext() {
for ext in ["ts", "tsx", "js", "jsx", "mjs", "cjs"] {
for ext in ["ts", "tsx", "mts", "cts", "js", "jsx", "mjs", "cjs"] {
let p = format!("src/search/retriever/Retriever.{ext}");
assert_eq!(module_path_for_tsjs(&p), "src/search/retriever/Retriever");
}
@@ -118,4 +142,28 @@ mod tests {
assert_eq!(module_path_for_tsjs("a/b/c.ts"), "a/b/c");
assert_eq!(module_path_for_tsjs("packages/x/src/Foo.ts"), "packages/x/src/Foo");
}
#[test]
fn tier2_basename_takes_precedence_over_extension() {
assert_eq!(code_lang_for_path(Path::new("Dockerfile")), Some("dockerfile"));
assert_eq!(code_lang_for_path(Path::new("foo/Dockerfile.dev")), Some("dockerfile"));
assert_eq!(code_lang_for_path(Path::new("myapp.dockerfile")), Some("dockerfile"));
assert_eq!(code_lang_for_path(Path::new("repo/Cargo.toml")), Some("toml"));
assert_eq!(code_lang_for_path(Path::new("pyproject.toml")), Some("toml"));
assert_eq!(code_lang_for_path(Path::new("repo/package.json")), Some("json"));
assert_eq!(code_lang_for_path(Path::new("tsconfig.json")), Some("json"));
assert_eq!(code_lang_for_path(Path::new("go.mod")), Some("go-mod"));
assert_eq!(code_lang_for_path(Path::new("pom.xml")), Some("xml"));
assert_eq!(code_lang_for_path(Path::new("build.gradle")), Some("groovy"));
}
#[test]
fn tier2_extension_fallback() {
assert_eq!(code_lang_for_path(Path::new("k8s/deploy.yaml")), Some("yaml"));
assert_eq!(code_lang_for_path(Path::new("k8s/deploy.yml")), Some("yaml"));
assert_eq!(code_lang_for_path(Path::new("foo/bar.toml")), Some("toml"));
assert_eq!(code_lang_for_path(Path::new("foo/bar.json")), Some("json"));
assert_eq!(code_lang_for_path(Path::new("foo/bar.xml")), Some("xml"));
assert_eq!(code_lang_for_path(Path::new("foo/bar.gradle")), Some("groovy"));
}
}

View File

@@ -13,7 +13,12 @@
//! `kebab-parse-*` crates per design §8: must NOT depend on store / embed
//! / llm / rag.
pub mod c;
pub mod cpp;
pub mod go;
pub mod java;
pub mod javascript;
pub mod kotlin;
pub mod lang;
pub mod python;
pub mod repo;
@@ -22,7 +27,12 @@ pub(crate) mod scaffold;
pub mod skip;
pub mod typescript;
pub use c::{PARSER_VERSION as C_PARSER_VERSION, CAstExtractor};
pub use cpp::{PARSER_VERSION as CPP_PARSER_VERSION, CppAstExtractor};
pub use go::{PARSER_VERSION as GO_PARSER_VERSION, GoAstExtractor};
pub use java::{PARSER_VERSION as JAVA_PARSER_VERSION, JavaAstExtractor};
pub use javascript::{PARSER_VERSION as JS_PARSER_VERSION, JavascriptAstExtractor};
pub use kotlin::{PARSER_VERSION as KOTLIN_PARSER_VERSION, KotlinAstExtractor};
pub use lang::{code_lang_for_path, module_path_for_python, module_path_for_tsjs};
pub use python::{PARSER_VERSION as PYTHON_PARSER_VERSION, PythonAstExtractor};
pub use repo::{RepoMeta, detect_repo};

View File

@@ -173,8 +173,9 @@ impl Extractor for TypescriptAstExtractor {
}
/// Select the tree-sitter grammar based on the workspace path's
/// extension. `.tsx` → TSX grammar; everything else (`.ts`, `.d.ts`,
/// missing extension) → TypeScript grammar.
/// extension. `.tsx` → TSX grammar; everything else (`.ts`, `.mts`,
/// `.cts`, `.d.ts`, missing extension) → TypeScript grammar (the JSX-
/// agnostic variants all share one grammar in tree-sitter-typescript 0.23).
fn select_grammar(workspace_path: &str) -> tree_sitter::Language {
if workspace_path.ends_with(".tsx") {
tree_sitter_typescript::LANGUAGE_TSX.into()

View File

@@ -0,0 +1,34 @@
// sample.go
package chunk
import (
"fmt"
"strings"
)
const Version = "v1"
type MdHeadingV1Chunker struct {
Name string
}
// ChunkDoc returns a stub list of strings.
func (m *MdHeadingV1Chunker) ChunkDoc(input string) []string {
return []string{m.Name}
}
func (m MdHeadingV1Chunker) Name2() string {
return m.Name
}
type Stringer interface {
String() string
}
func Free(x int) int {
return x + 1
}
func init() {
fmt.Println(strings.ToUpper("init"))
}

View File

@@ -0,0 +1,36 @@
// sample.java
package com.kebab.chunk;
import java.util.List;
import java.util.stream.Collectors;
/**
* Heading-aware Markdown chunker.
*/
public class MdHeadingV1Chunker {
private final String name;
public MdHeadingV1Chunker(String name) {
this.name = name;
}
public List<String> chunkDoc(String input) {
return List.of(name, input);
}
public String getName() {
return name;
}
public static class Builder {
private String name;
public Builder withName(String n) { this.name = n; return this; }
public MdHeadingV1Chunker build() { return new MdHeadingV1Chunker(name); }
}
}
interface Stringer {
String asString();
}
enum Mode { DEFAULT, FAST }

View File

@@ -0,0 +1,29 @@
// sample.kt
package com.kebab.chunk
import java.util.List
/**
* Heading-aware Markdown chunker.
*/
class MdHeadingV1Chunker(val name: String) {
fun chunkDoc(input: String): List<String> = listOf(name, input)
fun getName(): String = name
companion object {
fun withName(n: String): MdHeadingV1Chunker = MdHeadingV1Chunker(n)
}
}
interface Stringer {
fun asString(): String
}
enum class Mode { DEFAULT, FAST }
fun freeFunction(x: Int): Int = x + 1
object Singleton {
fun ping(): String = "pong"
}

View File

@@ -9,6 +9,8 @@ fn known_extensions_map_to_canonical_identifiers() {
("foo.pyi", Some("python")),
("foo.ts", Some("typescript")),
("foo.tsx", Some("typescript")),
("foo.mts", Some("typescript")), // ESM TS — same grammar
("foo.cts", Some("typescript")), // CommonJS TS — same grammar
("foo.js", Some("javascript")),
("foo.mjs", Some("javascript")),
("foo.cjs", Some("javascript")),

View File

@@ -12,6 +12,12 @@ use kebab_core::{AudioType, ImageType, MediaType};
/// `MediaType::Image(_)` / `MediaType::Audio(_)`. Anything else (including
/// missing extension) → `MediaType::Other(ext)`.
pub(crate) fn media_type_for(path: &Path) -> MediaType {
// p10-2: code_lang_for_path is the single source of truth for code lang
// (design §3.5). Delegate before falling back to extension branches.
if let Some(lang) = kebab_parse_code::code_lang_for_path(path) {
return MediaType::Code(lang.to_string());
}
let ext = path
.extension()
.and_then(|s| s.to_str())
@@ -19,7 +25,9 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {
.unwrap_or_default();
match ext.as_str() {
"md" => MediaType::Markdown,
// Markdown + MDX (markdown + JSX, treated as plain markdown — the
// JSX islands are folded into raw passthrough by the md parser).
"md" | "mdx" => MediaType::Markdown,
"pdf" => MediaType::Pdf,
"png" => MediaType::Image(ImageType::Png),
@@ -34,15 +42,6 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {
"flac" => MediaType::Audio(AudioType::Flac),
"ogg" => MediaType::Audio(AudioType::Ogg),
// p10-1A-2: Rust is the only code lang activated in 1A. Other
// recognized code langs stay Other until their phase (1B+).
"rs" => MediaType::Code("rust".to_string()),
// p10-1B: Python / TS / JS AST chunkers active.
"py" | "pyi" => MediaType::Code("python".into()),
"ts" | "tsx" => MediaType::Code("typescript".into()),
"js" | "mjs" | "cjs" | "jsx" => MediaType::Code("javascript".into()),
// Empty string (no extension) and any other extension: bucket as
// Other and let downstream extractors decide if they support it.
_ => MediaType::Other(ext),
@@ -86,7 +85,8 @@ mod tests {
media_type_for(Path::new("crates/kebab-core/src/lib.rs")),
MediaType::Code("rust".to_string())
);
assert_eq!(media_type_for(Path::new("Cargo.toml")), MediaType::Other("toml".to_string()));
// Cargo.toml is a Tier 2 code manifest (p10-2), handled by code_lang_for_path
assert_eq!(media_type_for(Path::new("Cargo.toml")), MediaType::Code("toml".to_string()));
}
#[test]
@@ -102,6 +102,32 @@ mod tests {
assert_eq!(media_type_for(Path::new("a/b.rs")), MediaType::Code("rust".into()));
}
#[test]
fn ts_variants_mts_cts() {
// .mts / .cts are TypeScript ESM / CommonJS — same grammar as .ts.
assert_eq!(media_type_for(Path::new("a/b.mts")), MediaType::Code("typescript".into()));
assert_eq!(media_type_for(Path::new("a/b.cts")), MediaType::Code("typescript".into()));
}
#[test]
fn mdx_routes_to_markdown() {
// MDX is markdown with JSX islands; the md parser folds the JSX
// through as raw passthrough.
assert_eq!(media_type_for(Path::new("docs/page.mdx")), MediaType::Markdown);
}
#[test]
fn go_files_map_to_media_code_go() {
assert_eq!(media_type_for(Path::new("a/b.go")), MediaType::Code("go".into()));
}
#[test]
fn java_kotlin_files_map_to_media_code() {
assert_eq!(media_type_for(Path::new("a/b.java")), MediaType::Code("java".into()));
assert_eq!(media_type_for(Path::new("a/b.kt")), MediaType::Code("kotlin".into()));
assert_eq!(media_type_for(Path::new("a/b.kts")), MediaType::Code("kotlin".into()));
}
#[test]
fn unknown_and_missing_extension() {
assert_eq!(
@@ -113,4 +139,14 @@ mod tests {
MediaType::Other(String::new())
);
}
#[test]
fn tier2_files_map_to_media_code() {
assert_eq!(media_type_for(Path::new("a/deploy.yaml")), MediaType::Code("yaml".into()));
assert_eq!(media_type_for(Path::new("a/Dockerfile")), MediaType::Code("dockerfile".into()));
assert_eq!(media_type_for(Path::new("a/Cargo.toml")), MediaType::Code("toml".into()));
assert_eq!(media_type_for(Path::new("a/pom.xml")), MediaType::Code("xml".into()));
assert_eq!(media_type_for(Path::new("a/build.gradle")), MediaType::Code("groovy".into()));
assert_eq!(media_type_for(Path::new("a/go.mod")), MediaType::Code("go-mod".into()));
}
}

View File

@@ -56,5 +56,6 @@
"skipped_kebabignore": 0,
"skipped_size_exceeded": 0,
"unchanged": 0,
"purged_deleted_files": 0,
"updated": 1
}

View File

@@ -264,6 +264,28 @@ impl kebab_core::DocumentStore for SqliteStore {
}))
}
fn get_asset(
&self,
id: &kebab_core::AssetId,
) -> Result<Option<kebab_core::RawAsset>> {
let conn = self.lock_conn();
let result = conn.query_row(
r#"SELECT
asset_id, source_uri, workspace_path, media_type,
byte_len, checksum, storage_kind, storage_path,
discovered_at
FROM assets
WHERE asset_id = ?"#,
rusqlite::params![id.0.as_str()],
asset_from_row,
);
match result {
Ok(asset) => Ok(Some(asset)),
Err(rusqlite::Error::QueryReturnedNoRows) => Ok(None),
Err(e) => Err(e.into()),
}
}
fn get_asset_by_workspace_path(
&self,
path: &kebab_core::WorkspacePath,
@@ -286,6 +308,88 @@ impl kebab_core::DocumentStore for SqliteStore {
}
}
fn get_document_by_workspace_path(
&self,
path: &kebab_core::WorkspacePath,
) -> Result<Option<kebab_core::CanonicalDocument>> {
let conn = self.lock_conn();
let row: Option<DocumentRow> = conn
.query_row(
"SELECT
doc_id, asset_id, workspace_path, title, lang,
source_type, trust_level, parser_version,
doc_version, schema_version, metadata_json,
provenance_json, created_at, updated_at,
last_chunker_version, last_embedding_version
FROM documents WHERE workspace_path = ?",
params![path.0],
document_row_from_sql,
)
.map(Some)
.or_else(rows_optional)
.map_err(StoreError::from)?;
let Some(row) = row else { return Ok(None) };
let doc_id = kebab_core::DocumentId(row.doc_id.clone());
let mut blocks_stmt = conn
.prepare(
"SELECT payload_json FROM blocks
WHERE doc_id = ? ORDER BY ordinal ASC",
)
.map_err(StoreError::from)?;
let block_rows = blocks_stmt
.query_map(params![row.doc_id], |r| {
let payload_json: String = r.get(0)?;
Ok(payload_json)
})
.map_err(StoreError::from)?;
let mut blocks: Vec<kebab_core::Block> = Vec::new();
for block_row in block_rows {
let payload_json = block_row.map_err(StoreError::from)?;
let block: kebab_core::Block = serde_json::from_str(&payload_json)
.context("deserialize block payload_json")?;
blocks.push(block);
}
let metadata: kebab_core::Metadata = serde_json::from_str(&row.metadata_json)
.context("deserialize metadata_json")?;
let provenance: kebab_core::Provenance =
serde_json::from_str(&row.provenance_json)
.context("deserialize provenance_json")?;
Ok(Some(kebab_core::CanonicalDocument {
doc_id,
source_asset_id: kebab_core::AssetId(row.asset_id),
workspace_path: kebab_core::WorkspacePath(row.workspace_path),
title: row.title.unwrap_or_default(),
lang: kebab_core::Lang(row.lang.unwrap_or_default()),
blocks,
metadata,
provenance,
parser_version: kebab_core::ParserVersion(row.parser_version),
schema_version: row.schema_version as u32,
doc_version: row.doc_version as u32,
last_chunker_version: row.last_chunker_version.map(kebab_core::ChunkerVersion),
last_embedding_version: row.last_embedding_version.map(kebab_core::EmbeddingVersion),
}))
}
fn all_workspace_paths(&self) -> Result<Vec<kebab_core::WorkspacePath>> {
let conn = self.lock_conn();
let mut stmt = conn
.prepare("SELECT workspace_path FROM documents")
.map_err(StoreError::from)?;
let rows = stmt
.query_map([], |r| r.get::<_, String>(0))
.map_err(StoreError::from)?;
let mut out = Vec::new();
for row in rows {
let path = row.map_err(StoreError::from)?;
out.push(kebab_core::WorkspacePath(path));
}
Ok(out)
}
fn list_documents(
&self,
filter: &kebab_core::DocFilter,
@@ -550,7 +654,8 @@ fn rows_optional<T>(err: rusqlite::Error) -> rusqlite::Result<Option<T>> {
/// Reconstruct a [`kebab_core::RawAsset`] from one `assets` row.
/// Row mapper for `RawAsset`. Column names are self-documenting; the
/// SELECT in [`DocumentStore::get_asset_by_workspace_path`] must include
/// SELECTs in [`DocumentStore::get_asset`] and
/// [`DocumentStore::get_asset_by_workspace_path`] must both include
/// all nine columns by their schema names.
fn asset_from_row(row: &rusqlite::Row<'_>) -> rusqlite::Result<kebab_core::RawAsset> {
use std::path::PathBuf;

View File

@@ -35,4 +35,4 @@ pub use error::StoreError;
pub use eval::{EvalQueryResultRecord, EvalRunRecord, EvalRunRow};
pub use fts::rebuild_chunks_fts;
pub use jobs::IngestRunRow;
pub use store::{CountSummary, NotIndexed, SqliteStore};
pub use store::{CountSummary, NotIndexed, SqliteStore, purge_deleted_workspace_path};

View File

@@ -540,10 +540,132 @@ pub(crate) fn purge_orphan_at_workspace_path(
Ok(())
}
/// Purge all stored data for a document whose on-disk file has been
/// deleted (as opposed to content-changed, which is handled by
/// `purge_orphan_at_workspace_path`).
///
/// Returns the `chunk_id`s that were associated with the document so
/// the caller can issue a matching `VectorStore::delete_by_chunk_ids`
/// on the LanceDB side.
///
/// Deletion order:
/// 1. Collect chunk_ids (before cascade removes them).
/// 2. DELETE the `documents` row → CASCADE clears `blocks`, `chunks`,
/// `embedding_records`.
/// 3. DELETE the `assets` row **only if no other document still
/// references it** (twin-file protection — `assets` can be shared
/// across identical-content files via the blake3 PK).
/// 4. If the asset was `storage_kind = 'copied'`, best-effort delete
/// the on-disk byte file at `storage_path`.
///
/// Returns `Ok(vec![])` when no document exists at `workspace_path`
/// (idempotent — caller doesn't need to pre-check).
pub fn purge_deleted_workspace_path(
store: &SqliteStore,
workspace_path: &kebab_core::WorkspacePath,
) -> anyhow::Result<Vec<kebab_core::ChunkId>> {
let conn = store.lock_conn();
// Look up the document + its asset_id.
let doc_row: Option<(String, String)> = conn
.query_row(
"SELECT doc_id, asset_id FROM documents WHERE workspace_path = ?",
rusqlite::params![workspace_path.0],
|r| Ok((r.get(0)?, r.get(1)?)),
)
.optional()
.map_err(StoreError::from)?;
let Some((doc_id, asset_id)) = doc_row else {
return Ok(Vec::new());
};
// 1. Collect chunk_ids before CASCADE removes them.
let mut stmt = conn
.prepare("SELECT chunk_id FROM chunks WHERE doc_id = ?")
.map_err(StoreError::from)?;
let rows = stmt
.query_map(rusqlite::params![doc_id], |r| r.get::<_, String>(0))
.map_err(StoreError::from)?;
let chunk_ids: Vec<kebab_core::ChunkId> = rows
.map(|r| r.map(kebab_core::ChunkId))
.collect::<rusqlite::Result<Vec<_>>>()
.map_err(StoreError::from)?;
drop(stmt);
// 2. DELETE the document row (CASCADE clears blocks / chunks /
// embedding_records via the FK constraints in V001).
conn.execute(
"DELETE FROM documents WHERE doc_id = ?",
rusqlite::params![doc_id],
)
.map_err(StoreError::from)?;
// 3. Delete the asset row only when no other document still
// references it (twin-file safety: two files with identical
// bytes share a single asset row via the blake3 PK).
let remaining_refs: i64 = conn
.query_row(
"SELECT COUNT(*) FROM documents WHERE asset_id = ?",
rusqlite::params![asset_id],
|r| r.get(0),
)
.map_err(StoreError::from)?;
if remaining_refs == 0 {
// 4. Capture storage details before deleting the row.
let asset_storage: Option<(String, String)> = conn
.query_row(
"SELECT storage_kind, storage_path FROM assets WHERE asset_id = ?",
rusqlite::params![asset_id],
|r| Ok((r.get(0)?, r.get(1)?)),
)
.optional()
.map_err(StoreError::from)?;
conn.execute(
"DELETE FROM assets WHERE asset_id = ?",
rusqlite::params![asset_id],
)
.map_err(StoreError::from)?;
// 5. Best-effort: remove the on-disk copied asset file.
if let Some((storage_kind, storage_path)) = asset_storage {
if storage_kind == "copied" {
let _ = std::fs::remove_file(&storage_path);
}
}
}
tracing::debug!(
target: "kebab-store-sqlite",
workspace_path = %workspace_path.0,
doc_id = %doc_id,
chunk_count = chunk_ids.len(),
"purged deleted-file document from store"
);
Ok(chunk_ids)
}
/// UPSERT a row into `assets`. Used by both the `put_asset_with_bytes`
/// path (which has bytes + computed `storage_kind/path`) and the
/// `DocumentStore::put_asset` path (which only has the `RawAsset` and
/// reads `storage_kind/path` from `asset.stored`).
///
/// **`assets.workspace_path` is "last-registered path" semantics for
/// twin files** (two source files with identical content share one
/// `assets` row keyed on `asset_id = blake3(content)`). Each ingest
/// of either twin overwrites `workspace_path` with whichever path was
/// seen most recently — this is intentional and correct after PR #146
/// made `try_skip_unchanged` document-centric (uses
/// `get_document_by_workspace_path`, not `get_asset_by_workspace_path`)
/// and PR #149 made `reset --orphans-only` document-centric too.
/// Do NOT "fix" the flip-flop by adding a UNIQUE constraint on
/// `workspace_path` in the `assets` table — twin de-dup is load-bearing.
/// When you need media_type for a known document, use the 2-step lookup
/// `get_document_by_workspace_path` → `doc.source_asset_id` →
/// `get_asset(asset_id)` so the result is twin-safe.
pub(crate) fn upsert_asset_row(
conn: &Connection,
asset: &kebab_core::RawAsset,

View File

@@ -41,6 +41,7 @@ fn fixture_report() -> IngestReport {
skipped_generated: 0,
skipped_size_exceeded: 0,
skip_examples: kebab_core::SkipExamples::default(),
purged_deleted_files: 0,
items: Some(vec![
IngestItem {
kind: IngestItemKind::New,

View File

@@ -22,7 +22,7 @@ Cargo workspace, 함수 호출 기반 모듈러 모놀리스. UI binary (`kebab-
| OCR | Ollama vision LM (default `gemma4:e4b`) — `OcrEngine` trait 으로 Tesseract / Apple Vision 등 future swap (HOTFIXES P6-2) |
| Image caption | Ollama vision LM, runtime gate `image.caption.enabled` (default OFF) |
| PDF parser | `lopdf` per-page 텍스트, `chunker_version = "pdf-page-v1"` 가 PDF 자산에 하드코딩 (HOTFIXES P7-3) |
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript`**parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). |
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` / `tree-sitter-go` / `tree-sitter-java` / `tree-sitter-kotlin-ng`**parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`, Go = `code-go-ast-v1`, Java = `code-java-ast-v1`, Kotlin = `code-kotlin-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). Kotlin grammar 은 `tree-sitter-kotlin-ng` 사용 — bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 에 고착되어 있어 사용 불가. **Tier 2 (p10-2)**: YAML/k8s → `serde_yaml` + `k8s-manifest-resource-v1` (apiVersion+kind per resource), Dockerfile → `dockerfile-file-v1` (whole-file), Cargo.toml/go.mod/.json/.xml/.groovy → `manifest-file-v1` (whole-file). Tier 2 chunkers live in `kebab-chunk`; no tree-sitter grammar needed (structure from file type, not AST). **Tier 3 (p10-3)**: shell scripts (`.sh`/`.bash`/`.zsh`) direct → `code-text-paragraph-v1` (blank-line paragraph segmentation + 80-line / 20-overlap line-window for oversize). Same chunker also serves as fallback when Tier 1/2 emit 0 chunks or Err — non-k8s YAML / invalid YAML / AST extractor failures all picked up. symbol = None; lang preserved from input doc. **Tier 1 family complete (p10-1D)**: C (`tree-sitter-c`, `code-c-ast-v1`, `.c`/`.h`) + C++ (`tree-sitter-cpp`, `code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`). C symbol = function name only; C++ symbol = `namespace::Class::method` (recursive nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → Tier 3 fallback. |
| 1B symbol path | workspace path → module path: Python = dotted prefix (`kebab_eval.metrics.compute_mrr`), TypeScript/JavaScript = slash-style prefix (`src/Foo.Foo.search`). Rust 1A-2 는 file-scope nesting 만 (workspace prefix 없음, 비일관 수용 — HOTFIXES 2026-05-20). |
| TUI | Ratatui + crossterm — P9-1 Library 패널, P9-2/3/4 진행 예정 |
| Desktop | Tauri 2 + `pdfjs-dist` (native PDF render backend 금지) — P9-5 |
@@ -52,7 +52,7 @@ flowchart TB
ppdf["kebab-parse-pdf"]
pimg["kebab-parse-image"]
paud["kebab-parse-audio<br/>(P8 보류)"]
pcode["kebab-parse-code<br/>(P10-1A-2 + P10-1B)"]
pcode["kebab-parse-code<br/>(P10-1A-2 + P10-1B + P10-1C-Go + P10-1C-JK + P10-2 + P10-3 + P10-1D)"]
ptypes["kebab-parse-types"]
norm["kebab-normalize"]
chunk["kebab-chunk"]
@@ -127,7 +127,7 @@ flowchart TB
UI → store/llm/parse 직접 의존 금지. 모든 user-facing 진입은 `kebab-app` facade 만 통한다 (frozen 설계 §8). `kebab-cli``--config <path>` flag 를 honor 하려면 `kebab_app::*_with_config(cfg, …)` companion 을 통해 Config 을 명시적으로 thread 하는 패턴 — 자세한 이유는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 의 `--config` 항목.
`kebab-parse-code` 의 외부 tree-sitter grammar crate 의존: P10-1A-2 에서 `tree-sitter-rust` 추가, P10-1B 에서 `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` 추가. 모두 `kebab-parse-code` 에만 격리 (facade 룰 — UI crate / chunker 가 직접 import 금지).
`kebab-parse-code` 의 외부 tree-sitter grammar crate 의존: P10-1A-2 에서 `tree-sitter-rust` 추가, P10-1B 에서 `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` 추가, P10-1C-Go 에서 `tree-sitter-go` 추가, P10-1C-JK 에서 `tree-sitter-java` / `tree-sitter-kotlin-ng` 추가, P10-1D 에서 `tree-sitter-c` / `tree-sitter-cpp` 추가. 모두 `kebab-parse-code` 에만 격리 (facade 룰 — UI crate / chunker 가 직접 import 금지). Kotlin 은 `tree-sitter-kotlin-ng` 사용 (bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 에 고착 — 사용 불가).
## 디렉토리 구조
@@ -165,7 +165,16 @@ kebab/
│ ├── kebab-source-fs/ # 워크스페이스 walk + checksum (P1-1)
│ ├── kebab-parse-md/ # Markdown frontmatter + blocks (P1-2/3)
│ ├── kebab-normalize/ # ParsedBlock → CanonicalDocument (P1-4)
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 + code-rust-ast-v1 + code-python-ast-v1 + code-ts-ast-v1 + code-js-ast-v1 chunker (P1-5, P7-2, P10-1A-2, P10-1B)
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 + code-*-ast-v1 (Tier 1) + k8s-manifest-resource-v1 + dockerfile-file-v1 + manifest-file-v1 + tier2_shared (P10-2) + code-text-paragraph-v1 (P10-3) chunker (P1-5, P7-2, P10-1A-2, P10-1B, P10-1C-Go, P10-1C-JK, P10-2, P10-3, P10-1D)
│ │ └── src/
│ │ ├── code_*_ast_v1.rs # Tier 1 AST chunkers (rust/python/ts/js/go/java/kotlin/c/cpp)
│ │ ├── code_c_ast_v1.rs # Tier 1 (p10-1D): C top-level fn / struct / enum / union
│ │ ├── code_cpp_ast_v1.rs # Tier 1 (p10-1D): C++ namespace::Class::method (recursive nesting)
│ │ ├── k8s_manifest_resource_v1.rs # Tier 2 (p10-2): YAML multi-doc, apiVersion+kind per resource
│ │ ├── dockerfile_file_v1.rs # Tier 2 (p10-2): whole-file Dockerfile
│ │ ├── manifest_file_v1.rs # Tier 2 (p10-2): whole-file Cargo.toml / go.mod / .json / .xml / .groovy
│ │ ├── code_text_paragraph_v1.rs # Tier 3 (p10-3): blank-line paragraph + 80/20 line-window fallback
│ │ └── tier2_shared.rs # Tier 2 (p10-2): shared oversize fallback + Chunk builder helpers
│ ├── kebab-store-sqlite/ # SQLite + FTS5 (V001/V002/V003) (P1-6, P2-1, P3-3)
│ ├── kebab-search/ # Lexical + Vector + Hybrid retriever (P2-2, P3-4)
│ ├── kebab-embed/ kebab-embed-local/ # Embedder trait + fastembed adapter (P3-1, P3-2)
@@ -175,7 +184,7 @@ kebab/
│ ├── kebab-eval/ # golden query runner + metrics (P5-1, P5-2)
│ ├── kebab-parse-image/ # ImageExtractor + Ollama OCR + caption (P6)
│ ├── kebab-parse-pdf/ # lopdf per-page text extractor (P7-1)
│ ├── kebab-parse-code/ # tree-sitter AST extractors: Rust (P10-1A-2), Python + TypeScript + JavaScript (P10-1B); chunker lives in kebab-chunk
│ ├── kebab-parse-code/ # tree-sitter AST extractors: Rust (P10-1A-2), Python + TypeScript + JavaScript (P10-1B), Go (P10-1C-Go), Java + Kotlin (P10-1C-JK — java.rs + kotlin.rs), C + C++ (P10-1D — c.rs + cpp.rs); chunker lives in kebab-chunk
│ ├── kebab-app/ # facade (P0 시그니처 + P3-5/P6-4/P7-3 본체)
│ ├── kebab-tui/ # Ratatui shell + Library 패널 (P9-1)
│ ├── kebab-mcp/ # stdio MCP server — tools: schema, doctor, search, ask (P9-FB-30)

View File

@@ -401,6 +401,201 @@ KB --json schema | jq '.stats.code_lang_breakdown'
- `const foo = () => {...}` 같은 expression-level 함수는 `<top-level>` glue 로 잡힘 (declaration-level 단위만 1B 1차 범위). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-20).
- `.gitignore` honor — `node_modules/` / `__pycache__/` / `.venv/` 등 built-in 안전망 자동 skip.
## P10-1C-Go Go 코드 색인
P10-1B 와 동일한 격리 KB 설정. `.go` 파일을 워크스페이스에 두고 ingest 하면 `code-go-ast-v1` chunker 가 package 단위 AST 로 처리한다.
```bash
cat > /tmp/kebab-smoke/workspace/sample_code/hello.go <<'EOF'
package main
import "fmt"
func Hello(name string) string {
return fmt.Sprintf("Hello, %s!", name)
}
EOF
KB ingest
KB search --mode hybrid "Hello" --code-lang go --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "main.Hello", lang = "go"
```
## P10-2 Tier 2 리소스 파일 색인
P10-1C-Go 와 동일한 격리 KB 설정. `.yaml` / `Dockerfile` / `.toml` 등 Tier 2 리소스 파일을 워크스페이스에 두고 ingest 하면 각 확장자에 맞는 chunker 로 처리된다.
```bash
# 1) Kubernetes manifest (YAML multi-doc)
cat > /tmp/kebab-smoke/workspace/deploy.yaml <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:latest
---
apiVersion: v1
kind: Service
metadata:
name: my-app-svc
namespace: default
spec:
selector:
app: my-app
ports:
- port: 80
EOF
# 2) Dockerfile (전체 파일 단일 chunk)
cat > /tmp/kebab-smoke/workspace/Dockerfile <<'EOF'
FROM rust:1.85 AS builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/kebab /usr/local/bin/kebab
ENTRYPOINT ["kebab"]
EOF
# 3) Cargo.toml (manifest — 전체 파일 단일 chunk)
cp Cargo.toml /tmp/kebab-smoke/workspace/Cargo.toml
# 4) ingest
KB ingest
# 5) 언어별 검색 (citation.symbol 확인)
KB search --mode hybrid "Deployment" --code-lang yaml --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "Deployment/default/my-app" (kind/namespace/name), lang = "yaml"
KB search --mode hybrid "rust:1.85" --code-lang dockerfile --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "<dockerfile>", lang = "dockerfile"
KB search --mode hybrid "kebab-cli" --code-lang toml --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "<manifest>", lang = "toml"
# 6) schema stats 에 Tier 2 언어 카운트 확인
KB --json schema | jq '.stats.code_lang_breakdown'
# 기대: {"yaml": N, "dockerfile": N, "toml": N, ...}
```
**Tier 2 citation.symbol 컨벤션**:
- **YAML k8s 리소스**: `<kind>/<namespace>/<name>` (예: `Deployment/default/my-app`). `namespace` 없으면 `<kind>/<name>`. multi-doc YAML 은 `---` 구분자 기준으로 resource 별 chunk.
- **Dockerfile**: `<dockerfile>` (고정 심볼, 전체 파일이 단일 chunk).
- **TOML / JSON / XML / Groovy / go.mod**: `<manifest>` (고정 심볼, 전체 파일이 단일 chunk). 단, 파일이 `tier2_shared` 의 oversize threshold 초과 시 줄 단위 fallback chunk.
## P10-3 Tier 3 paragraph fallback
P10-2 와 동일한 격리 KB 설정. `.sh` 파일은 direct, 비-k8s YAML 은 fallback 으로 들어간다.
```bash
# 1) shell script (direct Tier 3)
cat > /tmp/kebab-smoke/workspace/deploy.sh <<'EOF'
#!/usr/bin/env bash
set -e
echo "ingesting..."
kebab ingest
echo "done"
kebab schema --json | jq '.stats'
EOF
# 2) 비-k8s YAML (Tier 2 가 0 chunk → Tier 3 fallback)
cat > /tmp/kebab-smoke/workspace/docker-compose.yml <<'EOF'
version: '3'
services:
api:
image: nginx:latest
ports:
- 8080:80
EOF
# 3) ingest
KB ingest
# 4) 언어별 검색 (citation.symbol = None 확인)
KB search --mode hybrid "ingest" --code-lang shell --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
# 기대: symbol = null, lang = "shell", chunker_version = "code-text-paragraph-v1"
KB search --mode hybrid "nginx" --code-lang yaml --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
# 기대: symbol = null, lang = "yaml", chunker_version = "code-text-paragraph-v1"
# 5) schema stats 에 shell 카운트 확인
KB --json schema | jq '.stats.code_lang_breakdown'
# 기대: {"shell": N, "yaml": M, ...}
```
**Tier 3 citation.symbol 컨벤션**: 항상 `null`. 의미 단위 식별 안 함. `lang` 은 원본 lang 보존 (shell → `"shell"`, yaml → `"yaml"` 등).
## P10-1D C + C++ AST chunkers
P10-3 와 동일한 격리 KB 설정. `.c` 와 `.cpp` 파일이 각자의 AST chunker 로 처리된다.
```bash
# 1) C 파일 — top-level function symbol
cat > /tmp/kebab-smoke/workspace/parser.c <<'EOF'
#include <stdio.h>
int parse_record(const char *line) {
if (line == NULL) return -1;
return 0;
}
EOF
# 2) C++ 파일 — namespace::Class::method symbol
cat > /tmp/kebab-smoke/workspace/chunker.cpp <<'EOF'
namespace kebab {
namespace chunk {
class Foo {
public:
void bar() { /* impl */ }
};
} // namespace chunk
} // namespace kebab
EOF
# 3) ingest
KB ingest
# 4) 언어별 검색 (citation.symbol 확인)
KB search --mode hybrid "parse_record" --code-lang c --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "parse_record" (function name only), lang = "c"
KB search --mode hybrid "bar" --code-lang cpp --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "kebab::chunk::Foo" 또는 "kebab::chunk::Foo::bar" (namespace::Class[::method]), lang = "cpp"
# 5) schema stats 에 C/C++ 카운트 확인
KB --json schema | jq '.stats.code_lang_breakdown'
# 기대: {"c": N, "cpp": M, ...}
```
**Tier 1 (p10-1D) citation.symbol 컨벤션**: C 는 function name only (`parse_record` 같이 nesting 없음). C++ 는 `namespace::Class::method` (recursive namespace + class nesting). `.h` 파일이 C++ syntax (namespace / template / class) 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback (`code-text-paragraph-v1`) 으로 자동 picked up.
## 검증 체크리스트
- `kebab doctor` 가 `--config` path 를 honor 하고 그 안의 `storage.data_dir` 를 출력 (XDG default 가 아님).
@@ -433,6 +628,11 @@ rm -rf /tmp/kebab-smoke # 통째로 정리
- (P7-3) 한 PDF 가 N 페이지면 `kebab ingest` 가 N 개 (또는 그 이상의, 페이지 길면 multi-chunk) 의 chunk 를 한 transaction 안에서 commit. 500 페이지 책 → 500+ chunk 한 번에 → embedding throughput 가 bottleneck. 임베딩 활성 워크스페이스에서 큰 PDF 를 처음 ingest 하면 분-단위 시간 + WAL 크기 증가 가능 — P+ 스케일 hardening task 까지 정상 동작이지만 비용은 측정 가능.
- (P10-1A-2) `.rs` 파일을 워크스페이스에 두면 `kebab ingest` 결과에 `new` 카운터에 포함. `kebab search --mode hybrid "<함수명>" --code-lang rust --json` 가 `citation.kind = "code"`, `citation.lang = "rust"` (SearchHit top-level `code_lang` 도 동일), `citation.symbol` (함수/타입 이름), `citation.line_start` / `citation.line_end` 를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"rust": N` 이 나오면 chunk 가 색인됨.
- (P10-1B) `.py` / `.ts` / `.tsx` / `.js` / `.mjs` / `.cjs` / `.jsx` 파일을 워크스페이스에 두면 `kebab ingest` 결과에 `new` 카운터에 포함. `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` 검색이 `citation.symbol` 에 module path prefix 를 포함한 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 해당 언어 카운트 등장 확인.
- (P10-1C-Go) `.go` 파일을 워크스페이스에 두면 `kebab ingest` 가 `code-go-ast-v1` 로 처리. `--code-lang go` 검색이 `citation.symbol` 에 `<package>.<Func>` / `<package>.(*Receiver).<Method>` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"go": N` 등장 확인.
- (P10-1C-JK) `.java` 파일은 `code-java-ast-v1`, `.kt`/`.kts` 파일은 `code-kotlin-ast-v1` 로 처리. `--code-lang java` / `--code-lang kotlin` 검색이 `citation.symbol` 에 `com.foo.Foo.bar` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"java": N` / `"kotlin": N` 등장 확인.
- (P10-2) `.yaml`/`.yml` 파일은 apiVersion+kind 파싱으로 k8s resource 별 chunk 생성 (`k8s-manifest-resource-v1`). `Dockerfile`/`Dockerfile.*` 는 전체 파일 단일 chunk (`dockerfile-file-v1`). `.toml`/`.json`/`.xml`/`.groovy`/`go.mod` 는 전체 파일 단일 chunk (`manifest-file-v1`). `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` 검색이 `citation.symbol` 에 각각 `Deployment/default/my-app` / `<dockerfile>` / `<manifest>` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"yaml": N` / `"dockerfile": N` / `"toml": N` 등장 확인.
- (P10-3) `.sh`/`.bash`/`.zsh` 파일은 direct Tier 3 (`code-text-paragraph-v1`). 비-k8s YAML (apiVersion+kind 없는 yaml) 은 k8s chunker 가 0 chunk → Tier 3 fallback 으로 picked up. `--code-lang shell` / `--code-lang yaml` 검색이 `citation.symbol = null`, `chunker_version = "code-text-paragraph-v1"` 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"shell": N` 등장 확인.
- (P10-1D) `.c` / `.h` 파일은 `code-c-ast-v1` (function name only symbol). `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx` 는 `code-cpp-ast-v1` (`namespace::Class::method` symbol). `--code-lang c` / `--code-lang cpp` 검색 동작 + `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"c": N` / `"cpp": M` 등장 확인. `.h` 파일이 C++ 내용 (namespace 등) 갖고 있으면 자동으로 Tier 3 (`code-text-paragraph-v1`) fallback 으로 picked up.
- (P7-3 + follow-up) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_vector_orphans_for_workspace_path` 가 옛 chunk_id 를 LanceDB 에서 먼저 삭제, 이어서 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embedding_records 를 SQLite 에서 sweep. 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). 두 store 모두 정합 — 옛 본문 검색 시 옛 chunks 가 더 이상 surface 되지 않음.
### Embedding upgrade (fb-39b)

View File

@@ -0,0 +1,540 @@
# p10-1C-Go Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task.
**Goal:** Activate Go code ingest end-to-end on top of 1A-2 (Rust) + 1B (Python/TS/JS) infrastructure. Add `tree-sitter-go` grammar + `GoAstExtractor` + `code-go-ast-v1` chunker + media routing + app dispatch arm.
**Architecture:** Mirror 1A-2 / 1B exactly. `kebab-parse-code/src/go.rs` walks tree-sitter-go parse tree; emits one `Block::Code` per top-level AST semantic unit with `SourceSpan::Code { symbol, lang: Some("go") }`. Symbol prefix = **source-extracted package name** (from `package_clause` AST node — design §3.4 Go row). `kebab-chunk/src/code_go_ast_v1.rs` is a near-duplicate of `code-rust-ast-v1`. App dispatch's `ingest_one_code_asset` (PR #142 generalized 4-arm match) gets a 5th arm.
**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already in workspace), `tree-sitter-go` (NEW), 1A-2/1B infrastructure unchanged.
**Memory note:** Host has been OOM-killed previously. Use `cargo test -p <crate>` and `cargo check -p <crate>` only. ONE full-suite invocation reserved for Task G gate.
---
## Pre-flight
Branch `feat/p10-1c-go` already exists.
- [ ] **Disk hygiene**: `cargo clean` if previous artifacts are bloated. Skip if disk is comfortable (`df -h /`).
Reference files:
- 1A-2 Rust extractor: `crates/kebab-parse-code/src/rust.rs` — closest single-language scaffold template.
- 1B Python extractor (closest analog for "class-nesting recursion" — Go doesn't have classes but has package as the single prefix): `crates/kebab-parse-code/src/python.rs`.
- 1A-2 chunker scaffold: `crates/kebab-chunk/src/code_rust_ast_v1.rs`.
- 1B dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` (~L1645, 4-arm match).
- 1A-2 source-fs routing: `crates/kebab-source-fs/src/media.rs` `"rs" =>` arm.
---
## Task A: Workspace dep `tree-sitter-go`
**Files:**
- Modify: `Cargo.toml` (workspace `[workspace.dependencies]`, after `tree-sitter-javascript` line)
- Modify: `crates/kebab-parse-code/Cargo.toml`
- [ ] **Step 1**: `cargo add tree-sitter-go -p kebab-parse-code` to resolve version.
- [ ] **Step 2**: Lift the resolved version into `[workspace.dependencies]` after `tree-sitter-javascript`:
```toml
# Go grammar for code ingest (kebab-parse-code, p10-1C).
tree-sitter-go = "<resolved>"
```
Switch the crate's entry to `{ workspace = true }` matching existing tree-sitter-* style.
- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine.
- [ ] **Step 4**: Commit:
```bash
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
git commit -m "build(p10-1c-go): add tree-sitter-go workspace dep
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task B: source-fs media routing `.go` → `MediaType::Code("go")`
**Files:**
- Modify: `crates/kebab-source-fs/src/media.rs` (add arm after the existing JS arm at ~L44)
- Test: same file's test module
- [ ] **Step 1 (failing test)** — add to existing tests near `py_ts_js_files_map_to_media_code`:
```rust
#[test]
fn go_files_map_to_media_code_go() {
assert_eq!(media_type_for(Path::new("a/b.go")), MediaType::Code("go".into()));
}
```
- [ ] **Step 2**: Run → FAIL.
- [ ] **Step 3**: Add the arm before the catch-all `_ => MediaType::Other(ext)`:
```rust
// p10-1C-Go: Go ingest activated.
"go" => MediaType::Code("go".into()),
```
- [ ] **Step 4**: Run → PASS. `cargo test -p kebab-source-fs` → no regression.
- [ ] **Step 5**: clippy clean, commit:
```bash
git add crates/kebab-source-fs/
git commit -m "feat(p10-1c-go): route .go to MediaType::Code(go)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task C: App dispatch allowlist + bail arm for "go"
**Files:**
- Modify: `crates/kebab-app/src/lib.rs` (dispatch match guard + 4 internal match arms in `ingest_one_code_asset`)
- [ ] **Step 1**: Find the `MediaType::Code(lang) if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript")` arm (~L953). Add `"go"` to the allowlist:
```rust
MediaType::Code(lang)
if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript" | "go") =>
{
```
- [ ] **Step 2**: In `ingest_one_code_asset`'s 4 `match code_lang` blocks (parser_version, chunker_version, extract, chunk), add a "go" arm that `bail!()`s for now (extractor + chunker land in Task D/E). Mirror the Python/TS/JS bail-then-activate pattern:
```rust
let parser_version = match code_lang {
// ... existing arms ...
"go" => anyhow::bail!("go ingest not yet wired (p10-1c-go Task F)"),
other => anyhow::bail!("unsupported code_lang: {other}"),
};
// similar for chunker_version / extract / chunk matches
```
- [ ] **Step 3**: `cargo test -p kebab-app --lib` → existing 52 lib tests stay green. `cargo test -p kebab-app --test code_ingest_smoke` → 6 stay green (Rust path unaffected).
- [ ] **Step 4**: clippy clean, commit:
```bash
git add crates/kebab-app/
git commit -m "refactor(p10-1c-go): add go to ingest dispatch allowlist (bail until Task F)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task D: `GoAstExtractor` (`kebab-parse-code/src/go.rs`)
**Files:**
- Create: `crates/kebab-parse-code/src/go.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs` (`pub mod go;` + re-exports `GO_PARSER_VERSION`, `GoAstExtractor`)
- Create: `crates/kebab-parse-code/tests/fixtures/sample.go`
Scaffold mirrors `crates/kebab-parse-code/src/rust.rs` line-for-line for the `CanonicalDocument` skeleton (Extractor trait impl, `id_for_doc`, ProvenanceEvent, final `CanonicalDocument` literal). The novel parts:
### Constants
```rust
pub const PARSER_VERSION: &str = "code-go-v1";
pub struct GoAstExtractor;
// new() + Default
// supports: matches!(m, MediaType::Code(l) if l == "go")
// agent = "kb-parse-code"
// metadata.code_lang = Some("go")
// SourceType::Note (no SourceType::Code variant)
// repo/git_branch/git_commit via detect_repo
```
### Package extraction
Unlike 1B's path-based `module_path_for_python` / `_for_tsjs`, the Go package prefix comes from the **source code's `package` declaration** (design §3.4). tree-sitter-go's grammar:
- Root: `source_file`
- First named child is typically `package_clause` → contains `package_identifier` child whose text is the package name.
Helper (local to `go.rs`):
```rust
/// Returns the package name from a tree-sitter-go `source_file`, or
/// `None` if the file has no `package_clause` (invalid Go in practice,
/// but be defensive).
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_clause" {
// `package_clause` has a `package_identifier` named child.
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "package_identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
```
### Semantic-unit rules
| node kind | unit | symbol |
|-----------|------|--------|
| `function_declaration` (name field) | 1 | `<pkg>.<fn_name>` |
| `method_declaration` | 1 | `<pkg>.(<TypeText>).<MethodName>` where `<TypeText>` includes a leading `*` if the receiver is `pointer_type`. Examples: `chunk.(*MdHeadingV1Chunker).ChunkDoc`, `chunk.(Foo).Bar`. |
| `type_declaration` (struct / interface / type alias) | 1 per inner `type_spec` | `<pkg>.<TypeName>` |
| `const_declaration`, `var_declaration`, `import_declaration` (single or block) | glue | `<pkg>.<top-level>` (or `<pkg>.<package>` if file has ZERO real units AND glue is import-only — same `<module>` post-pass pattern as 1B Python, renamed to `<package>` to avoid colliding with Go's `package` keyword? — actually use `<module>` per design §3.4 — see "module / namespace 만 있고 symbol 없는 경우" line) |
`unit_start` walks `comment` siblings (same as 1B). Go doesn't have separate attribute / decorator nodes.
Method receiver pointer detection:
```rust
// In the method_declaration arm:
let receiver = child.child_by_field_name("receiver"); // parameter_list
let receiver_type_text = receiver.and_then(|r| {
let mut cw = r.walk();
for p in r.named_children(&mut cw) {
if p.kind() == "parameter_declaration" {
// type field is either type_identifier (value) or pointer_type (ptr)
if let Some(ty) = p.child_by_field_name("type") {
let s = &src[ty.start_byte()..ty.end_byte()];
return Some(s.to_string()); // includes leading "*" if pointer_type
}
}
}
None
});
// Format: "(*Foo)" or "(Foo)" — wrap in parens, preserve leading "*" if any.
let owner = receiver_type_text
.map(|t| format!("({t})"))
.unwrap_or_else(|| "()".to_string());
let method_name = name_text(&child, src);
// symbol = format!("{pkg}.{owner}.{method_name}")
```
Read tree-sitter-go's grammar.json or node-types.json (in the registry source) if any field name above differs in the resolved crate version.
### Fixture `tests/fixtures/sample.go`:
```go
// sample.go
package chunk
import (
"fmt"
"strings"
)
const Version = "v1"
type MdHeadingV1Chunker struct {
Name string
}
// ChunkDoc returns a stub list of strings.
func (m *MdHeadingV1Chunker) ChunkDoc(input string) []string {
return []string{m.Name}
}
func (m MdHeadingV1Chunker) Name2() string {
return m.Name
}
type Stringer interface {
String() string
}
func Free(x int) int {
return x + 1
}
func init() {
fmt.Println(strings.ToUpper("init"))
}
```
### Test module
Mirror Python's test shape (use `crate::rust::tests_support::fixed_code_asset` from 1B):
```rust
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.go"),
).unwrap();
let asset = crate::rust::tests_support::fixed_code_asset(
"crates/x/src/sample.go", "go",
);
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg };
GoAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_go() {
let e = GoAstExtractor::new();
assert!(e.supports(&MediaType::Code("go".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn go_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc.blocks.iter().filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("go"));
symbol.clone()
}
_ => None,
},
_ => None,
}).collect();
syms.sort();
assert!(syms.iter().any(|s| s == "chunk.Free"), "got {syms:?}");
assert!(syms.iter().any(|s| s == "chunk.init"));
assert!(syms.iter().any(|s| s == "chunk.MdHeadingV1Chunker"));
assert!(syms.iter().any(|s| s == "chunk.(*MdHeadingV1Chunker).ChunkDoc"));
assert!(syms.iter().any(|s| s == "chunk.(MdHeadingV1Chunker).Name2"));
assert!(syms.iter().any(|s| s == "chunk.Stringer"));
assert!(syms.iter().any(|s| s == "chunk.<top-level>")); // import + const grouped
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 { assert_eq!(extract_fixture().blocks, a.blocks); }
}
}
```
### Step list
- [ ] Step 1: create fixture + test module.
- [ ] Step 2: run → FAIL (`GoAstExtractor` undefined).
- [ ] Step 3: implement `go.rs`. Scaffold mirrors `python.rs` (Extractor impl + extract scaffold + `build_blocks` returning blocks). `build_blocks` does: extract_package → walk root's named children → branch per node kind per the table above → emit `Block::Code` with `SourceSpan::Code { symbol, lang: Some("go") }`. Use the same `flush_glue` / glue grouping / `<top-level>` vs `<module>` post-pass as Python (rename to `<package>` if user prefers, but spec §3.4 says `<module>` so keep that name for cross-language consistency).
- [ ] Step 4: wire into `lib.rs`:
```rust
pub mod go;
pub use go::{PARSER_VERSION as GO_PARSER_VERSION, GoAstExtractor};
```
- [ ] Step 5: `cargo test -p kebab-parse-code` → all pass (Rust/Python/TS/JS + new Go). `cargo clippy -p kebab-parse-code --all-targets -- -D warnings` clean.
- [ ] Step 6: commit:
```bash
git add crates/kebab-parse-code/
git commit -m "feat(p10-1c-go): tree-sitter-go AST extractor (GoAstExtractor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task E: `code-go-ast-v1` chunker
**Files:**
- Create: `crates/kebab-chunk/src/code_go_ast_v1.rs`
- Modify: `crates/kebab-chunk/src/lib.rs`
Identical pattern to PR #142 Task I (TS) / Task L (JS) — near-duplicate of `code_rust_ast_v1.rs` with substitutions:
- `const VERSION_LABEL: &str = "code-go-ast-v1";`
- struct name `CodeGoAstV1Chunker`
- error message says `"CodeGoAstV1Chunker only handles..."`
- module doc-comment prose `Rust``Go`, `code-rust-ast-v1``code-go-ast-v1`
`split_oversize` / `make_chunk` / `AST_CHUNK_MAX_LINES = 200` / `BYTES_PER_TOKEN = 3` / `POLICY_HASH_HEX_LEN = 16` IDENTICAL (language-agnostic).
Test module: copy from `code_ts_ast_v1.rs` and substitute names. KEEP cross-chunker `policy_hash_matches_md_heading_v1`.
Wire into `crates/kebab-chunk/src/lib.rs`:
```rust
mod code_go_ast_v1;
pub use code_go_ast_v1::CodeGoAstV1Chunker;
```
(Alphabetical placement.)
Verify + commit:
- `cargo test -p kebab-chunk code_go_ast` PASS (~6 tests)
- `cargo test -p kebab-chunk` full per-crate green
- `cargo clippy -p kebab-chunk --all-targets -- -D warnings` clean
```bash
git add crates/kebab-chunk/
git commit -m "feat(p10-1c-go): code-go-ast-v1 chunker (1:1 + oversize split)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task F: Activate Go in app dispatch
**Files:**
- Modify: `crates/kebab-app/src/lib.rs` (replace 4 "go" bail! arms with real calls)
- Modify: `crates/kebab-app/tests/code_ingest_smoke.rs` (add Go integration test)
Replace the 4 `"go" => anyhow::bail!(...)` arms in `ingest_one_code_asset` (added in Task C) with real:
```rust
"go" => ParserVersion(kebab_parse_code::GO_PARSER_VERSION.to_string()),
// ...
"go" => CodeGoAstV1Chunker.chunker_version(),
// ...
"go" => kebab_parse_code::GoAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::GoAstExtractor::extract (code:go)")?,
// ...
"go" => CodeGoAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeGoAstV1Chunker::chunk (code:go)")?,
```
Add imports at top of lib.rs:
- `kebab_chunk::CodeGoAstV1Chunker`
- `kebab_parse_code::GoAstExtractor`
Integration test (mirror PR #142's `python_file_ingests_and_searches_as_code_citation`):
```rust
#[test]
fn go_file_ingests_and_searches_as_code_citation() {
// ... TempDir + Config harness same as Python/TS test ...
let pkg_dir = env.workspace_root.join("chunk");
std::fs::create_dir_all(&pkg_dir).unwrap();
std::fs::write(
pkg_dir.join("ast.go"),
"package chunk\n\nfunc ParseDoc(input string) string {\n return input\n}\n",
).unwrap();
let report = kebab_app::ingest_with_config(/* ... */).unwrap();
assert!(report.new >= 1);
let go_item = report.items.as_ref().unwrap().iter()
.find(|i| i.doc_path.0.ends_with("ast.go")).expect("ast.go item");
assert_eq!(go_item.parser_version.as_ref().unwrap().0, "code-go-v1");
assert_eq!(go_item.chunker_version.as_ref().unwrap().0, "code-go-ast-v1");
let hits = kebab_app::search_with_config(/* search "ParseDoc" */).unwrap();
let h = hits.iter().find(|h| matches!(h.citation, kebab_core::Citation::Code { .. }))
.expect("Citation::Code hit");
match &h.citation {
kebab_core::Citation::Code { lang, symbol, line_start, .. } => {
assert_eq!(lang.as_deref(), Some("go"));
assert_eq!(symbol.as_deref(), Some("chunk.ParseDoc"));
assert!(*line_start >= 1);
}
_ => unreachable!(),
}
assert_eq!(h.code_lang.as_deref(), Some("go"));
}
```
Verify:
- `cargo test -p kebab-app --test code_ingest_smoke` → 7/7 (6 existing + 1 new go)
- `cargo test -p kebab-app --lib` → 52/52 (no regression)
- clippy clean
```bash
git add crates/kebab-app/
git commit -m "feat(p10-1c-go): activate Go in ingest_one_code_asset dispatch
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task G: Snapshot + full-suite gate + manual SMOKE
**Files:**
- Create: `crates/kebab-chunk/tests/code_go_ast_snapshot.rs` + fixture + baseline (mirror `code_python_ast_snapshot.rs` from PR #142)
- [ ] **Step 1**: Add snapshot integration test. In-memory `CanonicalDocument` (no kebab-parse-code dep — boundary §6.3). Generate baseline: `UPDATE_SNAPSHOTS=1 cargo test -p kebab-chunk code_go_ast_snapshot` → re-run without env → PASS.
- [ ] **Step 2**: Full-suite gate (the ONE invocation allowed this PR):
```bash
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace --no-fail-fast -j 1
```
Both must be CLEAN/GREEN.
- [ ] **Step 3**: Manual SMOKE (optional but recommended — mirror PR #142 SMOKE):
```bash
cargo build --release # OR debug if RAM-tight
rm -rf /tmp/kebab-go-smoke && mkdir -p /tmp/kebab-go-smoke/ws/chunk
echo 'package chunk
func ParseDoc(input string) string { return input }
' > /tmp/kebab-go-smoke/ws/chunk/ast.go
# adapt isolated config from docs/SMOKE.md
./target/release/kebab --config /tmp/kebab-go-smoke/config.toml ingest --json | jq '.items[].parser_version' | sort -u
./target/release/kebab --config /tmp/kebab-go-smoke/config.toml search "ParseDoc" --code-lang go --json | jq '.hits[0]'
```
Expected: `code-go-v1` in parser_versions; Citation::Code with symbol `chunk.ParseDoc`.
- [ ] **Step 4**: Commit snapshot only (full-suite + SMOKE are gates, not commit content):
```bash
git add crates/kebab-chunk/tests/
git commit -m "test(p10-1c-go): code-go-ast-v1 chunker snapshot + full-suite gate
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task H: Docs + version bump
- README: 지원 형식 row — add Go (`.go`, `code-go-ast-v1`).
- HANDOFF: P10 phase row note 1C-Go merged (Go active). Java/Kotlin remain pending.
- ARCHITECTURE: directory tree note for kebab-parse-code includes `go.rs` (Java/Kotlin coming in next PR). Decisions table — no new row (1C-Go follows the 1A-2/1B convention).
- SMOKE: extend the P10 section with a 1-line note for Go (or compact Go example).
- tasks/INDEX + tasks/p10/INDEX: flip the row for 1C-Go to 🟡 (PR open) → ✅ on merge. The 1C row in p10/INDEX may need a split — `p10-1C-Go ⏳ → 🟡` and `p10-1C-JavaKotlin ⏳ unchanged` (since user split into 2 PRs).
- frozen design §10.1: add a one-liner — "p10-1C-Go 활성화 (Go)" (Java/Kotlin will get its own line in the next PR).
- `Cargo.toml`: workspace version `0.11.1 → 0.12.0` (minor — dogfooding surface 확장, 새 chunker + extractor 활성화).
```bash
git add -A
git commit -m "docs(p10-1c-go): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + chore: bump version 0.11.1 → 0.12.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Finalize: PR + review loop + release
Per workflow memory (gitea-pr + review loop, no single-shot):
- [ ] `gitea-pr` → PR title `feat(p10-1C-Go): tree-sitter-go AST extractor + chunker — Go 코드 색인 활성화`
- [ ] Review loop until APPROVE → merge → main pull → branch cleanup → `cargo clean``gitea-release v0.12.0`.
---
## Self-Review (filled by plan author)
- **Spec coverage**: design §1C Go (extractor + chunker + activation) → Tasks D/E/F; §3.3 (`code-go-ast-v1`) → Task E; §3.4 symbol path → Task D (extract_package + method receiver pointer detection); §6.1 (`kebab-parse-code/src/go.rs`) → Task D; §6.2 (`kebab-chunk/src/code_go_ast_v1.rs`) → Task E; §6.3 dep graph (`tree-sitter-go` parser-side) → Task A; §9.1 Tier-1 + oversize fallback → Task E (1A-2 split_oversize reused identically).
- **No placeholders**: novel logic (`extract_package`, method receiver pointer detection, fixture, test assertions, dispatch arm additions) given concretely. Mechanical mirrors (chunker, integration test, snapshot test) pinned to exact existing files with substitutions.
- **Type consistency**: `GoAstExtractor` / `GO_PARSER_VERSION = "code-go-v1"` / `CodeGoAstV1Chunker` / `VERSION_LABEL = "code-go-ast-v1"` used consistently across Tasks A-H. `MediaType::Code("go")` in routing + dispatch. `Citation::Code` with `lang: Some("go")` in integration test.

View File

@@ -0,0 +1,494 @@
# p10-1C-JavaKotlin Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task.
**Goal:** Activate Java + Kotlin code ingest end-to-end. Mirror 1C-Go (PR #151 / v0.12.0) for Java (single-language scaffold) and Kotlin (additional top-level fn variant). Both use source-side `package` extraction (design §3.4 JVM convention).
**Architecture:** Same shape as 1B (multi-language single PR). 2 new tree-sitter grammars + 2 extractors + 2 chunkers + media routing + app dispatch arms. 1C-Go pattern is the closest template for source-side `package` extraction.
**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already), `tree-sitter-java` + `tree-sitter-kotlin` (NEW). 1A-2/1B/1C-Go infrastructure unchanged.
**Memory note:** Host has been OOM'd previously. Per-crate cargo only. ONE full-suite + clippy invocation in Task J.
---
## Pre-flight
Branch `feat/p10-1c-jk` already exists.
- [ ] **Disk hygiene**: `cargo clean` if heavy (last cleanup recovered 34 GB).
Reference files:
- 1C-Go extractor: `crates/kebab-parse-code/src/go.rs` — closest template for source-side package extraction.
- 1B Python extractor: `crates/kebab-parse-code/src/python.rs` — class-nesting recursion model (relevant for Java/Kotlin).
- 1A-2 chunker: `crates/kebab-chunk/src/code_rust_ast_v1.rs` — duplicate-with-substitution.
- 1B dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` 4-arm match (~L1645). 1C-Go already added `"go"`; this PR adds `"java"` + `"kotlin"`.
---
## Task A: Workspace deps (tree-sitter-java + tree-sitter-kotlin)
**Files:**
- Modify: `Cargo.toml` (workspace `[workspace.dependencies]`, after `tree-sitter-go` line)
- Modify: `crates/kebab-parse-code/Cargo.toml`
- [ ] **Step 1**: `cargo add tree-sitter-java tree-sitter-kotlin -p kebab-parse-code`. If `tree-sitter-kotlin` resolves to a fork name, verify the actively-maintained crate (e.g. check crates.io page / GitHub stars / last update). Likely `tree-sitter-kotlin` (without fork suffix) is the default.
- [ ] **Step 2**: Lift the two resolved versions into `[workspace.dependencies]` after `tree-sitter-go`:
```toml
# JVM family grammars for code ingest (kebab-parse-code, p10-1C-JK).
tree-sitter-java = "<resolved>"
tree-sitter-kotlin = "<resolved>"
```
Switch crate's entries to `{ workspace = true }`.
- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine.
- [ ] **Step 4**: Commit:
```bash
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
git commit -m "build(p10-1c-jk): add tree-sitter-java + tree-sitter-kotlin workspace deps
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
If the kotlin crate has a different actual name (e.g. `tree-sitter-kotlin-ng` or fork suffix), document the choice in the commit body briefly.
---
## Task B: source-fs routing `.java` / `.kt` / `.kts`
**Files:**
- Modify: `crates/kebab-source-fs/src/media.rs` (add arm after the existing `.go` arm)
- Test: same file's test module
- [ ] **Step 1 (failing test)** — add near `go_files_map_to_media_code_go`:
```rust
#[test]
fn java_kotlin_files_map_to_media_code() {
assert_eq!(media_type_for(Path::new("a/b.java")), MediaType::Code("java".into()));
assert_eq!(media_type_for(Path::new("a/b.kt")), MediaType::Code("kotlin".into()));
assert_eq!(media_type_for(Path::new("a/b.kts")), MediaType::Code("kotlin".into()));
}
```
- [ ] **Step 2**: Run → FAIL.
- [ ] **Step 3**: Add the arms before the `_ => MediaType::Other(ext)` fallback (after `"go" => ...`):
```rust
// p10-1C-JK: JVM family (Java + Kotlin) ingest activated.
"java" => MediaType::Code("java".into()),
"kt" | "kts" => MediaType::Code("kotlin".into()),
```
- [ ] **Step 4**: Run → PASS. `cargo test -p kebab-source-fs` → no regression.
- [ ] **Step 5**: clippy clean, commit.
```bash
cargo clippy -p kebab-source-fs --all-targets -- -D warnings
git add crates/kebab-source-fs/
git commit -m "feat(p10-1c-jk): route .java/.kt/.kts to MediaType::Code
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task C: App dispatch + bail arms for "java" + "kotlin"
**Files:**
- Modify: `crates/kebab-app/src/lib.rs`
- [ ] **Step 1**: Find the dispatch arm guard (currently `matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript" | "go")`). Add `"java"` + `"kotlin"`:
```rust
MediaType::Code(lang)
if matches!(lang.as_str(),
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin") =>
```
- [ ] **Step 2**: In `ingest_one_code_asset` the 4 `match code_lang` blocks add `"java"` and `"kotlin"` arms that `bail!()` for now:
```rust
"java" => anyhow::bail!("java ingest not yet wired (p10-1c-jk Task F)"),
"kotlin" => anyhow::bail!("kotlin ingest not yet wired (p10-1c-jk Task I)"),
```
(in each of the 4 blocks before the `other =>` catch-all).
- [ ] **Step 3**: Verify per-crate:
- `cargo test -p kebab-app --lib` → 52 stay green
- `cargo test -p kebab-app --test code_ingest_smoke` → 7 stay green
- `cargo clippy -p kebab-app --all-targets -- -D warnings` clean
- [ ] **Step 4**: Commit:
```bash
git add crates/kebab-app/
git commit -m "refactor(p10-1c-jk): add java + kotlin to ingest dispatch allowlist (bail until Tasks F/I)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
```
---
## Task D: `JavaAstExtractor`
**Files:**
- Create: `crates/kebab-parse-code/src/java.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs` (`pub mod java;` + re-exports `JAVA_PARSER_VERSION`, `JavaAstExtractor`)
- Create: `crates/kebab-parse-code/tests/fixtures/sample.java`
Scaffold mirrors `crates/kebab-parse-code/src/go.rs` (1C-Go) — single-language with source-side `package` extraction. Differences:
### Constants
```rust
pub const PARSER_VERSION: &str = "code-java-v1";
pub struct JavaAstExtractor;
// supports: matches!(m, MediaType::Code(l) if l == "java")
// code_lang = Some("java"), SourceType::Note, repo via detect_repo
```
### Package extraction (Java)
tree-sitter-java grammar:
- Root: `program`
- `package_declaration` (top-level child) → contains `scoped_identifier` (dotted) OR `identifier` (single-segment)
```rust
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_declaration" {
// package_declaration has scoped_identifier OR identifier as first named child
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "scoped_identifier" || sub.kind() == "identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
```
(Verify field names against tree-sitter-java's node-types.json if any field differs.)
### AST mapping
| node kind | unit | symbol |
|-----------|------|--------|
| `class_declaration` (name field) | 1 + recurse body | `<pkg>.<ClassName>` |
| `interface_declaration` (name) | 1 + recurse body | `<pkg>.<InterfaceName>` |
| `enum_declaration` (name) | 1 | `<pkg>.<EnumName>` |
| `record_declaration` (name, Java 14+) | 1 | `<pkg>.<RecordName>` |
| `annotation_type_declaration` (name) | 1 | `<pkg>.<AnnotationName>` |
| Inside class body: `method_declaration` (name) | 1 | `<pkg>.<Class>.<method>` |
| Inside class body: `constructor_declaration` (name = class name) | 1 | `<pkg>.<Class>.<ClassName>` (matches Java convention) |
| Nested classes recurse with class name pushed onto mod_path | as above | `<pkg>.<Outer>.<Inner>` etc. |
| `import_declaration`, `package_declaration` | glue | `<pkg>.<top-level>` |
| `field_declaration` at top of class | NOT a unit in 1C-JK (would explode unit count for value-only fields) | n/a |
`unit_start` walks `comment` siblings; Java has `@interface` annotations but those are part of `annotation_type_declaration` itself, not separate sibling nodes.
`mod_path` = class nesting (like 1B Python). Empty at file top level.
### Fixture `tests/fixtures/sample.java`:
```java
// sample.java
package com.kebab.chunk;
import java.util.List;
import java.util.stream.Collectors;
/**
* Heading-aware Markdown chunker.
*/
public class MdHeadingV1Chunker {
private final String name;
public MdHeadingV1Chunker(String name) {
this.name = name;
}
public List<String> chunkDoc(String input) {
return List.of(name, input);
}
public String getName() {
return name;
}
public static class Builder {
private String name;
public Builder withName(String n) { this.name = n; return this; }
public MdHeadingV1Chunker build() { return new MdHeadingV1Chunker(name); }
}
}
interface Stringer {
String asString();
}
enum Mode { DEFAULT, FAST }
```
### Test module (inline `#[cfg(test)] mod tests`)
Mirror 1C-Go shape:
```rust
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.java"),
).unwrap();
let asset = crate::rust::tests_support::fixed_code_asset(
"crates/x/src/sample.java", "java",
);
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg };
JavaAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_java() { /* ... */ }
#[test]
fn java_units_match_design_3_4_symbols() {
let doc = extract_fixture();
let mut syms: Vec<String> = doc.blocks.iter().filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(lang.as_deref(), Some("java"));
symbol.clone()
}
_ => None,
},
_ => None,
}).collect();
syms.sort();
// workspace path → package extracted from source = com.kebab.chunk
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker"), "got {syms:?}");
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.MdHeadingV1Chunker")); // constructor
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.getName"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.withName"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.build"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.Stringer"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.Mode"));
assert!(syms.iter().any(|s| s == "com.kebab.chunk.<top-level>"));
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 { assert_eq!(extract_fixture().blocks, a.blocks); }
}
}
```
### Wire into lib.rs
```rust
pub mod java;
pub use java::{PARSER_VERSION as JAVA_PARSER_VERSION, JavaAstExtractor};
```
### Verify + commit
- `cargo test -p kebab-parse-code` → all pass
- `cargo clippy -p kebab-parse-code --all-targets -- -D warnings` clean
- commit `feat(p10-1c-jk): tree-sitter-java AST extractor (JavaAstExtractor)`
---
## Task E: `code-java-ast-v1` chunker
Identical pattern to 1C-Go Task E. Duplicate `code_rust_ast_v1.rs` with substitutions:
- `VERSION_LABEL = "code-java-ast-v1"`, struct `CodeJavaAstV1Chunker`
- error message + module doc-comment prose
- Test module: parser_version `"code-java-v1"`, code_lang `"java"`
- Keep cross-chunker `policy_hash_matches_md_heading_v1`
Wire into `crates/kebab-chunk/src/lib.rs` (alphabetical). Verify + commit.
---
## Task F: Activate Java in app dispatch
Replace the `"java"` `bail!()` arms in `ingest_one_code_asset` with real calls (`JavaAstExtractor` + `CodeJavaAstV1Chunker`). Add integration test `java_file_ingests_and_searches_as_code_citation` (mirror 1C-Go test, fixture `pkg_dir/Foo.java` with `package com.foo;` and `public class Foo { public String bar() { ... } }`, assert symbol `com.foo.Foo.bar`).
Verify + commit.
---
## Task G: `KotlinAstExtractor`
**Files:**
- Create: `crates/kebab-parse-code/src/kotlin.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs`
- Create: `crates/kebab-parse-code/tests/fixtures/sample.kt`
Constants: `PARSER_VERSION = "code-kotlin-v1"`, `KotlinAstExtractor`, `code_lang = "kotlin"`.
### Package extraction (Kotlin)
tree-sitter-kotlin grammar:
- Root: `source_file`
- `package_header` (top-level) → contains `identifier` (dotted is single `identifier` node text; verify against node-types.json)
```rust
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
if child.kind() == "package_header" {
let mut c2 = child.walk();
for sub in child.named_children(&mut c2) {
if sub.kind() == "identifier" {
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
}
}
}
}
None
}
```
(Verify against tree-sitter-kotlin's node-types.json — Kotlin grammar varies more than Java's.)
### AST mapping (Kotlin)
| node kind | unit | symbol |
|-----------|------|--------|
| `class_declaration` (name field) — covers `class`, `data class`, `sealed class`, `enum class`, `interface` (Kotlin's interface is a class_declaration variant) | 1 + recurse body | `<pkg>.<ClassName>` |
| `object_declaration` (name) — singleton | 1 + recurse | `<pkg>.<ObjectName>` |
| `function_declaration` (name) | 1 | `<pkg>.<fn_name>` (top-level) or `<pkg>.<Class>.<method>` (inside class) |
| Inside class body: `function_declaration` → method | 1 | `<pkg>.<Class>.<method>` |
| `property_declaration` at top-level (`val` / `var`) | glue | `<top-level>` (Kotlin top-level properties are common — keep as glue not unit) |
| `import_header`, `package_header` | glue | `<top-level>` |
(Detect class-vs-interface via modifier; for 1C 1차 treat both as `class_declaration` arm — symbol differs only via name. If tree-sitter-kotlin exposes `interface` keyword via modifier list, mention in HOTFIXES if special handling needed.)
### Fixture `sample.kt`:
```kotlin
// sample.kt
package com.kebab.chunk
import java.util.List
/**
* Heading-aware Markdown chunker.
*/
class MdHeadingV1Chunker(val name: String) {
fun chunkDoc(input: String): List<String> = listOf(name, input)
fun getName(): String = name
companion object {
fun withName(n: String): MdHeadingV1Chunker = MdHeadingV1Chunker(n)
}
}
interface Stringer {
fun asString(): String
}
enum class Mode { DEFAULT, FAST }
fun freeFunction(x: Int): Int = x + 1
object Singleton {
fun ping(): String = "pong"
}
```
### Test module — assert symbols
```rust
// Asserted symbols:
"com.kebab.chunk.MdHeadingV1Chunker"
"com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"
"com.kebab.chunk.MdHeadingV1Chunker.getName"
"com.kebab.chunk.MdHeadingV1Chunker.Companion" // companion object (verify name)
"com.kebab.chunk.MdHeadingV1Chunker.Companion.withName" // method on companion
"com.kebab.chunk.Stringer"
"com.kebab.chunk.Mode"
"com.kebab.chunk.freeFunction" // top-level fn (Kotlin-specific!)
"com.kebab.chunk.Singleton"
"com.kebab.chunk.Singleton.ping"
"com.kebab.chunk.<top-level>" // import + property glue
```
(Companion object: tree-sitter-kotlin may use `companion_object` or `object_declaration` with `companion` modifier — verify and adjust the symbol if `Companion` isn't the right name.)
### Wire into lib.rs
```rust
pub mod kotlin;
pub use kotlin::{PARSER_VERSION as KOTLIN_PARSER_VERSION, KotlinAstExtractor};
```
Verify + commit.
---
## Task H: `code-kotlin-ast-v1` chunker
Same pattern as Task E. Substitute kotlin labels. Verify + commit.
---
## Task I: Activate Kotlin in app dispatch
Replace `"kotlin"` bail arms with real calls. Add integration test `kotlin_file_ingests_and_searches_as_code_citation`. Verify + commit.
---
## Task J: Snapshots + full-suite + SMOKE
- Create 2 snapshot tests (`code_java_ast_snapshot.rs`, `code_kotlin_ast_snapshot.rs`) + baselines. Mirror 1C-Go Task G snapshot test.
- ONE workspace test + clippy invocation.
- Manual SMOKE: write a `.java` and `.kt` file in TempDir, ingest, search.
Verify + commit (snapshot only).
---
## Task K: Docs + version bump
- README + HANDOFF + ARCHITECTURE + SMOKE + 2 INDEX updates + design §10.1.
- `Cargo.toml` version `0.12.0 → 0.13.0` (minor, surface 확장).
Commit `docs(p10-1c-jk): ... + chore: bump 0.12.0 → 0.13.0`.
---
## Finalize
`gitea-pr` → review loop → merge → main pull → branch cleanup → `cargo clean``gitea-release v0.13.0`.
---
## Self-Review (filled by plan author)
- **Spec coverage**: design §1C Java + Kotlin → Tasks D-I; §3.4 symbol path → extractor (Java D, Kotlin G); §6.1/§6.2 module structure → Tasks D/E/G/H; §6.3 dep graph → Task A; §9.1 Tier-1 + oversize fallback → chunkers E/H.
- **No placeholders**: novel logic (Java `extract_package`, Kotlin `extract_package`, AST walk arm tables) given concretely. Chunkers (E, H) are explicit "duplicate code_rust_ast_v1.rs with substitution X/Y/Z".
- **Type consistency**: `JavaAstExtractor` / `JAVA_PARSER_VERSION` / `CodeJavaAstV1Chunker` + `KotlinAstExtractor` / `KOTLIN_PARSER_VERSION` / `CodeKotlinAstV1Chunker` used consistently. `MediaType::Code("java")` / `("kotlin")` in routing + dispatch.
- **Kotlin grammar risk**: noted — tree-sitter-kotlin's exact node kinds (`class_declaration` vs `object_declaration`, `companion_object` vs companion modifier, `package_header` vs `package_directive`) should be verified against the resolved crate's node-types.json. Pin contract via test fixture; HOTFIXES any deviation found during implementation.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,930 @@
# p10-1D C + C++ AST Chunkers Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Activate C + C++ code ingest end-to-end. P10 Tier 1 chunker family final entry.
**Architecture:** Same shape as 1B (multi-language single PR) and 1C-JK (JVM family). 2 new tree-sitter grammars + 2 extractors + 2 chunkers + media routing (delegated via `code_lang_for_path`, no change) + app dispatch arms. C symbol = function name only; C++ symbol = `namespace::Class::method` via recursive class/namespace nesting (Java/Kotlin + Python hybrid).
**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already), `tree-sitter-c` + `tree-sitter-cpp` (NEW). 1A-2/1B/1C/p10-2/p10-3 infrastructure unchanged.
**Memory note:** Host has been OOM'd previously (재부팅 사례). Per-crate cargo only. ONE full-suite + clippy invocation in Task J. NO `cargo test --workspace` outside that gate.
---
## Pre-flight
Branch `feat/p10-1d-c-cpp` already exists (spec commit `8add684`).
- [ ] **Disk hygiene**: `df -h /` 점검. 80% 넘으면 `cargo clean`.
Reference files:
- 1C-JK extractor: `crates/kebab-parse-code/src/{java,kotlin}.rs` — closest template for source-side identifier prefix (package vs namespace).
- 1B Python extractor: `crates/kebab-parse-code/src/python.rs` — class-nesting recursion model (relevant for C++ class nesting).
- 1A-2 chunker: `crates/kebab-chunk/src/code_rust_ast_v1.rs` — duplicate-with-substitution pattern.
- 1B/1C/p10-2/p10-3 dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` (~L17962116). Current allowlist + 4-arm match.
- spec: `tasks/p10/p10-1d-c-cpp-ast-chunker.md`.
---
## Task A: Workspace deps (tree-sitter-c + tree-sitter-cpp)
**Files:**
- Modify: `Cargo.toml` (`[workspace.dependencies]`, after `tree-sitter-kotlin-ng`)
- Modify: `crates/kebab-parse-code/Cargo.toml`
- [ ] **Step 1**: `cargo add tree-sitter-c tree-sitter-cpp -p kebab-parse-code`. If either crate's actively-maintained name differs (e.g. `tree-sitter-cpp` vs `tree-sitter-cpp-ng`), verify on crates.io. The `tree-sitter-c` 0.24 / `tree-sitter-cpp` 0.23 line is the most common; verify compatibility with workspace `tree-sitter = "0.26"` (likely already supported via the `tree-sitter-language` shim).
- [ ] **Step 2**: Lift the two resolved versions into `[workspace.dependencies]` (after `tree-sitter-kotlin-ng`):
```toml
# C/C++ family grammars for code ingest (kebab-parse-code, p10-1D).
tree-sitter-c = "<resolved>"
tree-sitter-cpp = "<resolved>"
```
Switch crate's `Cargo.toml` entries to `{ workspace = true }`.
- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine.
- [ ] **Step 4**: Commit:
```bash
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
git commit -m "$(cat <<'EOF'
build(p10-1d): add tree-sitter-c + tree-sitter-cpp workspace deps
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
If a crate's resolved name has a non-obvious fork suffix (e.g. `tree-sitter-cpp-ng`), document it in the commit body.
---
## Task B: C AST extractor (`kebab-parse-code/src/c.rs`)
**Files:**
- Create: `crates/kebab-parse-code/src/c.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs` (pub mod + `C_PARSER_VERSION` const)
- [ ] **Step 1**: Create `crates/kebab-parse-code/src/c.rs`. Mirror `crates/kebab-parse-code/src/go.rs` (closest template — single-language, no namespace/package nesting, top-level units). Replace tree-sitter-go with tree-sitter-c:
```rust
//! p10-1D: C AST extractor.
use crate::traits::{Extractor, ExtractContext};
use anyhow::{Context, Result};
use kebab_core::{Block, BlockId, CanonicalDocument, CodeBlock, CommonBlock, /*..*/, SourceSpan, id_for_block, id_for_doc};
use tree_sitter::Parser;
pub const C_PARSER_VERSION: &str = concat!("tree-sitter-c-", env!("CARGO_PKG_VERSION"));
// Or use the tree-sitter-c crate version: better to hardcode for stability.
// Look at how go.rs / rust.rs / etc. set their PARSER_VERSION.
pub struct CAstExtractor {
parser: Parser,
}
impl CAstExtractor {
pub fn new() -> Self {
let mut parser = Parser::new();
parser.set_language(&tree_sitter_c::LANGUAGE.into()).expect("load tree-sitter-c");
Self { parser }
}
}
impl Extractor for CAstExtractor {
fn extract(&mut self, ctx: &ExtractContext, bytes: &[u8]) -> Result<CanonicalDocument> {
// ... mirror go.rs:
// 1. parse the tree
// 2. iterate source_file's named_children
// 3. for each top-level node:
// - function_definition → emit unit (symbol = fn name)
// - struct_specifier (named) → emit unit (symbol = struct name)
// - enum_specifier (named) → emit unit (symbol = enum name)
// - union_specifier (named) → emit unit (symbol = union name)
// - declaration → glue
// - preproc_include / preproc_def / preproc_function_def / preproc_ifdef → glue
// - else → glue
// 4. <top-level> glue chunk if any glue accumulated
// 5. <module> post-pass if 0 units
// ...
todo!("mirror go.rs structure with C-specific node-kind names")
}
}
```
**ACTION**: Read `crates/kebab-parse-code/src/go.rs` in full first. It's the closest template — single-language, no namespace prefix to thread through (C is even simpler than Go since there's no `package`). Port the structure: parse → iterate top-level → match on node-kind → emit units or accumulate glue.
Node-kind name reference (tree-sitter-c): `function_definition`, `struct_specifier`, `enum_specifier`, `union_specifier`, `declaration`, `preproc_*`. Confirm by checking the crate's `node-types.json` if uncertain.
**Function name extraction**: `function_definition` has a `declarator` field. The innermost `identifier` of that declarator is the function name. Mirror how go.rs extracts function names — it uses tree-sitter field traversal.
- [ ] **Step 2**: Register the module in `crates/kebab-parse-code/src/lib.rs`:
```rust
pub mod c;
pub use c::{CAstExtractor, C_PARSER_VERSION};
```
- [ ] **Step 3**: Build:
```bash
cargo build -p kebab-parse-code 2>&1 | tail -5
```
Expected: clean.
- [ ] **Step 4**: Commit (no test yet — Task D adds the snapshot test):
```bash
git add crates/kebab-parse-code/src/c.rs crates/kebab-parse-code/src/lib.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): C AST extractor (tree-sitter-c)
Top-level units: function_definition (symbol = fn name), struct_specifier,
enum_specifier, union_specifier (each emits 1 unit with the symbol being
the named identifier). Preprocessor directives + top-level declarations
group into a <top-level> glue chunk. Empty file or zero units → <module>
post-pass.
C symbol = function name only — no namespace, no class nesting (design §3.4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task C: C++ AST extractor (`kebab-parse-code/src/cpp.rs`)
**Files:**
- Create: `crates/kebab-parse-code/src/cpp.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs`
- [ ] **Step 1**: Create `crates/kebab-parse-code/src/cpp.rs`. The closest template is `crates/kebab-parse-code/src/java.rs` (1C-JK) — it handles package prefix + class nesting via recursion. C++ adds namespace nesting (multiple levels possible).
Pseudocode:
```rust
//! p10-1D: C++ AST extractor.
use crate::traits::{Extractor, ExtractContext};
use anyhow::{Context, Result};
use kebab_core::{/* ... */};
use tree_sitter::{Node, Parser};
pub const CPP_PARSER_VERSION: &str = "tree-sitter-cpp-<resolved>";
pub struct CppAstExtractor { parser: Parser }
impl CppAstExtractor {
pub fn new() -> Self {
let mut parser = Parser::new();
parser.set_language(&tree_sitter_cpp::LANGUAGE.into()).expect("load tree-sitter-cpp");
Self { parser }
}
fn visit(&self, node: Node, source: &[u8], prefix: &[&str], units: &mut Vec<(String, Node)>, glue: &mut Vec<Node>) {
// prefix is the namespace/class chain so far (e.g. ["kebab", "chunk", "MdHeadingV1Chunker"]).
for child in node.named_children(&mut node.walk()) {
match child.kind() {
"namespace_definition" => {
let name = child.child_by_field_name("name")
.and_then(|n| n.utf8_text(source).ok())
.unwrap_or("<anonymous>");
let mut new_prefix = prefix.to_vec();
new_prefix.push(name);
let body = child.child_by_field_name("body").unwrap_or(child);
self.visit(body, source, &new_prefix, units, glue);
}
"class_specifier" | "struct_specifier" if child.child_by_field_name("name").is_some() => {
let name = child.child_by_field_name("name")
.and_then(|n| n.utf8_text(source).ok())
.unwrap_or("<anonymous>");
// Emit the class itself as a unit.
let symbol = build_symbol(prefix, &[], name); // e.g. "kebab::chunk::Foo"
units.push((symbol, child));
// Recurse for nested classes / methods.
let mut new_prefix = prefix.to_vec();
new_prefix.push(name);
let body = child.child_by_field_name("body").unwrap_or(child);
self.visit(body, source, &new_prefix, units, glue);
}
"function_definition" => {
// declarator may be qualified_identifier (out-of-class def) or plain identifier.
let symbol = extract_fn_symbol(child, source, prefix);
units.push((symbol, child));
// Do NOT recurse into function body — inner classes/lambdas left to a future revision.
}
"template_declaration" => {
// Recurse: unwrap to inner declarator (function_definition or class_specifier)
// and treat it as if it were directly there. Template params NOT in symbol.
self.visit(child, source, prefix, units, glue);
}
"enum_specifier" if child.child_by_field_name("name").is_some() => {
let name = child.child_by_field_name("name").and_then(|n| n.utf8_text(source).ok()).unwrap_or("<anonymous>");
let symbol = build_symbol(prefix, &[], name);
units.push((symbol, child));
}
"concept_definition" => {
let name = /* extract */;
let symbol = build_symbol(prefix, &[], &name);
units.push((symbol, child));
}
_ => glue.push(child),
}
}
}
}
fn build_symbol(prefix: &[&str], extras: &[&str], leaf: &str) -> String {
// Join with ::
let mut parts: Vec<&str> = prefix.iter().copied().collect();
parts.extend_from_slice(extras);
parts.push(leaf);
parts.join("::")
}
fn extract_fn_symbol(node: Node, source: &[u8], prefix: &[&str]) -> String {
// function_definition.declarator may be a function_declarator wrapping a
// qualified_identifier (out-of-class def like `void Foo::bar(){}`) or a
// plain identifier (free fn or in-namespace fn).
// Need to walk down to the leaf identifier and any qualifier chain.
// For qualified_identifier "Foo::bar::baz", break into ["Foo", "bar"] qualifier + "baz" leaf.
// ...
todo!("walk declarator → qualified_identifier → assemble symbol with prefix")
}
// Extractor impl: parse, visit(root, ...), emit chunks-of-blocks per (symbol, node) pair + <top-level> glue + <module> fallback.
```
This is the most intricate extractor in p10-1D. **Action**: read `crates/kebab-parse-code/src/java.rs` for the recursion pattern, then `crates/kebab-parse-code/src/python.rs` for the class-nesting pattern, and combine. tree-sitter-cpp's node-types.json (or a quick `tree-sitter parse` against a sample file) confirms exact node-kind names.
- [ ] **Step 2**: Register in `crates/kebab-parse-code/src/lib.rs`:
```rust
pub mod cpp;
pub use cpp::{CppAstExtractor, CPP_PARSER_VERSION};
```
- [ ] **Step 3**: Build:
```bash
cargo build -p kebab-parse-code 2>&1 | tail -5
```
Expected: clean.
- [ ] **Step 4**: Commit:
```bash
git add crates/kebab-parse-code/src/cpp.rs crates/kebab-parse-code/src/lib.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): C++ AST extractor (tree-sitter-cpp)
Symbol = namespace::Class::method via recursive visit. namespace_definition
pushes namespace name (anonymous → <anonymous>). class_specifier / struct_specifier
(named) emit class unit + recurse with class name pushed. function_definition
emits method unit (symbol may include qualified_identifier prefix for
out-of-class definitions). template_declaration unwraps to inner declarator
(template params NOT in symbol). enum_specifier + concept_definition emit
type-level units. extern "C" block content + using/include/define → glue.
Constructor / destructor symbols use Class::Class / Class::~Class
convention. Operator overloads keep operator+ form.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task D: C chunker + snapshot test
**Files:**
- Create: `crates/kebab-chunk/src/code_c_ast_v1.rs`
- Create: `crates/kebab-chunk/tests/fixtures/sample.c`
- Create: `crates/kebab-chunk/tests/code_c_ast_snapshot.rs`
- Modify: `crates/kebab-chunk/src/lib.rs`
- [ ] **Step 1**: Create `crates/kebab-chunk/src/code_c_ast_v1.rs`. **Mirror `crates/kebab-chunk/src/code_go_ast_v1.rs`** (closest 1-extractor pattern, no nesting):
```rust
//! p10-1D: C AST chunker.
use crate::tier2_shared::build_chunk;
use crate::{Chunker, ChunkPolicy};
use anyhow::Result;
use kebab_core::{Block, Chunk, Document};
pub const VERSION_LABEL: &str = "code-c-ast-v1";
pub struct CodeCAstV1Chunker;
impl Chunker for CodeCAstV1Chunker {
fn chunker_version(&self) -> &'static str { VERSION_LABEL }
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
crate::tier2_shared::policy_hash(policy)
}
fn chunk(&self, doc: &Document, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Mirror code_go_ast_v1.rs's body — iterate doc.blocks, each Block::Code
// contributes 1 chunk via build_chunk. Apply oversize fallback per block
// via tier2_shared::push_chunks_with_oversize.
// ...
todo!("mirror code_go_ast_v1.rs verbatim, substituting VERSION_LABEL")
}
}
```
Read `code_go_ast_v1.rs` and port verbatim — the language-agnostic body iterates `doc.blocks` and emits chunks. Only the `VERSION_LABEL` and (potentially) symbol formatting helper change.
- [ ] **Step 2**: Create `tests/fixtures/sample.c` (~30 lines, includes top-level fn, struct, enum, preprocessor):
```c
#include <stdio.h>
#include <stdlib.h>
#define MAX_BUF 4096
typedef enum {
OK = 0,
ERR_PARSE,
ERR_IO,
} status_t;
typedef struct {
int id;
char name[64];
status_t status;
} record_t;
static int counter = 0;
int parse_record(const char *line, record_t *out) {
if (line == NULL || out == NULL) return ERR_PARSE;
return OK;
}
void print_record(const record_t *r) {
printf("[%d] %s (status=%d)\n", r->id, r->name, r->status);
}
int main(void) {
record_t r = { .id = 1, .name = "foo", .status = OK };
print_record(&r);
return 0;
}
```
Expected snapshot: 3 function units (`parse_record`, `print_record`, `main`) + 1 enum unit (`status_t`) + 1 struct unit (`record_t`) + 1 `<top-level>` glue (preproc + global var). Total ~6 chunks.
- [ ] **Step 3**: Create `tests/code_c_ast_snapshot.rs` mirroring `tests/code_go_ast_snapshot.rs`. Assertions:
```rust
// Pseudocode:
// 1. Load fixture sample.c
// 2. Run CAstExtractor → Document
// 3. Run CodeCAstV1Chunker.chunk(&doc, &policy)
// 4. Assert chunks.len() == expected (6).
// 5. Assert symbols (from chunks[i].source_spans[0]::SourceSpan::Code.symbol) match expected list:
// ["status_t", "record_t", "parse_record", "print_record", "main", "<top-level>"]
// (order matches AST traversal order — verify by running once.)
// 6. Assert all chunks have lang = Some("c").
```
- [ ] **Step 4**: Register module in `crates/kebab-chunk/src/lib.rs`:
```rust
pub mod code_c_ast_v1;
pub use code_c_ast_v1::CodeCAstV1Chunker;
```
- [ ] **Step 5**: Run test:
```bash
cargo test -p kebab-chunk --test code_c_ast_snapshot -- --nocapture 2>&1 | tail -25
```
Expected: PASS. If chunk count or symbol order differs from expectation, INSPECT the actual output and update the test's expected list to match (run once to learn, codify on second run).
- [ ] **Step 6**: Clippy + commit:
```bash
cargo clippy -p kebab-chunk --all-targets -- -D warnings
git add crates/kebab-chunk/src/code_c_ast_v1.rs \
crates/kebab-chunk/src/lib.rs \
crates/kebab-chunk/tests/fixtures/sample.c \
crates/kebab-chunk/tests/code_c_ast_snapshot.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): code-c-ast-v1 chunker + snapshot test
Mirrors code-go-ast-v1's chunker pattern (1 chunk per AST unit + <top-level>
glue + oversize fallback). Snapshot test against tests/fixtures/sample.c
(function + struct + enum + preprocessor) verifies symbol order + lang=c
stamping.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task E: C++ chunker + snapshot test
**Files:**
- Create: `crates/kebab-chunk/src/code_cpp_ast_v1.rs`
- Create: `crates/kebab-chunk/tests/fixtures/sample.cpp`
- Create: `crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs`
- Modify: `crates/kebab-chunk/src/lib.rs`
- [ ] **Step 1**: Create `code_cpp_ast_v1.rs`. **Mirror `code_c_ast_v1.rs`** verbatim, only VERSION_LABEL differs:
```rust
pub const VERSION_LABEL: &str = "code-cpp-ast-v1";
pub struct CodeCppAstV1Chunker;
impl Chunker for CodeCppAstV1Chunker {
fn chunker_version(&self) -> &'static str { VERSION_LABEL }
// ... identical body — both languages use the same Block::Code → Chunk emission ...
}
```
The actual symbol-formatting work happens in the EXTRACTOR (Task C). The chunker's job is to iterate blocks the extractor produced and emit Chunks. Both C and C++ chunkers are essentially identical bodies.
- [ ] **Step 2**: Create `tests/fixtures/sample.cpp` (~50 lines, includes namespace + nested class + method + free fn + template):
```cpp
#include <string>
#include <vector>
namespace kebab {
namespace chunk {
class MdHeadingV1Chunker {
public:
MdHeadingV1Chunker() = default;
~MdHeadingV1Chunker() = default;
std::string chunk_doc(const std::string& doc) {
return doc;
}
int operator()(int x) const {
return x * 2;
}
private:
int counter_ = 0;
};
template <typename T>
T identity(T value) {
return value;
}
} // namespace chunk
void global_helper() {
// free function in kebab namespace
}
} // namespace kebab
int main() {
kebab::chunk::MdHeadingV1Chunker c;
return 0;
}
```
Expected snapshot symbols (verify on first run, then codify):
- `kebab::chunk::MdHeadingV1Chunker` (class unit)
- `kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker` (constructor)
- `kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker` (destructor)
- `kebab::chunk::MdHeadingV1Chunker::chunk_doc`
- `kebab::chunk::MdHeadingV1Chunker::operator()`
- `kebab::chunk::identity` (template fn)
- `kebab::global_helper`
- `main` (free fn, no namespace)
- `<top-level>` (include + using)
~9 chunks total.
- [ ] **Step 3**: Create `tests/code_cpp_ast_snapshot.rs` mirroring `code_c_ast_snapshot.rs`. Assert symbol list matches expected (run once to learn the actual order, codify).
- [ ] **Step 4**: Register module in `lib.rs`:
```rust
pub mod code_cpp_ast_v1;
pub use code_cpp_ast_v1::CodeCppAstV1Chunker;
```
- [ ] **Step 5**: Run test:
```bash
cargo test -p kebab-chunk --test code_cpp_ast_snapshot -- --nocapture 2>&1 | tail -30
```
Expected: PASS.
- [ ] **Step 6**: Clippy + commit:
```bash
cargo clippy -p kebab-chunk --all-targets -- -D warnings
git add crates/kebab-chunk/src/code_cpp_ast_v1.rs \
crates/kebab-chunk/src/lib.rs \
crates/kebab-chunk/tests/fixtures/sample.cpp \
crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): code-cpp-ast-v1 chunker + snapshot test
Identical chunker body to code-c-ast-v1; per-language work happens in the
CppAstExtractor (Task C). Snapshot fixture covers nested namespace +
class + ctor/dtor + method + operator overload + template fn + free fn +
top-level main, verifying namespace::Class::method symbol convention per
design §3.4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task F: ingest_one_code_asset dispatch + tier3 fallback list extension
**Files:**
- Modify: `crates/kebab-app/src/lib.rs`
- [ ] **Step 1**: Top-of-file `use kebab_chunk::{...}` extend with `CodeCAstV1Chunker` + `CodeCppAstV1Chunker`:
```rust
use kebab_chunk::{
/* existing items */,
CodeCAstV1Chunker,
CodeCppAstV1Chunker,
};
```
- [ ] **Step 2**: Allowlist (around line 953) extend:
```rust
if matches!(lang.as_str(),
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "shell"
| "c" | "cpp")
```
- [ ] **Step 3**: `parser_version` match — add C/C++ arms (Tier 1, so they DO get a real parser version):
```rust
let parser_version = match code_lang {
// ... existing 7 Tier 1 + Tier 2 + shell arms ...
"c" => ParserVersion(kebab_parse_code::C_PARSER_VERSION.to_string()),
"cpp" => ParserVersion(kebab_parse_code::CPP_PARSER_VERSION.to_string()),
other => anyhow::bail!("unsupported code_lang: {other}"),
};
```
- [ ] **Step 4**: `chunker_version` match — add C/C++ arms:
```rust
let chunker_version = match code_lang {
// ... existing arms ...
"c" => CodeCAstV1Chunker.chunker_version(),
"cpp" => CodeCppAstV1Chunker.chunker_version(),
other => anyhow::bail!("unreachable chunker_version: {other}"),
};
```
- [ ] **Step 5**: `canonical_result` extract match — add C/C++ arms:
```rust
let canonical_result: anyhow::Result<CanonicalDocument> = match code_lang {
"rust" => RustAstExtractor::new().extract(&ctx, &bytes).context("..."),
// ... existing ...
"c" => CAstExtractor::new().extract(&ctx, &bytes)
.context("kb-parse-code::CAstExtractor::extract (code:c)"),
"cpp" => CppAstExtractor::new().extract(&ctx, &bytes)
.context("kb-parse-code::CppAstExtractor::extract (code:cpp)"),
// ... Tier 2 + shell ...
other => anyhow::bail!("unreachable (extract): {other}"),
};
```
(Add `use kebab_parse_code::{CAstExtractor, CppAstExtractor};` at the top if not already wildcard-imported.)
- [ ] **Step 6**: `chunks_result` match — add C/C++ arms:
```rust
let chunks_result: anyhow::Result<Vec<Chunk>> = if extract_fell_back {
// ... existing ...
} else {
match code_lang {
"rust" => CodeRustAstV1Chunker.chunk(&canonical, chunk_policy).context("..."),
// ... existing ...
"c" => CodeCAstV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeCAstV1Chunker::chunk (code:c)"),
"cpp" => CodeCppAstV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeCppAstV1Chunker::chunk (code:cpp)"),
// ... existing ...
other => anyhow::bail!("unreachable (chunk): {other}"),
}
};
```
- [ ] **Step 7**: `tier3_fallback_cv` (p10-3 Critical fix) — C/C++ are fallback-eligible (extract may fail on `.h` C++ headers or malformed code):
```rust
let tier3_fallback_cv = match code_lang {
"rust" | "python" | "typescript" | "javascript"
| "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "c" | "cpp" // p10-1d:
=> Some(CodeTextParagraphV1Chunker.chunker_version()),
_ => None,
};
```
(The exact location of this match is in `ingest_one_code_asset` between ~lines 1921-1927 per the p10-3 critical fix.)
- [ ] **Step 8**: Build:
```bash
cargo build -p kebab-app 2>&1 | tail -5
```
Expected: clean.
- [ ] **Step 9**: Per-crate test (no regression):
```bash
cargo test -p kebab-app --lib -- --nocapture 2>&1 | tail -10
```
Expected: 52 PASS (existing baseline).
- [ ] **Step 10**: Clippy + commit:
```bash
cargo clippy -p kebab-app --all-targets -- -D warnings
git add crates/kebab-app/src/lib.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): activate C + C++ in ingest_one_code_asset dispatch
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
+ allowlist + tier3_fallback_cv list with "c" + "cpp" arms. C uses
CAstExtractor + CodeCAstV1Chunker; C++ uses CppAstExtractor +
CodeCppAstV1Chunker. Both langs are Tier 3-fallback-eligible (e.g. .h
file with C++ syntax may fail tree-sitter-c parse → Tier 3 paragraph
fallback).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task G: code_ingest_smoke integration tests (C + C++)
**Files:**
- Modify: `crates/kebab-app/tests/code_ingest_smoke.rs`
- [ ] **Step 1**: Append 2 tests at the end of the file (mirror the existing tier1 tests `c_ast_v1_*` if present; if not, mirror `rust_ast_v1_*` or `go_ast_v1_*`):
```rust
#[test]
fn tier1_c_ingest_searchable() {
let env = TestEnv::lexical_only();
let workspace = env.workspace_root();
std::fs::write(
workspace.join("parser.c"),
"#include <stdio.h>\n\nint parse_record(const char *line) {\n if (line == NULL) return -1;\n return 0;\n}\n",
)
.unwrap();
let report = env.ingest().expect("ingest");
assert!(report.new_docs >= 1, "expected at least 1 new doc");
let hits = env.search_code_lang("c", "parse_record").expect("search");
assert!(!hits.is_empty(), "expected at least 1 c hit");
match &hits[0].citation {
Citation::Code { symbol, lang, .. } => {
assert_eq!(symbol.as_deref(), Some("parse_record"), "C symbol must be function name only");
assert_eq!(lang.as_deref(), Some("c"));
}
other => panic!("expected Citation::Code, got {other:?}"),
}
assert_eq!(
hits[0].chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-c-ast-v1"),
);
}
#[test]
fn tier1_cpp_ingest_searchable() {
let env = TestEnv::lexical_only();
let workspace = env.workspace_root();
std::fs::write(
workspace.join("chunker.cpp"),
"namespace kebab {\nnamespace chunk {\nclass Foo {\npublic:\n void bar() { /* impl */ }\n};\n}\n}\n",
)
.unwrap();
let report = env.ingest().expect("ingest");
assert!(report.new_docs >= 1);
let hits = env.search_code_lang("cpp", "bar").expect("search");
assert!(!hits.is_empty(), "expected at least 1 cpp hit");
match &hits[0].citation {
Citation::Code { symbol, lang, .. } => {
// Symbol could be "kebab::chunk::Foo::bar" or "kebab::chunk::Foo" depending on which chunk hits first.
assert!(
symbol.as_deref().map_or(false, |s| s.starts_with("kebab::chunk::Foo")),
"C++ symbol must start with namespace::Class prefix, got {:?}", symbol
);
assert_eq!(lang.as_deref(), Some("cpp"));
}
other => panic!("expected Citation::Code, got {other:?}"),
}
assert_eq!(
hits[0].chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-cpp-ast-v1"),
);
}
```
- [ ] **Step 2**: Run tests:
```bash
cargo test -p kebab-app --test code_ingest_smoke tier1_c_ingest tier1_cpp_ingest -- --nocapture 2>&1 | tail -30
```
Expected: 2 PASS.
- [ ] **Step 3**: Full smoke regression:
```bash
cargo test -p kebab-app --test code_ingest_smoke -- --nocapture 2>&1 | tail -30
```
Expected: 18 PASS (16 existing + 2 new).
- [ ] **Step 4**: Clippy + commit:
```bash
cargo clippy -p kebab-app --tests -- -D warnings
git add crates/kebab-app/tests/code_ingest_smoke.rs
git commit -m "$(cat <<'EOF'
test(p10-1d): integration smoke tests for C + C++
Verifies end-to-end ingest + search + Citation::Code shape:
- tier1_c_ingest_searchable: .c file → --code-lang c search → symbol
= function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1".
- tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search →
symbol starts with namespace::Class prefix, lang = "cpp",
chunker_version = "code-cpp-ast-v1".
Brings code_ingest_smoke to 18 tests (Rust 3 + Python 1 + TS 1 + JS 1 +
Go 1 + Java 1 + Kotlin 1 + yaml 1 + dockerfile 1 + manifest 1 + shell 1 +
yaml-fallback 1 + 2 reingest-unchanged regression + c 1 + cpp 1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task H: frozen design §10 activation log
**Files:**
- Modify: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`
- [ ] **Step 1**: Find §10 activation log. Add p10-1D entry right after the p10-3 entry:
```
**p10-1D 활성화 (C + C++) (2026-05-21)**: Tier 1 chunker family 완료 — C (`code-c-ast-v1`, `.c`/`.h`) + C++ (`code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`) AST chunker 활성화. C symbol = function name only; C++ symbol = `namespace::Class::method` (recursive namespace + class nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback 으로 자동 picked up.
```
- [ ] **Step 2**: Commit:
```bash
git add docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md \
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md 2>/dev/null
git add docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
git commit -m "$(cat <<'EOF'
docs(p10-1d): activate C + C++ in frozen design §10
P10 Tier 1 chunker family complete.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task I: README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX + tasks/p10/INDEX
**Files:**
- Modify: `README.md` (Mermaid + ingest row), `HANDOFF.md`, `docs/ARCHITECTURE.md`, `docs/SMOKE.md`, `tasks/INDEX.md`, `tasks/p10/INDEX.md`
- [ ] **Step 1 — README.md**: Update the `kebab ingest` row's supported-langs list to include `.c` / `.h``code-c-ast-v1` and `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx``code-cpp-ast-v1`. Extend `--code-lang c` / `--code-lang cpp` in the enumeration. Update the Mermaid `chunker[...]` node to include `code-c-ast-v1, code-cpp-ast-v1` in the brace.
- [ ] **Step 2 — HANDOFF.md**: P10 row append `, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)**`. Update 한 줄 요약 to include C/C++. Update 다음 후보 (drop p10-1D; remaining: P9-5 desktop / P8 audio).
- [ ] **Step 3 — docs/ARCHITECTURE.md**: code parser table row: append C + C++ row mention. Flowchart `pcode` node: append `+ P10-1D`. Directory tree chunkers list: add `code_c_ast_v1.rs` + `code_cpp_ast_v1.rs`.
- [ ] **Step 4 — docs/SMOKE.md**: Add a "## P10-1D C + C++ AST chunker" section after the P10-3 section. Walkthrough with sample.c + sample.cpp ingest + `--code-lang c` / `--code-lang cpp` search assertions. Append verification checklist entry.
- [ ] **Step 5 — tasks/INDEX.md + tasks/p10/INDEX.md**: Flip p10-1D row ⏳ → ✅ (v0.16.0).
- [ ] **Step 6**: Commit:
```bash
git add README.md HANDOFF.md docs/ARCHITECTURE.md docs/SMOKE.md tasks/INDEX.md tasks/p10/INDEX.md
git commit -m "$(cat <<'EOF'
docs(p10-1d): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++). Tier 2 (k8s + dockerfile + manifest) and Tier 3
(paragraph fallback) already active. p10-1D 활성화 + ✅ flip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task J: workspace test gate + clippy
- [ ] **Step 1**: Disk check (`df -h /`) + optional `cargo clean`.
- [ ] **Step 2**: `cargo test --workspace --no-fail-fast -j 1 2>&1 | tail -80`. Expected: all PASS.
- [ ] **Step 3**: `cargo clippy --workspace --all-targets -- -D warnings 2>&1 | tail -30`. Expected: clean.
---
## Task K: version bump + gitea PR + release
**Files:**
- Modify: `Cargo.toml`
- [ ] **Step 1**: Workspace `version = "0.15.0"``"0.16.0"`.
- [ ] **Step 2**: `cargo build -p kebab-cli` to refresh Cargo.lock.
- [ ] **Step 3**: Commit:
```bash
git add Cargo.toml Cargo.lock
git commit -m "$(cat <<'EOF'
chore: bump version 0.15.0 → 0.16.0 (p10-1d C + C++ AST chunkers)
Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1
+ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace
deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema
major bump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
- [ ] **Step 4**: Push branch + open gitea PR via REST API. Title: `feat(p10-1d): C + C++ AST chunkers — P10 Tier 1 chunker family complete`.
- [ ] **Step 5**: Wait for code-reviewer APPROVE → merge via gitea REST API → cut `gitea-release v0.16.0`.
---
## Verification matrix
| 검증 | 명령 | 기대 |
|------|------|------|
| C symbol | `kebab search --code-lang c --json` | `Citation::Code.symbol = "<fn_name>"` |
| C++ symbol | `kebab search --code-lang cpp --json` | `Citation::Code.symbol = "namespace::Class::method"` |
| .h fallback | `.h` with C++ syntax → ingest | Tier 3 fallback: `chunker_version = "code-text-paragraph-v1"`, lang = c |
| code_lang_breakdown | `kebab schema --json` | `"c": N`, `"cpp": M` |
---
## Risks reminder (구현 중 주의)
- **tree-sitter grammar version resolution**: tree-sitter 0.26 호환 grammar. crates.io 최신 버전 default.
- **tree-sitter-cpp 의 node-kind 명**: spec 의 가정 (`namespace_definition`, `class_specifier`, `function_definition`, `template_declaration`, `concept_definition`, etc.) 이 실제 grammar 와 일치하는지 fixture parse 로 검증.
- **out-of-class method def 의 prefix 복원**: `void Foo::bar()` 의 declarator 가 `function_declarator > qualified_identifier > namespace_identifier "Foo" + identifier "bar"`. spec 의 `extract_fn_symbol` 이 이 chain 정확히 walk.
- **Operator overload**: tree-sitter-cpp 의 `operator_name` 또는 `field_identifier` "operator+" 형태. fixture 로 검증.
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그.

File diff suppressed because it is too large Load Diff

View File

@@ -1545,6 +1545,16 @@ transitional 형태) 의 source of truth.
**p10-1B 활성화 (Python / TypeScript / JavaScript) (2026-05-20)**: Python (`code-python-ast-v1`, `.py`), TypeScript (`code-ts-ast-v1`, `.ts`/`.tsx`), JavaScript (`code-js-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx`) AST chunker 활성화. symbol path 는 workspace 경로 → module path prefix: Python = dotted (예: `kebab_eval.metrics.compute_mrr`), TypeScript/JavaScript = slash-style (예: `src/Foo.Foo.search`). Rust 1A-2 의 file-scope-only symbol 과 비일관 수용 (HOTFIXES 2026-05-20). expression-level 함수 (`const foo = () => {}`) 는 glue 처리 (HOTFIXES 2026-05-20).
**p10-1C-Go 활성화 (Go) (2026-05-20)**: Go (`code-go-ast-v1`, `.go`) AST chunker 활성화. symbol = `<package>.<Func>` / `<package>.(*Receiver).<Method>` 형식.
**p10-1C-JavaKotlin 활성화 (Java + Kotlin) (2026-05-20)**: Java (`code-java-ast-v1`, `.java`) + Kotlin (`code-kotlin-ast-v1`, `.kt`/`.kts`) AST chunker 활성화. symbol = `com.foo.Foo.bar` 형식 (패키지 + 클래스 + 메서드/필드). Kotlin grammar 은 `tree-sitter-kotlin-ng` 사용 (bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 고착으로 사용 불가).
**p10-2 활성화 (Tier 2 chunker) (2026-05-20)**: Tier 2 resource-aware chunker 3종 활성화 — k8s-manifest-resource-v1 (`.yaml`/`.yml`), dockerfile-file-v1 (`Dockerfile`), manifest-file-v1 (`Cargo.toml` 등 설정 파일). 추가 code_lang 매핑: XML (`.xml`, `pom.xml`), Groovy (`build.gradle`, `.gradle`), Go module (`go.mod`).
**p10-3 활성화 (Tier 3 paragraph fallback) (2026-05-21)**: Tier 3 chunker `code-text-paragraph-v1` 활성화. shell script (`.sh`/`.bash`/`.zsh`) direct routing + Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback 으로 retry. 비-k8s YAML / invalid YAML / AST 실패 케이스 모두 picked up. lang 은 입력 보존 (shell → "shell", yaml → "yaml" 등), symbol 은 항상 None.
**p10-1D 활성화 (C + C++) (2026-05-21)**: P10 Tier 1 chunker family 완료 — C (`code-c-ast-v1`, `.c`/`.h`) + C++ (`code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`) AST chunker 활성화. C symbol = function name only (no nesting); C++ symbol = `namespace::Class::method` (recursive namespace + class nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback 으로 자동 picked up.
### 10.2 MCP server transport (fb-30)
`kebab mcp` 가 stdio JSON-RPC server. Rust SDK = `rmcp 1.6`. Tool surface

View File

@@ -237,6 +237,9 @@ pub struct Metadata {
- Dockerfile (`Dockerfile`, `*.dockerfile`) → `dockerfile`
- TOML (`.toml`) → `toml`
- JSON (`.json`) → `json`
- XML (`.xml`, `pom.xml`) → `xml`
- Groovy (`build.gradle`, `.gradle`) → `groovy`
- Go module (`go.mod`) → `go-mod`
- Shell (`.sh`, `.bash`, `.zsh`) → `shell`
- Make (`Makefile`, `*.mk`) → `make`
- 미지원 / Tier 3 fallback → null

View File

@@ -14,12 +14,18 @@
"schema_version": { "const": "reset_report.v1" },
"scope": {
"type": "string",
"enum": ["all", "data_only", "vector_only", "config_only"]
"enum": ["all", "data_only", "vector_only", "config_only", "orphans_only"]
},
"removed_paths": {
"type": "array",
"items": { "type": "string" }
},
"embedding_rows_truncated": { "type": "integer", "minimum": 0 }
"embedding_rows_truncated": { "type": "integer", "minimum": 0 },
"orphans_purged": { "type": "integer", "minimum": 0, "default": 0 },
"purged_paths": {
"type": "array",
"items": { "type": "string" },
"default": []
}
}
}

View File

@@ -14,6 +14,34 @@ historical contract that was implemented; this file accumulates the
deltas so phase 5+ readers can find the live behavior without diffing
git history.
## 2026-05-21 — p10-2: k8s multi-resource YAML chunk_id collision
**Origin**: P10 종합 도그푸딩 (`/tmp/kebab-p10-dogfood/`, 16 파일). 한 파일에 2+ k8s document (Deployment + Service, `---` 구분) 인 YAML 이 ingest 실패.
**Symptom**: `DocumentStore::put_chunks (code): UNIQUE constraint failed: chunks.chunk_id`. document row 는 생성되나 chunk 0개 → 검색 불가. p10-2 의 통합 테스트 `tier2_k8s_yaml_ingest_searchable` 가 single-Deployment fixture 만 써서 미발견.
**원인**: `tier2_shared::push_chunks_with_oversize` 의 non-oversize 분기가 `split_key = None` 하드코딩. `K8sManifestResourceV1Chunker` 가 resource 마다 호출 — 같은 document 의 모든 resource 가 `doc_id` + `chunker_version` + `base_policy_hash` 공유 + `split_key = None` → 동일 `id_hash` → 동일 `chunk_id`. p10-3 의 `code_text_paragraph_v1` 가 같은 버그였고 `df3c5b8` 에서 fix 됐지만 그건 `build_chunk_no_symbol` 직접 호출 경로, `push_chunks_with_oversize` 경로는 미수정.
**Fix** (PR #158, v0.16.1): `push_chunks_with_oversize``base_split_key: Option<u32>` 추가. k8s chunker 가 `Some(resource.line_start)` 전달 → resource 별 distinct chunk_id. dockerfile / manifest 는 `None` (파일당 1 chunk, 충돌 없음, chunk_id 불변).
**Deviation note**: single-resource k8s YAML 의 chunk_id 도 `None → Some(1)` 으로 바뀜 (`id_hash``base_policy_hash``base_policy_hash#L1`). `chunker_version` (`k8s-manifest-resource-v1`) 은 의도적으로 bump 안 함 — p10-2 가 v0.14.0 (~1주 전) 머지된 dogfood 단계라 prod KB 없음. v0.14.0~v0.16.0 사이 single-resource k8s 를 색인한 KB 는 re-ingest 시 old chunk 가 orphan 될 수 있으나 (UNIQUE 충돌 아님 — 다른 id), `kebab reset` 또는 re-ingest sweep 으로 정리됨. dogfood-only 단계라 chunker_version bump (전체 re-process) 보다 가벼운 선택.
Cross-link: `tasks/p10/p10-2-tier2-resource-aware.md` Risks/notes section.
## 2026-05-21 — p10-1D: typedef-wrapped struct/enum in C falls into glue
**Origin**: PR #156 (p10-1d) code-reviewer review. Verified during dogfood.
**Symptom**: `typedef struct { ... } Foo;` in a `.c` file does NOT emit a struct-level unit. tree-sitter-c classifies the construct as a top-level `type_definition` with an *anonymous* inner `struct_specifier` (no `name` field), so the extractor's `struct_specifier` arm doesn't fire — the whole declaration falls into `<top-level>` glue. The named typedef alias `Foo` is therefore not searchable as a symbol.
**Status**: Consistent with spec p10-1d-c-cpp-ast-chunker.md's Risks/notes ("Anonymous union / struct … anonymous → glue"), but the spec's main body line 22 ("struct_specifier (named, top-level) → 1 unit") suggests this idiom WOULD emit. Tension noted, not yet fixed.
**Workaround**: search the struct by its field/function names, or use `--code-lang c` to broaden scope. Typedef-aliased struct names won't surface as `Citation::Code.symbol`.
**Next step**: dogfood real C code for a week+; if this turns out to be a frequent pain point (kernel-style code, libuv, etc.), revisit the extractor to detect `type_definition` → inner `struct_specifier` and emit a synthetic unit named after the typedef alias.
Cross-link: `tasks/p10/p10-1d-c-cpp-ast-chunker.md` Risks/notes section.
## 2026-05-20 — p10-1B: Rust 1A-2 symbol path is file-scope-only; 1B+ uses workspace path → module prefix
**무엇이 바뀌었나**: P10-1A-2 의 Rust `code-rust-ast-v1` chunker 가 생성하는 symbol 은 file-scope mod-path nesting 만 사용한다 (예: `Foo::double`). P10-1B 이후 Python / TypeScript / JavaScript 의 symbol 은 workspace 경로 → module path prefix 를 포함한다 (예: `kebab_eval.metrics.compute_mrr`, `src/Foo.Foo.search`).

View File

@@ -142,10 +142,11 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능.
- [p10-1A-1 code ingest framework](p10/p10-1a-1-code-ingest-framework.md) — ✅ 머지
- [p10-1A-2 Rust AST chunker](p10/p10-1a-2-rust-ast-chunker.md) — ✅ 머지
- [p10-1B Python + TS/JS AST chunkers](p10/p10-1b-py-ts-js-ast-chunkers.md) — 🟡 PR 오픈 (코드 완성, 머지 대기)
- p10-1C Go + Java + Kotlin AST chunkers
- p10-1D C + C++ AST chunkers — ⏳
- p10-2 Tier 2 resource-aware — ⏳
- p10-3 Tier 3 paragraph + line-window fallback — ⏳
- p10-1C-Go Go AST chunker — 🟡 PR 오픈 (v0.12.0, `code-go-ast-v1`)
- p10-1C-JavaKotlin Java + Kotlin AST chunkers — 🟢 PR 오픈 (v0.13.0, `code-java-ast-v1` / `code-kotlin-ast-v1`)
- p10-1D C + C++ AST chunkers — ✅ 머지 (v0.16.0, `code-c-ast-v1` + `code-cpp-ast-v1`)
- p10-2 Tier 2 resource-aware — ✅ 머지 (v0.14.0, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1`)
- p10-3 Tier 3 paragraph + line-window fallback — ✅ 머지 (v0.15.0, `code-text-paragraph-v1`)
## Post-merge 핫픽스

View File

@@ -5,9 +5,10 @@
| 1A-1 | code ingest framework (wire schema, parse-code crate skeleton, filter flags, skip policy, config 절) | ✅ 머지 |
| 1A-2 | Rust AST chunker | ✅ 머지 |
| 1B | Python + TS/JS AST chunkers | 🟡 PR 오픈 (코드 완성, 머지 대기) |
| 1C | Go + Java + Kotlin AST chunkers | ⏳ |
| 1D | C + C++ AST chunkers | ⏳ |
| 2 | Tier 2 resource-aware (k8s / Dockerfile / manifest) | ⏳ |
| 3 | Tier 3 paragraph + line-window fallback | ⏳ |
| 1C-Go | Go AST chunker (`code-go-ast-v1`) | 🟡 PR 오픈 (v0.12.0) |
| 1C-JavaKotlin | Java + Kotlin AST chunkers (`code-java-ast-v1` / `code-kotlin-ast-v1`) | 🟢 PR 오픈 (v0.13.0) |
| 1D | C + C++ AST chunkers | ✅ 머지 (v0.16.0) |
| 2 | Tier 2 resource-aware (k8s / Dockerfile / manifest) | ✅ 머지 (v0.14.0) |
| 3 | Tier 3 paragraph + line-window fallback | ✅ 머지 (v0.15.0) |
Design: [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md)

View File

@@ -0,0 +1,54 @@
# p10-1C-Go — Go AST chunker
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `code-go-ast-v1`), §3.4 (symbol path — Go `package.Func` / `package.(*Receiver).Method`), §3.5 (code_lang `go`, ext `.go`), §6.1 (`kebab-parse-code/src/go.rs`), §6.2 (`kebab-chunk/src/code_go_ast_v1.rs`), §9.1 (Tier 1 AST per-language + oversize fallback).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1C (Go 부분 — Java + Kotlin 은 후속 PR).
**Plan:** [2026-05-20-p10-1c-go-ast-chunker.md](../../docs/superpowers/plans/2026-05-20-p10-1c-go-ast-chunker.md).
## Goal
1A-2 / 1B 인프라 위에 Go AST chunker 활성화. 사용자 결정으로 1C 의 3 언어 (Go + Java + Kotlin) 를 2 PR 로 분할 — Go 가 method receiver / package convention 면에서 Java/Kotlin (JVM family) 과 다르므로 별 PR. 본 PR 머지 시점부터 Go 프로젝트 dogfooding 가능.
## 동결된 설계 결정 (이 task 로 확정)
- **Symbol path 의 package prefix = 소스 코드의 `package` 선언에서 추출** (design §3.4 그대로). 1B 의 workspace-path 변환과 다름 — Go 는 언어 자체에 `package` declaration 이 있어 그게 canonical source. tree-sitter-go 의 `source_file` root 의 첫 named child `package_clause` 에서 추출. 빈 경우 (이론상 invalid Go, 실용엔 거의 없음) `<unknown>` 또는 fallback `<package>` (1A `<module>` 패턴과 유사).
- **Method receiver 표현** (design 예시 그대로): `package.(*Receiver).Method` (포인터 receiver), `package.(Receiver).Method` (value receiver). tree-sitter-go 의 `method_declaration``receiver` field 에서 type + pointer 여부 추출. 예: `func (m *MdHeadingV1Chunker) ChunkDoc(...)` → symbol `chunk.(*MdHeadingV1Chunker).ChunkDoc`.
- **Top-level unit 종류**:
- `function_declaration` → 1 unit, symbol `package.Func`
- `method_declaration` → 1 unit, symbol `package.(*Receiver).Method` / `package.(Receiver).Method`
- `type_declaration` (struct / interface / type alias) → 1 unit each, symbol `package.TypeName`
- `const_declaration`, `var_declaration`, `import_declaration` (블록 또는 단일) → glue, grouped → `package.<top-level>` (1A/1B 패턴)
- **Go 의 generic 처리**: `func Foo[T any](...)` 또는 `type Foo[T any] struct{}` 의 type parameter 는 symbol 에 미포함 (Go 자체도 보통 symbol 에 안 적음). 단순 `package.Foo` 만.
- **Test detection**: Go 의 `func TestXxx(t *testing.T)`*일반 fn 으로 emit*. test 감지 boost/penalty 등 ranking 영향은 본 task 범위 밖 (ranking brainstorm 보류 메모리 따름).
- frozen design 자체는 변경 없음 (§3.4 의 Go 행이 이미 본 결정과 일치). §10.1 에 1C-Go 활성화 한 줄 추가.
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` passes (memory-conscious: per-crate 위주, full-suite gate 는 docs task 직전 1회).
- `cargo clippy --workspace --all-targets -- -D warnings` passes.
- Go fixture (`tests/fixtures/sample.go`) ingest → chunk snapshot 안정 + `Citation::Code` 의 symbol 이 §3.4 컨벤션 일치 (`pkg.Func` / `pkg.(*Receiver).Method`).
- 격리 TempDir KB 에 Go 파일 두고 `kebab search --code-lang go --json``Citation::Code { lang: "go", symbol: "...", ... }` 반환.
- `kebab schema --json | jq .stats.code_lang_breakdown``"go"` 카운트.
- README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design §10.1 한 줄 추가.
- workspace `Cargo.toml` minor bump (0.11.1 → 0.12.0).
## Allowed dependencies
- `kebab-parse-code``tree-sitter-go` 추가 (workspace deps). 기존 deps 유지.
- `kebab-chunk` 의 새 모듈 `code_go_ast_v1.rs` — kebab-core + serde_json_canonicalizer + blake3 + anyhow + tracing. tree-sitter 절대 import 금지.
- `kebab-app`, `kebab-source-fs` 변경 — 새 crate dep 없음.
## Forbidden dependencies
- `kebab-chunk``tree-sitter-go` 직접 import 금지.
- UI crate 가 `kebab-parse-code` 직접 import 금지.
- `kebab-parse-code` 가 store / embed / llm / rag 직접 import 금지.
## Risks / notes
- tree-sitter-go 의 `package_clause` node 가 root 의 첫 named child 인지 grammar 버전에 따라 다를 수 있음 — extractor 가 `source_file` 전체를 named_children iterate 하면서 첫 `package_clause` 잡는 방식이 안전.
- `method_declaration` 의 receiver pointer 여부: tree-sitter-go AST 에서 receiver type 이 `pointer_type` 노드면 `*Receiver`, 그냥 `type_identifier``Receiver`. 정확한 텍스트 추출 필요.
- Generic type parameter (`[T any]`) 가 method_declaration / function_declaration 의 name field 와 별도 child — name 만 추출하면 generic 부분 자동 제외.
- 1B Python/TS/JS 패턴 (helpers from lang.rs) 와 *다른* 모델 — 본 task 의 mod_prefix 는 source-side AST 에서 추출, helper fn 불필요.
- 머지 후 deviation 은 `tasks/HOTFIXES.md` 에 dated 로그 + 본 spec `Risks / notes` 에 one-line cross-link.

View File

@@ -0,0 +1,69 @@
# p10-1C-JavaKotlin — Java + Kotlin AST chunkers
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `code-java-ast-v1` + `code-kotlin-ast-v1`), §3.4 (symbol path — Java/Kotlin `package.Class.method`), §3.5 (code_lang `java` + `kotlin`, ext `.java` / `.kt` / `.kts`), §6.1 (`kebab-parse-code/src/{java,kotlin}.rs`), §6.2 (`kebab-chunk/src/code_{java,kotlin}_ast_v1.rs`), §9.1 (Tier 1 AST per-language + oversize fallback).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1C (Java + Kotlin 부분 — Go 는 PR #151 / v0.12.0 별 PR 완료).
**Plan:** [2026-05-20-p10-1c-jk-ast-chunker.md](../../docs/superpowers/plans/2026-05-20-p10-1c-jk-ast-chunker.md).
## Goal
1C-Go (PR #151 / v0.12.0) 의 자매 PR. 같은 1C phase 의 JVM family (Java + Kotlin) 묶음. 머지 시점부터 `.java` / `.kt` / `.kts` 파일 dogfooding 가능.
## 동결된 설계 결정 (이 task 로 확정)
- **Symbol prefix = 소스 코드의 `package` 선언에서 추출** (design §3.4 그대로, 1C-Go 모델과 동일). 1B 의 workspace-path 변환과 다름.
- **Java**: tree-sitter-java 의 `package_declaration` → 안의 `scoped_identifier` 또는 `identifier` 텍스트 (e.g. `com.kebab.chunk`). 없으면 `<unknown>`.
- **Kotlin**: tree-sitter-kotlin 의 `package_header``identifier` 텍스트. 없으면 (default package) `<unknown>`.
- **Symbol 형식** (design §3.4): `package.Class.method`. 예시: `com.kebab.chunk.MdHeadingV1Chunker.chunkDoc`.
- **Java AST mapping**:
- `class_declaration` (name) → 1 unit + recurse body
- `interface_declaration` (name) → 1 unit + recurse
- `enum_declaration` (name) → 1 unit
- `record_declaration` (Java 14+) (name) → 1 unit
- `annotation_type_declaration` → 1 unit
- Inside class/interface/enum: `method_declaration` (name) → unit `package.Class.method` (class nesting like 1B Python)
- `import_declaration`, `package_declaration` 자체 → glue `<top-level>`
- Top-level fn 없음 (Java 자체에 없음)
- **Kotlin AST mapping**:
- `class_declaration` (name) → 1 unit + recurse class_body. `data class` / `sealed class` / `enum class` 도 같은 노드.
- `object_declaration` (name) → 1 unit + recurse class_body (singleton)
- `function_declaration` (name) — **top-level 가능** → unit `package.fnName`. Class 내부면 `package.Class.method`.
- `property_declaration` at top-level → glue
- `interface` (in tree-sitter-kotlin 보통 `class_declaration` with `interface` modifier 또는 별 노드) → 1 unit
- `import_header`, `package_header` 자체 → glue `<top-level>`
- **Glue grouping**: 1B Python / 1C-Go 패턴 동일 — imports + 기타 → 하나의 `<top-level>` (또는 `<module>` post-pass if file has zero real units).
- **Tree-sitter Kotlin crate 선택**: tree-sitter-kotlin 의 가장 잘 유지되는 crate 사용 (`tree-sitter-kotlin` 또는 fork). resolve 시 active maintainer 확인.
- frozen design 자체 변경 없음 — §10.1 에 1C-JK 활성화 한 줄.
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` passes.
- `cargo clippy --workspace --all-targets -- -D warnings` passes.
- Java/Kotlin fixture 각각 (`tests/fixtures/sample.java`, `tests/fixtures/sample.kt`) ingest → chunk snapshot 안정 + symbol 이 §3.4 컨벤션 일치.
- 격리 TempDir KB 에 `.java` / `.kt` 파일 두고 `kebab search --code-lang java --json` / `--code-lang kotlin --json``Citation::Code` 반환.
- `kebab schema --json | jq .stats.code_lang_breakdown``"java"` + `"kotlin"` 카운트.
- README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design §10.1 한 줄.
- workspace `Cargo.toml` minor bump (0.12.0 → 0.13.0).
## Allowed dependencies
- `kebab-parse-code``tree-sitter-java` + `tree-sitter-kotlin` 추가. 기존 deps 유지.
- `kebab-chunk` 의 새 모듈 2개 (`code_java_ast_v1.rs`, `code_kotlin_ast_v1.rs`) — language-agnostic body. tree-sitter import 금지.
- `kebab-app`, `kebab-source-fs` — 새 crate dep 없음.
## Forbidden dependencies
- `kebab-chunk` 가 tree-sitter-java / tree-sitter-kotlin import 금지 (boundary §6.3).
- UI crate 가 `kebab-parse-code` 직접 import 금지.
- `kebab-parse-code` 가 store / embed / llm / rag 직접 import 금지.
## Risks / notes
- tree-sitter-kotlin: 공식 또는 가장 활발히 유지되는 crate (`tree-sitter-kotlin` 또는 fork) 선택 필요. resolve 시 metadata 확인.
- Kotlin 의 grammar 가 다른 tree-sitter-* 보다 update 빈도 낮을 수 있어 grammar field 명 변동 가능 — 테스트 fixture 로 contract 고정.
- Java record (Java 14+) — tree-sitter-java 에서 `record_declaration` 노드 (확인 필요).
- Kotlin sealed class / data class / object declaration 등 변종 노드 — tree-sitter-kotlin 의 정확한 node kind 명 확인 필요 (grammar.json / node-types.json).
- Java class 안의 inner class — Python 패턴 (recursion with class name pushed) 동일 처리.
- Kotlin top-level fn 은 1B Python 의 top-level fn 패턴 + 1C-Go 의 package-prefix 패턴 hybrid — `package.fnName`.
- 머지 후 deviation 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` cross-link.

View File

@@ -0,0 +1,120 @@
# p10-1D — C + C++ AST chunkers
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `code-c-ast-v1` + `code-cpp-ast-v1`), §3.4 (symbol path — C `func_name`, C++ `namespace::Class::method`), §3.5 (code_lang `c` + `cpp`, ext `.c`/`.h` / `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`), §6.1 (`kebab-parse-code/src/{c,cpp}.rs`), §6.2 (`kebab-chunk/src/code_{c,cpp}_ast_v1.rs`), §9.1 (Tier 1 AST per-language + oversize fallback), §10 (activation log).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1D (C + C++ 부분).
**Plan:** [2026-05-21-p10-1d-c-cpp-ast-chunker.md](../../docs/superpowers/plans/2026-05-21-p10-1d-c-cpp-ast-chunker.md).
## Goal
p10-1A-2 / 1B / 1C / p10-2 / p10-3 인프라 위에 C + C++ AST chunker 2종을 단일 PR 로 활성화. P10 의 Tier 1 chunker family 마지막. 머지 시점부터 `.c` / `.h` / `.cpp` / `.cc` / `.cxx` / `.hpp` / `.hh` / `.hxx` 파일 dogfooding 가능.
`.h` 가 design 명시대로 C 매핑 — C++ 프로젝트의 `.h` 는 tree-sitter-c 의 parse 가 namespace / template 같은 C++ syntax 에 실패할 가능성. 실패 시 p10-3 의 Tier 3 fallback 으로 자동 picked up (이미 wired).
## 동결된 설계 결정 (이 task 로 확정)
### C extractor (`code-c-ast-v1`)
- **Symbol** = function name only. design §3.4 그대로 — no nesting, no namespace. 예: `parse_blocks`.
- **Top-level units**:
- `function_definition` (named) → 1 unit, symbol = function name
- `struct_specifier` (named, top-level) → 1 unit, symbol = struct name
- `enum_specifier` (named, top-level) → 1 unit, symbol = enum name
- `union_specifier` (named, top-level) → 1 unit, symbol = union name
- `declaration` (top-level — typedef / global var / fn prototype) → glue `<top-level>`
- `preproc_include` / `preproc_def` / `preproc_function_def` / `preproc_ifdef` 등 preprocessor → glue `<top-level>`
- **Static / extern / inline fn**: 일반 fn 과 동일 처리 (storage class qualifier 무시 — symbol 은 declarator 의 fn name 만).
- **Inner struct / enum 안의 nested declaration** (C 도 가능): 1B Python class-nesting 미적용 — C 의 inner type 은 흔치 않고 outer 가 typedef wrapper 인 패턴이라 top-level 만 emit.
- **Empty file 또는 unit 0개** → `<module>` post-pass (1A-2 패턴).
### C++ extractor (`code-cpp-ast-v1`)
- **Symbol** = `namespace::Class::method` (design §3.4 그대로). namespace 가 없으면 `Class::method` 또는 `func_name`. 예: `kebab::chunk::MdHeadingV1Chunker::chunk_doc`.
- **Top-level units + recursion**:
- `namespace_definition` (named) → recurse with namespace name pushed (Python class-nesting + Java/Kotlin package-prefix hybrid).
- **Anonymous namespace** (`namespace { ... }`) → namespace name = `<anonymous>` push (Python `<unnamed>` 패턴 일관).
- `class_specifier` / `struct_specifier` (top-level or in namespace or nested in class, named) → recurse with class name pushed.
- `function_definition` (top-level or in namespace or in class) → 1 unit, symbol per nesting (`namespace::Class::method` / `namespace::func` / `Class::method` / `func_name`).
- `template_declaration` → 내부 declarator type 따라 recurse / emit (function template → method emit, class template → class recurse). template type params (`<T>`, `<typename T>`) 는 symbol 미포함 (Go generic 처리와 동일).
- `enum_specifier` (named) → 1 unit, symbol per nesting.
- `concept_definition` (C++20) → 1 unit, symbol per nesting (treat as type-level definition).
- `using_declaration` / `using_directive` / `preproc_include` / `preproc_def` 등 → glue `<top-level>`.
- `extern "C"` 블록 안의 정의 → 일반 fn 처리 (block 자체는 glue).
- **Method out-of-class definition** (`Class::method` 형태로 namespace 밖에서 정의): tree-sitter-cpp 의 `function_declarator``qualified_identifier` 따라 prefix 복원 — declarator 의 `Class::method` 자체에서 추출.
- **Operator overload** (`operator+`, `operator()` 등): symbol = `Class::operator+` 그대로.
- **Constructor / destructor**: symbol = `Class::Class` / `Class::~Class` (convention).
- **Empty file 또는 unit 0개** → `<module>` post-pass.
### 공통
- **`<top-level>` glue grouping**: preprocessor + global var + using 선언 등 의미 단위 외 → 1 glue chunk per file.
- **Oversize fallback**: 1A-2 의 `AST_CHUNK_MAX_LINES = 200` 동일.
- **`.h` 의 fallback 보장**: C parser 실패 시 p10-3 의 Tier 3 fallback wrapper (이미 wired) 가 picked up → `Citation::Code { symbol: None, lang: "c" }` + `code-text-paragraph-v1`.
### Module layout
```
crates/kebab-parse-code/src/
├── c.rs [신규] — C AST extractor (PARSER_VERSION `tree-sitter-c-<ver>`)
├── cpp.rs [신규] — C++ AST extractor (PARSER_VERSION `tree-sitter-cpp-<ver>`)
└── lib.rs [edit] — pub use + C_PARSER_VERSION / CPP_PARSER_VERSION 상수 노출
crates/kebab-chunk/src/
├── code_c_ast_v1.rs [신규] — VERSION_LABEL `code-c-ast-v1`. 1A-2 패턴 (canonical Document → Vec<Chunk>).
├── code_cpp_ast_v1.rs [신규] — VERSION_LABEL `code-cpp-ast-v1`. 동일 패턴.
└── lib.rs [edit] — pub use 2개
crates/kebab-source-fs/src/media.rs [편집 불요] — code_lang_for_path 위임 패턴 그대로 (Task C of p10-2 이후 단일 source of truth).
crates/kebab-parse-code/src/lang.rs [편집 불요] — `.c`/`.h`/`.cpp` 등 매핑은 1A-1 시점부터 이미 존재.
crates/kebab-app/src/lib.rs [edit] — ingest_one_code_asset 의 allowlist + 4-arm match 에 "c" + "cpp" 추가. tier3 fallback list 에도 둘 추가.
crates/kebab-chunk/tests/ [신규]
├── fixtures/sample.c — C fixture (top-level fn + struct)
├── fixtures/sample.cpp — C++ fixture (namespace + class + method)
├── code_c_ast_snapshot.rs — C snapshot test
└── code_cpp_ast_snapshot.rs — C++ snapshot test
crates/kebab-app/tests/code_ingest_smoke.rs [edit] — 2 신규 integration test (c + cpp). 16 + 2 = 18.
Cargo.toml workspace.dependencies [edit] — tree-sitter-c + tree-sitter-cpp.
crates/kebab-parse-code/Cargo.toml [edit] — 위 2 dep 신규 entry.
```
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` PASS (memory-conscious `-j 1`).
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- C fixture (`tests/fixtures/sample.c`) + C++ fixture (`tests/fixtures/sample.cpp`) ingest → chunk snapshot 안정. C snapshot 의 chunks 가 모두 `Citation::Code { lang: "c", symbol: Some(<fn|struct|enum name>), ... }`. C++ snapshot 의 chunks 가 namespace + class nesting 포함 (`kebab::chunk::Foo::bar`).
- 격리 TempDir KB 에 `.c` / `.cpp` 파일 두고 `kebab search --code-lang c --json` / `--code-lang cpp --json` 가 각각 `Citation::Code` 반환. integration test `tier1_c_ingest_searchable` + `tier1_cpp_ingest_searchable` (기존 16 + 2 = 18).
- `kebab schema --json | jq .stats.code_lang_breakdown``"c"` + `"cpp"` 카운트 등장 (.c/.cpp 파일 ingest 후).
- README + HANDOFF + docs/ARCHITECTURE + docs/SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design 2026-04-27 §10 activation log 한 줄.
- workspace `Cargo.toml` minor bump (0.15.0 → 0.16.0), gitea-release v0.16.0.
## Allowed dependencies
- `kebab-parse-code``tree-sitter-c` + `tree-sitter-cpp` workspace deps 추가. 기존 deps 유지.
- `kebab-chunk` 의 새 모듈 2개 (`code_c_ast_v1.rs`, `code_cpp_ast_v1.rs`) — language-agnostic body, tree-sitter import 금지. 기존 `tier2_shared::build_chunk` (pub(crate)) 재사용.
- `kebab-app`, `kebab-source-fs` — 새 crate dep 없음.
## Forbidden dependencies
- `kebab-chunk` 가 tree-sitter-c / tree-sitter-cpp 직접 import 금지 (boundary §6.3).
- `kebab-parse-code` 가 store / embed / llm / rag 직접 import 금지.
- UI crate (`kebab-cli` / `kebab-mcp` / `kebab-tui`) 가 `kebab-parse-code` / `kebab-chunk` 직접 import 금지 — `kebab-app` facade 만.
## Risks / notes
- **tree-sitter-c / tree-sitter-cpp 호환성**: tree-sitter 0.26 (현재 workspace) 과 호환 필요. resolve 시 `tree-sitter-language` shim 사용 fork (1C-JK 의 tree-sitter-kotlin-ng 패턴) 가능성 — crate.io 의 가장 활발한 maintainer 우선. 실패 시 별도 fork 검토.
- **`.h` parse 실패**: C++ 헤더 (`namespace`, `template`, `class`) 를 C parser 가 만나면 partial parse + error nodes. 1A-2 의 extractor 패턴이 error node 무시 + recoverable parse 진행 — emit 결과가 *불완전* 할 가능성. 그럴 때 chunks 가 0 으로 떨어지면 p10-3 Tier 3 fallback 으로 자동 picked up (이미 wired). 부분 emit 시 일부만 색인 — Tier 3 fallback 안 함. dogfood 후 HOTFIXES 검토.
- **Method out-of-class definition** (`Class::method` 형식): tree-sitter-cpp 의 `function_definition` 의 declarator 가 `qualified_identifier` 일 때 prefix 복원. fixture 로 검증.
- **Template specialization** (`template<> class Foo<int>`): tree-sitter-cpp 의 `template_declaration` 안의 `class_specifier` name 만 추출 — `Foo` 만 symbol 에 들어가고 `<int>` 미포함. design 의 generic 무시 룰 일관.
- **`extern "C"` block 안의 fn**: 일반 fn 처리. 외부 wrapping block 은 glue.
- **Anonymous union / struct** (`struct { int x; }` 변수 안에): 흔치 않음 + named 만 unit. anonymous 는 glue.
- **typedef-wrapped struct/enum idiom** (`typedef struct { ... } Foo;`) — anonymous inner struct → glue. Named typedef alias 미캡처. dogfood 후 HOTFIXES 검토. See [HOTFIXES.md 2026-05-21 entry](../HOTFIXES.md).
- **Macro-heavy code** (Linux kernel 등): `#define FOO(x) ...` 매크로가 function-like 라도 parser 가 fn 으로 인식 안 함. preprocessor glue 로 처리 — symbol 안 잡힘. 의도된 동작 (parser 의 macro expansion 안 함).
- **`__attribute__((...))`** annotations: tree-sitter-c 의 attribute 노드는 declarator 옆 sibling. 무시 가능. function name 추출에 영향 없음.
- **fixture 크기**: sample.c 는 ~30 line (top-level fn + struct + enum + preprocessor), sample.cpp 는 ~50 line (nested namespace + class + method + template + free fn). oversize fallback 의 별도 검증은 1A-2 의 long_section_snapshot 패턴이 이미 cover (필요 시 별도 fixture).
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` cross-link.

View File

@@ -0,0 +1,121 @@
# p10-2 — Tier 2 resource-aware chunkers (k8s + Dockerfile + manifest)
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `k8s-manifest-resource-v1` + `dockerfile-file-v1` + `manifest-file-v1`), §3.4 (citation symbol — `<kind>/<namespace>/<name>` / `<dockerfile>` / `<manifest>`), §3.5 (code_lang 추가 매핑 `xml` / `groovy` / `go-mod`), §6.1 (`kebab-parse-code/src/lang.rs` 갱신 + `kebab-source-fs/src/media.rs` 의 inline duplication 정리), §6.2 (`kebab-chunk/src/{k8s_manifest_resource_v1,dockerfile_file_v1,manifest_file_v1}.rs`), §9.2 (Tier 2 정의), §10.1 (deactivation log 한 줄).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1.2 (Phase 2) + §9.2.
**Plan:** [2026-05-20-p10-2-tier2-resource-aware.md](../../docs/superpowers/plans/2026-05-20-p10-2-tier2-resource-aware.md).
## Goal
p10-1A-2 / 1B / 1C 인프라 위에 Tier 2 resource-aware chunker 3종을 단일 PR 로 활성화. AST 가 아닌 file/document-level chunking — 1B (Python+TS+JS) 의 묶음 패턴 따름. 머지 시점부터 `.yaml` / `.yml` / `Dockerfile` / 매니페스트 7종 dogfooding 가능.
비-k8s YAML (Helm values, CI yml, docker-compose 등) 및 invalid YAML 은 본 phase 에선 skip — p10-3 의 paragraph fallback 이 머지되면 자동으로 wire 됨.
## 동결된 설계 결정 (이 task 로 확정)
### 공통
- **3 chunker = self-contained**. `kebab-parse-code` 에 Tier 2 용 extractor 모듈 추가 없음. lang.rs 의 `code_lang_for_path` 갱신만. AST 가 아니라 추상화 비용이 코드 보상보다 큼.
- **`code_lang_for_path` = single source of truth** (design §3.5). `kebab-source-fs/src/media.rs` 의 inline 확장자 match 는 이 함수 호출로 통일 (1A-1 부터 누적된 duplication 정리, 작은 리팩토링).
- **parser_version** = `"none-v1"` 통일. Tier 2 는 parse 단계가 없음을 명시하는 sentinel. chunker_version cascade 만 의미 있음.
- **oversize fallback** = AST chunker 와 동일 정책 (`AST_CHUNK_MAX_LINES = 200` 초과 시 line-window split). 거대 ConfigMap / multi-stage Dockerfile / aggregate POM 대비. split chunk 는 같은 symbol 공유 (line range 만 다름).
- **frozen design 갱신** (본 PR 안에서):
- §3.5 `code_lang` 매핑 표에 3 줄 추가:
- XML (`.xml`, `pom.xml`) → `xml`
- Groovy (`build.gradle`, `.gradle`) → `groovy`
- Go module (`go.mod`) → `go-mod`
- §10.1 deactivation log 한 줄 추가: "p10-2 활성화 — Tier 2 chunker 3종 active."
### k8s-manifest-resource-v1
- **Trigger**: `MediaType::Code("yaml")` (= `.yaml` / `.yml`).
- **k8s 식별**: YAML document 의 top-level mapping 에 `apiVersion: <string>` + `kind: <string>` 둘 다 있어야 인정. 하나라도 없거나 string 타입이 아니면 그 document skip (전체 파일 skip 아님 — 다른 document 는 정상 처리).
- **Multi-document split 구현**: `serde_yaml::Deserializer::from_str` 의 multi-document iterator 가 line offset 을 안 줘서, 원본 텍스트의 `^---\s*$` 줄 정규식 기준으로 pre-split 후 각 슬라이스를 deserialize. line_start/line_end 는 pre-split 단계에서 추적. trailing `---` 의 빈 슬라이스는 skip.
- **Symbol**: `<kind>/<metadata.namespace>/<metadata.name>` (namespace 있으면) 또는 `<kind>/<metadata.name>` (cluster-scoped) 또는 `<kind>/<unnamed>` (name 누락). 예: `Deployment/prod/api-server`, `ClusterRole/cluster-admin`, `ConfigMap/<unnamed>`.
- **Chunk text**: pre-split 슬라이스의 원본 텍스트 그대로 (deserialized form 아님 — 원본 보존).
- **Citation**: `Citation::Code { path, line_start, line_end, symbol: Some(<위>), lang: Some("yaml") }`.
- **Failure modes**:
- Invalid YAML (어떤 document 라도 deserialize 실패) → 파일 전체 emit 0 chunk + warning log `invalid yaml: {path}`. p10-3 의 paragraph fallback 이 picked up.
- 인정된 document 0개 (모두 비-k8s) → 파일 전체 emit 0 chunk. 동일 fallback.
### dockerfile-file-v1
- **Trigger**: `MediaType::Code("dockerfile")` — 파일명이 정확히 `Dockerfile`, 또는 prefix `Dockerfile.` (e.g. `Dockerfile.dev`), 또는 확장자 `.dockerfile` (e.g. `myapp.dockerfile`).
- **Algorithm**: 파일 전체 텍스트 → 1 chunk emit.
- **Symbol**: 통일 `<dockerfile>`.
- **Citation**: `Citation::Code { path, line_start: 1, line_end: <EOF>, symbol: Some("<dockerfile>"), lang: Some("dockerfile") }`.
### manifest-file-v1
- **Trigger**: 파일명이 design §9.2 의 7종 중 하나:
| basename | code_lang |
|----------------|-----------|
| `Cargo.toml` | `toml` |
| `pyproject.toml` | `toml` |
| `package.json` | `json` |
| `tsconfig.json`| `json` |
| `go.mod` | `go-mod` |
| `pom.xml` | `xml` |
| `build.gradle` | `groovy` |
- **제외**: `build.gradle.kts` 는 1C-JK 의 Kotlin AST chunker (code-kotlin-ast-v1) 가 잡으므로 본 chunker 의 대상 아님.
- **Algorithm**: 파일 전체 텍스트 → 1 chunk emit.
- **Symbol**: 통일 `<manifest>` (7종 모두). manifest 종류 구분은 `code_lang` 으로 — 예: `--code-lang go-mod` 는 go.mod 만, `--code-lang toml` 은 Cargo.toml + pyproject.toml.
- **Citation**: `Citation::Code { path, line_start: 1, line_end: <EOF>, symbol: Some("<manifest>"), lang: Some(<위 매핑>) }`.
### Routing (kebab-app::ingest_one_code_asset)
기존 7-arm AST match 옆에 Tier 2 분기 추가:
```text
"rust" | "python" | "typescript" | "javascript"
| "go" | "java" | "kotlin" → 기존 AST chunker (1A-2 / 1B / 1C)
"yaml" → k8s_manifest_resource_v1
"dockerfile" → dockerfile_file_v1
"toml" | "json" | "xml"
| "groovy" | "go-mod" → manifest_file_v1
_ → skip (p10-3 fallback 의 자리)
```
`code_lang_for_path` 의 lookup 순서: basename 우선 매칭 (`Cargo.toml` / `Dockerfile.*` / etc.) → 확장자 fallback (`.yaml` / `.toml` / etc.).
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` passes (memory-conscious: per-crate 위주, full-suite gate 는 docs task 직전 1회).
- `cargo clippy --workspace --all-targets -- -D warnings` passes.
- 각 chunker 의 snapshot test 안정:
- `crates/kebab-chunk/tests/fixtures/sample.yaml` — 2 k8s doc (Deployment + Service) + 1 비-k8s doc (apiVersion 빠짐) → 2 chunk emit, 비-k8s doc skip.
- `crates/kebab-chunk/tests/fixtures/sample.dockerfile` → 1 chunk, symbol `<dockerfile>`.
- `crates/kebab-chunk/tests/fixtures/sample.Cargo.toml` + `sample.package.json` + `sample.pom.xml` + `sample.go.mod` (4종) → 각 1 chunk, symbol `<manifest>`, 매핑된 code_lang.
- `code_lang_for_path` 의 basename 우선 매칭 + 확장자 fallback unit test.
- 격리 TempDir KB 에 yaml + Dockerfile + Cargo.toml 두고 `kebab search --code-lang yaml --json` / `--code-lang dockerfile --json` / `--code-lang toml --json` 각각 `Citation::Code` 반환 (기존 `code_ingest_smoke.rs` 에 3 테스트 추가, 총 12 테스트).
- `kebab schema --json | jq .stats.code_lang_breakdown``yaml` / `dockerfile` / `toml` / `json` / `xml` / `groovy` / `go-mod` 카운트 (사용된 것만 등장).
- README + HANDOFF + docs/ARCHITECTURE + docs/SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design §3.5 매핑 3 줄 + §10.1 활성화 한 줄.
- workspace `Cargo.toml` minor bump (0.13.0 → 0.14.0), gitea-release v0.14.0.
## Allowed dependencies
- `kebab-chunk` 에 새 모듈 3개 (`k8s_manifest_resource_v1.rs` / `dockerfile_file_v1.rs` / `manifest_file_v1.rs`) 및 dep entry `serde_yaml = { workspace = true }` (workspace 에 이미 존재). 기존 deps (kebab-core / serde_json_canonicalizer / blake3 / anyhow / tracing) 유지.
- `kebab-parse-code``lang.rs` 갱신만. extractor 모듈 추가 없음, 새 crate dep 없음.
- `kebab-source-fs/src/media.rs``code_lang_for_path` 호출로 inline match 정리. 기존 dep 유지 (kebab-parse-code 는 이미 의존).
- `kebab-app::ingest_one_code_asset` — match 분기 확장. 새 crate dep 없음.
## Forbidden dependencies
- `kebab-chunk` 가 store / embed / llm / rag / tree-sitter 직접 import 금지 (boundary §6.3 유지).
- `kebab-parse-code` 가 store / embed / llm / rag 직접 import 금지.
- UI crate (`kebab-cli` / `kebab-mcp` / `kebab-tui` / `kebab-desktop`) 가 `kebab-parse-code` / `kebab-chunk` 직접 import 금지 — `kebab-app` facade 만.
## Risks / notes
- **serde_yaml line offset 없음** → 원본 텍스트의 `^---\s*$` 정규식 split 으로 line 추적. trailing `---` 의 빈 슬라이스 / 첫 슬라이스에 `---` prefix 없음 / 비-표준 separator (예: `--- # comment`) 모두 fixture 로 검증.
- **apiVersion / kind 가 string 이 아닌 경우** (예: `kind: 42`) — `serde_yaml::Value::as_str()` 으로 string 체크 후 인정. 비-string 이면 비-k8s 취급.
- **cluster-scoped resource** (Namespace, ClusterRole, ClusterRoleBinding, …) — metadata.namespace 없음이 정상. symbol = `<kind>/<name>` 형태.
- **metadata.name 누락** — 비정상이지만 panic 금지. `<kind>/<unnamed>` fallback + warning log.
- **거대 ConfigMap / Helm-rendered manifest** — `AST_CHUNK_MAX_LINES = 200` oversize fallback. split chunk 가 같은 symbol 공유 → search 시 dedupe 또는 user-visible 두 hit 으로 보임 (1A-2 의 oversize 와 동일 동작).
- **YAML anchor / merge keys (`&`, `<<`, `*`)** — serde_yaml 가 자동 resolve. 원본 텍스트 보존 정책상 chunk text 는 원본 (resolve 전) 유지, 파싱은 resolve 후 값으로.
- **`Dockerfile.example` 같은 doc-purpose 파일** — 확장자/접두사 매칭에 잡힘. user intent 와 어긋날 수 있으나 본 phase 의 scope 밖 (skip 정책은 1A-1 의 size/built-in/generated 정책으로 통제). dogfood 후 false positive 빈도 보고 HOTFIXES 결정.
- **`pom.xml` aggregate parent POM** — 매우 큼 (수백~수천 줄). oversize fallback 으로 split. 거대 fixture 로 한 번 검증.
- **`media.rs` 정리** — 1A-1 부터 누적된 inline `match extension` duplication 을 `code_lang_for_path` 호출로 교체. 기존 단위 테스트 동작 보존 (테스트는 결과 값만 보므로 통과해야 함).
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` 에 one-line cross-link.
- **[HOTFIXES 2026-05-21]** multi-resource k8s YAML (2+ document) 이 `chunk_id` 충돌로 ingest 실패 — `push_chunks_with_oversize` 의 non-oversize 분기가 `split_key = None` 하드코딩. PR #158 (v0.16.1) 에서 `base_split_key` 파라미터로 fix. See `tasks/HOTFIXES.md` 2026-05-21 entry.

View File

@@ -0,0 +1,116 @@
# p10-3 — Tier 3 paragraph + line-window fallback chunker
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `code-text-paragraph-v1`), §3.5 (code_lang routing — `shell` 활성화 + "미지원 / Tier 3 fallback" 명확화), §6.2 (`kebab-chunk/src/code_text_paragraph_v1.rs`), §6.3 (`tier2_shared::build_chunk``pub(crate)` 노출), §9.3 (Tier 3 정의), §10.1 (deactivation log 한 줄).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1.3 (Phase 3) + §9.3.
**Plan:** [2026-05-20-p10-3-tier3-paragraph-fallback.md](../../docs/superpowers/plans/2026-05-20-p10-3-tier3-paragraph-fallback.md).
## Goal
p10-1A-2 / 1B / 1C / 1A-1 의 framework + p10-2 Tier 2 인프라 위에 Tier 3 paragraph fallback chunker 활성화. 단일 PR. 머지 시점부터:
- `.sh` / `.bash` / `.zsh` 파일이 paragraph 단위로 색인.
- p10-2 의 비-k8s YAML / invalid YAML / Tier 1 AST extractor 실패 등 0-chunk 결과가 자동으로 Tier 3 로 fallback 되어 색인 — 이전에 skip 되던 파일이 search 가능.
## 동결된 설계 결정 (이 task 로 확정)
### chunker (`code-text-paragraph-v1`)
- **Input**: `Document` with single `Block::Code { text, lang, ... }`. Tier 2 의 `synthesize_tier2_document` 와 동일한 모양 — fallback wrapper 가 같은 doc 재사용.
- **VERSION_LABEL**: `"code-text-paragraph-v1"`.
- **Paragraph 분할**: `text.lines()` 순회. 빈 줄 (정확히 빈 줄 또는 only-whitespace) 을 paragraph boundary 로. 빈 줄 자체는 어느 paragraph 에도 포함되지 않음 (chunk 의 line range 에 미포함). 빈 paragraph (전부 whitespace) skip.
- **Paragraph 크기 룰** (design §9.3 default 그대로, hardcoded):
- paragraph line count ≤ 80 → 1 chunk emit.
- paragraph line count > 80 → line-window split with window size 80 / overlap 20 (stride 60). 즉 line 1-80, 61-140, 121-200, … 마지막 window 는 EOF 까지 (≤ 80 lines).
- `FALLBACK_LINES_PER_CHUNK = 80`, `FALLBACK_LINES_OVERLAP = 20` 둘 다 hardcoded constants (1A-2 의 `AST_CHUNK_MAX_LINES = 200` 패턴 그대로 — 사용자 config 노출 안 함, 미래 HOTFIXES 시 노출 검토).
- **Citation**: `SourceSpan::Code { line_start, line_end, symbol: None, lang: <input lang> }`. `symbol = None` 통일 (Tier 3 는 의미 단위 식별 안 함). `lang` 은 입력 Document 의 `Block::Code.lang` 그대로 보존 — shell → `"shell"`, k8s skip → `"yaml"`, Rust extractor 실패 → `"rust"` 등.
- **chunk_id 충돌 방지**: 동일 paragraph 의 line-window split 시 `id_for_chunk``split_key``window_start` 전달 (Tier 2 `#L{k}` 패턴 동일).
- **Edge cases**:
- 전체 파일이 빈 줄만 → 0 chunk emit (fallback 의 fallback 없음). `tracing::warn!`.
- 단일 paragraph + ≤ 80 lines → 1 chunk, line range 1..N.
- 빈 줄 없는 거대 파일 (한 paragraph 전체) → line-window split.
### Routing / fallback wrapper
- **`code_lang_for_path`** 변경 없음 (shell 매핑은 1A-1 시점부터 이미 존재).
- **`ingest_one_code_asset` allowlist** (`crates/kebab-app/src/lib.rs:953`) 에 `"shell"` 추가.
- **4-arm match (parser_version / chunker_version / extract / chunks)** 에 `"shell"` arm 추가:
- parser_version = `"none-v1"` (Tier 2 sentinel 재사용).
- chunker_version = `CodeTextParagraphV1Chunker.chunker_version()`.
- extract = `synthesize_tier2_document(asset, &bytes, "shell", &parser_version)?` (재사용).
- chunks = `CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)?`.
- **Fallback wrapper** (핵심 신규 로직) — chunks match 직후 후처리:
- Tier 1/2 lang 의 결과가 `Err(_)` 또는 `Ok(empty_vec)` 이면 Tier 3 retry.
- retry 시:
- `chunker_version``code-text-paragraph-v1` 로 swap (downstream stamping 정확성).
- `canonical.parser_version``"none-v1"` 로 swap (Tier 1 의 `RUST_PARSER_VERSION` 등이 misleading 하므로).
- `CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)` 실행.
- 실패 사유는 `tracing::warn!("tier1/2 emitted 0 chunks or errored for {workspace_path} ({code_lang}); falling back to tier 3")`.
- **Tier 3 자체가 0 chunk 또는 Err** 인 경우는 그대로 fail/skip (fallback 의 fallback 없음).
### `tier2_shared::build_chunk` 노출
- 현재 module-private `fn build_chunk`. Tier 3 가 동일 Chunk 생성 (hash / token / policy_hash 일관) 을 위해 호출 — `pub(crate) fn build_chunk(...)` 으로 visibility 만 변경. signature 동일.
### Lang 보존 정책
- Tier 3 chunk 의 `Citation::Code.lang` = 입력 Document 의 `Block::Code.lang` 그대로. 명시적으로 표:
| Source | input lang | Tier 3 output lang |
|--------|-----------|----------|
| shell direct | `"shell"` | `"shell"` |
| k8s 0-chunk fallback | `"yaml"` | `"yaml"` |
| Rust AST 실패 fallback | `"rust"` | `"rust"` |
| manifest 0-chunk (이론상, 거의 발생 안 함) | `"toml"` 등 | 유지 |
- 검색 시 `--code-lang shell` / `--code-lang yaml` 등이 fallback chunk 도 매칭 — search filter 동작 자연.
### Non-scope
- **미지원 확장자 wiring**: `.txt` / `.log` / `.scala` / `.rb` 등은 본 PR scope 밖. `code_lang_for_path` 의 매핑은 unchanged. Tier 3 chunker 자체는 만들어두고, 미래에 `code_lang_for_path` 에 새 lang 추가 시 자동 picked up (1A-2 패턴).
- **config 노출**: `FALLBACK_LINES_PER_CHUNK` / `FALLBACK_LINES_OVERLAP` hardcoded. config.toml 노출 없음.
### Frozen design 갱신
- `docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md` §10.1 활성화 로그 한 줄.
- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` §10 activation log 한 줄.
- §3.5 의 "미지원 / Tier 3 fallback → null" 표현은 그대로 유지 (해당 표현이 본 phase 의 정확한 의미 — Tier 3 chunk 의 lang 은 입력 lang 보존이므로 "null" 은 미지원 확장자 wire 시 적용).
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` PASS (memory-conscious `-j 1`).
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- 4 신규 unit test in `crates/kebab-chunk/tests/code_text_paragraph_v1.rs`:
- `shell_multi_paragraph_splits_on_blank_lines` — 3-paragraph fixture → 3 chunk, symbol=None, lang=shell, contiguous (exclusive of blank lines).
- `single_long_paragraph_line_window_split` — 200+ line single paragraph → window split, distinct chunk_ids, expected line ranges (1-80, 61-140, 121-200, …).
- `empty_file_emits_zero_chunks` — 빈 텍스트 → `Ok(vec![])`.
- `lang_field_preserved_from_input_doc` — lang=yaml 입력 → emit chunk lang=yaml.
- 2 신규 integration test in `crates/kebab-app/tests/code_ingest_smoke.rs`:
- `tier3_shell_ingest_searchable``.sh` 파일 ingest → `--code-lang shell` 검색 → `Citation::Code { symbol: None, lang: "shell" }`, `chunker_version: "code-text-paragraph-v1"`.
- `tier3_yaml_fallback_picks_up_non_k8s_yaml` — apiVersion+kind 없는 yaml ingest → fallback 발동 → `Citation::Code { symbol: None, lang: "yaml" }`, chunker_version `code-text-paragraph-v1`.
- 기존 12 smoke test + 2 신규 = 14 testing surface. (Tier 1 9 + Tier 2 3 + Tier 3 2.)
- `kebab schema --json | jq .stats.code_lang_breakdown``"shell"` 카운트 등장 (.sh 파일 ingest 후). 비-k8s YAML 도 `"yaml"` 카운트에 누적 (Tier 2 와 Tier 3 가 같은 lang).
- README + HANDOFF + docs/ARCHITECTURE + docs/SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design §10.1 + §10 activation log 한 줄씩.
- workspace `Cargo.toml` minor bump (0.14.0 → 0.15.0), gitea-release v0.15.0.
## Allowed dependencies
- `kebab-chunk` 의 새 모듈 `code_text_paragraph_v1.rs` — kebab-core + anyhow + tracing. tier2_shared 의 `build_chunk` 호출 (visibility `pub(crate)` 로 노출). tree-sitter / serde_yaml 비사용.
- `kebab-app::ingest_one_code_asset` — 4-arm match + allowlist + fallback wrapper 확장. 새 crate dep 없음.
- `kebab-parse-code` — 변경 없음 (lang.rs 의 shell 매핑은 1A-1 부터 존재).
- `kebab-source-fs` — 변경 없음 (media.rs 이미 `code_lang_for_path` 위임).
## Forbidden dependencies
- `kebab-chunk` 가 store / embed / llm / rag / tree-sitter 직접 import 금지 (boundary §6.3 유지).
- UI crate (`kebab-cli` / `kebab-mcp` / `kebab-tui` / `kebab-desktop`) 가 `kebab-parse-code` / `kebab-chunk` 직접 import 금지 — `kebab-app` facade 만.
## Risks / notes
- **Fallback infinite loop 방지**: Tier 3 자체가 0 chunk 또는 Err 인 경우는 그대로 fail/skip — fallback 의 fallback 없음. 명시 spec.
- **chunker_version swap 시 `try_skip_unchanged` 일관성**: fallback 발동 후 stored chunker_version = `code-text-paragraph-v1`. 다음 ingest 에 동일 파일 → 동일 chunker_version 으로 lookup 매칭 (skip 동작 OK). Tier 1 chunker 가 미래에 작동하기 시작하면 (예: tree-sitter grammar fix) cascade rule 로 incremental cache miss → 자동 reprocess 가 정상 동작.
- **lang 보존 vs fallback 의미**: fallback chunk 의 lang 이 원본 lang 유지라 search filter `--code-lang yaml` 가 Tier 2 와 Tier 3 chunk 둘 다 매칭. 의도된 동작 — 사용자가 "yaml 파일 검색" 했을 때 모든 yaml 결과 표시.
- **line-window overlap 의미**: 80/20 (stride 60) 은 design §9.3 default. 거대 paragraph (예: minified JSON 한 줄) 의 경우에도 동일 알고리즘 — 단 한 줄 = 한 line 이라 split 발생 안 함 (length 80 lines 기준). minified 의 경우 chunk 한 개에 매우 긴 텍스트가 들어가는데 이는 paragraph 분할 정책의 inherent limitation. 미래 HOTFIXES 검토.
- **빈 줄 처리**: `^\s*$` 매칭 (whitespace-only) 줄을 paragraph boundary 로. 탭만 있는 줄 / CR-only 줄 등 edge case fixture 로 검증.
- **shell line-comment 처리**: shell script 의 `# comment` 줄은 일반 line. paragraph 분할에 영향 없음 (빈 줄 아님). chunk 안에 그대로 보존.
- **fallback wrapper 의 `canonical.parser_version` mutation**: Document 의 parser_version 을 Tier 3 fallback 시 `"none-v1"` 로 swap. CanonicalDocument 가 `mut` 로 받아져야 함. 이미 `let mut canonical = match ...` 이라 mut 가능. plan 단계 검증.
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` cross-link.